python – How do I do a case-insensitive string comparison?
python – How do I do a case-insensitive string comparison?
Assuming ASCII strings:
string1 = Hello
string2 = hello
if string1.lower() == string2.lower():
print(The strings are the same (case insensitive))
else:
print(The strings are NOT the same (case insensitive))
As of Python 3.3, casefold() is a better alternative:
string1 = Hello
string2 = hello
if string1.casefold() == string2.casefold():
print(The strings are the same (case insensitive))
else:
print(The strings are NOT the same (case insensitive))
If you want a more comprehensive solution that handles more complex unicode comparisons, see other answers.
Comparing strings in a case insensitive way seems trivial, but its not. I will be using Python 3, since Python 2 is underdeveloped here.
The first thing to note is that case-removing conversions in Unicode arent trivial. There is text for which text.lower() != text.upper().lower()
, such as ß
:
ß.lower()
#>>> ß
ß.upper().lower()
#>>> ss
But lets say you wanted to caselessly compare BUSSE
and Buße
. Heck, you probably also want to compare BUSSE
and BUẞE
equal – thats the newer capital form. The recommended way is to use casefold
:
str.casefold()
Return a casefolded copy of the string. Casefolded strings may be used for
caseless matching.Casefolding is similar to lowercasing but more aggressive because it is
intended to remove all case distinctions in a string. […]
Do not just use lower
. If casefold
is not available, doing .upper().lower()
helps (but only somewhat).
Then you should consider accents. If your font renderer is good, you probably think ê == ê
– but it doesnt:
ê == ê
#>>> False
This is because the accent on the latter is a combining character.
import unicodedata
[unicodedata.name(char) for char in ê]
#>>> [LATIN SMALL LETTER E WITH CIRCUMFLEX]
[unicodedata.name(char) for char in ê]
#>>> [LATIN SMALL LETTER E, COMBINING CIRCUMFLEX ACCENT]
The simplest way to deal with this is unicodedata.normalize
. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does
unicodedata.normalize(NFKD, ê) == unicodedata.normalize(NFKD, ê)
#>>> True
To finish up, here this is expressed in functions:
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize(NFKD, text.casefold())
def caseless_equal(left, right):
return normalize_caseless(left) == normalize_caseless(right)
python – How do I do a case-insensitive string comparison?
Using Python 2, calling .lower()
on each string or Unicode object…
string1.lower() == string2.lower()
…will work most of the time, but indeed doesnt work in the situations @tchrist has described.
Assume we have a file called unicode.txt
containing the two strings Σίσυφος
and ΣΊΣΥΦΟΣ
. With Python 2:
>>> utf8_bytes = open(unicode.txt, r).read()
>>> print repr(utf8_bytes)
xcexa3xcexafxcfx83xcfx85xcfx86xcexbfxcfx82nxcexa3xcex8axcexa3xcexa5xcexa6xcex9fxcexa3n
>>> u = utf8_bytes.decode(utf8)
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True
The Σ character has two lowercase forms, ς and σ, and .lower()
wont help compare them case-insensitively.
However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:
>>> s = open(unicode.txt, encoding=utf8).read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True
So if you care about edge-cases like the three sigmas in Greek, use Python 3.
(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)