python – How do I do a case-insensitive string comparison?

python – How do I do a case-insensitive string comparison?

Assuming ASCII strings:

string1 = Hello
string2 = hello

if string1.lower() == string2.lower():
    print(The strings are the same (case insensitive))
else:
    print(The strings are NOT the same (case insensitive))

As of Python 3.3, casefold() is a better alternative:

string1 = Hello
string2 = hello

if string1.casefold() == string2.casefold():
    print(The strings are the same (case insensitive))
else:
    print(The strings are NOT the same (case insensitive))

If you want a more comprehensive solution that handles more complex unicode comparisons, see other answers.

Comparing strings in a case insensitive way seems trivial, but its not. I will be using Python 3, since Python 2 is underdeveloped here.

The first thing to note is that case-removing conversions in Unicode arent trivial. There is text for which text.lower() != text.upper().lower(), such as ß:

ß.lower()
#>>> ß

ß.upper().lower()
#>>> ss

But lets say you wanted to caselessly compare BUSSE and Buße. Heck, you probably also want to compare BUSSE and BUẞE equal – thats the newer capital form. The recommended way is to use casefold:

str.casefold()

Return a casefolded copy of the string. Casefolded strings may be used for
caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is
intended to remove all case distinctions in a string. […]

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).

Then you should consider accents. If your font renderer is good, you probably think ê == ê – but it doesnt:

ê == ê
#>>> False

This is because the accent on the latter is a combining character.

import unicodedata

[unicodedata.name(char) for char in ê]
#>>> [LATIN SMALL LETTER E WITH CIRCUMFLEX]

[unicodedata.name(char) for char in ê]
#>>> [LATIN SMALL LETTER E, COMBINING CIRCUMFLEX ACCENT]

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does

unicodedata.normalize(NFKD, ê) == unicodedata.normalize(NFKD, ê)
#>>> True

To finish up, here this is expressed in functions:

import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize(NFKD, text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)

python – How do I do a case-insensitive string comparison?

Using Python 2, calling .lower() on each string or Unicode object…

string1.lower() == string2.lower()

…will work most of the time, but indeed doesnt work in the situations @tchrist has described.

Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:

>>> utf8_bytes = open(unicode.txt, r).read()
>>> print repr(utf8_bytes)
xcexa3xcexafxcfx83xcfx85xcfx86xcexbfxcfx82nxcexa3xcex8axcexa3xcexa5xcexa6xcex9fxcexa3n
>>> u = utf8_bytes.decode(utf8)
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True

The Σ character has two lowercase forms, ς and σ, and .lower() wont help compare them case-insensitively.

However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:

>>> s = open(unicode.txt, encoding=utf8).read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)

Leave a Reply

Your email address will not be published. Required fields are marked *