python – How to strip all whitespace from string

python – How to strip all whitespace from string

Taking advantage of str.splits behavior with no sep parameter:

>>> s =  t foo n bar 
>>> .join(s.split())
foobar

If you just want to remove spaces instead of all whitespace:

>>> s.replace( , )
tfoonbar

Premature optimization

Even though efficiency isnt the primary goal—writing clear code is—here are some initial timings:

$ python -m timeit .join( t foo n bar .split())
1000000 loops, best of 3: 1.38 usec per loop
$ python -m timeit -s import re re.sub(rs+, ,  t foo n bar )
100000 loops, best of 3: 15.6 usec per loop

Note the regex is cached, so its not as slow as youd imagine. Compiling it beforehand helps some, but would only matter in practice if you call this many times:

$ python -m timeit -s import re; e = re.compile(rs+) e.sub(,  t foo n bar )
100000 loops, best of 3: 7.76 usec per loop

Even though re.sub is 11.3x slower, remember your bottlenecks are assuredly elsewhere. Most programs would not notice the difference between any of these 3 choices.

For Python 3:

>>> import re
>>> re.sub(rs+, , strip my ntr ASCII and u00A0 u2003 Unicode spaces)
stripmyASCIIandUnicodespaces
>>> # Or, depending on the situation:
>>> re.sub(r(s|u180B|u200B|u200C|u200D|u2060|uFEFF)+, , 
... uFEFFttt strip all u000A kinds of u200B whitespace n)
stripallkindsofwhitespace

…handles any whitespace characters that youre not thinking of – and believe us, there are plenty.

s on its own always covers the ASCII whitespace:

  • (regular) space
  • tab
  • new line (n)
  • carriage return (r)
  • form feed
  • vertical tab

Additionally:

  • for Python 2 with re.UNICODE enabled,
  • for Python 3 without any extra actions,

s also covers the Unicode whitespace characters, for example:

  • non-breaking space,
  • em space,
  • ideographic space,

…etc. See the full list here, under Unicode characters with White_Space property.

However s DOES NOT cover characters not classified as whitespace, which are de facto whitespace, such as among others:

  • zero-width joiner,
  • Mongolian vowel separator,
  • zero-width non-breaking space (a.k.a. byte order mark),

…etc. See the full list here, under Related Unicode characters without White_Space property.

So these 6 characters are covered by the list in the second regex, u180B|u200B|u200C|u200D|u2060|uFEFF.

Sources:

python – How to strip all whitespace from string

Alternatively,

strip my spaces.translate( None, string.whitespace )

And here is Python3 version:

strip my spaces.translate(str.maketrans(, , string.whitespace))

Leave a Reply

Your email address will not be published. Required fields are marked *