python – Regular expression matching a multiline block of text

python – Regular expression matching a multiline block of text

Try this:

re.compile(r^(.+)n((?:n.+)+), re.MULTILINE)

I think your biggest problem is that youre expecting the ^ and $ anchors to match linefeeds, but they dont. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (n), a carriage-return (r), or a carriage-return+linefeed (rn). If you arent certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r^(.+)(?:n|rn?)((?:(?:n|rn?).+)+), re.MULTILINE)

BTW, you dont want to use the DOTALL modifier here; youre relying on the fact that the dot matches everything except newlines.

This will work:

>>> import re
>>> rx_sequence=re.compile(r^(.+?)nn((?:[A-Z]+n)+),re.MULTILINE)
>>> rx_blanks=re.compile(rW+) # to remove blanks and newlines
>>> text=Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... 
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub(,sequence)
...   print Title:,title
...   print Sequence:,sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)nn((?:[A-Z]+n)+)

  • The first character (^) means starting at the beginning of a line. Be aware that it does not match the newline itself (same for $: it means just before a newline, but it does not match the newline itself).
  • Then (.+?)nn means match as few characters as possible (all characters are allowed) until you reach two newlines. The result (without the newlines) is put in the first group.
  • [A-Z]+n means match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
  • ((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
  • You could add a final n in the regular expression if you want to enforce a double newline at the end.
  • Also, if you are not sure about what type of newline you will get (n or r or rn) then just fix the regular expression by replacing every occurrence of n by (?:n|rn?).

python – Regular expression matching a multiline block of text

The following is a regular expression matching a multiline block of text:

import re
result = re.findall((startText)(.+)((?:n.+)+)(endText),input)

Leave a Reply

Your email address will not be published. Required fields are marked *