python – Regular expression matching a multiline block of text
python – Regular expression matching a multiline block of text
Try this:
re.compile(r^(.+)n((?:n.+)+), re.MULTILINE)
I think your biggest problem is that youre expecting the ^
and $
anchors to match linefeeds, but they dont. In multiline mode, ^
matches the position immediately following a newline and $
matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (n
), a carriage-return (r
), or a carriage-return+linefeed (rn
). If you arent certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r^(.+)(?:n|rn?)((?:(?:n|rn?).+)+), re.MULTILINE)
BTW, you dont want to use the DOTALL modifier here; youre relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r^(.+?)nn((?:[A-Z]+n)+),re.MULTILINE)
>>> rx_blanks=re.compile(rW+) # to remove blanks and newlines
>>> text=Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
...
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub(,sequence)
... print Title:,title
... print Sequence:,sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)nn((?:[A-Z]+n)+)
- The first character (
^
) means starting at the beginning of a line. Be aware that it does not match the newline itself (same for $: it means just before a newline, but it does not match the newline itself). - Then
(.+?)nn
means match as few characters as possible (all characters are allowed) until you reach two newlines. The result (without the newlines) is put in the first group. [A-Z]+n
means match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.((?:
textline)+)
means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.- You could add a final
n
in the regular expression if you want to enforce a double newline at the end. - Also, if you are not sure about what type of newline you will get (
n
orr
orrn
) then just fix the regular expression by replacing every occurrence ofn
by(?:n|rn?)
.
python – Regular expression matching a multiline block of text
The following is a regular expression matching a multiline block of text:
import re
result = re.findall((startText)(.+)((?:n.+)+)(endText),input)