Regex on the Texts of Harry Potter

Each tuple in the list corresponds to a chapter in the book and takes the form (‘C H A P T E R O N E’, ‘THE BOY WHO LIVED’, ‘Mr. and Mrs. Dursley, of number four, Privet Drive, blah, blah, blah…’) and contains the text of the entire chapter..The sub-pattern, ([A-Z][ ]){9,}[A-Z], means any UPPERCASE letter followed by a space, repeated 9 or more times, and ending with a final UPPERCASE letter..This is how the chapter numbers are captured.It is followed by s+ which is not in parenthesis, so not captured, and instructs python to look for white space (line breaks, tabs, spaces, etc), one or more times..This again means any UPPERCASE letter, and/or line break (..So the chapter title capturing string above captures all UPPERCASE letters unless the next word is also all UPPERCASE but immediately followed by a period, declared with a positive lookahead ?=..WEASLEY”, also contains a word followed by a period so now we’d only capture the chapter title as “THE WOES OF”).We need another negative lookahead, (?![a-z']|[A-Z.]), to make sure that all UPPERCASE words followed by a period are not also followed by lowercase words (chapter text), or are not the last word in the UPPERCASE string (because although the chapter title may contain a period, it never ends in one).Group #3 is the easiest one: (.*?)..Keep going until forced to stop, with the next part.The last part of the regular expression tells python at what point to stop capturing text: (?=(?:[A-Z][ ]){9,}|This book..It provides instructions to stop capturing text once a sequence of UPPERCASE letter and space is repeated at least 9 times (because thus begins the next chapter, "C H A P T E R T W O”), or until the string This book..This final string marks the end of the book; for each of the seven Harry Potter books ends with this text:This book was art directed by David Saylor and designed by Becky Terhune.. More details

Leave a Reply