Using Regular Expression in Genetics with Python

finds the preceding character or character group zero or one times.

If it is a requirement to be specific or to match a specific number of repeats we can use curly bracket notation.

Following a character or group with a single number in the curly brackets will match exactly that number of repeats.

For example CA{3}GATT will match CAAAGATT, but not CAAGATT, or CAAAAGATT.

To specify a range, we can use the same curly brackets syntax, and use the general pattern where {n, x} finds the preceding character or group between n and x times inclusive.

For example, the TCG{2, 4}A, will match the patterns TCGGA, TCGGGA, TCGGGA, but not TCGA, or TCGGGGGA.

A summary of these metacharacters can be found in Table 1.

Match a patternThe re module is used to write regular expressions (regex) in Python.

To load this module, we need to use the import statement.

The following line of code is necessary to include at the top of your code:import reIn order to use a tool from the regular expression module, it is necessary to prefix it with the module name.

The simplest regular expression function, re.

search ( ) determines whether or not a pattern exists somewhere within a string.


search ( ) takes 2 arguments, which are both strings.

The first argument is the pattern you want to search for, and the second argument is the string that you want to search in.

For clarity, in the example below, I have included the 2 arguments as keyword arguments.

It is usually not necessary to include these keyword arguments.

For completeness, I have included the alternative syntaxic version commented out.

A simple example is illustrated in the code below; here we are looking for the presence of a tri-nucleotide repeat of A bases within a hard-coded DNA variable.

The output of many re function calls is a match object.

If we look at the match object above, we can determine whether there was a match.

The span identifies the index of the string where the matches occurred, and the match identifies the exact string that got matched.

A regular expression search can also be used as part of a conditional statement:Extracting match object valuesOften in our scripts, we want to not only determine whether a match occurred, but where precisely this match occurred.

We may also want to extract the match itself.

Fortunately, Python makes it simple to extract match object values such as the index positions of the match and the exact string that matched.

This can be achieved by employing a few methods on the match object.

Alternation and Character Groupsre.

search () can also be utilised to find more flexible patterns.

By way of an example, the NCII restriction enzyme recognises the nucleotide sequence pattern, ‘CCSGG’, where the nucleotide code ‘S’ can be either a C or a G.

Instead of writing two regular expression searches, we can capture this variation using an alteration.

Here, to represent a number of different alternatives, we write the alternatives inside parentheses and separate them using a pipe character (the | metacharacter is also known as the alternation operator, see Table 1).

The utility of alternation groups stems from their ability to be used as units of repetitions.

For example, to identify whether a sequence is delimited by a start and stop codon and therefore has the potential to be an open reading frame, we could write the following regex expression:This regex will search for either UAG, UAG, and UGA at the end of the sequence.

To improve upon this regex, and ensure the start and stop codon are in the same frame we could change the regex to:This checks that all the characters between the start and stop codon are a multiple of 3.

Character groupsCharacter groups can also be used to capture variation within a single pattern.

Consider the consensus N-glycosylation site in proteins.

This sequence motif has the pattern; Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro.

Using the corresponding single letter amino acid codes, with character groups, we can write this pattern as follows:This pattern will identify, the single letter code N, followed by any character that is not P (see Table 1, negated character groups), followed by either an S or a T, followed by any character that is not a P.

A pair of square brackets with a list of characters inside them can represent any one of these characters (refer to Table 1).

The Power of RegexThe real power of regex is exploited when these tools are used together.

Consider the following scenario.

Many human hereditary neurodegenerative disorders, such as Huntington’s disease (HD) have been linked to the anomalous expansion in the number of trinucleotide repeats in particular genes.

The pathological severity of HD correlates with the number of (CAG)n repeats in exon-1 of the gene htt which encodes the protein huntington.

In Huntington’s disease, a higher number of repeats means an earlier onset of disease and a more rapid disease progression.

The CAG codon specifies glutamine, and HD belongs to a broad class of polyglutamine diseases.

Healthy (wild-type) variants of this gene feature between 6–35 tandem repeats, whereas more than 35 repeats virtually assure the disease.

We can use regex in order to decipher the polyglutamine repeat number.

This firstly involves writing a pattern to find the tri-nucleotide repeat number above a set threshold.

The Codon CAA also encodes glutamine, therefore, in the htt_pattern above we must use the | alternation operator.

We can then use the curly brackets notation discussed above to specify how many time we want to find this pattern.

Here, I have chosen 18 or more times, by deliberately leaving off the upper limit.

I first searched the NCBI nucleotide database for the htt mRNA sequence, and downloaded it to my working directory.

I then read this sequence in, and used my pattern to determine the lenght of the glutamine tandem repeats which are above 18.

For clarity I have highlighted the match in the NCBI FASTA file.

I also used the re.

findall ( )function, as re.

search ( ) would only first the first occurrence, and it is possible in this case that many matches could be found.

Conclusion:This tutorial has provided a brief introduction into how regex can be applied specifically in genetics.

Knowledge in regexes is highly transferable, in particular syntactic forms and functionality behave roughly similar in Python and in many other mainstream programming languages such as Perl and R.

.. More details

Leave a Reply