How to Design Powerful Scripts in Genetics

How to Design Powerful Scripts in GeneticsUser input and Command Line Arguments made Simple- with PythonStephen FordhamBlockedUnblockFollowFollowingApr 11The design of a script should have the user in mind.

Often, when we write a script which solves a particular problem, the chances are it could solve a problem for someone else too.

With this in mind, we should consider how easy our program is to use by other people.

Following the general theme of my tutorials, the examples will be genetic based, but the ideas in this article can be applied more broadly to any program design.

Consider the following scenario.

I want to search a DNA sequence and determine whether two restriction enzyme sites are present above a specified threshold.

For example, are the restriction enzymes sequences for EcoRI and HindIII present in a DNA sequence more than 5, 10, or 15 times?I want my python program to fulfill the following criteria:1.

Be flexible: give the user the ability to search for any restriction sites, or indeed any DNA motif in a DNA sequence and tell me how many times it appears above a set threshold.


Give the user control: Search any DNA sequence the user chose3.

Be forgiving: I want the program to consider typical user mistakes, for example; typing in a DNA sequence with non-ambiguous bases, or using a lower case for the DNA bases.


Be extra forgiving: Give the user all the attempts they need to enter a valid DNA sequence.

Interactive user inputIn order to get user input into the program, we must use the input() function.

The input() function takes a single string argument, which is the prompt to be displayed to the user, and returns the value typed in as a string.

Here, it is best to type a message, which makes it explicitly clear what you expect the user to do.

In the example below, we are asking the user to enter a DNA sequence.

However, mistakes are inevitable, and a wrong strike of a key or typo could occur.

In this situation, we have two choices, we could simply exit the program, and start from the beginning once more.

To exit immediately, we would run the following code.

However, this seems unintuitive and feels wrong.

Alternatively, we could give the user another, or as many opportunities as are necessary to enter an acceptable DNA sequence.

This can be regarded as user input validation, in essence checking that the input supplied by the user makes sense.

In this example, we want the user to enter a sequence without any ambiguous bases.

We can use a negated character group and the re.

search() function to ensure that the user enters a valid DNA sequence and the string .

upper() method to make sure the sequence is in upper case.

If you would like to see how to use these features of the regular expression module in more depth, please see my article on regular expression in genetics.

When a program captures user input, we must think about its design carefully.

In particular, we want to avoid situations where a user runs a program which requires some input, but does not makes this clear to the user.

In this situation, the user may turn their attention to something else, and the program may be waiting for an input prompt, thereby making no progress.

Whilst user input can make programs more interactive and flexible, it makes it harder to run the programs as part of an unsupervised workflow.

Ideally, we want to engineer a situation where the program has all the arguments specified at the beginning.

To achieve this end, command line arguments are an ideal solution.

Command line argumentsCommand line arguments are the strings that you type after the name of the program that you want to run.

In order to use command line arguments in our python scripts, it is necessary to import the sys module.

We can then access the command line arguments using the special list returned by sys.


The first element of sys.

argv is the name of the program, here: Command_line_arguments.

py and is at index 1 in the list.

The example below illustrates the indices and their respective command line arguments.

We can made the code even more robust by allowing the user to enter the sequences in either lower or upper case on the command line.

We can use the string .

upper() method to handle this.

In the command line below:‘GA’ is the first command line argument at index 1, which points to pattern‘TC’ is the second command line argument at index 2, which points to pattern2 and2 is the third command line argument at index 3, which points to count_noMy python script defines a function to search for two restriction enzymes within the specified DNA sequence.

I then build up a dictionary where I store the restriction enzyme sequences as keys and their counts as values.

I then iterate over my dictionary and compare whether the sequences are higher than the count specified by the user.

The complete script is shown below in my Github Gist.

Now we can run the script using the command line arguments.

Here, to test the functionality of the program, we will use a small sequence specified by the user, and try to determine whether the tri-nucleotides TGC and ATC are present more than 3 times.

In this example, we can clearly see that the user entered a non-valid DNA nucleotide, F (highlighted in yellow).

My program, then asked for the user to enter a valid DNA sequence, and when this was submitted, returned the result.

Very often though, we might want to search whole genomes, so asking for user input is simply not feasible.

To further demonstrate how useful this program can be, I downloaded the complete E.

coli genome from NCBI, gene accession number: Genbank CU928161.


This genome is relatively large, to be exact 5,032,268 base pairs.

However, the script I designed handled it effortlessly, and returned the result almost instantly.

Here, I searched whether the restriction enzyme sites, EcoR1 GTTACC, and HindIII AAGCCT were present above 50 times in the genome.

If they were, the program told me many times they appeared.

Further to this, we could make the output from the terminal even clearer, so anyone reading the results from the terminal could interpret the data.

ConclusionCommand line arguments represent a good way to get input for your python programs.

All the respective data that your program needs will be present at the start.

This tutorial has provided a brief introduction into how to design python scripts with the user in mind.

If we want the user to interact with the program, we should anticipate mistakes may arise and plan to foolproof our scripts accordingly.

.. More details

Leave a Reply