Tuesday, May 21, 2013

Central Dogma of Bio Informatics and Model organisms


The whole process which involves transcription of DNA to mRNA and translation of mRNA to a functional protein is called the Central Dogma of Bio Informatics. This dogma forms the backbone of Bio Informatics and is represented by following stages.

DNA Replication
The DNA contains the genetic blueprint which is maintained and passed on by a process called replication. Replication is carried out by DNA polymerase.

Transcription
DNA codes for the production of messenger RNA (mRNA) during transcription. Transcription is carried out by RNA polymerase. The mRNA (messenger RNA) undergoes splicing and migrates from the nucleus to the cytoplasm.

Translation
Ultimately, this created mRNA finds its way to a ribosome, where it is translated. In prokaryotic cells, which don’t have nucleus and ribosomes the process of transcription and translation may be linked together. In eukaryotic cells, the site of transcription is usually separated from the site of translation (the cytoplasm), thus the mRNA must be transported out of the nucleus into the cytoplasm, where it can be bound by ribosomes. The mRNA is read by the ribosome as triplet codons. Then RNAs (tRNAs) transferred into the ribosome-mRNA complex, matching the codon in the mRNA to the anti-codon in the tRNA, thus adding the correct amino acid in the sequence encoding the gene. Then the amino acids are linked into the growing peptide chain. Finally, protein will be created according to the processes mentioned before.
So eventually it’s clear that the DNA carries information for proteins which perform many functions in different locations of an organism. Proteins in living organisms perform several functions like biological catalysts, Structural proteins etc. The proteins which perform a catalytic function are known as enzymes. Each functional protein has a specific region known as active site which combines with the substrate. The active site has a unique geometric shape that is complementary to the geometric shape of a substrate molecule, similar to the fit of puzzle pieces. This means that proteins specifically react with only one or a very few similar compounds.

Hence an error in the blueprint (DNA) would cause a change in the geometric shape of the active site of the protein. This results in deactivation of the functional protein thus engaging in termination of a reaction or its function and eventually causing a disease condition.

So the Central Dogma of Bioinformatics acts as the backbone for diagnosis of diseases and for drug designing. To perform this much easier and effectively model organisms are used extensively


Model organisms

A model organism is a simplified system that is accessible and easily manipulated. A model organism is an animal, plant or microbe that can be used to study certain biological processes.

They are used to obtain information about other species including humans that are more difficult to study directly which means the situations where human experimentation would be unfeasible or unethical. Regardless of their obvious differences in size and life style, all these model organisms create proteins that perform the same core functions as in humans

When scientists discover that a particular gene is associated with a disease in humans, one of the first things they typically do is find out what that gene does in a model organism such as mouse. Hence the mouse genome is similarly organized to the human genome and large blocks of genes are even arranged in the same order. This often used extensively to establish disease models by imitating the gene defects seen in humans, and these models can be used to test the efficacy of new drugs.

The model organisms and humans share the similar core functional proteins the geometric shape of the active sites is also similar. Thus the potential drugs which have a complementary geometric shape of the disease causing protein can be used on to the model organisms which would result in a similar conclusion if it is used on to humans. For the final confirmation of the drug can be tested on more related species of humans such as chimpanzee and eventually on humans as well.

Many of the drugs in current use were discovered by experiments conducted in animals and humans. However, many drugs are now being designed with the specific disorder in view. Model organisms play a significant role as a resource in the process of drug designing.

Saturday, January 5, 2013


Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word, such as noun, verb, adjective, etc. It is an extremely powerful and accurate tool. You can use it in any application that deals with natural language text to analyze words/tokens and classify them into categories.
For pre-requisites, follow these simple steps:
  1. Download and install java JDK and JRE on your system from here.
  2. Edit system environment variables by right clicking on My Computer -> Properties -> Advance System Settings ->Environment Variables. Copy the path to the bin directory of your JDK installation to the beginning of your environment variable PATH. For default settings, this will look like this: “C:\Program Files\Java\jdk1.7.0_09\bin;” (without the quotes ofcourse).
  3. Download Eclipse IDE from here depending upon your system configuration. You can pick “Eclipse IDE for Java and DSL Developers” if you are not sure which one to chose.
  4. Download Stanford POS tagger from here.
You’re almost ready to go. Lets setup our work:
  1. Open Eclipse and chose the location of your workspace. This is where all your projects will be stored.
  2. Make a new project and name it anything you want. I’ll go with the name “practise“.
  3. Add a new class to it. You can name it “tagText”.
  4. Go to the directory where your downloaded the Stanford POS tagger, and inside the folder “models”. Copy a .tagger file and its corresponding .props file. I will assume these are: “left3words-wsj-0-18.tagger” and “left3words-wsj-0-18.props”.  In your workspace directory, inside your project folder make a new folder and name it “taggers”. Go to this folder and paste the tagger and props files.
Alright people. Now lets start coding !
Add/write this code to the tagText.java file you created.
01import java.io.IOException;
02import edu.stanford.nlp.tagger.maxent.MaxentTagger;
03 
04public class tagText {
05public static void main(String[] args) throws IOException,
06ClassNotFoundException {
07 
08// Initialize the tagger
09MaxentTagger tagger = new MaxentTagger("taggers/left3words-wsj-0-18.tagger");
10 
11// The sample string
12String sample = "This is a sample text";
13 
14// The tagged string
15String tagged = tagger.tagString(sample);
16 
17//output the tagged sample string onto your console
18System.out.println("Input: " + sample);
19System.out.println("Output: "+ tagged);
20}
21}
We are not done yet. We need to import the Stanford tagger library to eclipse. To do this:
Right click on your project “practise” -> Build Path -> Configure Build Path -> Click on Add External JARs -> Browse to the location of your download directory of the Stanford POS tagger and select the stanford-postagger.jar file -> Click OK.
Import library to Eclipse
That’s it guyz. Run your code and you should have this output:
Loading default properties from trained tagger taggers/left3words-wsj-0-18.tagger
Reading POS tagger model from taggers/left3words-wsj-0-18.tagger … done [2.1 sec].
i/FW can/MD man/VB the/DT controls/NNS of/IN this/DT machine/NN
The output you will get
The “FW”, “MD”, “VB”, etc next to each word are classes. For example, VB stands for Verb. The complete list of classes can be found here.
To play around more with this, you can have lots of English sentences stored in a file, say “input.txt” and we can run the tagger and store all tagged sentences in another file, say “output.txt”.
To accomplish this, add a new class named “tagTextToFile” to your project with the following code:
01import java.io.*;
02import edu.stanford.nlp.tagger.maxent.MaxentTagger;
03 
04public class tagTextToFile {
05 
06 public static void main(String[] args) throws IOException,
07 ClassNotFoundException {
08 
09 String tagged;
10 
11 // Initialize the tagger
12 MaxentTagger tagger = new MaxentTagger("taggers/left3words-wsj-0-18.tagger");
13 
14 // The sample string
15 String sample = "i can man the controls of this machine";
16 
17 //The tagged string
18 tagged = tagger.tagString(sample);
19 
20 //output the tagged sample string onto your console
21 System.out.println(tagged);
22 
23 /* next we will pick up some sentences from a file input.txt and store the output of
24 tagged sentences in another file output.txt. So make a file input.txt and write down
25 some sentences separated by a new line */
26 
27 FileInputStream fstream = new FileInputStream("input.txt");
28 DataInputStream in = new DataInputStream(fstream);
29 BufferedReader br = new BufferedReader(new InputStreamReader(in));
30 
31 //we will now pick up sentences line by line from the file input.txt and store it in the string sample
32 while((sample = br.readLine())!=null)
33 {
34 //tag the string
35 tagged = tagger.tagString(sample);
36 FileWriter q = new FileWriter("output.txt",true);
37 BufferedWriter out =new BufferedWriter(q);
38 //write it to the file output.txt
39 out.write(tagged);
40 out.newLine();
41 out.close();
42 }
43 
44}
45 
46}
References:
  1. http://www.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/
  2. http://nlp.stanford.edu/software/tagger.shtml