Saturday, January 5, 2013

Stanford POS tagger with Eclipse- Simple Tutorial


Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word, such as noun, verb, adjective, etc. It is an extremely powerful and accurate tool. You can use it in any application that deals with natural language text to analyze words/tokens and classify them into categories.
For pre-requisites, follow these simple steps:
  1. Download and install java JDK and JRE on your system from here.
  2. Edit system environment variables by right clicking on My Computer -> Properties -> Advance System Settings ->Environment Variables. Copy the path to the bin directory of your JDK installation to the beginning of your environment variable PATH. For default settings, this will look like this: “C:\Program Files\Java\jdk1.7.0_09\bin;” (without the quotes ofcourse).
  3. Download Eclipse IDE from here depending upon your system configuration. You can pick “Eclipse IDE for Java and DSL Developers” if you are not sure which one to chose.
  4. Download Stanford POS tagger from here.
You’re almost ready to go. Lets setup our work:
  1. Open Eclipse and chose the location of your workspace. This is where all your projects will be stored.
  2. Make a new project and name it anything you want. I’ll go with the name “practise“.
  3. Add a new class to it. You can name it “tagText”.
  4. Go to the directory where your downloaded the Stanford POS tagger, and inside the folder “models”. Copy a .tagger file and its corresponding .props file. I will assume these are: “left3words-wsj-0-18.tagger” and “left3words-wsj-0-18.props”.  In your workspace directory, inside your project folder make a new folder and name it “taggers”. Go to this folder and paste the tagger and props files.
Alright people. Now lets start coding !
Add/write this code to the tagText.java file you created.
01import java.io.IOException;
02import edu.stanford.nlp.tagger.maxent.MaxentTagger;
03 
04public class tagText {
05public static void main(String[] args) throws IOException,
06ClassNotFoundException {
07 
08// Initialize the tagger
09MaxentTagger tagger = new MaxentTagger("taggers/left3words-wsj-0-18.tagger");
10 
11// The sample string
12String sample = "This is a sample text";
13 
14// The tagged string
15String tagged = tagger.tagString(sample);
16 
17//output the tagged sample string onto your console
18System.out.println("Input: " + sample);
19System.out.println("Output: "+ tagged);
20}
21}
We are not done yet. We need to import the Stanford tagger library to eclipse. To do this:
Right click on your project “practise” -> Build Path -> Configure Build Path -> Click on Add External JARs -> Browse to the location of your download directory of the Stanford POS tagger and select the stanford-postagger.jar file -> Click OK.
Import library to Eclipse
That’s it guyz. Run your code and you should have this output:
Loading default properties from trained tagger taggers/left3words-wsj-0-18.tagger
Reading POS tagger model from taggers/left3words-wsj-0-18.tagger … done [2.1 sec].
i/FW can/MD man/VB the/DT controls/NNS of/IN this/DT machine/NN
The output you will get
The “FW”, “MD”, “VB”, etc next to each word are classes. For example, VB stands for Verb. The complete list of classes can be found here.
To play around more with this, you can have lots of English sentences stored in a file, say “input.txt” and we can run the tagger and store all tagged sentences in another file, say “output.txt”.
To accomplish this, add a new class named “tagTextToFile” to your project with the following code:
01import java.io.*;
02import edu.stanford.nlp.tagger.maxent.MaxentTagger;
03 
04public class tagTextToFile {
05 
06 public static void main(String[] args) throws IOException,
07 ClassNotFoundException {
08 
09 String tagged;
10 
11 // Initialize the tagger
12 MaxentTagger tagger = new MaxentTagger("taggers/left3words-wsj-0-18.tagger");
13 
14 // The sample string
15 String sample = "i can man the controls of this machine";
16 
17 //The tagged string
18 tagged = tagger.tagString(sample);
19 
20 //output the tagged sample string onto your console
21 System.out.println(tagged);
22 
23 /* next we will pick up some sentences from a file input.txt and store the output of
24 tagged sentences in another file output.txt. So make a file input.txt and write down
25 some sentences separated by a new line */
26 
27 FileInputStream fstream = new FileInputStream("input.txt");
28 DataInputStream in = new DataInputStream(fstream);
29 BufferedReader br = new BufferedReader(new InputStreamReader(in));
30 
31 //we will now pick up sentences line by line from the file input.txt and store it in the string sample
32 while((sample = br.readLine())!=null)
33 {
34 //tag the string
35 tagged = tagger.tagString(sample);
36 FileWriter q = new FileWriter("output.txt",true);
37 BufferedWriter out =new BufferedWriter(q);
38 //write it to the file output.txt
39 out.write(tagged);
40 out.newLine();
41 out.close();
42 }
43 
44}
45 
46}
References:
  1. http://www.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/
  2. http://nlp.stanford.edu/software/tagger.shtml

No comments:

Post a Comment