Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word, such as noun, verb, adjective, etc. It is an extremely powerful and accurate tool. You can use it in any application that deals with natural language text to analyze words/tokens and classify them into categories.
For pre-requisites, follow these simple steps:
- Download and install java JDK and JRE on your system from here.
- Edit system environment variables by right clicking on My Computer -> Properties -> Advance System Settings ->Environment Variables. Copy the path to the bin directory of your JDK installation to the beginning of your environment variable PATH. For default settings, this will look like this: “C:\Program Files\Java\jdk1.7.0_09\bin;” (without the quotes ofcourse).
- Download Eclipse IDE from here depending upon your system configuration. You can pick “Eclipse IDE for Java and DSL Developers” if you are not sure which one to chose.
- Download Stanford POS tagger from here.
You’re almost ready to go. Lets setup our work:
- Open Eclipse and chose the location of your workspace. This is where all your projects will be stored.
- Make a new project and name it anything you want. I’ll go with the name “practise“.
- Add a new class to it. You can name it “tagText”.
- Go to the directory where your downloaded the Stanford POS tagger, and inside the folder “models”. Copy a .tagger file and its corresponding .props file. I will assume these are: “left3words-wsj-0-18.tagger” and “left3words-wsj-0-18.props”. In your workspace directory, inside your project folder make a new folder and name it “taggers”. Go to this folder and paste the tagger and props files.
Alright people. Now lets start coding !
Add/write this code to the tagText.java file you created.
01 | import java.io.IOException; |
02 | import edu.stanford.nlp.tagger.maxent.MaxentTagger; |
05 | public static void main(String[] args) throws IOException, |
06 | ClassNotFoundException { |
09 | MaxentTagger tagger = new MaxentTagger( "taggers/left3words-wsj-0-18.tagger" ); |
12 | String sample = "This is a sample text" ; |
15 | String tagged = tagger.tagString(sample); |
18 | System.out.println( "Input: " + sample); |
19 | System.out.println( "Output: " + tagged); |
We are not done yet. We need to import the Stanford tagger library to eclipse. To do this:
Right click on your project “practise” -> Build Path -> Configure Build Path -> Click on Add External JARs -> Browse to the location of your download directory of the Stanford POS tagger and select the stanford-postagger.jar file -> Click OK.
Import library to Eclipse
That’s it guyz. Run your code and you should have this output:
Loading default properties from trained tagger taggers/left3words-wsj-0-18.tagger
Reading POS tagger model from taggers/left3words-wsj-0-18.tagger … done [2.1 sec].
i/FW can/MD man/VB the/DT controls/NNS of/IN this/DT machine/NN
The output you will get
The “FW”, “MD”, “VB”, etc next to each word are classes. For example, VB stands for Verb. The complete list of classes can be found
here.
To play around more with this, you can have lots of English sentences stored in a file, say “input.txt” and we can run the tagger and store all tagged sentences in another file, say “output.txt”.
To accomplish this, add a new class named “tagTextToFile” to your project with the following code:
02 | import edu.stanford.nlp.tagger.maxent.MaxentTagger; |
04 | public class tagTextToFile { |
06 | public static void main(String[] args) throws IOException, |
07 | ClassNotFoundException { |
12 | MaxentTagger tagger = new MaxentTagger( "taggers/left3words-wsj-0-18.tagger" ); |
15 | String sample = "i can man the controls of this machine" ; |
18 | tagged = tagger.tagString(sample); |
21 | System.out.println(tagged); |
27 | FileInputStream fstream = new FileInputStream( "input.txt" ); |
28 | DataInputStream in = new DataInputStream(fstream); |
29 | BufferedReader br = new BufferedReader( new InputStreamReader(in)); |
32 | while ((sample = br.readLine())!= null ) |
35 | tagged = tagger.tagString(sample); |
36 | FileWriter q = new FileWriter( "output.txt" , true ); |
37 | BufferedWriter out = new BufferedWriter(q); |
References:
- http://www.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/
- http://nlp.stanford.edu/software/tagger.shtml
No comments:
Post a Comment