Saturday, January 8, 2011
Two years after finishing my B.A. in Computer Science I started working towards my Masters degree. I can honestly say it has been an adventure. At my college in order to keep the standards as high as some universities we had to take an exam that covered around 80% of the core computer courses of my B.A. and all of the core computer courses of my M.Sc., it was tough but I passed that exam with a score of 79 (it is considered a high score, the highest score I ever heard of was 89 and the woman who received it took 3 months off work and became a teachers’ aid in one of the courses).
The last thing left before I finish my degree is a final project (we don’t have a thesis in my college). I chose a work in Python since ESRI started supporting Python programming from their 9.3 version and I though it could be useful to know. And because I wanted to do something new.
The first project I tried was about extending the NLTK (natural language toolkit) library. The project was about finding semantic similarities between words with context relation (text turned intro triplets of word1,word2,relation fro example from the text “That dog is ugly” –> (dog,ugly,adj) is the extracted context relation). I finished that work only to find the NLTK library didn’t implement extracting context relation, they only had a demo with three texts. It took me a lot of time to realize I don’t have the necessary skill to implement that. If anyone wants my code it can be found here but please let me know if you use it and where.
The next project I was given was also about finding similarities between words but this time with vectors. The first step was extracting text from Yahoo Answers into text files organized by category I did that in C# (TODO: write a post about that). The next step was using distance functions to calculate the distance/similarity between two words.
For me Python was very similar to C programming I did on my B.A. just without memory allocations. The only thing I had to get used to was that Tabs were used not as a visual aid for the programmer (as in most languages) but as block identifiers, for example:
Is different from:
The second script just won’t run (TODO: give the message).
And the second thing I had to get used to is that method in side a class must have as their first parameter a paramter that is called self. That parameter is like the keyword this in C# and denotes the current object being used.
Calling the method is done “regularly”:
The syntax is fairly simple:
|Comment||#||#This is a comment|
|Multiple row comment 1||‘’’||‘’’|
This is a comment
|Multiple row comment 2||“””||“””|
This is a comment
|Reading input||<variable> = raw_input(“<Text to print>”)||option = raw_input(“Please enter…”)|
|Adding Strings|| |
<text1> + <text2>
|str = “1” + “2”|
str = “12”
|Multiplying Strings|| |
<text> * <int1>
|str = “1”*3|
str = “111”
|Joining Strings||<string>.join||' '.join(['Monty', 'Python'])|
|Splitting Strings||<string>.split|| |
Indexes are a special thing in Python you can use negative as well as positive indexes, the following diagram demonstrates this:
keywords: Python, NLTK, natural language toolkit, semantic similarities, context relation, string, index, variable,