Location Extraction

Posted in Uncategorized by Juan Wajnerman on the October 7th, 2008

In this post I’ll describe I have been doing for Instedd during the last couple of weeks. In one of the projects we have we need to classify a series of articles depending on the geographical location they are talking about. This process is known as geotagging, and is really important on the biosurveillance areas.

Geotagging items is not a new thing, and many web sites already supports adding geographic information to the objects their handle. For example, Flickr allows you to set the coordinates where the picture was taken. Wikipedia also has structured information that contains the latitude and longitude for articles about a place in the world. On the other side, specs like GeoRSS can be used to augment the information given by a feed. However, even though all these new geo-related features are being widely adopted, there are still much information out there, that would need a human reading the text to understand which places is it mentioning.

So, we decided to make this process automatically as most as possible, extracting the information from the text itself. This is know as “location extraction”, and is actually a branch of a more general thing named “entity extraction”.

(more…)

Firefox text selection tips

Posted in Uncategorized by Juan Wajnerman on the March 29th, 2008

Firefox has many settings that are not available through the standard “Preferences”
dialog. However these can be configured navigating to the “about:config” address.

In that list we can find two useful options related to text selection, that has different
default values depending on the platform (Windows, Linux, Mac).

The first option “layout.word_select.stop_at_punctuation” controls whether the selection
when you do a double-click or use ctrl+arrows to select text, stops at dots, slashes and
other symbols. Enabling this option is specially useful when trying to change some part of
the URL in the address bar. Otherwise the entire URL is selected, thing that can still be
done triple-clicking on the address.

This option is selected by default in Windows version of Firefox, but is not in the Linux
version. There is a long discussion about how it would be in future versions.

The second option “layout.word_select.eat_space_to_next_word” is related to the selection
of surrounding spaces when double-clicking on a word on a page text. I think this option
should be always set to false, because sometimes is really annoying specially when copying
and pasting usernames and passwords. This time the value I consider more useful is the
default in Linux and the opposite happens in Windows. No exact idea of what happens on the
Mac’s world ;-) .