Short story on scaling an NLP problem without using a ton of hardware.Posted by Javier Mansilla 2 years, 8 months ago Comments
The basic idea of relation extraction is to be able to detect mentioned things in text (so called Mentions, or Entity-Occurrences), and later decide if in the text is expressed or not the target relation between each couple of those things. In our case, we needed to find where companies were mentioned, and later determine if in a given sentence it was said that Company-A was funding Company-B or not.
In order to detect those funding we needed to be sure of capturing every mention of a company. And although the NER used catched most of them, there are always some folks that name their company #waywire or 8th Story, words that are not very easily trackable with a NER.
A good solution is to build a Gazetteer containing all the company names we can get. The idea of working with Gazettes, is that when using them, each time one of the Gazette entries is seen on a text, it’s automatically considered as a mention of a given object, ie, an Entity-Occurrence.
From an encyclopedic source; we got more than 300K entries.Great!
The next challenge was that… well, in the text to process, a company could be mentioned on a different way than the official one stated on the encyclopedic source. For instance, would be more natural to find mentions of “Yiftee” than “Yiftee Inc.”
So, after incorporating a basic schema for the alternative names (ie, substrings of the original long name), the number of entries grew up to 600K.
After that, when we felt confident about our gazette and wanted to start processing text, we faced several issues:
we weren’t able to handle that gazette size at reasonable speed
we were having tons of poor quality Entity-Occurrences (ie, most of the times a human reader would say that it was wrongly labeled as an Entity-Occurrence)
tons of poor quality Entity-Occurrences implied quadratic tons of potential funding evidences to check (roughly one per pair of Entity-Occurrences on the same sentence)
So, knowing that we were trading recall, we decided to add several levels of filters. Let the pruning start!
First step was to add a second encyclopedic source, not to augment, but instead to add confidence, keeping only the intersection of those 2 sources.
Next, with a precomputed count of words frequency, we filtered out all those company names that were too probable to occur as normal text (we used some threshold and tuned it a bit before leaving it fixed).
With that very same idea of words frequency, we pruned the companies sub-names (the substrings of the original long company names), with a higher threshold; so, for a company listed like “Hope Street Media” we didn't end up with a dangerous entry for “Hope”, but instead for “Yiftee Inc.” we did have “Yiftee” on the final list.
With all that done, we reduced the list to about 100k, which was still capturing a really good portion of the names to work with, but reducing a lot the mentioned issues above.
The last step was to pick a sample of documents, preprocess them, and simply hand-check the most used (found) Gazette-Items creating a blacklist for the cases where it was obvious that occurrences were most of the times not the company mentioned, but just natural usage of those words on the language.
We finished very satisfied with the results and also with the lessons learnt. Hopefully some of the tips above can help you.
 recall: (also known as sensitivity) is the fraction of relevant instances that are retrieved. http://en.wikipedia.org/wiki/Precision_and_recall
Want to read more related content? Follow us on Twitter @machinalis