The easy way using Data SciencePosted by Gonzalo Garcia Berrotarán 2 years, 10 months ago Comments
When we discussed Information Extraction and IEPY among professional peers we noticed that the approach was often unknown to those who could benefit from it the most. Its main beneficiaries are those with large volumes of unstructured or poorly structured text, where it is very costly to go through the text manually to extract relationships (e.g. in the VC industry such as funding, acquisitions, creating or opening of offices, etc.) between entities (companies, investment funds, people, and so on).
To create an example directed at those with perhaps less of a technical background, we processed the news articles from TechCrunch News, the main technology blog in the United States. We sought the funding relationships in U.S. companies. We published the result and found some interesting things:
VC Industry and Specialized Press
The publication of news about funding may result from investigation by specialized journalists or pushed from the companies themselves, who manage to promote news within mainstream media content.
Then checking the funding-related content in the TechCrunch News posts and comparing it to more complete databases can show us the editorial policies that these journalists follow or the companies efficiency in placing their own content.
So for example, in the funded companies vs. average funding chart (currently one of the main discussion topics) you can see a growing gap between the events covered in the more general database (CrunchBase) and those from TechCrunch News coverage
Since last year there has been a tendency to cover events where the funding amount was greater than the average from the CrunchBase database. Based on this data, higher level funding events should attract more attention from journalists than other below-average ones.
Considering geographical distribution of events coverage
Some of the highlights we can see include:
The low coverage in the Massachusetts area. Close to NY and with almost the same number of events in CrunchBase, it is almost half as likely to report news of a funding in Massachusetts compared with NY. (Does it make sense to hire PR agencies in one area instead of another?)
The significantly high coverage in a state such as Utah is comparable to that of the main ones; NY and CA, but without a high volume of funding events
The consistent media coverage of events in the two main areas of the industry: CA and NY.
And so on.
In summary, what was the advantage of this approach?
If you wanted to have an overall view you could include content from other blogs like Gigaom, VentureBeat, TWSJ, Forbes Tech, Mashable, Wired, The Verge, etc. without extra effort once the tool has learned to identify and predict relationships (e.g. funding to companies).
And of course, as the demo outlines, we were able to read several thousand news articles, extract the information to build a database and make the demo without arousing the deepest murderous rage in us that reading ~100k articles looking for that relationship can awake.
Disclaimer: this post and the demo doesn't pretend to be a comprehensive analysis of the VC industry but to show what information extraction can be used for