Precognox at CILC 2013 - The Use of Corpora in Natural Language Processing

Fri 22 March, 2013

Our computational linguist Zoltan Varju presented his thought on the use of corpora in natural language processing at V. International Conference on Corpus Linguistics (CILC 2013). Here is the abstract:

Some researchers suggest[1] that in the analysis of corpora, even less sophisticated algorithms give better results using large, web-scale corpora. Corpus based language models brings empirical evidence into linguistic inquires and statistical methods have become the state-of-the-art techniques in natural language processing and linguistics[2].

On the other hand, we have to face methodological question when we are using web corpora. In most cases, the industry unconsciously relies on Leech notion of representativeness[3] and aims to use a corpus that is big enough to make generalization to the whole language. However, usage determines sampling and we cannot generalize outside the domain of our data. One of the most striking example of this vicious circle is named entity recognition, which is a notoriously domain specific task.

Although we aim full automatic solutions, we are very far from such applications. The human factor in processing corpora is still important and we need more elaborated methods. One promising direction can be crowdsourcing, that reduces the time and costs of annotation[4], but the costs of expertise in data curation cannot be saved. Titles like [5] show that the industry is interested in standard practices and needs guidance to overcome ad hoc, domain specific solutions.

[1] Alon Harvey – Peter Norvig – Fernando Pereira: The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, March/April 2009, p 8-12
[2] Peter Norvig: Colorless green ideas learn furiously: Chomsky and the two cultures of statistical learning, Significance, 2012 August, Vol. 9, Issue 4, p.30-33
[3] Leech, G. (1991). The state of the art in corpus linguistics. In Aijmer, K. & B. Altenberg (eds.), English corpus linguistics: studies in honour of Jan Svartvik. London: Longman. 8–29.
[4] Chris Callison-Burch and Mark Dredze (eds): Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Association for Computational Linguistics, 2010, http://www.aclweb.org/anthology/W10-07
[5] James Pustejovsky – Amber Stubbs: Natural Language Annotation for Machine Learning, O'Reilly Media, 2012

A new year, a brand new name

Tue 22 January, 2013

A new year, and a brand new name; Precognox. The company name has changed, but our commitment to the state-of-the-art search and text mining solutions and  professional J2EE  is the same.

Due to the name change, we set up a new English web site at Precognox.com

If you are wondering where does this weird name come from, we tell you that the etymology is influenced by Minority Report, a film from the early 2000s. We strongly believe that data can help you to gain insight into future on which you make better decisions and we hope we can help you in that.

Company day in Budapest

Wed 19 December, 2012

We held our regular company day in Budapest. The event started with a brief overview of the year and the forthcoming challenges in 2013. After the serious talks we had a lunch and an in-house bowling competition.

Our new colleague Gábor Németh

Thu 29 November, 2012

Our development team has got a new talent, Gábor Németh. Gábor is a senior Java developer with broad experience in working at multinational and domestic companies.

Main careers:
Magyar Telekom - senior developer
Wirecard AG. (Germany) - software developer
Loyalty Partner GmbH. (Germany) - software developer

We believe our new colleague strengthen our products and services.

Mind reading instead of searching

Tue 30 October, 2012

What will the future bring us? The most prestigious weekly newspaper in Hungary, the HVG, devoted its special issue (HVG Extra, Jövő 2.0 - Future 2.0) to this question. Our colleagues, Endre Jóföldi (CEO) and Zoltán Varjú speculated on the recent developments in search and language technology in their contribution to the issue. They think technology becomes so sophisticated and the collection of contextual data (location, user behavior and etc.) makes possible to "read our minds" and serve up relevant search results even without typing  queries...


