• Semantic search
  • text mining
  • agile software development

Berlin Buzzwords - the buzz that matters

Mon 10 June, 2013

Our CTO Károly Kása has just returned from Berlin Buzzwords, Europe’s hottest geeky conference on scalability, search and data analysis. The two-day conference was dominated by presenters who are not just experts, but main developers of the most important open source tools in the realm of big data and enterprise search. Don’t worry if you couldn’t get to Berlin to get hands-on information about the latest developments, the organizers will make available every talk on the Internet. The friendly atmosphere made possible to get in touch with like minded attendees during the breaks.

Company workshop

Tue 16 April, 2013

We had a company workshop for our developers last Friday. Main themes were: Solr updates, new Java features, Android development.

Precognox joins LinguaPark Cluster

Tue 9 April, 2013

Precognox has joined LinguaPark Cluster (http://www.linguapark.hu/) the largest cluster of language experts in the fields of translation, translation and language technologies and communication in Hungary. We are very proud to join a cluster with members from both academia and industry in a field that is one of the top priorities within the European Union.

At The CEBIT Business Persons' Meeting

Tue 26 March, 2013

In business life no tools can substitue personal contacts entirely. Our executive and development managers took part in twelve, pre-set meetings at the FutureMatch business persons' meeting, organized by CEBIT in Hannover in March.

We have been able to meet CEOs of South Korean, Italian, Croatian, Serbian, Spanish, Brazilian, Polish and even one Hungarian company. Several of these meetings are likely to come to fruition in the future.

 

Precognox at CILC 2013 - The Use of Corpora in Natural Language Processing

Fri 22 March, 2013

Our computational linguist Zoltan Varju presented his thought on the use of corpora in natural language processing at V. International Conference on Corpus Linguistics (CILC 2013). Here is the abstract:

Some researchers suggest[1] that in the analysis of corpora, even less sophisticated algorithms give better results using large, web-scale corpora. Corpus based language models brings empirical evidence into linguistic inquires and statistical methods have become the state-of-the-art techniques in natural language processing and linguistics[2].

On the other hand, we have to face methodological question when we are using web corpora. In most cases, the industry unconsciously relies on Leech notion of representativeness[3] and aims to use a corpus that is big enough to make generalization to the whole language. However, usage determines sampling and we cannot generalize outside the domain of our data. One of the most striking example of this vicious circle is named entity recognition, which is a notoriously domain specific task.

Although we aim full automatic solutions, we are very far from such applications. The human factor in processing corpora is still important and we need more elaborated methods. One promising direction can be crowdsourcing, that reduces the time and costs of annotation[4], but the costs of expertise in data curation cannot be saved. Titles like [5] show that the industry is interested in standard practices and needs guidance to overcome ad hoc, domain specific solutions.

[1] Alon Harvey – Peter Norvig – Fernando Pereira: The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, March/April 2009, p 8-12
[2] Peter Norvig: Colorless green ideas learn furiously: Chomsky and the two cultures of statistical learning, Significance, 2012 August, Vol. 9, Issue 4, p.30-33
[3] Leech, G. (1991). The state of the art in corpus linguistics. In Aijmer, K. & B. Altenberg (eds.), English corpus linguistics: studies in honour of Jan Svartvik. London: Longman. 8–29.
[4] Chris Callison-Burch and Mark Dredze (eds): Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Association for Computational Linguistics, 2010, http://www.aclweb.org/anthology/W10-07
[5] James Pustejovsky – Amber Stubbs: Natural Language Annotation for Machine Learning, O'Reilly Media, 2012

Syndicate content
Customers