We had a company workshop for our developers last Friday. Main themes were: Solr updates, new Java features, Android development.
Precognox has joined LinguaPark Cluster (http://www.linguapark.hu/) the largest cluster of language experts in the fields of translation, translation and language technologies and communication in Hungary. We are very proud to join a cluster with members from both academia and industry in a field that is one of the top priorities within the European Union.
In business life no tools can substitue personal contacts entirely. Our executive and development managers took part in twelve, pre-set meetings at the FutureMatch business persons' meeting, organized by CEBIT in Hannover in March.
We have been able to meet CEOs of South Korean, Italian, Croatian, Serbian, Spanish, Brazilian, Polish and even one Hungarian company. Several of these meetings are likely to come to fruition in the future.
Our computational linguist Zoltan Varju presented his thought on the use of corpora in natural language processing at V. International Conference on Corpus Linguistics (CILC 2013). Here is the abstract:
Some researchers suggest that in the analysis of corpora, even less sophisticated algorithms give better results using large, web-scale corpora. Corpus based language models brings empirical evidence into linguistic inquires and statistical methods have become the state-of-the-art techniques in natural language processing and linguistics.
On the other hand, we have to face methodological question when we are using web corpora. In most cases, the industry unconsciously relies on Leech notion of representativeness and aims to use a corpus that is big enough to make generalization to the whole language. However, usage determines sampling and we cannot generalize outside the domain of our data. One of the most striking example of this vicious circle is named entity recognition, which is a notoriously domain specific task.
Although we aim full automatic solutions, we are very far from such applications. The human factor in processing corpora is still important and we need more elaborated methods. One promising direction can be crowdsourcing, that reduces the time and costs of annotation, but the costs of expertise in data curation cannot be saved. Titles like  show that the industry is interested in standard practices and needs guidance to overcome ad hoc, domain specific solutions.
 Alon Harvey – Peter Norvig – Fernando Pereira: The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, March/April 2009, p 8-12
 Peter Norvig: Colorless green ideas learn furiously: Chomsky and the two cultures of statistical learning, Significance, 2012 August, Vol. 9, Issue 4, p.30-33
 Leech, G. (1991). The state of the art in corpus linguistics. In Aijmer, K. & B. Altenberg (eds.), English corpus linguistics: studies in honour of Jan Svartvik. London: Longman. 8–29.
 Chris Callison-Burch and Mark Dredze (eds): Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Association for Computational Linguistics, 2010, http://www.aclweb.org/anthology/W10-07
 James Pustejovsky – Amber Stubbs: Natural Language Annotation for Machine Learning, O'Reilly Media, 2012
A new year, and a brand new name; Precognox. The company name has changed, but our commitment to the state-of-the-art search and text mining solutions and professional J2EE is the same.
Due to the name change, we set up a new English web site at Precognox.com
If you are wondering where does this weird name come from, we tell you that the etymology is influenced by Minority Report, a film from the early 2000s. We strongly believe that data can help you to gain insight into future on which you make better decisions and we hope we can help you in that.