• Semantic search
  • text mining
  • agile software development

Finding the needle in the haystack

Thu 5 November, 2015

Recognizing names is so easy for us. But what about machines? And what if you would like to recognize names in different languages? Meltwater asked us to help their system to extract names from the first few lines of news articles (we call them by-lines) and we are currently working on solutions for Brazilian Portuguese, Arabic and Chinese.

The two faces of names

Recognizing names is a Janus-faced problem. Have a look at the following screenshot, taken from LeMonde.fr.

Even if you don’t speak French, you can still recognize the name of its author. Humans know that it is a tradition to mention the author’s name somewhere near the title, and they are also aware that names follow special orthographic rules, i.e. parts of a name usually start with a capital letter and that there is a plethora of common given names in Western-European languages. Things are getting complicated with languages using non-Latin alphabets, like Arabic, Russian. Machines are lacking the ability of this cultural and linguistic awareness, so it is our task to make them clear.

Teaching machines to recognize names

Our team is working with translators who are either native speakers of a target language or they are speakers with native-like fluency. In every case, we are working with real-world data. First we prepare the data by annotations (start and end positions of names were tagged). Then we split our data into two distinct data sets: one for development and one for testing. The development data is our primary source of information and it is used to learn about the nature of names in a given language. What kind of names does a given language use? How many words are in a name? Does a given language use prefixes, suffixes, infixes, etc.? Is there a convention to ascribe authorship, for example “by” in English? The test data is used for the continuous evaluation of the system, i.e. literally to see if we can find names in the by-lines. Employing these methods we can achieve high precision with really low false positive rate.

We are happy to work with multiple languages

We love working on multilingual solutions! Our company is located in the middle of Europe and we have accustomed to the challenges of a multilingual environment. Even our biggest client is a company offering translation memory and online services to professional translators.

About Meltwater

Meltwater helps companies make better, more informed decisions based on insights from the outside. More than 23,000 companies use the Meltwater media intelligence platform to stay on top of billions of online conversations, extract relevant insights, and use them to strategically manage their brand and stay ahead of their competition. With 50 offices on six continents, Meltwater is dedicated to personal, global service built on local expertise. Meltwater also operates the Meltwater Entrepreneurial School of Technology (MEST), a nonprofit organization devoted to nurturing future generations of entrepreneurs. For more information, follow Meltwater on Twitter, Facebook, LinkedIn, YouTube, or visit www.meltwater.com.

KConnect at ICT 2015 Exhibition

Wed 21 October, 2015

KConnect participated the interactive ICT 2015 Exhibition /Digital Agenda for Europe/ this week. Unarguably, this member of our team is the hardest among the exhibitors.

KConnect provides medical-specific multilingual text processing services, consisting of semantic annotation, semantic search, search log analysis, document classification and machine translation. Here is a short video on how it works.

We are ready to help you if you have a health related search or text mining problem.

Tagging documents with informative key-phrases for cognitive computing

Fri 16 October, 2015

Extracting key-words and key-phrases is a common task in Enterprise Search and Information Retrieval. Key-phrases are commonly used by search engines and other indices to categorize texts, build facets, or locate specific data in documents.

tagging documents

We have a problem

Despite the fact that key-phrase extraction is a common task both in industry and in academy, the precision and recall values of the state-of-the-art systems are usually under 30% and 40%. These measures have often been calculated by using a set of key-phrases extracted by humans as a baseline. Precision is the fraction of true positives over the sum of true positives and false positives, recall is the fraction of true positives over the sum of true positives and false negatives. Although human annotators do their best to extract key-phrases, it is not an easy task and sometimes there is no full agreement between two humans on which phrases are the most relevant in a given text. The picture looks to be miserable, but we can see key-phrase extractor almost everywhere. Why? First, the relatively low precision and recall values are not that bad, even they are pretty good compared to other Information Retrieval systems. Second, users find most of the key-phrases natural, or at least somehow informative.

Designing our new key-phrase extracting algorithm was a long journey. We wanted a more or less language independent solution, since in Europe, chances are high that you have to deal with multiple languages. First, we studied data-driven approaches that usually employ a language model (basically huge frequency tables of words or n-grams) and compares the word or n-gram frequencies of a given text to this model. Although these approaches are very good, you have to build a new model for every new language. Also, comparing frequencies can be computationally intensive, even if you want to extract key-phrases in real time.

Our solution

Finally, we found TextRank, an unsupervised algorithm for key-phrase extraction. TextRank runs on texts and it requires no other inputs. It is a graph-based algorithm inspired by Google’s PageRank. The basic idea is that you can build a graph from the words of a text. Let every word be a node in the graph, you can draw an edge between two words, if they are neighbors (i.e. occurring in a sequence, one before the other) or there is no more than one word between them. This way you can build up a graph, so you can compute the PageRank value of each node. PageRank is a good measure, because it combines the authority (how many node links to it) and hubness (well connected nodes that make authority nodes accessible for many-many less important nodes).

Our linguistic team modified TextRank, so now it takes some linguistic information into consideration (e.g. we are weighting the edges and we are using a directed graph). Although we haven’t reached better evaluation results than the state-of-the-art systems, we felt that our solution gives very natural key-phrases.

think like a human

Think like a human and evaluate your results

Graph-based models are becoming popular models of human cognition. Studies showed that at least a significant portion of language can be modeled by graph structures. E.g. Griffiths et al.  describes an experiment that asked participants to respond with the first word that came into their mind that starts with a specific letter.  From a technical point of view, think of this task as giving suggestions based on the initial character. At first, one can think that the most frequent word starting with that character would be great to predict the answers. It has been found that word association networks are better at this task. Griffiths and his fellows built a network from a word association database and computed the PageRank value for each of its nodes to show that. Other experiments suggests that we organize semantic information into networks (Tenenbaum), and language development can be described in terms of growing language networks in our mind (Ke and Yao). Language graphs have been applied to measure the coherence of speech of patients with thought disorders. From this point of view, TextRank can be interpreted as a way of determining the central elements of a text.

We can’t build a product on our intuition, so we designed an online study which asked participants to rate key-phrases on a scale from “Totally relevant” to “Totally irrelevant” and we found 7.6% percent of them are totally relevant, 46.4% are rather relevant, 32.4% somehow relevant, 13.2% are rather irrelevant and 0.4% are totally irrelevant.

It seems that we designed an experiment just to support our claims. This is partly true. We think the good old precision and recall values are good if you can make a baseline. But asking humans to find a few key-phrases is asking them to rank alternatives. TextRank and our modification of the algorithm do the same thing, it ranks every word according to its PageRank value. During our evaluation task we would like to know if humans can accept automatically extracted phrases, i.e. although humans can accept them as key-phrases they render them to a lower rank.

Quality matters: Battle tested solutions

Thu 8 October, 2015

We love software, especially if it is safe, reliable, and does its job. Our philosophy of design is simple: know your product. So we are using SCRUM/Agile methodology, rigorous testing and we evaluate every machine learning solution.

Some part of our team


Engineering is a curious mixture of science and craft. Our team has almost thirty members, some of them with 8-10 years of experience. The team structure is flat, junior stuff is fully integrated into the work under the guidance of their senior peers. We are a Java shop and our engineers are keen to learn about the Java ecosystem, from the low-level details to designing architectures. As craftsmen, we know that even the cutting-edge technology can be used in the wrong way. So, we introduced the SCRUM methodology. We work closely with our clients, this means short iterations that provides possibilities for feedback and thinking about the further directions during the development.

Quality matters

Using SCRUM methodology and test-driven development reduces the possibilities of bugs though it can’t eliminate them. We have a separated Quality Assurance Team which main purpose is to exhaustively test every product made by our software development team. When it is required we give testing to independent testers. Also, our team can be hired for doing independent tests.


In the era of big data, you can’t avoid machine learning (ML) applications. These often require labeled data for supervised learning tasks. Our evaluation team has got experienced annotators who can prepare labeled data for training and testing. We see evaluation as part of the quality assurance and testing process, hence we don’t sell products without an evaluation report of its ML parts.
There is no software without bugs. But it does matter where those bugs lie! SCRUM methodology, quality assurance and evaluation helps us to avoid the critical ones.


Automatic Detection of Emotions in Text

Wed 30 September, 2015

Our research is the first attempt to offer a solution for detecting emotions in Hungarian texts. In general, emotion analysis is mostly popular in behavioral sciences and psychology, however, in the recent years it also started to spread in the field of NLP (Natural Language Processing).

Plutchik wheel of human emotions

The background

It is important to make a distinction between the widely used sentiment analysis and emotion analysis. Emotion analysis aims to extract emotional states from a given text. Detecting emotions is extremely hard, they come and go so quickly and they are usually associated with extra-linguistic clues such as facial expressions, tone and etc.

In the Internet Era, it is becoming more and more important to analyze and extract emotions from texts, not just because it is uniquely fascinating and challenging to NLP experts but also because it is becoming strikingly important in the field of economy, if for example we would like to measure customer satisfaction.

Our research group hypothesizes that words with emotional meaning or content should be the best markers of the speaker’s/writer’s emotional intent, so we have constructed a Hungarian Emotion Dictionary. The dictionary consists of sub-dictionaries, each based on Ekman's six basic emotions, namely sadness, anger, disgust, fear, surprise and joy. Our team manually annotated several blog posts and their comments to test the efficiency of using our dictionaries for emotion analysis.

How can we use emotion analysis?

During the local elections in 2014, we analyzed Hungarian tweets related mayoral candidates in Budapest. We found that anger is the best predictor of winning! We were surprised, since most studies (like this classic from Bollen et al. found number of mentions and/or positive sentiment the best factors of success. The number of Hungarian Twitter users is very small, and less fine-grained solutions like sentiment analysis or the frequency of mentions could give us a bad picture since most of the tweets were neutral, and mentions of small party candidates were very rare.  So, we analyzed tweets by our emotion dictionaries and gave each candidate an emotion score that reflects the relative proportion of each emotion in tweets mentioning him/her. From the six basic emotions, it was the mean square error of anger which were in accord with the results of opinion polls and later the final outcome of the election.

The Economist’s R-word index is one of the most well-known indicator of the economy. It is so simple, as it depicts the frequency of the term “recession” in the Wall Street Journal and in the Financial Times, yet it is mostly accurate. We created a corpus, or a collection of articles from various news sites and blogs. We found no correlation between the frequency of “recession” and its Hungarian synonyms and the GDP. However, the level of fear and anger are usually increasing before the GDP starts to decline.

Syndicate content