• Semantic search
  • text mining
  • agile software development

Visualizing Star Wars movie scripts - relationships matter

Wed 20 January, 2016

A long time ago, in a galaxy far, far away our data analysts were talking about the upcoming new Star Wars movie. One of them has never seen any eposide of the two trilogies before, so they decided to make the movie more accessible to this poor fellow.

Relationships matter

The Star Wars universe is full of strange-looking characters. They are talking and beeping a lot, and sometimes it is hard to track down who talked to whom. So we took the movie scripts and started coding hard.

See the amazing Star Wars universe...

 

 

Finding the needle in the haystack

Thu 5 November, 2015

Recognizing names is so easy for us. But what about machines? And what if you would like to recognize names in different languages? Meltwater asked us to help their system to extract names from the first few lines of news articles (we call them by-lines) and we are currently working on solutions for Brazilian Portuguese, Arabic and Chinese.

The two faces of names

Recognizing names is a Janus-faced problem. Have a look at the following screenshot, taken from LeMonde.fr.


Even if you don’t speak French, you can still recognize the name of its author. Humans know that it is a tradition to mention the author’s name somewhere near the title, and they are also aware that names follow special orthographic rules, i.e. parts of a name usually start with a capital letter and that there is a plethora of common given names in Western-European languages. Things are getting complicated with languages using non-Latin alphabets, like Arabic, Russian. Machines are lacking the ability of this cultural and linguistic awareness, so it is our task to make them clear.

Teaching machines to recognize names

Our team is working with translators who are either native speakers of a target language or they are speakers with native-like fluency. In every case, we are working with real-world data. First we prepare the data by annotations (start and end positions of names were tagged). Then we split our data into two distinct data sets: one for development and one for testing. The development data is our primary source of information and it is used to learn about the nature of names in a given language. What kind of names does a given language use? How many words are in a name? Does a given language use prefixes, suffixes, infixes, etc.? Is there a convention to ascribe authorship, for example “by” in English? The test data is used for the continuous evaluation of the system, i.e. literally to see if we can find names in the by-lines. Employing these methods we can achieve high precision with really low false positive rate.

We are happy to work with multiple languages

We love working on multilingual solutions! Our company is located in the middle of Europe and we have accustomed to the challenges of a multilingual environment. Even our biggest client is a company offering translation memory and online services to professional translators.

About Meltwater

Meltwater helps companies make better, more informed decisions based on insights from the outside. More than 23,000 companies use the Meltwater media intelligence platform to stay on top of billions of online conversations, extract relevant insights, and use them to strategically manage their brand and stay ahead of their competition. With 50 offices on six continents, Meltwater is dedicated to personal, global service built on local expertise. Meltwater also operates the Meltwater Entrepreneurial School of Technology (MEST), a nonprofit organization devoted to nurturing future generations of entrepreneurs. For more information, follow Meltwater on Twitter, Facebook, LinkedIn, YouTube, or visit www.meltwater.com.

KConnect at ICT 2015 Exhibition

Wed 21 October, 2015

KConnect participated the interactive ICT 2015 Exhibition /Digital Agenda for Europe/ this week. Unarguably, this member of our team is the hardest among the exhibitors.

KConnect provides medical-specific multilingual text processing services, consisting of semantic annotation, semantic search, search log analysis, document classification and machine translation. Here is a short video on how it works.

We are ready to help you if you have a health related search or text mining problem.

Tagging documents with informative key-phrases for cognitive computing

Fri 16 October, 2015

Extracting key-words and key-phrases is a common task in Enterprise Search and Information Retrieval. Key-phrases are commonly used by search engines and other indices to categorize texts, build facets, or locate specific data in documents.

tagging documents

We have a problem

Despite the fact that key-phrase extraction is a common task both in industry and in academy, the precision and recall values of the state-of-the-art systems are usually under 30% and 40%. These measures have often been calculated by using a set of key-phrases extracted by humans as a baseline. Precision is the fraction of true positives over the sum of true positives and false positives, recall is the fraction of true positives over the sum of true positives and false negatives. Although human annotators do their best to extract key-phrases, it is not an easy task and sometimes there is no full agreement between two humans on which phrases are the most relevant in a given text. The picture looks to be miserable, but we can see key-phrase extractor almost everywhere. Why? First, the relatively low precision and recall values are not that bad, even they are pretty good compared to other Information Retrieval systems. Second, users find most of the key-phrases natural, or at least somehow informative.

Designing our new key-phrase extracting algorithm was a long journey. We wanted a more or less language independent solution, since in Europe, chances are high that you have to deal with multiple languages. First, we studied data-driven approaches that usually employ a language model (basically huge frequency tables of words or n-grams) and compares the word or n-gram frequencies of a given text to this model. Although these approaches are very good, you have to build a new model for every new language. Also, comparing frequencies can be computationally intensive, even if you want to extract key-phrases in real time.

Our solution

Finally, we found TextRank, an unsupervised algorithm for key-phrase extraction. TextRank runs on texts and it requires no other inputs. It is a graph-based algorithm inspired by Google’s PageRank. The basic idea is that you can build a graph from the words of a text. Let every word be a node in the graph, you can draw an edge between two words, if they are neighbors (i.e. occurring in a sequence, one before the other) or there is no more than one word between them. This way you can build up a graph, so you can compute the PageRank value of each node. PageRank is a good measure, because it combines the authority (how many node links to it) and hubness (well connected nodes that make authority nodes accessible for many-many less important nodes).

Our linguistic team modified TextRank, so now it takes some linguistic information into consideration (e.g. we are weighting the edges and we are using a directed graph). Although we haven’t reached better evaluation results than the state-of-the-art systems, we felt that our solution gives very natural key-phrases.

think like a human

Think like a human and evaluate your results

Graph-based models are becoming popular models of human cognition. Studies showed that at least a significant portion of language can be modeled by graph structures. E.g. Griffiths et al.  describes an experiment that asked participants to respond with the first word that came into their mind that starts with a specific letter.  From a technical point of view, think of this task as giving suggestions based on the initial character. At first, one can think that the most frequent word starting with that character would be great to predict the answers. It has been found that word association networks are better at this task. Griffiths and his fellows built a network from a word association database and computed the PageRank value for each of its nodes to show that. Other experiments suggests that we organize semantic information into networks (Tenenbaum), and language development can be described in terms of growing language networks in our mind (Ke and Yao). Language graphs have been applied to measure the coherence of speech of patients with thought disorders. From this point of view, TextRank can be interpreted as a way of determining the central elements of a text.

We can’t build a product on our intuition, so we designed an online study which asked participants to rate key-phrases on a scale from “Totally relevant” to “Totally irrelevant” and we found 7.6% percent of them are totally relevant, 46.4% are rather relevant, 32.4% somehow relevant, 13.2% are rather irrelevant and 0.4% are totally irrelevant.

It seems that we designed an experiment just to support our claims. This is partly true. We think the good old precision and recall values are good if you can make a baseline. But asking humans to find a few key-phrases is asking them to rank alternatives. TextRank and our modification of the algorithm do the same thing, it ranks every word according to its PageRank value. During our evaluation task we would like to know if humans can accept automatically extracted phrases, i.e. although humans can accept them as key-phrases they render them to a lower rank.

Quality matters: Battle tested solutions

Thu 8 October, 2015

We love software, especially if it is safe, reliable, and does its job. Our philosophy of design is simple: know your product. So we are using SCRUM/Agile methodology, rigorous testing and we evaluate every machine learning solution.

Some part of our team

Craftsmanship

Engineering is a curious mixture of science and craft. Our team has almost thirty members, some of them with 8-10 years of experience. The team structure is flat, junior stuff is fully integrated into the work under the guidance of their senior peers. We are a Java shop and our engineers are keen to learn about the Java ecosystem, from the low-level details to designing architectures. As craftsmen, we know that even the cutting-edge technology can be used in the wrong way. So, we introduced the SCRUM methodology. We work closely with our clients, this means short iterations that provides possibilities for feedback and thinking about the further directions during the development.
 

Quality matters

Using SCRUM methodology and test-driven development reduces the possibilities of bugs though it can’t eliminate them. We have a separated Quality Assurance Team which main purpose is to exhaustively test every product made by our software development team. When it is required we give testing to independent testers. Also, our team can be hired for doing independent tests.
 

Evaluate

In the era of big data, you can’t avoid machine learning (ML) applications. These often require labeled data for supervised learning tasks. Our evaluation team has got experienced annotators who can prepare labeled data for training and testing. We see evaluation as part of the quality assurance and testing process, hence we don’t sell products without an evaluation report of its ML parts.
There is no software without bugs. But it does matter where those bugs lie! SCRUM methodology, quality assurance and evaluation helps us to avoid the critical ones.

 

Syndicate content
Customers