The participants of the event had accomplished four different projects.
clustering the most popular recipes by ingredients within the ''Receptkereso”, our recipe-finder app
gender classification of Twitter users
analysis of the facebook sites of the teachers' protesting movement and the protesting meetings before the Election Office
supplementing our company's Hungarian GATE plugin with sentiment - and emotion dictionary and with Hungarian language named entity recognition gazetteeres
We can report that the Hackathon day went by in very good spirits, it ended with a tasty pizza and friendly discussion.
A long time ago, in a galaxy far, far away our data analysts were talking about the upcoming new Star Wars movie. One of them has never seen any eposide of the two trilogies before, so they decided to make the movie more accessible to this poor fellow.
The Star Wars universe is full of strange-looking characters. They are talking and beeping a lot, and sometimes it is hard to track down who talked to whom. So we took the movie scripts and started coding hard.
Recognizing names is so easy for us. But what about machines? And what if you would like to recognize names in different languages? Meltwater asked us to help their system to extract names from the first few lines of news articles (we call them by-lines) and we are currently working on solutions for Brazilian Portuguese, Arabic and Chinese.
Recognizing names is a Janus-faced problem. Have a look at the following screenshot, taken from LeMonde.fr.
Even if you don’t speak French, you can still recognize the name of its author. Humans know that it is a tradition to mention the author’s name somewhere near the title, and they are also aware that names follow special orthographic rules, i.e. parts of a name usually start with a capital letter and that there is a plethora of common given names in Western-European languages. Things are getting complicated with languages using non-Latin alphabets, like Arabic, Russian. Machines are lacking the ability of this cultural and linguistic awareness, so it is our task to make them clear.
Our team is working with translators who are either native speakers of a target language or they are speakers with native-like fluency. In every case, we are working with real-world data. First we prepare the data by annotations (start and end positions of names were tagged). Then we split our data into two distinct data sets: one for development and one for testing. The development data is our primary source of information and it is used to learn about the nature of names in a given language. What kind of names does a given language use? How many words are in a name? Does a given language use prefixes, suffixes, infixes, etc.? Is there a convention to ascribe authorship, for example “by” in English? The test data is used for the continuous evaluation of the system, i.e. literally to see if we can find names in the by-lines. Employing these methods we can achieve high precision with really low false positive rate.
We love working on multilingual solutions! Our company is located in the middle of Europe and we have accustomed to the challenges of a multilingual environment. Even our biggest client is a company offering translation memory and online services to professional translators.
Meltwater helps companies make better, more informed decisions based on insights from the outside. More than 23,000 companies use the Meltwater media intelligence platform to stay on top of billions of online conversations, extract relevant insights, and use them to strategically manage their brand and stay ahead of their competition. With 50 offices on six continents, Meltwater is dedicated to personal, global service built on local expertise. Meltwater also operates the Meltwater Entrepreneurial School of Technology (MEST), a nonprofit organization devoted to nurturing future generations of entrepreneurs. For more information, follow Meltwater on Twitter, Facebook, LinkedIn, YouTube, or visit www.meltwater.com.
KConnect provides medical-specific multilingual text processing services, consisting of semantic annotation, semantic search, search log analysis, document classification and machine translation. Here is a short video on how it works.
We are ready to help you if you have a health related search or text mining problem.
Extracting key-words and key-phrases is a common task in Enterprise Search and Information Retrieval. Key-phrases are commonly used by search engines and other indices to categorize texts, build facets, or locate specific data in documents.
Despite the fact that key-phrase extraction is a common task both in industry and in academy, the precision and recall values of the state-of-the-art systems are usually under 30% and 40%. These measures have often been calculated by using a set of key-phrases extracted by humans as a baseline. Precision is the fraction of true positives over the sum of true positives and false positives, recall is the fraction of true positives over the sum of true positives and false negatives. Although human annotators do their best to extract key-phrases, it is not an easy task and sometimes there is no full agreement between two humans on which phrases are the most relevant in a given text. The picture looks to be miserable, but we can see key-phrase extractor almost everywhere. Why? First, the relatively low precision and recall values are not that bad, even they are pretty good compared to other Information Retrieval systems. Second, users find most of the key-phrases natural, or at least somehow informative.
Designing our new key-phrase extracting algorithm was a long journey. We wanted a more or less language independent solution, since in Europe, chances are high that you have to deal with multiple languages. First, we studied data-driven approaches that usually employ a language model (basically huge frequency tables of words or n-grams) and compares the word or n-gram frequencies of a given text to this model. Although these approaches are very good, you have to build a new model for every new language. Also, comparing frequencies can be computationally intensive, even if you want to extract key-phrases in real time.
Finally, we found TextRank, an unsupervised algorithm for key-phrase extraction. TextRank runs on texts and it requires no other inputs. It is a graph-based algorithm inspired by Google’s PageRank. The basic idea is that you can build a graph from the words of a text. Let every word be a node in the graph, you can draw an edge between two words, if they are neighbors (i.e. occurring in a sequence, one before the other) or there is no more than one word between them. This way you can build up a graph, so you can compute the PageRank value of each node. PageRank is a good measure, because it combines the authority (how many node links to it) and hubness (well connected nodes that make authority nodes accessible for many-many less important nodes).
Our linguistic team modified TextRank, so now it takes some linguistic information into consideration (e.g. we are weighting the edges and we are using a directed graph). Although we haven’t reached better evaluation results than the state-of-the-art systems, we felt that our solution gives very natural key-phrases.
Graph-based models are becoming popular models of human cognition. Studies showed that at least a significant portion of language can be modeled by graph structures. E.g. Griffiths et al. describes an experiment that asked participants to respond with the first word that came into their mind that starts with a specific letter. From a technical point of view, think of this task as giving suggestions based on the initial character. At first, one can think that the most frequent word starting with that character would be great to predict the answers. It has been found that word association networks are better at this task. Griffiths and his fellows built a network from a word association database and computed the PageRank value for each of its nodes to show that. Other experiments suggests that we organize semantic information into networks (Tenenbaum), and language development can be described in terms of growing language networks in our mind (Ke and Yao). Language graphs have been applied to measure the coherence of speech of patients with thought disorders. From this point of view, TextRank can be interpreted as a way of determining the central elements of a text.
We can’t build a product on our intuition, so we designed an online study which asked participants to rate key-phrases on a scale from “Totally relevant” to “Totally irrelevant” and we found 7.6% percent of them are totally relevant, 46.4% are rather relevant, 32.4% somehow relevant, 13.2% are rather irrelevant and 0.4% are totally irrelevant.
It seems that we designed an experiment just to support our claims. This is partly true. We think the good old precision and recall values are good if you can make a baseline. But asking humans to find a few key-phrases is asking them to rank alternatives. TextRank and our modification of the algorithm do the same thing, it ranks every word according to its PageRank value. During our evaluation task we would like to know if humans can accept automatically extracted phrases, i.e. although humans can accept them as key-phrases they render them to a lower rank.