Analyzing discourse on recent issues in Hungary

, Precognox
, Precognox

The Hungarian government passed a restrictive bill against the Central European University. This was followed by a storm of social media posts and protesters flowed on the streets to express their opinion that was backed by torrents of articles in the media. We are touched by the recent issues in our country and we wanted to see the national and international discourse around it.



Global discourse on lex-CEU

We collected data from Twitter to visualize the discourse (topic models) and to show the geographic distribution of the participants of the discussion.

, Precognox

You can find the visualization of the topics here.

You can find the 3D visualization of the geographic distribution of tweets here.

We used the Twitter API to collect 7822 tweets written in English containing one of the terms ‘CEU’, ‘Central European University’ or the hashtag #istandwithceu. We used the Stanford CoreNLP tool for lemmatization and named entity extraction. Topic modelling was done by the gensim package and the interactive topic model visualization is generated with the pyldavis library. For the interactive globe visualization, we did the same search, which gave us 9745 tweets. We extracted the geo-location data from tweets to map them on Google’s Globe WebGL.

, Precognox

Local discourse on mass protests

After the CEU, the government targeted NGOs with a new proposal which requires civil organizations accepting financial help from abroad to register at the court as a “foregin funded organization” despite the fact that they are already obliged to publish their books as any other NGO in the world. Citizens responded with peaceful mass protests and the online media followed the story closely. However, the pro-government media interpreted the news in a very different way.

We collected articles related to lex CEU, the anti-NGO bill and the protest from four Hungarian news site (two independent: and and two pro-government: and We analyzed 513 articles that appeared between 01.04.2017 and 13.04.2017. We found no significant differences between the two groups at the level of text statistics (lexical diversity and the length of the articles).

Below, you can have a look at the top 150 most frequent words of each site. The word clouds were made by using Processing and the WordCram library.

, Precognox

There are no big differences between the word frequencies. So we examined the keywords of each site with the help of the fantastic AntConc software for corpus linguistics. This word cloud shows that there is a great divide between pro-government and independent media.

, Precognox

We found that the volume of coverage is much less on the pro-government sites (162 vs 351 number of articles). We used Latent Dirichlet Allocation to analyze the topics of the articles and we found that although the two sides covers the same issues, there are two big topics which can be identified by lda due to the different linguistic features of the two groups. While the pro-government media prefers terms like Soros University (CEU), Soros funded NGOs, and foreign actors, the independent media is using a more neutral language and it is using the official names of persons and institutions. You can find our interactive visualization of the topics here.

, Precognox

Topic 1 is mainly about the mass demonstrations against lex CEU and the proposed anti-NGO bill. Surprisingly, this topic contains articles exclusively from and

, Precognox

Topic 2 is also about the mass demonstrations against lex CEU and the proposed anti-NGO bill. However this topic contains articles exclusively from and due to the very different terms.


A walk into the semantic space

Having found dramatic differences between the independent and pro-government media, we were wondering how strong is the difference between the languages used by these sites. We trained a word2vec model on the corpus and plotted its 3D t-SNE projection using the threejs R package to see how the words used in the articles are related to each other.

, Precognox

You can find our visualization here.

We plotted only the most frequent five hundred words from each site. There are commonly used words in the top 500, these words occupy the central part of the plot. It seems that origo has no distinct language, as we can barely see yellow dots on the plot. Given the recent story of the site, we cannot wonder; origo has been bought by a group standing close to the government and most of its staffs left it, so the site recruited new people and started to collaborate sites on the right side of the political spectrum. Although 444 and Index covered the same stories, it seems the two sites developed their own languages.


Follow the narratives, but don’t be a solutionist

Tracking down narratives on an issue and visualizing your findings is super easy in 2017 thanks to the open source community. We love technology and we are happy whenever it can help us to see the big picture. We see there are two narratives on the same topic, but closing the gap between the two groups and starting a rational discussion between citizens is not about technology. There is no app that can help us. We hope that the followers of these distinct narratives can find a common ground and start a discussion in the real life before they lost the ability to understand each other.

Read on

Bill Bishop: The Big Sort: Why the Clustering of Like-Minded America is Tearing Us Apart, Mariner Books, 2009

Eli Pariser: The Filter Bubble: How the New Personalized Web Is Changing What We Read and How We Think, Penguin Books, 2012

George Lakoff: Don’t Think of an Elephant!: Know Your Values and Frame the Debate–The Essential Guide for Progressives, Chelsea Green Publishing, 2004

Norman Fairclough: Language and Power, 3rd Edition, Routledge, 2014