Getting started with Natural Language Processing in Polish

Corpus
Tokenizing
Quantitative analysis: corpus
- Frequent words (unigrams)
  - Word cloud
- Ngrams
  - Bigrams
  - Sixgrams
Morphological analysis
- Usage
- Lemmatization
Conclusions
Citations

There are countless tutorials on how to do NLP on the internet. Why would you care to read another one? Well, we will be looking into analysing Polish language, which is still rather underdeveloped in comparison to English.

Polish language is much more complex when compared to English language, as it contains 7 cases (deklinacja), 3 kinds (masculine, femenine, unspecified) and 11 different templates of verb conjugation, not to mention the exceptions [1] [2] [3]. English is rather simple: verb conjugation is easy, nouns pluralise easily, there are hardly any genders.

Corpus

For the sake of this analysis, we will look at collection of articles obtained (the nice word for scraped) from one of the leading right-wing news websites, wpolityce.pl.

Sample article from the dataset.

The articles were collected over a period of 15 days ending 10th June 2019, only articles linked from front page were collected. In total there are 1453 unique articles. I shall later publish my insights on collecting datasets from the wild.

Tokenizing

Tokens are the smallest unit of the language that's most commonly taken into account when doing text analysis. You may think of it as a word, which in most cases it is. Things get trickier in English with it's abbreviations like I'd or we've but Polish is much easier in this aspect.

„|Gazeta|Wyborcza|”|wielokrotnie|na| |swoich|łamach|wspierała|działania|Fundacji|Nie|lękajcie|się|Marka|Lisińskiego|.

Sample tokenization of a sentence from the corpus, tokens are delimited with vertical bars. Please note empty token between words na and swoich - it's a dirty whitespace, which we'll deal with shortly.

In Python, you can tokenize using either NLTK's nltk.tokenize package or spaCy's tokenizer using pretrained language models. I have used the latter with xx_ent_wiki_sm multilanguage model. In empirical analysis it appeared to split up the tokens in a plausible manner.

The distribution of numbers of tokens for an article looks as follows:

The distribution of numbers of tokens in an article: vertical axis shows number of articles falling into particular bucket, horizontal axis is the token lenght of article.

Quantitative analysis: corpus

Reading the entire dataset without any cleaning yields mediocre results. The resulting corpus contains:

badly interpreted non-breaking space from latin1 encoding ("\xa0")
whitespace-only tokens ("\n", "\n \n" and others)
punctuation, also in repeated non-ambiguous way (multiple quote chars, hyphen/minus used interchangeably)
stopwords: common words that skew the word distribution but do not bring any value to the analysis.

Frequent words (unigrams)

10 most frequent tokens in a clean dataset

word	frequency
PiS	1444
proc	1198
wyborach	835
powiedział	726
PAP	709
PSL	676
r.	653
Polski	652
Polsce	625
Europejskiej	601

Word cloud

The raw data above can be visualised using a word cloud. The idea is to represent the frequency of the word with it's relative size on the visualisation, framing it in visually attractive and appealing way.

You can find plenty of online generators that will easily let you tweak the shape, colors and fonts. Check out my wordcloud of 50 most common words from the corpus generated using generator from worditout.com Sample wordcloud

Concluding our unigram analysis, the most common topics were revolving around PiS (abbreviation used by the polish ruling party), elections (wyborach stands for elections, proc. is abbreviated percent).

Ngrams

Ngram is just a sequence of n consecutive tokens. For n=2 we also use word bigram. You can get the most frequent Ngrams (sequences of words of particilar length) using either nltk.ngrams or trivial custom code. For length 2-3 you are highly likely to fetch popular full names. Longer ngrams are likely to catch parts of frequently repeated sentences, like promos, ads and references.

Bigrams

As expected, we are catching frequent proper names of entities in the text. Below is a breakdown of 10 most frequent bigrams from the corpus:

frequency	bigram	interpretation
377	Koalicji Europejskiej	political party
348	Parlamentu Europejskiego	name of institution
322	PAP EOT	source/author alias
261	4 czerwca	important date discussed at the time
223	Koalicja Europejska	political party
136	Andrzej Duda	full name of President of Poland
126	Jarosław Kaczyński	full name of important politician
126	Jana Pawła	most likely: prefix from John Paul II
125	Pawła II	most likely: suffix from John Paul II

Sixgrams

Longer ngrams expose frequent phrases used throughout the corpus. Polish has a lot of cases and persons, so this method is not of much help to find frequent phrases in the language itself; you are more likely to find repeated parts of sentences or adverts.

frequency	sixgram
61	Kup nasze pismo w kiosku lub
61	nasze pismo w kiosku lub skorzystaj
61	pismo w kiosku lub skorzystaj z
61	w kiosku lub skorzystaj z bardzo
61	kiosku lub skorzystaj z bardzo wygodnej

Clearly, there is an issue here. While doing our quantitative analysis our distributions and counts are skewed by the advert texts being present in most of the articles. Ideally we should factor these phrases out of our corpus, either at a stage of data collection (improving the parser to annotate/omit these phrase) or data preprocessing (something we're doing in this article).

Much better idea is to perform ngrams analysis on lemmatized text. This way you may mine more knowledge about the language, not just repeated phrases.

Morphological analysis

In the previous paragraph we saw that the different forms of the words will make learning about the language harder - we are capturing the information about the entire word, with the particular gender, tense and case. This becomes particularly disruptive in Polish.

One potential solution is called morphological analysis. This is a process of mapping a word to all of it's potential dictionary base words (lexems). Sample lemmatization:

Original word	Lemma tag	Lexem
Dyrektywa	dyrektywa	subst:sg:nom:f
PSD2	PSD2	ign
zawiera	zawierać	fin:sg:ter:imperf
przepisy	przepis	subst:pl:nom:m3
odnoszące	odnosić	pact:sg:nom:n:imperf:aff
się	się	qub
do	do	prep:gen
płatności	płatność	subst:sg:gen:f
elektronicznych	elektroniczny	adj:pl:gen:m1:pos
realizowanych	realizować	ppas:pl:gen:m1:imperf:aff
wewnątrz	wewnątrz	adv:pos
Unii	unia	subst:sg:gen:f
Europejskiej	europejski	adj:sg:gen:f:pos
.	.	interp

We will not be implementing this part from scratch, instead we can use resources from IPI PAN (Polish Academy of Science), specifically a tool called Morfeusz.

Morfeusz is a morphosyntactic analyzer which you can use to find all word lexems and the forms. The output will be slightly different than the table above: instead, for every input word Morfeusz will output all of it's possible dictionary forms.

Usage

I assume you are running either a recent Ubuntu or Fedora. After fetching the right version from Morfeusz download page do the following:

tar xzfv <path to archive>
sudo cp morfeusz/lib/libmorfeusz2.so /usr/lib/local
sudo chmod a+x /usr/lib/local/libmorfeusz2.so
sudo echo "/usr/local/lib" > /etc/ld.so.conf.d/local.conf
sudo ldconfig

Now we can download and install the python egg from morfeusz download page.

easy_install <path_to_downloaded_egg>

The best thing about the python package is that you can retrieve the result from the analyzer using just a couple of python lines:

import morfeusz2
m = morfeusz2.Morfeusz()
result = m.analyze("Ala ma kota, kot ma Alę.")
# the result is:
[(0, 1, ('Dyrektywa', 'dyrektywa', 'subst:sg:nom:f', ['nazwa_pospolita'], [])),   
 (1, 2, ('PSD2', 'PSD2', 'ign', [], [])),                                         
 (2, 3, ('zawiera', 'zawierać', 'fin:sg:ter:imperf', [], [])),                    
 (3,                                                                              
  4,                                                                              
  ('przepisy', 'przepis', 'subst:pl:nom.acc.voc:m3', ['nazwa_pospolita'], [])),   
 (4,                                     
  5,                                     
  ('odnoszące',                                                                   
   'odnosić',                                                                     
   'pact:pl:nom.acc.voc:m2.m3.f.n:imperf:aff',                                    
   [],                                                                            
   [])),                                 
 (4, 5, ('odnoszące', 'odnosić', 'pact:sg:nom.acc.voc:n:imperf:aff', [], [])),

Putting it all in a pandas DataFrame is a trivial task as well:

import pandas as pd
colnames = ["token_start", "token_end", "segment", "lemma", "interp", "common", "qualifiers"]
df = pd.DataFrame.from_records([(elem[0], elem[1], *elem[2]) for elem in result], columns=colnames)

Helpful links on tackling Morfeusz:

Lemmatization

You have already seen the output of a lemmatizer at the beginning of previous section. In contrast to Morfeusz's output, this time we want to obtain a mapping for each word to it's most probable dictionary form.

In order to obtain direct tags instead of list of tags for each token, you will need to go deep into morphological taggers land. Your options are:

WCRFT - actually having a working demo here. Requires some skill to get it to work, dependencies are non-trivial to put together,
PANTERA - doesn't appear to be actively maintained
Concraft - implemented in Haskell, looks to be the easiest one to get running

If you are just doing a casual analysis of text, you may want to consider outsourcing the whole tagging process to a RESTful API, like http://nlp.pwr.wroc.pl/redmine/projects/nlprest2/wiki/Asynapi.

You can easily reverse-engineer the exact calls to the API using Firefox/Chrome dev tools. This way you can find the correct values for the parameters. devtools

As with any API provided to the public, you should notify the provider about your intentions and remember about some sane throttling of requests. Moreover, it's considered good manners to set the user parameter to something meaningful (like email address).

Conclusions

We went through the basic steps of analyzing a corpus and touched on more advanced topics, like morphological analysis and lemmatization. If you're doing a quick and dirty analysis, outsourcing those basic tasks to ready tools and APIs sounds like the best option. In case you need to run a large scale experiments, it makes more sense to run the tools on your own hardware. The lemmatized text might become a powerful input to a more sophisticated models, like recurrent neural networks.