Hello

My name is Tomasz Dziopa and this is my space for scribbling thoughts.

Put your docker images on a diet

Why big = bad?

Large images are bad in two ways:

they take up a lot of space on the disk. Docker daemon enforces an upper bound on the total size of images on your disk - if you rebuild the images frequently and they are large, you will end up cleaning up your cache very frequently.
they take a lot of time to download. This is bad, because:
- not everyone has an unlimited broadband connection! It sucks when you have to download fat docker image using your cafe's shitty wifi or using your mobile plan allowance
- there are cases when bandwidth can become costly. If you host your images in a private repo at your cloud provider, it's quite likely that they'll charge you for egress bytes.

So - how big is it?

docker image list | grep <image name>

will list all available tags for your image with their uncompressed sizes. This is how much the image actually takes on your hard drive.

docker manifest inspect <image> | jq ".layers | map(.size) | add / 1024"

will output a single number: the total size in MB of compressed layers. This is how much needs to be downloaded over the network in case of layer cache misses (worst case).

Generic advice is all about understanding how docker image layers work. In general, each layer contains only a set of differences from the previous layer. This applies to not only adding and updating files, but also removing.

In order to keep your keep your layer sizes small, only persist changes which are relevant to functioning of the final container. Do not persist: package manager caches, downloaded installers and tremporary artifacts.

Common strategy to solving this problem is chaining installing and cleanup commands as a single layer definition.

instead of doing this:

FROM ubuntu:20.04                    # 66 MB
RUN apt update                      # 27 MB
RUN apt install -y curl libpq-dev   # 16 MB
# total size: 109 MB

do:

FROM ubuntu:20.04                     # 66 MB
# removing apt cache after successful installation of dependencies
RUN apt update \                     # 16 MB
    && apt install -y curl libpq-dev \
    && rm -rf /var/lib/apt/lists/*
# total size: 82 MB

Another common approach to solving this problem is to use multi-stage build. Generally, the idea here is to perform some potentially heavy operations (e.g. compiling a binary from sources) in a separate layer, and then to copy just the resulting artifacts into new, "clean" layer. Example:

# Stage 1: build the Rust binary.
FROM rust:1.40 as builder
WORKDIR /usr/src/myapp
COPY . .
RUN cargo install --path .

# Stage 2: install runtime dependencies and copy the static binary into
a clean image

FROM debian:buster-slim
RUN apt-get update \
    && apt-get install -y extra-runtime-dependencies \
    && rm -rf /var/lib/apt/lists/*
COPY --from=builder /usr/local/cargo/bin/myapp /usr/local/bin/myapp
CMD ["myapp"]

Since every package manager has it's own arguments and best practices, the easiest way to manage this complexity is to use a tool which can find quick-wins in your Dockerfile definition. Open source hadolint analyzes your Dockerfile for common issues.

Removing cruft

If you've been developing with your Dockerfile for some time, there's a good chance you experimented with different libraries and dependencies, which often have the same purpose. The general advice here is: take a step back and analyze every dependency in your Dockerfile. Throw away each and every dependency which is not essential to the purpose of the image.

Another very effective approach at elliminating cruft is to examine the contents of the image layers. I cannot recommend enough dive CLI, which lets you look at the changes applied by each layer and hints at the largest and repeatedly modified files in the image.

By examining docker build trace you may also identify that you're installing GUI-specific packages (e.g. GNOME icons, extra fonts) which shouldn't be required for your console-only application. Good practice is to analyze the reverse dependencies of these packages and look for top-level packages which triggered the installation of the unexpected fonts. E.g. openjdk-11-jre package bundles suite of dependencies for developing UI apps. Some tips on debugging the issue:

just try removing the suspicious system package. Package managers should resolve the unmet dependencies and suggest potential top-level offending package as scheduled for removal.
use tools like apt-rdepends <package> to see a flattened list of packages which depend on the <package>.

Base image choice

Choosing the base image is one of the very first decisions people make when creating new image. It's quite common that the choice is affected by e.g. some example image on github, or what's already being used among your colleagues or organization.

Changing the base image can be quite significant to the overall image, but sometimes it's definitely worth revisiting that. Here's a quick walthrough of common points for consideration:

choosing a -slim image variant. Some distros (e.g. debian) offer a stripped-down version, with features which are redundant to most images removed. For non-interactive console applications, you usually do not need documentation and man pages.
alpine images - alpine is a container-optimized linux distribution, which uses it's own package manager (apk) and alternative implementation of libc, musl. The final images are usually very small, although there are some limitations:
- in case your application relies on libc, you will need to recompile it for musl.
- own package repository is sometimes lacking in comparison to debian and the versions may not be updated as frequently
- relatively small community support
distroless images - the slimming down approach taken to extreme, with virtually all non-essential components (including shell!) of the operating system stripped out from debian images. Conceptually, the application and all it's dependencies should be copied into the distroless image from a previous image stage (see example Dockerfile, more usages in kubernetes)
- could be a good choice if your application is statically linked or has a well-defined set of dependencies (think: limited set of shared libraries).

Image specialization

Common issue with rapidly evolving codebases is that the images are not specialized - image serves as a common execution environment for a very distinct use cases. For the purpose of this section, let's call these "furball" images.

Best practice is to identify the key use cases of the particular components and build dedicated images for these use cases. This way you can encourage better separation of concerns in your codebase.

Assuming all the images are retaining all their dependencies, in best case you will experience summed size of all specialized layers to be exactly the same as the "furball" image. But, the benefits are substantial:

specialization may encourage deeper cleanup of the dependencies
specialization may encourage bigger changes in the organization of the codebase, like factoring the code into components.
running specialized workflow will require fetching way smaller image
Since you usually don't work on all components at the same time, you'll experience significant improvement in quality of life - you'll be downloading just the dependencies used by your component!

Runtime tools

There were multiple attempts to minify the docker images by analyzing the runtime usage of files within the container. How this is supposed to work:

spin up the container with a sidecar container tracking every access to a file inside the filesystem
Perform actions on the running container covering all of the use cases of the image.
Let the tool export the image containing just the files from the filesystem which were accessed in previous step.

The results of this workflow can be very good, although your results may vary depending on the use cases. Some points for consideration:

the image is no longer a result of docker build
the slimming down workflow may remove some files from the image which can affect the functionality of the image or it's security => how much do you trust the tool that the image is secure and correct?
image may not be suitable for any other use case than the one used in the slimming down process. This may be a showstopper for the images used for experimentation.

Getting started with Natural Language Processing in Polish

There are countless tutorials on how to do NLP on the internet. Why would you care to read another one? Well, we will be looking into analysing Polish language, which is still rather underdeveloped in comparison to English.

Polish language is much more complex when compared to English language, as it contains 7 cases (deklinacja), 3 kinds (masculine, femenine, unspecified) and 11 different templates of verb conjugation, not to mention the exceptions [1] [2] [3]. English is rather simple: verb conjugation is easy, nouns pluralise easily, there are hardly any genders.

Corpus

For the sake of this analysis, we will look at collection of articles obtained (the nice word for scraped) from one of the leading right-wing news websites, wpolityce.pl.

Sample article from the dataset.

The articles were collected over a period of 15 days ending 10th June 2019, only articles linked from front page were collected. In total there are 1453 unique articles. I shall later publish my insights on collecting datasets from the wild.

Tokenizing

Tokens are the smallest unit of the language that's most commonly taken into account when doing text analysis. You may think of it as a word, which in most cases it is. Things get trickier in English with it's abbreviations like I'd or we've but Polish is much easier in this aspect.

„|Gazeta|Wyborcza|”|wielokrotnie|na| |swoich|łamach|wspierała|działania|Fundacji|Nie|lękajcie|się|Marka|Lisińskiego|.

Sample tokenization of a sentence from the corpus, tokens are delimited with vertical bars. Please note empty token between words na and swoich - it's a dirty whitespace, which we'll deal with shortly.

In Python, you can tokenize using either NLTK's nltk.tokenize package or spaCy's tokenizer using pretrained language models. I have used the latter with xx_ent_wiki_sm multilanguage model. In empirical analysis it appeared to split up the tokens in a plausible manner.

The distribution of numbers of tokens for an article looks as follows:

The distribution of numbers of tokens in an article: vertical axis shows number of articles falling into particular bucket, horizontal axis is the token lenght of article.

Quantitative analysis: corpus

Reading the entire dataset without any cleaning yields mediocre results. The resulting corpus contains:

badly interpreted non-breaking space from latin1 encoding ("\xa0")
whitespace-only tokens ("\n", "\n \n" and others)
punctuation, also in repeated non-ambiguous way (multiple quote chars, hyphen/minus used interchangeably)
stopwords: common words that skew the word distribution but do not bring any value to the analysis.

Frequent words (unigrams)

10 most frequent tokens in a clean dataset

word	frequency
PiS	1444
proc	1198
wyborach	835
powiedział	726
PAP	709
PSL	676
r.	653
Polski	652
Polsce	625
Europejskiej	601

Word cloud

The raw data above can be visualised using a word cloud. The idea is to represent the frequency of the word with it's relative size on the visualisation, framing it in visually attractive and appealing way.

You can find plenty of online generators that will easily let you tweak the shape, colors and fonts. Check out my wordcloud of 50 most common words from the corpus generated using generator from worditout.com Sample wordcloud

Concluding our unigram analysis, the most common topics were revolving around PiS (abbreviation used by the polish ruling party), elections (wyborach stands for elections, proc. is abbreviated percent).

Ngrams

Ngram is just a sequence of n consecutive tokens. For n=2 we also use word bigram. You can get the most frequent Ngrams (sequences of words of particilar length) using either nltk.ngrams or trivial custom code. For length 2-3 you are highly likely to fetch popular full names. Longer ngrams are likely to catch parts of frequently repeated sentences, like promos, ads and references.

Bigrams

As expected, we are catching frequent proper names of entities in the text. Below is a breakdown of 10 most frequent bigrams from the corpus:

frequency	bigram	interpretation
377	Koalicji Europejskiej	political party
348	Parlamentu Europejskiego	name of institution
322	PAP EOT	source/author alias
261	4 czerwca	important date discussed at the time
223	Koalicja Europejska	political party
136	Andrzej Duda	full name of President of Poland
126	Jarosław Kaczyński	full name of important politician
126	Jana Pawła	most likely: prefix from John Paul II
125	Pawła II	most likely: suffix from John Paul II

Sixgrams

Longer ngrams expose frequent phrases used throughout the corpus. Polish has a lot of cases and persons, so this method is not of much help to find frequent phrases in the language itself; you are more likely to find repeated parts of sentences or adverts.

frequency	sixgram
61	Kup nasze pismo w kiosku lub
61	nasze pismo w kiosku lub skorzystaj
61	pismo w kiosku lub skorzystaj z
61	w kiosku lub skorzystaj z bardzo
61	kiosku lub skorzystaj z bardzo wygodnej

Clearly, there is an issue here. While doing our quantitative analysis our distributions and counts are skewed by the advert texts being present in most of the articles. Ideally we should factor these phrases out of our corpus, either at a stage of data collection (improving the parser to annotate/omit these phrase) or data preprocessing (something we're doing in this article).

Much better idea is to perform ngrams analysis on lemmatized text. This way you may mine more knowledge about the language, not just repeated phrases.

Morphological analysis

In the previous paragraph we saw that the different forms of the words will make learning about the language harder - we are capturing the information about the entire word, with the particular gender, tense and case. This becomes particularly disruptive in Polish.

One potential solution is called morphological analysis. This is a process of mapping a word to all of it's potential dictionary base words (lexems). Sample lemmatization:

Original word	Lemma tag	Lexem
Dyrektywa	dyrektywa	subst:sg:nom:f
PSD2	PSD2	ign
zawiera	zawierać	fin:sg:ter:imperf
przepisy	przepis	subst:pl:nom:m3
odnoszące	odnosić	pact:sg:nom:n:imperf:aff
się	się	qub
do	do	prep:gen
płatności	płatność	subst:sg:gen:f
elektronicznych	elektroniczny	adj:pl:gen:m1:pos
realizowanych	realizować	ppas:pl:gen:m1:imperf:aff
wewnątrz	wewnątrz	adv:pos
Unii	unia	subst:sg:gen:f
Europejskiej	europejski	adj:sg:gen:f:pos
.	.	interp

We will not be implementing this part from scratch, instead we can use resources from IPI PAN (Polish Academy of Science), specifically a tool called Morfeusz.

Morfeusz is a morphosyntactic analyzer which you can use to find all word lexems and the forms. The output will be slightly different than the table above: instead, for every input word Morfeusz will output all of it's possible dictionary forms.

Usage

I assume you are running either a recent Ubuntu or Fedora. After fetching the right version from Morfeusz download page do the following:

tar xzfv <path to archive>
sudo cp morfeusz/lib/libmorfeusz2.so /usr/lib/local
sudo chmod a+x /usr/lib/local/libmorfeusz2.so
sudo echo "/usr/local/lib" > /etc/ld.so.conf.d/local.conf
sudo ldconfig

Now we can download and install the python egg from morfeusz download page.

easy_install <path_to_downloaded_egg>

The best thing about the python package is that you can retrieve the result from the analyzer using just a couple of python lines:

import morfeusz2
m = morfeusz2.Morfeusz()
result = m.analyze("Ala ma kota, kot ma Alę.")
# the result is:
[(0, 1, ('Dyrektywa', 'dyrektywa', 'subst:sg:nom:f', ['nazwa_pospolita'], [])),   
 (1, 2, ('PSD2', 'PSD2', 'ign', [], [])),                                         
 (2, 3, ('zawiera', 'zawierać', 'fin:sg:ter:imperf', [], [])),                    
 (3,                                                                              
  4,                                                                              
  ('przepisy', 'przepis', 'subst:pl:nom.acc.voc:m3', ['nazwa_pospolita'], [])),   
 (4,                                     
  5,                                     
  ('odnoszące',                                                                   
   'odnosić',                                                                     
   'pact:pl:nom.acc.voc:m2.m3.f.n:imperf:aff',                                    
   [],                                                                            
   [])),                                 
 (4, 5, ('odnoszące', 'odnosić', 'pact:sg:nom.acc.voc:n:imperf:aff', [], [])),

Putting it all in a pandas DataFrame is a trivial task as well:

import pandas as pd
colnames = ["token_start", "token_end", "segment", "lemma", "interp", "common", "qualifiers"]
df = pd.DataFrame.from_records([(elem[0], elem[1], *elem[2]) for elem in result], columns=colnames)

Helpful links on tackling Morfeusz:

Lemmatization

You have already seen the output of a lemmatizer at the beginning of previous section. In contrast to Morfeusz's output, this time we want to obtain a mapping for each word to it's most probable dictionary form.

In order to obtain direct tags instead of list of tags for each token, you will need to go deep into morphological taggers land. Your options are:

WCRFT - actually having a working demo here. Requires some skill to get it to work, dependencies are non-trivial to put together,
PANTERA - doesn't appear to be actively maintained
Concraft - implemented in Haskell, looks to be the easiest one to get running

If you are just doing a casual analysis of text, you may want to consider outsourcing the whole tagging process to a RESTful API, like http://nlp.pwr.wroc.pl/redmine/projects/nlprest2/wiki/Asynapi.

You can easily reverse-engineer the exact calls to the API using Firefox/Chrome dev tools. This way you can find the correct values for the parameters. devtools

As with any API provided to the public, you should notify the provider about your intentions and remember about some sane throttling of requests. Moreover, it's considered good manners to set the user parameter to something meaningful (like email address).

Conclusions

We went through the basic steps of analyzing a corpus and touched on more advanced topics, like morphological analysis and lemmatization. If you're doing a quick and dirty analysis, outsourcing those basic tasks to ready tools and APIs sounds like the best option. In case you need to run a large scale experiments, it makes more sense to run the tools on your own hardware. The lemmatized text might become a powerful input to a more sophisticated models, like recurrent neural networks.

Citations

You're my type: Python, meet static typing

(originally published: 2020-09-19)

If you’re writing Python code in a production environment, it’s quite likely that you have been using type hints or static type checking. Why do we need these tools? How can you use them when you’re developing your own project? I hope to answer these questions in this blogpost.

The why

Python was designed as a dynamically typed language - the developer should be able to solve the problems quickly, iterating quickly in the interactive REPL. The type system will work things out transparently for the developer.

def checkConstraintsList2(solution, data):
    """
    This is an actual excerpt from a BEng thesis, python2 style.
    Trying to infer the data types is painful, so is debugging or
    extending this snippet.
    :param solution:
    :param data:
    """
    for slot in range(0, data.periodsPerDay * data.daysNum):
        for lecture in solution[slot]:
            for constraint in data.getConstraintsForCourse(lecture[0]):
                if slot == solution.mapKeys(constraint):
                    print "Violation lecture", lecture, "slot", slot

Ugly Python2-style snippet from an ML project. Non-trivial data-structures make reading this a horrible experience.

This approach works great for tiny, non-critical codebases, with a limited number of developers. Since even Google was at some point in time a tiny, non-critical codebase I’d advise you not to follow this path.

The how

PEP484 added type comments as a standardized way of adding type information to Python code.

def weather(celsius): # type: (Optional[float]) -> Union[str, Dict[str, str]]
    if celsius is None:
        return "I don’t know what to say"
    return (
        "It’s rather warm"
        if celsius > 20 else {"opinion": "Bring back summer :/"}
    )

The separation of the identifier and type hint makes it hard to read.

This used to be the only way to add types to Python codebase until Python 3.6 got released, adding optional type hints to it’s grammar:

from typing import Dict, Optional, Union

def weather(celsius: Optional[float]) -> Union[str, Dict[str, str]]:
    if celsius is None:
        return "I don’t know what to say"
    return (
        "It’s rather warm"
        if celsius > 21 else {"opinion": "Bring back summer :/"}
    )

Much better. Please note - this code example introduced some additional import statements. These are not free, since Python interpreter needs to load code for that import, resulting in some barely noticeable overhead, depending on your codebase.

If you’re looking to speed up your code, Cython uses a slightly altered Python syntax of type annotations for compiling to machine code in for the sake of the speed.

The wow

The type hints might be helpful for the developer, but humans commit errors all the time - you should definitely use a type checker to validate if your assumptions are correct. (And, catch some bugs before they hit you!)

How do you introduce types in your codebase?

Most type checkers follow the approach of gradual introduction of enforcement of type correctness. The type checkers can infer the types of the variables or return types to some degree, but in this approach it’s the developers responsibility to gradually increase the coverage of strict type checking, usually module by module.

Additionally, you can control either particular features of the type system being enforced (e.g. forbidding redefinition of a variable with a different type) or select one of the predefined strictness levels. The main issue is that it involves manual process - you need to define the order or modules in which you want to annotate your codebase and, well, manually do it.

Pytype follows a completely different approach - it type checks the entire codebase by default, instead taking a very permissive take on type correctness - if it’s valid Python, it’s OK. This approach definitely makes sense in application to older or unmaintained projects with minimal or no type hints since it allows you to catch the basic type errors very quickly, with no changes made to the codebase. This sounds very tempting, but the long term solution should be to apply more strict type checking. Valid Python’s type system is just way too permissive.

Most popular Python type checkers, compared

	Mypy	pytype	Pyright	[Pyre](https://github.com/facebook/pyre-check
Modes	Feature flags for strictness checks	Lenient: if it runs, it’s valid	Multiple levels of strictness enforced per project and directory	Permissive / strict modes
Applying to existing code	Gradual	Runs on entire codebase, type hints completely optional	Gradual	Gradual
Implementation	Daemon mode, incremental updates	set of CLI tools	Typescript, daemon mode, incremental updates	Daemon mode with watchman, incremental updates
IDE integration	Incomplete LSP plugin	maybe someday	LSP, Pylance vscode extension, vim	VSCode, vim, emacs
Extra points	Guido-approved	Merge-pyi automatically merges type stubs into your codebase	Snappy vscode integration, no Python required - runs on node js	Built in Pysa - static security analysis tool

Third-party libraries

Not all libraries you use are type annotated - that’s a sad fact. There are two solutions to this problem - either just annotate them, or use type stubs. If you’re using a library, you care only about the exported data structures and function signature types. Some typecheckers utilize an official collection of type annotations maintained as a part of Python project - Typeshed (see example type stubs file). If you find a project that doesn’t have type hints, you can contribute your annotations to that repo, for the benefit of everyone!

class TcpWSGIServer(BaseWSGIServer):
    def bind_server_socket(self) -> None: ...
    def getsockname(self) -> Tuple[str, Tuple[str, int]]: ...
    def set_socket_options(self, conn: SocketType) -> None: ...

The ... is an Ellipsis object. Since it's builtin constant, the code above is valid Python, albeit useless (besides being type stub).

Some static type checkers create pyi stubs behind the scenes. Pytype allows you to combine the inferred type stubs with the code using a single command. This is really neat - you can seed the initial types in your code with a single command - although the quality of the annotations is quite weak, since Pytype is a lenient tool, you will see Any wildcard type a lot. Type hints: beyond static type checking

The type-correctness is not the only use case of the type hints. Here you can find some neat projects making good use of type annotations:

Pydantic

Pydantic allows you to specify and validate your data model using intuitive type-annotated classes. It’s super fast and is a foundation of FastAPI web framework (request/response models and validations).

from typing import List, Optional

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()


class Item(BaseModel):
    name: str
    tax: Optional[float] = None
    tags: List[str] = []


@app.post("/items/", response_model=Item)
async def create_item(item: Item):
    return item

You get request and response validation for free.

Typer

Generating CLIs using just type-annotated functions. Typer takes care of parsing, validating and generating boilerplate for you.

Hypothesis

Hypothesis is a library enabling property-based testing for pytest. Instead of crafting custom unit test examples, you specify functions generating the test cases for you. Or… you can use inferred strategies which will sample the space of all valid values for a given type!

from hypothesis import given, infer

@given(username=infer, article_id=infer, comment=infer)
def test_adding_comment(username: str, article_id: int, comment: str):
    with mock_article(article_id):
    comment_id = add_comment(username, comment)
    assert comment_id is not None

Inferred strategies are a good start, but you should limit the search space of your examples to match your assumptions (would you expect your article_ids to be negative?)

Outro

Static type checking gives the developers additional layer of validation of their code against the common type errors. With the wonders of static analysis, you can significantly reduce the number of bugs, make contributing to the code easier and catch some non-trivial security issues.

that's all

Thanks