How To Become a Data Engineer

The demand for data engineers is growing rapidly. According to the demand has increased by 45% in 2019. The median salary for Data Engineers in SF Bay Area is around $160k. So the question is: how to become a data engineer?

What Data Engineering is

Data engineering is closely related to data as you can see from its name. But if data analytics usually means extracting insights from existing data, data engineering means the process of building infrastructure to deliver, store and process the data. According to The AI Hierarchy of Needs, the data engineering proccess is located at the very bottom: Collect, Move & Store, Data Preparation. So if your organization wants to be data/AI-driven then they should hire/train data engineers.

read more

Introduction to Apache Airflow

Apache Airflow is an advanced tool for building complex data pipelines, it is a swiss-knife for any data engineer. If you look at the open positions for data engineers, you will see that the experience with Apache Airflow is a must have.

Apache Airflow was created back in 2014 at Airbnb by Maxime Beauchemin, who is also the author of Apache Superset. Starting from January 2019 Airflow is a Top-Level project under the Apache Foundation.

The main idea of this post is to show you how to install Apache Airflow, describe its main concepts and build your first data pipeline using this wondeful tool. Let's dive into it.


Directed Acyclic Graph (DAG) is a core object in Apache Airflow which describes dependencies among your tasks.

read more

How to Work with PostgreSQL in Python

PostgreSQL is one of the most popular open source database. If you are building a web application you need a database. Python community likes PostgreSQL as well as PHP community likes MySQL. In order "to speak" with a PostgreSQL database pythonistas usually use psycopg2 library. It is written in C programming language using libpq. Let's dive into it.


pip install psycopg2

If you install the psycopg2 you have to have additional source files and compiler (gcc):

  • libpq
  • libssl

But you can install the precompiled binary, in this case you need to execute:

pip install psycopg2-binary

Getting started

read more

Introduction to Window Functions in SQL

It is interesting that many people working with data have no clue about window functions in SQL. During the long period of time instead of using window functions I prefered coding on Python and pandas. Today I would like to introduce you to the concept of "window" and how it is related to data extraction from a SQL database.

Window functions are applied to a subset/window of rows related to one another. In comparison to GROUP BY operation, window functions do not decrease the number of rows. Aggregate functions like AVG, SUM, COUNT could be used as window functions as well. Usually window functions are used to do analytical tasks. The following examples of queries will be performed on PostgreSQL database.


<function>(<expression>) OVER (
  <window range> -- optional

Looks creepy 😲 We need more practice. Let's assume that we have a salary table:

read more

What is new in Python 3.8

Python 3.8 is going to be released in October 2019 but you can taste it now. Currently the latest available version is python 3.8b2 (which is feature freezed). So what's new in Python 3.8?

f-string =

In python 3.8 you can output debug information more eloquently using f-string feature which was introduced in python 3.7. You just need to add = to it:

>> a = 123
>> b = 456
>> print(f'{a=} and {b=}')
a=123 and b=456

Assignment Expressions aka walrus operator :=

PEP572 is legendary because Guido decided to step down as BDFL after it was accepted. The idea behind := operator is to make code more readable and less nested. For example you want to check if a key is present in a dictionary and assign its value to a variable. Typically you will do the following:

params = {'foo': 'bar'}
x = params.get('foo')
if x:

But in Python 3.8 it can be expressed as:

params = {'foo': 'bar'}
if x := params.get('foo'):

Positional-only arguments

Keyword-only arguments (PEP3102) were introduced in Python 3 back in 2006. The idea behind this PEP is to restrict passing some arguments to functions or methods as positional. Let's see how it looks:

def test(a, b, *, key1, key2):
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
TypeError: test() takes 2 positional arguments but 4 were given</module></pyshell#7>

Python interpreter warns that the last 2 arguments should be passed as keyword-arguments. The right way to invoke the function is:

test(1, 2, key1=3, key2=4)

Okay, but how to impose a restriction for positional-only arguments? Hooray! Now Python 3.8 let's you do this using / symbol:

def position_only(a, b, c, d=0, /):

position_only(1, 2, 3)  # OK
position_only(1, 2, c=3)  # raises exception

Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
TypeError: position_only() got some positional-only arguments passed as keyword arguments: 'c'</module></pyshell#11>

Shared memory for IPC

Now for interprocess communication python developer can use shared memory mechanism. Previously objects were pickled/unpickled and transported via socket. In python 3.8 new module called shared_memory was introduced in the multiprocessing package. Let's see how to use it:

from multiprocessing import shared_memory
a = shared_memory.ShareableList(range(5))
>>> 'wnsm_bd6b5302'

We use the name of the shared memory in order to connect to it using different python shell console:

from multiprocessing import shared_memory
b = shared_memory.ShareableList(name='wnsm_bd6b5302')
>>> ShareableList([0, 1, 2, 3, 4], name='wnsm_bd6b5302')

Wow! Great feature indeed.


3.8 introduced TypedDict which is type annotation for values of fixed string keys dictionaries:

from typing import TypedDict

class Movie(TypedDict):
    title: str
    year: int

movie: Movie = {
    'title': 'Catch me if you can', 
    'year': 2002

Now you can use Movie to type annotations:

from typing import List

def my_function(films: List[Movie]):


Those of you who are familiar with Java already know what final means. For those who do not final prohibits changing an object. For example if you decorate you class with final, nobody should inherit from it.

from typing import Final, final

class FinalClass:

# type checker will fail because you cannot inherit from FinalClass
class InheritedFromFinal(FinalClass):

You can declare the final variable as:

birth_year: Final = 1989
birth_year = 1901  # IDE/type checker will warn you that the value should not be changed

Now you can also prohibit method overloading:

class MyClass:
    def prohibited(self):

class SecondClass(MyClass):
    def prohibited(self):  # it is prohibited!!11

It is just one of my favourite features, if you would like to know more take a look at changelog.

read more

Introduction to pandas: data analytics in Python

pandas probably is the most popular library for data analysis in Python programming language. This library is a high-level abstraction over low-level NumPy which is written in pure C. I use pandas on a daily basis and really enjoy it because of its eloquent syntax and rich functionality. Hope this note will help you dive into data analysis realm with Python and pandas.

DataFrame and Series

In order to master pandas you have to start from scratch with two main data structures: DataFrame and Series. If you don't understand them well you won't understand pandas.


Series is an object which is similar to Python built-in list data structure but differs from it because it has associated label with each element or so-called index. This distinctive feature makes it look like associated array or dictionary (hashmap representation).


>>> import pandas as pd
>>> my_series = pd.Series([5, 6, 7, 8, 9, 10])
>>> my_series
0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64

read more

How To Deploy a Telegram Bot

Part 1 — How To Create a Telegram Bot Using Python 

This is the second part of my small tutorial about creating a Telegram bot using Python and Django. Today I am going to show you how to deploy our Django app on Digital Ocean VPS hosting, we are going to use Linux Ubuntu 14.04 LTS version.

What you are going to learn:

  • How to deploy a django app (or any WSGI app) on Digital Ocean using SSH access
  • How to work with gunicorn and nginx
  • How to control your processes using supervisord
  • Setting up virtualenv using pyenv
  • Adding our app to system autorestart (init.d)

Even despite we are going to work with Django app, this instruction will be applicable for any Python WSGI web application, so do not hesitate to use it with Pyramid, Flask, Bottle or any other popular web application framework in Python ecosystem. In this post I am going to use VPS (Virtual Private Server) on popular Digital Ocean hosting. By the way, if you sign up using my link, you will get $10 credit instantly on your account, money can be used to buy small servers and use them for your own purposes.

read more

How To Create a Telegram Bot Using Python


For the past year, Telegram has introduced tons of new features including in-app games, bots, Telegraph and Instant Views, channels, groups and many more. What's going to be a next killer feature? Nobody knows. In this humble note I would like to show you how you can create a simple telegram bot using a popular programming language called Python.

Bots are great at many things, especially at automating borings tasks. It is up to your imagination what functions your future bot will have, but today we are going to create the one which will communicate with Planet Python, popular Python news aggregator. Bot will simply parse latest content and send it back to you via Telegram.

Our app will consist of django app and its source code is available on my github as planetpython_telegrambot repo. Feel free to fork and do whatever you want to do :)

read more

Celery Best Practices: practical approach

Celery logo

Celery is the most advanced task queue in the Python ecosystem and usually considered as a de facto when it comes to process tasks simultaneously in the background. The lastest version is 4.0.2, community around Celery is pretty big (which includes big corporations such as Mozilla, Instagram, Yandex and so on) and constantly evolves. Hence investing your time in this technology pays back.

This article is not about how to set up Celery, installation is pretty straightforward and instructions can be found on the official website. Instead, I would like to emphasize your attention on best practices while working with Celery.

Tips & Tricks with Celery

read more
  • 1