I usually use RQ with Django because RQ is one of the most popular and straightforward solutions among task queues in the Python ecosystem. A few months ago I stumbled upon a situation where I needed to test a code that used Django Signals. The scenario was simple. When the signal is emitted (the Django model object has been deleted) receiver listening to that signal invokes an RQ job using delay. The problem is that the corresponding object is created using pytest fixture and is deleted when a test finishes. The first straightforward solution was to patch an RQ job, but if we have more receivers in the future we should not forget to patch them all (which affects readability and code clarity). I decided to apply another solution that replaces a connection class based on a condition. The condition is having FAKE_REDIS equals to True inside settings.py.
Take a look at the code:
from rq.decorators import job as rq_job_decorator
def async_job(queue_name: str, *args: t.Any, **kwargs: t.Any) -> t.Any:
"""
The same as RQ's job decorator, but it automatically replaces the
``connection`` argument with a fake one if ``settings.FAKE_REDIS`` is set to ``True``.
"""
class LazyAsyncJob:
def __init__(self, f: t.Callable[..., t.Any]) -> None:
self.f = f
self.job: t.Optional[t.Callable[..., t.Any]] = None
def setup_connection(self) -> t.Callable[..., t.Any]:
if self.job:
return self.job
if settings.FAKE_REDIS:
from fakeredis import FakeRedis
queue = get_queue(queue_name, connection=FakeRedis()) # type: ignore
else:
queue = get_queue(queue_name)
RQ = getattr(settings, 'RQ', {})
default_result_ttl = RQ.get('DEFAULT_RESULT_TTL')
if default_result_ttl is not None:
kwargs.setdefault('result_ttl', default_result_ttl)
return rq_job_decorator(queue, *args, **kwargs)(self.f)
def delay(self, *args: t.Any, **kwargs: t.Any) -> t.Any:
self.job = self.setup_connection()
return self.job.delay(*args, **kwargs) # type: ignore
def __call__(self, *args: t.Any, **kwargs: t.Any) -> t.Any:
self.job = self.setup_connection()
return self.job(*args, **kwargs)
return LazyAsyncJob
In order to use this code you have to decorate all your jobs using async_job decorator:
@async_job('default')
def process_image():
pass
To apply FAKE_REDIS setting for all tests use the following fixture:
Python has the wonderful package for recording logs called logging. You can find it in the standard library. Many developers think that it is overcomplicated and not pythonic. In this post I will try to convince you that it is not true. We are going to explore how this package is structured, discover its main concepts and see the code examples. At the end of this post you can find the code which sends log records to the Telegram. Let's get started.
Why Logs
Logs are X-Ray of your program runtime. The more detailed your logs are the easier is to debug non-standard cases. One of the most popular log example is webserver's access log. Take a look at the nginx's log output from this blog:
Nginx also has an error log which contains errors while processing HTTP requests. Your program's logs can also be divided into normal logs, debug logs or error logs.
MySQL is one of the most popular relational database in the world. I started my career as a web developer and used PHP with MySQL intensively in the past. When I transitioned to Python I wanted to work with MySQL as well. This post is a small guide for those who want to query MySQL using Python. I have written the similar article about PostgreSQL & Python.
Installation
We are going to use PyMySQL package. This is a pure python implementation hence you do not need to install any additional C-libraries and headers. To install the package you need to:
The demand for data engineers is growing rapidly. According to hired.com, the demand has increased by 45% in 2019. The median salary for Data Engineers in SF Bay Area is around $160k. So the question is: how to become a data engineer?
What Data Engineering is
Data engineering is closely related to data as you can see from its name. But if data analytics usually means extracting insights from existing data, data engineering means the process of building infrastructure to deliver, store and process the data. According to The AI Hierarchy of Needs, the data engineering process is located at the very bottom: Collect, Move & Store, Data Preparation. So if your organization wants to be data/AI-driven then they should hire/train data engineers.
Apache Airflow is an advanced tool for building complex data pipelines, it is a swiss-knife for any data engineer. If you look at the open positions for data engineers, you will see that the experience with Apache Airflow is a must have.
Apache Airflow was created back in 2014 at Airbnb by Maxime Beauchemin, who is also the author of Apache Superset. Starting from January 2019 Airflow is a Top-Level project under the Apache Foundation.
The main idea of this post is to show you how to install Apache Airflow, describe its main concepts and build your first data pipeline using this wondeful tool. Let's dive into it.
DAG
Directed Acyclic Graph (DAG) is a core object in Apache Airflow which describes dependencies among your tasks.
PostgreSQL is one of the most popular open source database. If you are building a web application you need a database. Python community likes PostgreSQL as well as PHP community likes MySQL. In order "to speak" with a PostgreSQL database pythonistas usually use psycopg2 library. It is written in C programming language using libpq. Let's dive into it.
Installation
pip install psycopg2
If you install the psycopg2 you have to have additional source files and compiler (gcc):
libpq
libssl
But you can install the precompiled binary, in this case you need to execute:
It is interesting that many people working with data have no clue about window functions in SQL. During a long period of time instead of using window functions, I preferred coding in Python and pandas. Today I would like to introduce you to the concept of "window" and how it is related to data extraction from a SQL database.
Window functions are applied to a subset/window of rows related to one another. In comparison to GROUP BY operation, window functions do not decrease the number of rows. Aggregate functions like AVG, SUM, COUNT could be used as window functions as well. Usually, window functions are used to do analytical tasks. The following examples of queries will be performed on the PostgreSQL database.
Syntax
<function>(<expression>) OVER (
<window>
<sorting>
<window range> -- optional
)
Looks creepy 😲 We need more practice. Let's assume that we have a salary table:
Python 3.8 is going to be released in October 2019 but you can taste it now. Currently the latest available version is python 3.8b2 (which is feature freezed). So what's new in Python 3.8?
f-string =
In python 3.8 you can output debug information more eloquently using f-string feature which was introduced in python 3.7. You just need to add = to it:
>> a = 123
>> b = 456
>> print(f'{a=} and {b=}')
a=123 and b=456
Assignment Expressions aka walrus operator :=
PEP572 is legendary because Guido decided to step down as BDFL after it was accepted. The idea behind := operator is to make code more readable and less nested. For example you want to check if a key is present in a dictionary and assign its value to a variable. Typically you will do the following:
params = {'foo': 'bar'}
x = params.get('foo')
if x:
print(x)
But in Python 3.8 it can be expressed as:
params = {'foo': 'bar'}
if x := params.get('foo'):
print(x)
Positional-only arguments
Keyword-only arguments (PEP3102) were introduced in Python 3 back in 2006. The idea behind this PEP is to restrict passing some arguments to functions or methods as positional. Let's see how it looks:
def test(a, b, *, key1, key2):
pass
test(1,2,3,4)
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
test(1,2,3,4)
TypeError: test() takes 2 positional arguments but 4 were given</module></pyshell#7>
Python interpreter warns that the last 2 arguments should be passed as keyword-arguments. The right way to invoke the function is:
test(1, 2, key1=3, key2=4)
Okay, but how to impose a restriction for positional-only arguments? Hooray! Now Python 3.8 let's you do this using / symbol:
def position_only(a, b, c, d=0, /):
pass
position_only(1, 2, 3) # OK
position_only(1, 2, c=3) # raises exception
Traceback (most recent call last):
File "<pyshell#11>", line 1, in <module>
position_only(1,2,c=3)
TypeError: position_only() got some positional-only arguments passed as keyword arguments: 'c'</module></pyshell#11>
Shared memory for IPC
Now for interprocess communication python developer can use shared memory mechanism. Previously objects were pickled/unpickled and transported via socket. In python 3.8 new module called shared_memory was introduced in the multiprocessing package. Let's see how to use it:
from multiprocessing import shared_memory
a = shared_memory.ShareableList(range(5))
print(a.shm.name)
>>> 'wnsm_bd6b5302'
We use the name of the shared memory in order to connect to it using different python shell console:
from multiprocessing import shared_memory
b = shared_memory.ShareableList(name='wnsm_bd6b5302')
print(b)
>>> ShareableList([0, 1, 2, 3, 4], name='wnsm_bd6b5302')
Wow! Great feature indeed.
TypedDict
3.8 introduced TypedDict which is type annotation for values of fixed string keys dictionaries:
from typing import TypedDict
class Movie(TypedDict):
title: str
year: int
movie: Movie = {
'title': 'Catch me if you can',
'year': 2002
}
Now you can use Movie to type annotations:
from typing import List
def my_function(films: List[Movie]):
pass
final
Those of you who are familiar with Java already know what final means. For those who do not final prohibits changing an object. For example if you decorate you class with final, nobody should inherit from it.
from typing import Final, final
@final
class FinalClass:
pass
# type checker will fail because you cannot inherit from FinalClass
class InheritedFromFinal(FinalClass):
pass
You can declare the final variable as:
birth_year: Final = 1989
birth_year = 1901 # IDE/type checker will warn you that the value should not be changed
Now you can also prohibit method overloading:
class MyClass:
@final
def prohibited(self):
pass
class SecondClass(MyClass):
def prohibited(self): # it is prohibited!!11
pass
It is just one of my favourite features, if you would like to know more take a look at changelog.
pandas probably is the most popular library for data analysis in Python programming language. This library is a high-level abstraction over low-level NumPy which is written in pure C. I use pandas on a daily basis and really enjoy it because of its eloquent syntax and rich functionality. Hope this note will help you dive into data analysis realm with Python and pandas.
DataFrame and Series
In order to master pandas you have to start from scratch with two main data structures: DataFrame and Series. If you don't understand them well you won't understand pandas.
Series
Series is an object which is similar to Python built-in list data structure but differs from it because it has associated label with each element or so-called index. This distinctive feature makes it look like associated array or dictionary (hashmap representation).
How to deploy a django app (or any WSGI app) on Digital Ocean using SSH access
How to work with gunicorn and nginx
How to control your processes using supervisord
Setting up virtualenv using pyenv
Adding our app to system autorestart (init.d)
Even despite we are going to work with Django app, this instruction will be applicable for any Python WSGI web application, so do not hesitate to use it with Pyramid, Flask, Bottle or any other popular web application framework in Python ecosystem. In this post I am going to use VPS (Virtual Private Server) on popular Digital Ocean hosting. By the way, if you sign up using my link, you will get $10 credit instantly on your account, money can be used to buy small servers and use them for your own purposes.