In this post, I will talk about my experience with AWS certification for Solution Architect Associate and how I prepared for it. | Continue reading
For Spark, efficient memory usage is critical for good performance and Spark has its own internal model of memory | Continue reading
In data management, we are still in the Wild West - new trends emerge every day. How to stay relevant to the industry? | Continue reading
As we continue to move our applications from servers and virtual machines to containers, Kubernetes is inevitable. | Continue reading
M-Motivation, M-money, M-Mastery, M-Mystery | Continue reading
We are investigating possible ways to keep our application on AWS Lambda up and running under DDoS attack. | Continue reading
Have you heard of microservices? Of course you have - any housewife already knows how to deploy them on a k8s cluster. Here's some thinking about them. | Continue reading
Data Lake and the Data Warehouse. They seemed similar, but there are differences. | Continue reading
Data partitioning is critical to data processing performance especially for large volumes of data processing in Spark. Here are some partitioning tips | Continue reading
Specifying the exact amount of resource in python using built-in modules | Continue reading
Here are the details of the ORC data format for Hive and why the GC overhead limit exceeded error occurs. | Continue reading
In this post, we will figure out how to write concurrent applications on Python3.5 | Continue reading
Synchronicity vs Asynchrony, why Asynchrony was created in the first place? What is the event loop and how it all connected with cooperative multitasking? These questions will be explained in this post. | Continue reading
The data analysis is valuable because it allows you to be more confident that future results will be reliable, correctly interpreted and applied | Continue reading
Descriptive statistics will teach you the basic concepts used to describe the data sample | Continue reading
The agile development process is best suited for both the development team and the customer. It is known for its main idea of "check and adapt". | Continue reading
For highly loaded systems serverless is a simple way of infinite scaling, and for side projects, it is a great opportunity for free hosting. | Continue reading
The manager came to you, but you have not estimated the time for your tasks? In this post I will explain the general methods of how to estimate the time for a task or project | Continue reading
Have you ever thought about the basic rules of hygiene and safety in software engineering? | Continue reading
It is way easier to write the code or do a code review by a strictly defined practical style guide for Python, like PEP8 but better | Continue reading
Guide into bucketing - an optimization technique that uses buckets to determine data partitioning and avoid data shuffle. | Continue reading
Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. These speeds can be achievable using described tips. | Continue reading
If your application needs to measure elapsed time, you need a timer that will give the right answer even if the user changes the time on the system clock | Continue reading
Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. But how achievable are these speeds if you use a slow Python interpreter? | Continue reading
When we talk about working with data, we usually doing it in a system that belongs to one of two types - Schema On Read or Schema On Write. How are they differ? | Continue reading
Azure Blob Storage is a Microsoft solution for storing objects in the cloud. It is optimized for storing large amounts of data and can be easily accessed by your Python/spark application | Continue reading
Hypothesis testing is an essential procedure in statistics. A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. | Continue reading
Continuous Integration/Continuous Delivery | Continue reading
The evaluation of the major data formats and storage engines for the Big Data ecosystem has shown the pros and cons of each of them for various metrics, in this post I'll try to compare CSV, JSON, Parquet and Avro formats using Apache Spark. | Continue reading
This post closely examines the components of a Spark application, looks at how these components work together and look at how Spark applications run on the YARN cluster. | Continue reading
In this post we will go through Apache Spark core concepts such as RDD and DAG | Continue reading
Spark has become part of the Hadoop since 2.0 and is one of the most useful technologies for Python Big Data Engineers. Before going in depth of what the Apache Spark consists of, we will briefly understand the Hadoop platform and what YARN is doing there. | Continue reading
Understand when to use the Pearson product-moment correlation, what range of values its coefficient can take and how to measure the strength of association. | Continue reading
In order to asses and describe the distribution of characteristics, we need to know a couple of things: the values of these characteristics, which are typical for the distribution under study and how much they are typical. | Continue reading
There are many distributions, but here, we will be talking about the most common and used ones. | Continue reading
Knowing probability and its applications are important to work effectively on data science problems and this post will remind you what actually is a probability. | Continue reading
In this post, we will be talking about networking blocking and non-blocking I/O in order to explain the concept of asynchronous programming | Continue reading
It may seem that there is no difference between concurrency and parallelism, but this is because you did not understand the essence of the matter. Let's try to understand how they differ. | Continue reading
__context__ vs __cause__ attributes in exception handling in Python 3 | Continue reading
What is the definition of a good software engineer? This question's aim is to be personal, it focuses on the thoughts of the people you're asking it. I will show you my thoughts in this post. | Continue reading
Today we will talk about Unit Tests, which are placed at the bottom of the testing pyramid and have the shortest feedback cycle. | Continue reading
Describing how to resolve cython and numpy dependencies on setup step. | Continue reading
In this post I will explain the general methods of how to estimate the time for a task or project and show a step-by-step algorithm for this | Continue reading
A simple post about the distinction between customer problem, created product and architecture solution. | Continue reading
In this post I'll explain what are the best software engineering principles for me? | Continue reading
Continuous Integration/Continuous Delivery | Continue reading
Python interview questions. Part III. Senior | Continue reading