Sampling Efficiently from Groups

Suppose that you have to sample a student at random in a school. However, you cannot go into a classroom and just pick a student. All you are allowed to do is to pick a classroom, and then let the teacher pick a student at random. If the classrooms contain a variable number of st … | Continue reading


@lemire.me | 4 years ago

Rounding Integers to Even, Efficiently

When dividing a numerator n by a divisor d, most programming languages round “down”. It means that 1/2 is 0. Mathematicians will insist that 1/2 and claim that you really are computing floor(1/2). But let me think like a programmer. So 3/2 is 1. If you always want to round up, yo … | Continue reading


@lemire.me | 4 years ago

Multiplying backward for profit – Daniel Lemire's blog

Most programming languages have integer types with arithmetic operations like multiplications, additions and so forth. Our main processors support 64-bit integers which means that you can deal with rather large integers. However, you cannot represent everything with a 64-bit inte … | Continue reading


@lemire.me | 4 years ago

Simdjson 0.3: the fastest JSON parser in the world is even better

Last year (2019), we released the simjson library. It is a C++ library available under a liberal license (Apache) that can parse JSON documents very fast. How fast? We reach and exceed 3 gigabytes per second in many instances. It can also parse millions of small JSON documents pe … | Continue reading


@lemire.me | 4 years ago

Avoiding cache line overlap by replacing a 256-bit store with two 128-bit stores

Memory is organized in cache lines, frequently blocks of 64 bytes. On Intel and AMD processors, you can store and load memory in blocks of various sizes, such as 64 bits, 128 bits or 256 bits. In the old days, and on some limited devices today, reading and storing to memory requi … | Continue reading


@lemire.me | 4 years ago

Number of atoms in the universe versus floating-point values

It is estimated that there are about 1080 atoms in the universe. The estimate for the total number of electrons is similar. It is a huge number and it far exceeds the maximal value of a single-precision floating-point type in our current computers (which is about 1038). Yet the m … | Continue reading


@lemire.me | 4 years ago

Fast Float Parsing in Practice

In our work parsing JSON documents as quickly as possible, we found that one of the most challenging problem is to parse numbers. That is, you want to take the string “1.3553e142” and convert it quickly to a double-precision floating-point number. You can use the strtod function … | Continue reading


@lemire.me | 4 years ago

Will calling “free” or “delete” in C/C++ release the memory to the system?

In the C programming language, we typically manage memory manually. A typical heap allocation is a call to malloc followed by a call to free. In C++, you have more options, but it is the same routines under the hood. // allocate N kB data = malloc(N*1024); // do something with th … | Continue reading


@lemire.me | 4 years ago

Research should not stop with the research paper

The practice of academic research is based on the production of formal documents that undergo formal reviewers by peers. We routinely evaluate academics for jobs and promotions based on their publication output. When asked about their contribution to science, many academics are h … | Continue reading


@lemire.me | 4 years ago

Cost of a thread in C++ under Linux

Almost all our computers are made of several processing cores. Thus it can be efficient to “parallelize” expensive processing in a multicore manner. That is, instead of using a single core to do all of the work, you divide the work among multiple cores. A standard way to approach … | Continue reading


@lemire.me | 4 years ago

Filling large arrays with zeroes quickly in C++

Travis Downs reports that some C++ compilers have trouble filling up arrays with values at high speed. Typically, to fill an array with some value, C++ programmers invoke the std::fill. We might assume that for a task as simple as filling an array with zeroes, the C++ standard li … | Continue reading


@lemire.me | 4 years ago

Allocating large blocks of memory: bare-metal C++ speeds

In a previous post, I benchmarked the allocation of large blocks of memory using idiomatic C++. I got a depressing result: the speed could be lower than 2 GB/s. For comparison, the disk in my laptop has greater bandwidth. Methodologically, I benchmarked the “new” operator in C++ … | Continue reading


@lemire.me | 4 years ago

How fast can you allocate a large block of memory in C++?

In C++, the most basic memory allocation code is just a call to the new operator: char *buf = new char[s]; According to a textbook interpretation, we just allocated s bytes. If you benchmark this line of code, you might find that it almost entirely free on a per-byte basis for la … | Continue reading


@lemire.me | 4 years ago

I Teach Database Design

Most software runs on top of databases. These databases are organized logically, with a schema, that is a formal description. You have entities (your user), attributes (your user’s name) and relationships between them. Typical textbook database design comes from an era when it wa … | Continue reading


@lemire.me | 4 years ago

Xor Filters: Faster and Smaller Than Bloom Filters

In software, you frequently need to check whether some objects is in a set. For example, you might have a list of forbidden Web addresses. As someone enters a new Web address, you may want to check whether it is part of your black list. Or maybe you have a large list of already u … | Continue reading


@lemire.me | 4 years ago

Are 64-bit random identifiers free from collision?

It is common in software system to map objects to unique identifiers. For example, you might map all web pages on the Internet to a unique identifier. Often, these identifiers are integers. For example, many people like to use 64-bit integers. If you assign two 64-bit integers at … | Continue reading


@lemire.me | 4 years ago

Amazon’s new ARM servers: Graviton 2

Most servers on the Internet run on x64 processors, mostly made by Intel. Meanwhile, most smartphones run ARM processors. From a business perspective, these are different technologies. The x64 processors are mostly designed by only two companies (Intel and AMD), with one very lar … | Continue reading


@lemire.me | 4 years ago

AMD Zen 2 and branch mispredictions

Intel makes some of the very best processors many can buy. For a long time, its main rival (AMD) failed to compete. However, its latest generation of processors (Zen 2) appear to roughly match Intel, at a lower price point. In several benchmarks that I care about, my good old Int … | Continue reading


@lemire.me | 4 years ago

Instructions per Cycle: AMD Zen 2 versus Intel

The performance of a processor is determined by several factors. For example, processors with a higher frequency tend to do more work per unit of time. Physics makes it difficult to produce processors that have higher frequency. Modern processors can execute many instructions per … | Continue reading


@lemire.me | 4 years ago

Memory Parallelism: AMD Rome versus Intel

When thinking about “parallelism”, most programmers think about having multiple processors. However, even a single core in a modern processor has plenty of parallelism. It can execute many instructions per cycle and, importantly, it can issue multiple memory requests concurrently … | Continue reading


@lemire.me | 4 years ago

A Short History of Technology

-312: The aqueduc is invented. -100: Paper is invented. 100: Paper production at scale begins. 900: Gunpowder has been invented. 1040: The modern compas has been invented. 1439: Gutenberg’s printing press. 1451: Christopher Columbus’ boats have lateens, triangular sails that allo … | Continue reading


@lemire.me | 4 years ago

Cloud Computing: A Story of Incentives

Many businesses today run “in the cloud”. What this often means is that they have abstracted out the hardware entirely. Large corporations like Amazon, Google, Microsoft or IBM operate the servers. The business only needs to access the software, remotely. In theory, this means th … | Continue reading


@lemire.me | 4 years ago

Unrolling your loops can improve branch prediction

Modern processors predict branches (e.g., if-then clauses), often many cycles a ahead of time. When predictions are incorrect, the processor has to start again, an expensive process. Adding a hard-to-predict branch can multiply your running time. Does it mean that you should only … | Continue reading


@lemire.me | 4 years ago

Adding a predictable branch to existing code can increase branch mispredictions

Software is full of “branches”. They often take the form of if-then clauses in code. Modern processors try to predict the result of branches often long before evaluating them. Hard-to-predict branches are a challenge performance-wise because when a processor fails to predict corr … | Continue reading


@lemire.me | 4 years ago

Parsing numbers in C++: streams, strtod, from_chars

When programming, we often want to convert strings (e.g., “1.0e2”) into numbers (e.g., 100). In C++, we have many options. In a previous post, I reported that it is an expensive process when using the standard approach (streams). Many people pointed out to me that there are faste … | Continue reading


@lemire.me | 4 years ago

How expensive is it to parse numbers from a string in C++?

In software, we frequently have to parse numbers from strings. Numbers are typically represented in computers as 32-bit or 64-bit words whereas strings are variable-length sequences of bytes. We need to go from one representation to the other. For example, given the 3-character s … | Continue reading


@lemire.me | 4 years ago

Mispredicted branches can multiply your running times

Modern processors are superscalar, meaning that they can execute many instructions at once. For example, some processors can retire four or six instructions per cycle. Furthermore, many of these processors can initiate instructions out-of-order: they can start working on instruct … | Continue reading


@lemire.me | 4 years ago

Benchmarking is hard: processors learn to predict branches

A lot of software is an intricate of branches (if–then clauses). For performance reasons, modern processors predict the results of these branches. In my previous post, I showed how the bulk of your running time could be due to mispredicted branches. My benchmark consisted in writ … | Continue reading


@lemire.me | 4 years ago

A fast alternative to the modulo reduction

Suppose you want to pick an integer at random in a set of N elements. Your computer has functions to generate random 32-bit integers, how do you transform such numbers into indexes no larger than N? Suppose you have a hash table with a capacity N. Again, you need to transform you … | Continue reading


@lemire.me | 4 years ago

Doubling the speed of std:uniform_int_distribution in the GNU C++ library

The standard way in C++ to generate a random integer in a range is to call the std::uniform_int_distribution function. The current implementation of std::uniform_int_distribution in the GNU C++ library (libstdc++) to generate an integer in the interval [0,range] looks as follows. … | Continue reading


@lemire.me | 4 years ago

How far can you scale interleaved binary searches?

The binary search is the standard, textbook, approach when searching through sorted arrays. In a previous post, I showed how you can do multiple binary searches faster if you interleave them. The core idea is that while the processor is waiting for data from one memory access, yo … | Continue reading


@lemire.me | 4 years ago

Speeding up independent binary searches by interleaving them

Given a long list of sorted values, how do you find the location of a particular value? A simple strategy is to first look at the middle of the list. If your value is larger than the middle value, look at the last half of the list, if not look at the first half of … Continue read … | Continue reading


@lemire.me | 4 years ago

Passing integers by reference can be expensive

In languages like C++, you can pass values to functions in two ways. You can pass by value: the value is semantically “copied” before being passed to the function. Any change you made to the value within the function will be gone after the function terminates. The value is epheme … | Continue reading


@lemire.me | 4 years ago

How fast can scancount be?

In an earlier post, I described the following problem. Suppose that you have tens of arrays of integers. You wish to find all integer that are in more than 3 (say) of the sets. This is neither a union nor an intersection. A simple algorithm to solve this problem is to use a count … | Continue reading


@lemire.me | 4 years ago

Faster threshold queries with cache-sensitive scancount

Suppose that you are given 100 sorted arrays of integers. You can compute their union or their intersection. It is a common setup in data indexing: the integers might be unique identifiers. But there is more than just intersections and unions… What if you want all values that app … | Continue reading


@lemire.me | 4 years ago

Microbenchmarking Calls for Idealized Conditions

Programmers use software benchmarking to measure the speed of software. We need to distinguish system benchmarking, where one seeks to measure the performance of a system, with microbenchmarking, where one seeks to measure the performance of a small, isolated operation. For examp … | Continue reading


@lemire.me | 4 years ago

A new release of simdjson: runtime dispatching, 64-bit ARM support and more

JSON is a ubiquitous data exchange format. It is found everywhere on the Internet. To consume JSON, software uses tools called JSON parsers. Earlier this year, we released the first version of our JSON parsing library, simdjson. It is arguably the fastest standard-compliant parse … | Continue reading


@lemire.me | 4 years ago

Java BufferedReader Is CPU Bound

In an earlier post, I asked how fast the getline function in C++ could run through the lines in a text file. The answer was about 2 GB/s. That is slower than some of the best disk drives and network connections. If you take into account that software rarely only need to “just” ac … | Continue reading


@lemire.me | 4 years ago

Programming competition with $1000 in prizes: make my code readable

Colm MacCárthaigh is organizing a programming competition with three 3 prizes: $500, $300, $200. The objective? Produce the most readable, easy to follow, and well tested implementation of the nearly divisionless random integer algorithm. The scientific reference is Fast Random I … | Continue reading


@lemire.me | 4 years ago

Parsing JSON Quickly on Tiny Chips (ARM Cortex-A72 Edition)

I own an inexpensive card-size ROCKPro64 computer ($60). It has a ARM Cortex-A72 processors, the same processors you find in the recently released Raspberry Pi 4. These are processors similar to those you find in your smartphone, although they are far less powerful than the best … | Continue reading


@lemire.me | 4 years ago

Parsing JSON Using SIMD Instructions on the Apple A12 Processor

Most modern processors have “SIMD instructions“. These instructions operate over wide registers, doing many operations at once. For example, you can easy subtract 16 values from 16 other values in a single SIMD instruction. It a form of parallelism that can drastically improve th … | Continue reading


@lemire.me | 4 years ago

A fast 16-bit random number generator?

In software, we often need to generate random numbers. Commonly, we use pseudo-random number generators. A simple generator is wyhash. It is a multiplication followed by an XOR: uint64_t wyhash64_x; uint64_t wyhash64() { wyhash64_x += 0x60bee2bee120fc15; __uint128_t tmp; tmp = (_ … | Continue reading


@lemire.me | 4 years ago

Bounding the cost of the intersection between a small array and a large array

Consider the scenario where you are given a small sorted array of integers (e.g., [1,10,100]) and a large sorted array ([1,2,13,51,…]). You want to compute the intersection between these two arrays. A simple approach would be to take each value in the small array and do a binary … | Continue reading


@lemire.me | 4 years ago

How fast is getline in C++?

A standard way to read a text file in C++ is to call the getline function. To iterate over all lines in file and sum up their length, you might do as follows: while(getline(is, line)) { x += line.size(); } How fast is this? On a Skylake processor with a recent GNU GCC compiler wi … | Continue reading


@lemire.me | 4 years ago

What should we do with “legacy” Java 8 applications?

Java is a mature programming language. It was improved over many successive versions. Mostly, new Java versions did not break your code. Thus Java was a great, reliable platform. For some reason, the Oracle engineers decided to break things after Java 8. You cannot “just” upgrade … | Continue reading


@lemire.me | 4 years ago

Nearly Divisionless Random Integer Generation on Various Systems

It is common in software to need random integers within a range of values. For example, you may need to pick an object at random in an array. Random shuffling algorithms require such random integers. Typically, you generate a regular integers (say a 64-bit integer). From these in … | Continue reading


@lemire.me | 4 years ago

Nearly Divisionless Random Integer Generation on Various Systems

It is common in software to need random integers within a range of values. For example, you may need to pick an object at random in an array. Random shuffling algorithms require such random integers. Typically, you generate a regular integers (say a 64-bit integer). From these in … | Continue reading


@lemire.me | 4 years ago

Measuring the system clock frequency using loops (Intel and ARM)

In my previous post, Bitset decoding on Apple’s A12, I showed that Apple’s latest ARM-based processor can decode set bits out of a stream of words using 3 cycles per set bit. This compares favourably to Intel processors since I never could get one of them to do better than 3.5 cy … | Continue reading


@lemire.me | 5 years ago