Skip to content

Learning how to write "Less Slow" code in C++20, from numerical micro-kernels and SIMD to coroutines, ranges, and polymorphic state machines

Notifications You must be signed in to change notification settings

ashvardanian/less_slow.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning to Write Less Slow C, C++, and Assembly Code

The benchmarks in this repository don't aim to cover every topic entirely, but they help form a mindset and intuition for performance-oriented software design. It also provides an example of using some non-STL but de facto standard libraries in C++, importing them via CMake, and compiling from source. For higher-level abstractions and languages, check out less_slow.rs and less_slow.py.

Much modern code suffers from common pitfalls, such as bugs, security vulnerabilities, and performance bottlenecks. University curricula often teach outdated concepts, while bootcamps oversimplify crucial software development principles.

Less Slow C++

This repository offers practical examples of writing efficient C and C++ code. It leverages C++20 features and is designed primarily for GCC and Clang compilers on Linux, though it may work on other platforms. The topics range from basic micro-kernels executing in a few nanoseconds to more complex constructs involving parallel algorithms, coroutines, and polymorphism. Some of the highlights include:

  • 100x cheaper random inputs?! Discover how input generation sometimes costs more than the algorithm.
  • 40x faster trigonometry: Speed-up standard library functions like std::sin in just 3 lines of code.
  • 4x faster logic with std::ranges: Reduce stack usage and reuse registers more efficiently.
  • Compiler optimizations beyond -O3: Learn about less obvious flags and techniques for another 2x speedup.
  • Multiplying matrices? Check how a 3x3x3 GEMM can be 70% slower than 4x4x4, despite 60% fewer ops.
  • How many if conditions are too many? Test your CPU's branch predictor with just 10 lines of code.
  • Prefer recursion to iteration? Measure the depth at which your algorithm with SEGFAULT.
  • How not to build state machines: Compare std::variant, virtual functions, and C++20 coroutines.
  • Scaling to many cores? Learn how to use OpenMP, Intel's oneTBB, or your custom thread pool.
  • How to handle JSON avoiding memory allocations? Is it easier with C or C++ libraries?

To read, jump to the less_slow.cpp source file and read the code snippets and comments. Follow the instructions below to run the code in your environment and compare it to the comments as you read through the source.

Running the Benchmarks

If you are on Windows, it's recommended that you set up a Linux environment using WSL.

  • If you are on MacOS, consider using the non-native distribution of Clang from Homebrew or MacPorts.
  • If you are on Linux, make sure to install CMake and a recent version of GCC or Clang compilers to support C++20 features.

If you are familiar with C++ and want to review code and measurements as you read, you can clone the repository and execute the following commands.

git clone https://github.com/ashvardanian/less_slow.cpp.git # Clone the repository
cd less_slow.cpp                                            # Change the directory
cmake -B build_release -D CMAKE_BUILD_TYPE=Release          # Generate the build files
cmake --build build_release --config Release                # Build the project
build_release/less_slow                                     # Run the benchmarks

The build will pull and compile several third-party dependencies from the source:

  • Google's Benchmark is used for profiling.
  • Intel's oneTBB is used as the Parallel STL backend.
  • Eric Niebler's range-v3 replaces std::ranges.
  • Victor Zverovich's fmt replaces std::format.
  • Ash Vardanian's StringZilla replaces std::string.
  • Hana Dusíková's CTRE replaces std::regex.
  • Niels Lohmann's json is used for JSON deserialization.
  • Yaoyuan Guo's yyjson for faster JSON processing.
  • Google's Abseil replaces STL's associative containers.

To control the output or run specific benchmarks, use the following flags:

build_release/less_slow --benchmark_format=json             # Output in JSON format
build_release/less_slow --benchmark_out=results.json        # Save the results to a file instead of `stdout`
build_release/less_slow --benchmark_filter=std_sort         # Run only benchmarks containing `std_sort` in their name

To enhance stability and reproducibility, disable Simultaneous Multi-Threading (SMT) on your CPU and use the --benchmark_enable_random_interleaving=true flag, which shuffles and interleaves benchmarks as described here.

build_release/less_slow --benchmark_enable_random_interleaving=true

Google Benchmark supports User-Requested Performance Counters through libpmf. Note that collecting these may require sudo privileges.

sudo build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters="CYCLES,INSTRUCTIONS"

Alternatively, use the Linux perf tool for performance counter collection:

sudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort