Add an efficient reservoir sampling aggregator #1214

marcusb · 2024-12-24T17:32:15Z

This aggregator uses Li's "Algorithm L", a simple yet efficient
sampling method, with modifications to support a monoidal setting.

A JMH benchmark was added for both this and the old priority-queue
algoritm. In a single-threaded benchmark on an Intel Core i9-10885H,
this algorithm can outperform the old one by an order of magnitude or
more, depending on the parameters.

Because of this, the new algorithm was made the default for
Aggregtor.reservoirSample().

Unit tests were added for both algorithms. These are probabilistic and
are expected to fail on some 0.1% of times, per test case (p-value is
set to 0.001).

CLAassistant · 2024-12-24T17:32:21Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

marcusb · 2024-12-24T19:27:46Z

JMH benchmark results (Intel Core i9-10885H):

[info] Benchmark                                      (collectionSize)  (sampleRate)   Mode  Cnt        Score        Error  Units
[info] ReservoirSamplingBenchmark.timeAlgorithmL                   100         0.001  thrpt    3  1043859.778 ± 466398.508  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                   100          0.01  thrpt    3  1063744.046 ± 122691.603  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                   100           0.1  thrpt    3   296958.161 ±  21380.964  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                 10000         0.001  thrpt    3    28119.689 ±   2855.057  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                 10000          0.01  thrpt    3     9707.608 ±   5480.480  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                 10000           0.1  thrpt    3     3174.108 ±   1343.076  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL               1000000         0.001  thrpt    3      225.023 ±   1131.246  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL               1000000          0.01  thrpt    3       78.130 ±      4.710  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL               1000000           0.1  thrpt    3       28.881 ±     10.076  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue               100         0.001  thrpt    3   203221.750 ± 179576.474  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue               100          0.01  thrpt    3   208366.903 ±  35692.574  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue               100           0.1  thrpt    3   147470.590 ±   4329.993  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue             10000         0.001  thrpt    3     2230.555 ±     68.464  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue             10000          0.01  thrpt    3     1913.923 ±    221.179  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue             10000           0.1  thrpt    3      975.181 ±     72.015  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue           1000000         0.001  thrpt    3       20.913 ±      1.756  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue           1000000          0.01  thrpt    3       15.828 ±      0.473  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue           1000000           0.1  thrpt    3        4.783 ±      1.291  ops/s

marcusb · 2024-12-24T19:29:03Z

@regadas

marcusb · 2024-12-24T20:36:04Z

marcusb · 2024-12-31T17:30:02Z

Another large performance boost by optimizing the aggregation for IndexedSeq.

This aggregator uses Li's "Algorithm L", a simple yet efficient sampling method, with modifications to support a monoidal setting. A JMH benchmark was added for both this and the old priority-queue algoritm. In a single-threaded benchmark on an Intel Core i9-10885H, this algorithm can outperform the old one by an order of magnitude or more, depending on the parameters. Because of this, the new algorithm was made the default for Aggregtor.reservoirSample(). Unit tests were added for both algorithms. These are probabilistic and are expected to fail on some 0.1% of times, per test case (p-value is set to 0.001). Optimized overloads of aggregation methods append/appendAll were added that operate on IndexedSeqs. These have efficient random access and allow us to skip over items without examining each one, so sublinear runtime can be achieved.

marcusb force-pushed the sampling branch 4 times, most recently from 3d8bd4b to a0b526b Compare December 24, 2024 19:51

marcusb force-pushed the sampling branch 2 times, most recently from 036d5c2 to 431d5b7 Compare December 24, 2024 20:44

marcusb force-pushed the sampling branch from 7d1cfdf to 10198ac Compare December 31, 2024 20:36

marcusb force-pushed the sampling branch from 10198ac to 545993d Compare January 3, 2025 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an efficient reservoir sampling aggregator #1214

Add an efficient reservoir sampling aggregator #1214

marcusb commented Dec 24, 2024 •

edited

Loading

CLAassistant commented Dec 24, 2024

marcusb commented Dec 24, 2024 •

edited

Loading

marcusb commented Dec 24, 2024

marcusb commented Dec 24, 2024

marcusb commented Dec 31, 2024

Add an efficient reservoir sampling aggregator #1214

Are you sure you want to change the base?

Add an efficient reservoir sampling aggregator #1214

Conversation

marcusb commented Dec 24, 2024 • edited Loading

CLAassistant commented Dec 24, 2024

marcusb commented Dec 24, 2024 • edited Loading

marcusb commented Dec 24, 2024

marcusb commented Dec 24, 2024

marcusb commented Dec 31, 2024

marcusb commented Dec 24, 2024 •

edited

Loading

marcusb commented Dec 24, 2024 •

edited

Loading