Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an efficient reservoir sampling aggregator #1214

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

marcusb
Copy link

@marcusb marcusb commented Dec 24, 2024

This aggregator uses Li's "Algorithm L", a simple yet efficient
sampling method, with modifications to support a monoidal setting.

A JMH benchmark was added for both this and the old priority-queue
algoritm. In a single-threaded benchmark on an Intel Core i9-10885H,
this algorithm can outperform the old one by an order of magnitude or
more, depending on the parameters.

Because of this, the new algorithm was made the default for
Aggregtor.reservoirSample().

Unit tests were added for both algorithms. These are probabilistic and
are expected to fail on some 0.1% of times, per test case (p-value is
set to 0.001).

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@marcusb
Copy link
Author

marcusb commented Dec 24, 2024

JMH benchmark results (Intel Core i9-10885H):

[info] Benchmark                                      (collectionSize)  (sampleRate)   Mode  Cnt        Score        Error  Units
[info] ReservoirSamplingBenchmark.timeAlgorithmL                   100         0.001  thrpt    3  1043859.778 ± 466398.508  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                   100          0.01  thrpt    3  1063744.046 ± 122691.603  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                   100           0.1  thrpt    3   296958.161 ±  21380.964  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                 10000         0.001  thrpt    3    28119.689 ±   2855.057  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                 10000          0.01  thrpt    3     9707.608 ±   5480.480  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL                 10000           0.1  thrpt    3     3174.108 ±   1343.076  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL               1000000         0.001  thrpt    3      225.023 ±   1131.246  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL               1000000          0.01  thrpt    3       78.130 ±      4.710  ops/s
[info] ReservoirSamplingBenchmark.timeAlgorithmL               1000000           0.1  thrpt    3       28.881 ±     10.076  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue               100         0.001  thrpt    3   203221.750 ± 179576.474  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue               100          0.01  thrpt    3   208366.903 ±  35692.574  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue               100           0.1  thrpt    3   147470.590 ±   4329.993  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue             10000         0.001  thrpt    3     2230.555 ±     68.464  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue             10000          0.01  thrpt    3     1913.923 ±    221.179  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue             10000           0.1  thrpt    3      975.181 ±     72.015  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue           1000000         0.001  thrpt    3       20.913 ±      1.756  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue           1000000          0.01  thrpt    3       15.828 ±      0.473  ops/s
[info] ReservoirSamplingBenchmark.timePriorityQeueue           1000000           0.1  thrpt    3        4.783 ±      1.291  ops/s

@marcusb
Copy link
Author

marcusb commented Dec 24, 2024

@regadas

@marcusb marcusb force-pushed the sampling branch 4 times, most recently from 3d8bd4b to a0b526b Compare December 24, 2024 19:51
@marcusb
Copy link
Author

marcusb commented Dec 24, 2024

Reservoir Sampling Benchmark

@marcusb marcusb force-pushed the sampling branch 2 times, most recently from 036d5c2 to 431d5b7 Compare December 24, 2024 20:44
@marcusb
Copy link
Author

marcusb commented Dec 31, 2024

Another large performance boost by optimizing the aggregation for IndexedSeq.

Reservoir Sampling Benchmark(1)

This aggregator uses Li's "Algorithm L", a simple yet efficient
sampling method, with modifications to support a monoidal setting.

A JMH benchmark was added for both this and the old priority-queue
algoritm. In a single-threaded benchmark on an Intel Core i9-10885H,
this algorithm can outperform the old one by an order of magnitude or
more, depending on the parameters.

Because of this, the new algorithm was made the default for
Aggregtor.reservoirSample().

Unit tests were added for both algorithms. These are probabilistic and
are expected to fail on some 0.1% of times, per test case (p-value is
set to 0.001).

Optimized overloads of aggregation methods append/appendAll were added
that operate on IndexedSeqs. These have efficient random access and
allow us to skip over items without examining each one, so sublinear
runtime can be achieved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants