Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve histogram bin logic #18761

Merged
merged 4 commits into from
Nov 19, 2024
Merged

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Sep 16, 2024

Closes #18650.

This is a revamp of hist and how it deals with bin edges when data is not provided, with the intent of better aligning the defaults with numpy. Bins are still left half-open intervals, i.e. of the form (a, b]. Here are the main changes:

  1. If neither bin_count nor bins is provided, 10 bins are created. The bins are all equally-spaced, spanning from the minimum value to the maximum value. The lowest bin's left edge is increase by 0.1% the span of the entire data (in line with pandas, and our existing implementation) to include the lowest value. This way, all values are included in a call to series.hist() with no arguments, or bin_count provided:
    pl.Series([2, 5, 8]).hist(bin_count=4)
    # shape: (4, 3)
    # ┌────────────┬──────────────┬───────┐
    # │ breakpoint ┆ category     ┆ count │
    # │ ---        ┆ ---          ┆ ---   │
    # │ f64        ┆ cat          ┆ u32   │
    # ╞════════════╪══════════════╪═══════╡
    # │ 3.5        ┆ (1.994, 3.5] ┆ 1     │
    # │ 5.0        ┆ (3.5, 5.0]   ┆ 1     │
    # │ 6.5        ┆ (5.0, 6.5]   ┆ 0     │
    # │ 8.0        ┆ (6.5, 8.0]   ┆ 1     │
    # └────────────┴──────────────┴───────┘
  2. If the series has only a single (non-null) value, a span of 1.0 is used, regardless of the number of bins (when bin_counts is used). The default bin count is still 10.
    pl.Series([5]).hist()
    # shape: (10, 3)
    # ┌────────────┬────────────┬───────┐
    # │ breakpoint ┆ category   ┆ count │
    # │ ---        ┆ ---        ┆ ---   │
    # │ f64        ┆ cat        ┆ u32   │
    # ╞════════════╪════════════╪═══════╡
    # │ 4.6        ┆ (4.5, 4.6] ┆ 0     │
    # │ 4.7        ┆ (4.6, 4.7] ┆ 0     │
    # │ 4.8        ┆ (4.7, 4.8] ┆ 0     │
    # │ 4.9        ┆ (4.8, 4.9] ┆ 0     │
    # │ 5.0        ┆ (4.9, 5.0] ┆ 0     │
    # │ 5.1        ┆ (5.0, 5.1] ┆ 1     │
    # │ 5.2        ┆ (5.1, 5.2] ┆ 0     │
    # │ 5.3        ┆ (5.2, 5.3] ┆ 0     │
    # │ 5.4        ┆ (5.3, 5.4] ┆ 0     │
    # │ 5.5        ┆ (5.4, 5.5] ┆ 0     │
    # └────────────┴────────────┴───────┘
  3. If the series has no values, bins range over the unit interval.
    pl.Series([], dtype=pl.Int8).hist()
    # shape: (10, 3)
    # ┌────────────┬────────────┬───────┐
    # │ breakpoint ┆ category   ┆ count │
    # │ ---        ┆ ---        ┆ ---   │
    # │ f64        ┆ cat        ┆ u32   │
    # ╞════════════╪════════════╪═══════╡
    # │ 0.1        ┆ (0.0, 0.1] ┆ 0     │
    # │ 0.2        ┆ (0.1, 0.2] ┆ 0     │
    # │ 0.3        ┆ (0.2, 0.3] ┆ 0     │
    # │ 0.4        ┆ (0.3, 0.4] ┆ 0     │
    # │ 0.5        ┆ (0.4, 0.5] ┆ 0     │
    # │ 0.6        ┆ (0.5, 0.6] ┆ 0     │
    # │ 0.7        ┆ (0.6, 0.7] ┆ 0     │
    # │ 0.8        ┆ (0.7, 0.8] ┆ 0     │
    # │ 0.9        ┆ (0.8, 0.9] ┆ 0     │
    # │ 1.0        ┆ (0.9, 1.0] ┆ 0     │
    # └────────────┴────────────┴───────┘
  4. If bins is provided, it is always respected.
  5. If bins is provided with only a single edge, then it is zero-width, and no values are binned. The result is empty:
    pl.Series([1, 2, 3]).hist(bins=[2])
    # shape: (0, 3)
    # ┌────────────┬──────────┬───────┐
    # │ breakpoint ┆ category ┆ count │
    # │ ---        ┆ ---      ┆ ---   │
    # │ f64        ┆ cat      ┆ u32   │
    # ╞════════════╪══════════╪═══════╡
    # └────────────┴──────────┴───────┘
  6. You cannot provide both bins and bin_counts:
    pl.Series([2, 5, 8]).hist(bin_count=4)
    # polars.exceptions.ComputeError: can only provide one of `bin_count` or `bins`

This implementation is a bit cleaner in that it separates the bin-creation logic from the binning logic. It also includes a check for uniformity of bins, even if bins are provided by the user, enabling the fast path to be used in more cases. The fast path and the slower path are now a bit more obvious in separate functions, and I think that the new slow should be a bit faster than the older slow path. The new fast path revives an additional check that fixes a few floating point issues when dealing with integers but it's a bit improved (performance wise) compared to the old one.

I do think that hist could use some improvements with a include_lower parameter to avoid the "ugly" lowest bin which is extended by 0.1%, but that can be for another day.

@mcrumiller mcrumiller changed the title Add initial test fix: Improve histogram bin logic Sep 16, 2024
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Sep 16, 2024
Copy link

codecov bot commented Sep 18, 2024

Codecov Report

Attention: Patch coverage is 96.02649% with 6 lines in your changes missing coverage. Please review.

Project coverage is 79.58%. Comparing base (8cb7839) to head (c653d16).
Report is 51 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-ops/src/chunked_array/hist.rs 96.02% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18761      +/-   ##
==========================================
+ Coverage   79.55%   79.58%   +0.02%     
==========================================
  Files        1544     1544              
  Lines      213240   213307      +67     
  Branches     2441     2441              
==========================================
+ Hits       169643   169757     +114     
+ Misses      43048    43001      -47     
  Partials      549      549              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@mcrumiller mcrumiller marked this pull request as ready for review September 18, 2024 13:20
@ritchie46
Copy link
Member

@alexander-beedie can you take a look at this one?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 22, 2024

@alexander-beedie can you take a look at this one?

Looks like an improvement to me, but I'll see if I can run it by one of our developers (the one who spotted the panic) for bonus feedback tomorrow 👍

@Julian-J-S
Copy link
Contributor

Julian-J-S commented Sep 24, 2024

just sharing my thoughts:

Why deviate from numpy

numpy interval is opened the other side.

  • polars result: 1, 1, 0, 1
  • numpy result: 1, 0, 1, 1
pl.Series([2, 5, 8]).hist(bin_count=4)
shape: (4, 3)
┌────────────┬──────────────┬───────┐
│ breakpointcategorycount │
│ ---------   │
│ f64catu32   │
╞════════════╪══════════════╪═══════╡
│ 3.5        ┆ (1.994, 3.5] ┆ 1     │
│ 5.0        ┆ (3.5, 5.0]   ┆ 1     │
│ 6.5        ┆ (5.0, 6.5]   ┆ 0     │
│ 8.0        ┆ (6.5, 8.0]   ┆ 1     │
└────────────┴──────────────┴───────┘


np.histogram([2, 5, 8], bins=4)
(array([1, 0, 1, 1]), array([2. , 3.5, 5. , 6.5, 8. ]))

Return Struct with Edges (feature)

The current "category" is almost unusable since it is a string/cat.
Would love to see a struct with the edges here instead or additionally.
This also would remove the "strange" 1.994 hack",
Then just document how the intervals are defined and return struct with x1 & x2 as edges 😄
You could also completely remove that "breakpoint" column and use the struct values instead. simpler and more performance I would think 🤔

example

┌────────────┬──────────────┬───────┐
│ breakpoint ┆ EDGES        ┆ count │
│ ---        ┆ ---          ┆ ---   │
│ f64        ┆ STRUCT       ┆ u32   │
╞════════════╪══════════════╪═══════╡
│ 3.5        ┆ {2.0, 3.5}   ┆ 1     │
│ 5.0        ┆ {3.5, 5.0}   ┆ 1     │
│ 6.5        ┆ {5.0, 6.5}   ┆ 0     │
│ 8.0        ┆ {6.5, 8.0}   ┆ 1     │
└────────────┴──────────────┴───────┘

I can also open a feature request if this is too much for this PR?

@mcrumiller
Copy link
Contributor Author

@Julian-J-S I believe @MarcoGorelli has an issue open for exactly the point you make. I did not want to address that in this PR as I didn't want to cover too much ground but I wholeheartedly agree.

Regarding your other issue, I think the bounds difference from numpy is to align with pandas. In my opinion, the type of interval is probably not super important to the histogram, but I do agree about the edge bins being ugly.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 30, 2024

Yup, lets split the follow-up here - this PR can improve what is present right now (by fixing the panic and returning more reasonable bins), and then we can look to potentially retool it more comprehensively as a separate (and likely breaking) issue/PR.

Copy link
Collaborator

@alexander-beedie alexander-beedie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, working through my backlog...😅 @ritchie46 The improved behaviour gets my vote; further changes can be left to a subsequent PR. As the function is marked as unstable (and this makes several fixes) I don't see a problem merging 👍

@Julian-J-S
Copy link
Contributor

Julian-J-S commented Nov 14, 2024

Just my opinion/thoughts 😄
I have questions ❓
I try to see everything as "what value to this bring to the polars user?".

  • empty series -> range over unit interval
    • serious question: why? What is the reasoning? ❔
    • I can see no value in this. This is completely arbitrary and the result 100% useless. If I have NO DATA why return 10 unrelated rows with count=0?
    • this can also be dangerous. Imagine you have a bug in the pipeline or corrupt data which leads to all data being filtered out. I would expect to raise an error instead of completing with unrelated results 💥 💥
    • IMO this should raise or an empty results (like shown when only single edge provided)
  • 10 as "magic" correct bin count?
    • as a programmer we should specify what we want. Either the bin_count or the bins
    • there is no default magic correct number.
    • I am on board if we say
      • if no bin_count or bins provided we use an "intelligent algorithm" to determine the intervals but using always 10 seems bad.

On another note:
hist is marked as "unstable".
This is a great chance to rework hist with the many great ideas we have in mind 😄. Otherwise we might annoy the user with multiple breaking changes soon.
#18645
#18761 (comment)

@mcrumiller
Copy link
Contributor Author

I'm happy to update the logic in this PR, but I think we decided first to fix the obviously broken behavior before revamping the "working but odd" behavior.

@alexander-beedie
Copy link
Collaborator

I'm happy to update the logic in this PR, but I think we decided first to fix the obviously broken behavior before revamping the "working but odd" behavior.

Yup; let's get the fixes out first and then follow-up with the behavioural improvements 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hist panics after creating zero bins
4 participants