Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix arrow-json encoding with dictionary including nulls #6503

Merged
merged 3 commits into from
Oct 7, 2024

Conversation

samuelcolvin
Copy link
Contributor

@samuelcolvin samuelcolvin commented Oct 2, 2024

Rationale for this change

While dictionary values are generally not null (it's more efficient to have a null in the keys), there's nothing to stop them from being nulls, and currently datafusion-functions-json generates dictionaries with null values.

Until now they caused arrow-json to panic, that is now fixed.

What changes are included in this PR?

Use make_encoder_impl, not make_encoder in DictionaryEncoder to avoid unnecessary and incorrect nulls.is_none() assertion.

Add a test.

Are there any user-facing changes?

no.


Related: we've copied arrow-json and customised it to support the unions generated by datafusion-functions-json, and also to inline JSON data within JSON based no field metadata, would you be inerested in accepting logic to allow customising of JSON encoding?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 2, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert this move if you prefer, but I think the previous naming was weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a nice improvement to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually seems mod.rs naming is no longer the preferred convention: https://doc.rust-lang.org/reference/items/modules.html#module-source-filenames

Note: Prior to rustc 1.30, using mod.rs files was the way to load a module with nested children. It is encouraged to use the new naming convention as it is more consistent, and avoids having many files named mod.rs within a project.

I don't have any particular strong feelings about this, as it would be nice to keep consistent (as reader already has mod.rs) 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree keeping consistency is a good thing

@samuelcolvin
Copy link
Contributor Author

See #5318 (comment) for discussion of the nulls.is_none() assertion.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samuelcolvin and @mbrobbel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a nice improvement to me

arrow-json/src/writer/encoder.rs Show resolved Hide resolved
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually seems mod.rs naming is no longer the preferred convention: https://doc.rust-lang.org/reference/items/modules.html#module-source-filenames

Note: Prior to rustc 1.30, using mod.rs files was the way to load a module with nested children. It is encouraged to use the new naming convention as it is more consistent, and avoids having many files named mod.rs within a project.

I don't have any particular strong feelings about this, as it would be nice to keep consistent (as reader already has mod.rs) 👍

Comment on lines +1809 to +1835
#[test]
fn test_writer_null_dict() {
let keys = Int32Array::from_iter(vec![Some(0), None, Some(1)]);
let values = Arc::new(StringArray::from_iter(vec![Some("a"), None]));
let dict = DictionaryArray::new(keys, values);

let schema = SchemaRef::new(Schema::new(vec![Field::new(
"my_dict",
DataType::Dictionary(DataType::Int32.into(), DataType::Utf8.into()),
true,
)]));

let array = Arc::new(dict) as ArrayRef;
let batch = RecordBatch::try_new(schema, vec![array]).unwrap();

let mut json = Vec::new();
let write_builder = WriterBuilder::new().with_explicit_nulls(true);
let mut writer = write_builder.build::<_, JsonArray>(&mut json);
writer.write(&batch).unwrap();
writer.close().unwrap();

let json_str = str::from_utf8(&json).unwrap();
assert_eq!(
json_str,
r#"[{"my_dict":"a"},{"my_dict":null},{"my_dict":null}]"#
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit a117eed into apache:master Oct 7, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 7, 2024

Thanks @samuelcolvin and @Jefffrey and @mbrobbel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants