Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV parsing does not take into account non raw_strings (it fails if ',', '[' or ']' are inside quotes) #1445

Open
sjanel opened this issue Nov 20, 2024 · 10 comments

Comments

@sjanel
Copy link
Collaborator

sjanel commented Nov 20, 2024

Hi !

I wanted to use glaze to parse the currencies from this CSV with this structure:

struct CurrencyCSV {
  vector<string> Entity;
  vector<string> Currency;
  vector<string> AlphabeticCode;
  vector<string> NumericCode;
  vector<string> MinorUnit;
  vector<string> WithdrawalDate;
};

However, the parser interprets , (and [ ]) if they are inside brackets in a value. For instance, these lines fail:

"MOLDOVA, REPUBLIC OF",Russian Ruble,RUR,810,,1993-12
"FALKLAND ISLANDS (THE) [MALVINAS]",Falkland Islands Pound,FKP,238,2,

I think that this behavior could be expected if .raw_string is true, but it should work if we set .raw_string to false.

WDYT ?

If I have the time, I will try to make a PR.

@sjanel sjanel changed the title CSV parsing does not take into account non raw_strings (it fails if "," are inside quotes) CSV parsing does not take into account non raw_strings (it fails if ',', '[' or ']' are inside quotes) Nov 20, 2024
@stephenberry
Copy link
Owner

Actively working on this issue here: #1446

@stephenberry stephenberry linked a pull request Nov 21, 2024 that will close this issue
@stephenberry
Copy link
Owner

stephenberry commented Nov 21, 2024

I think the string parsing has been fixed. It also supports escaped quotes per the CSV specification, which were needed to parse this document. However, there is a bug where it parses the WithdrawalDate into the MinorUnit. I have to work on some other stuff at the moment. If you want to try to fix this from the csv_currency branch, that would be great! Otherwise, I'll get back to this when I'm able.

sjanel added a commit to sjanel/glaze that referenced this issue Nov 21, 2024
@sjanel
Copy link
Collaborator Author

sjanel commented Nov 21, 2024

I think the string parsing has been fixed. It also supports escaped quotes per the CSV specification, which were needed to parse this document. However, there is a bug where it parses the WithdrawalDate into the MinorUnit. I have to work on some other stuff at the moment. If you want to try to fix this from the csv_concurrency branch, that would be great! Otherwise, I'll get back to this when I'm able.

I think I found the bug, and fixed it in this commit. Feel free to take it into your branch. It's because we are skipping the trailing ',' in both the from<CSV for string_t and in the main loop of line 616, whereas for other type parsing we don't skip the commas, so I decided to just remove it from the string_t and it seems to work.

However, I did not see any test for rowwise parsing and I cannot make my commented test pass (see csv_test.cpp:606-607). Am I calling it incorrectly ?

@stephenberry
Copy link
Owner

Thanks, I got your fix on the branch so that it parses column wise correctly. I'll look at your commented test now.

@stephenberry
Copy link
Owner

I updated the rowwise test on the csv_currency branch. The first issue was that it was still reading in a column wise CSV file. Now it write out the data in rowwise format and then tries to read that back in. The outstanding issue is that our string writing for CSV does not escape quotes.

@stephenberry stephenberry removed a link to a pull request Nov 21, 2024
@stephenberry
Copy link
Owner

I'm going to merge the current fixes for column wise support. But, I'm going to keep this issue alive until rowwise support and escaped writing has been added. I just want to get these changes merged as a first step.

@stephenberry
Copy link
Owner

I could add you as a contributor to Glaze if you'd like to make branches directly in Glaze for these CSV fixes.

@sjanel
Copy link
Collaborator Author

sjanel commented Nov 22, 2024

I updated the rowwise test on the csv_currency branch. The first issue was that it was still reading in a column wise CSV file. Now it write out the data in rowwise format and then tries to read that back in. The outstanding issue is that our string writing for CSV does not escape quotes.

oh, my bad! I thought it only changed the way the csv file was parsed internally, I did not know the existence of rowwise CSV actually! Anyway, it would be great to have an example in the README and in the unit test.

@sjanel
Copy link
Collaborator Author

sjanel commented Nov 22, 2024

I could add you as a contributor to Glaze if you'd like to make branches directly in Glaze for these CSV fixes.

Thanks for the proposition, why not, I really enjoy glaze so if you need help for some items I would be glad to help when I have the time.

@stephenberry
Copy link
Owner

Cool, I sent you an invite. I find it easier to make pull requests and work from within a repository. No pressure to contribute, but any help is appreciated.

The CSV code needs some work, particularly with writing strings with escaped quotes and handling CSV files without headers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants