-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of prettified responses without correct content-type encoding #1110
Improve handling of prettified responses without correct content-type encoding #1110
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from accessing response.apparent_encoding
? The main problem with apparent_encoding
is that it breaks our output streaming behavior.
How are we making sure the response body is still processed in chunks as opposed to being buffered first?
Codecov Report
@@ Coverage Diff @@
## master #1110 +/- ##
==========================================
- Coverage 97.28% 97.18% -0.11%
==========================================
Files 67 71 +4
Lines 4235 4475 +240
==========================================
+ Hits 4120 4349 +229
- Misses 115 126 +11
Continue to review full report at Codecov.
|
The workaround/solution is done in 3 times:
The only potential problem I could see with part 2 is that the guessed encoding is not the good one (because there is no enough data to properly guess the right encoding). I think we are quite good with the new implementation and I do not have a solution for such hypothetical problem for now. I've had hard time adding pure stream tests. So I simply added tests to cover reported problems, and the covered code is still good (I did not introduced uncovered code). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Does this ignore the I’m also afraid that relying on the first chunk only (i.e., the first line) won’t be very reliable. I’d imagine applying some heuristics to each chunk until we build confidence (e.g., look for a non-whitespace / non-ASCII chunk). This also fails: # windows-1250 response with a correctly specified charset
@responses.activate
def test_POST_encoding_detection_from_content():
url = 'http://example.org' # Note: URL never fetched
body = 'Všichni lidé jsou si rovni.'.encode('windows-1250')
responses.add(responses.POST, url, body=body, content_type='text/plain; charset=windows-1250')
r = http('--form', 'POST', url)
assert 'Všichni lidé jsou si rovni.' in r The first line can also be something like |
This comment was marked as spam.
This comment was marked as spam.
We do not want to use the wole body, that would kill the streaming stuff. A concern I had was: OK, let's guess the encoding from the chunk 1. It returns UTF-8. If I try to guess the encoding using more data (more chunks), and if the detected encoding changes, that will be a mess to handle. I'm not sure how to warkaround that. What do you have in mind when you say "use as many bytes as possible"? I mean, the encoding is needed right after the first chunk, I do not see yet how to get more data without storing the full body (and losing the streaming stuff). Even "part of the body" is subjective, maybe fetching and storing first N chunks would lower the problem. I completely agree that the current proposed approach is too naive, let's improve it :) |
Indeed. I'll fix it. |
This comment was marked as spam.
This comment was marked as spam.
I pushed a patch to take into account the provided charset first, and then fallback to detection if none is specified or if the encoding detected by |
I added streaming tests. And as you both pointed it, |
This comment has been minimized.
This comment has been minimized.
This comment was marked as spam.
This comment was marked as spam.
I reworked the code to consume at least 512 bytes (not yet pushed). I'll integrate your test files to demonstrate (or not) the robustness of the new implementation. Thanks for your inputs, very valuable :) |
@Ousret I push my changes. I am not sure new tests are good. Have a look at 7dd8b63, and the new test To launch tests:
|
This comment was marked as spam.
This comment was marked as spam.
I recommend taking a step back and really thinking about the problem, the use cases, and the different streaming-related modes in which HTTPie operates. Q: What problem are we trying to solve? Q: How can we solve the problem? Q: What are the constraints? Q: Okay, but are there any scenarios in which HTTPie fully loads a response before outputting it and altering data is okay? Q: Right, so what if we focus only on the buffered and Q: Okay, if we go with this, what would the practical implications be? Q: Is there anything else that should be considered? |
Successfully waited for python-charset-normalizer-2.0.4-1.fc33 to appear in the f33-build repo If you include #1119, you should be able to run the tests on Fedora 33 and 34. |
The current Fedora problem is:
|
Yes I was trying if it was working with that new version. I'll revert that change. |
So I adapted the current implementation:
It seems to fix all reported issues but I am still not very satisfied though 🤔 @jakubroztocil di you think of other scenarii that would be broken with those changes? I would be happy to add more specific tests to ensure no regressions. |
It uses |
Co-authored-by: claudiatd <[email protected]>
[skip ci]
Supersedes #594.
Fixes #1022 and related already-closed issues.
Fixes #358.
Fixes #627.