Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of prettified responses without correct content-type encoding #1110

Merged
merged 16 commits into from
Sep 29, 2021
Merged

Improve handling of prettified responses without correct content-type encoding #1110

merged 16 commits into from
Sep 29, 2021

Conversation

BoboTiG
Copy link
Contributor

@BoboTiG BoboTiG commented Jul 16, 2021

Supersedes #594.
Fixes #1022 and related already-closed issues.
Fixes #358.
Fixes #627.

@BoboTiG BoboTiG requested a review from jkbrzt July 16, 2021 14:54
@BoboTiG

This comment has been minimized.

tests/test_httpie.py Outdated Show resolved Hide resolved
@BoboTiG

This comment has been minimized.

@BoboTiG BoboTiG marked this pull request as draft July 17, 2021 16:36
Copy link
Member

@jkbrzt jkbrzt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from accessing response.apparent_encoding? The main problem with apparent_encoding is that it breaks our output streaming behavior.

How are we making sure the response body is still processed in chunks as opposed to being buffered first?

httpie/models.py Outdated Show resolved Hide resolved
httpie/models.py Outdated Show resolved Hide resolved
This was referenced Jul 23, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2021

Codecov Report

Merging #1110 (5e7a37d) into master (4d7d6b6) will decrease coverage by 0.10%.
The diff coverage is 96.29%.

❗ Current head 5e7a37d differs from pull request most recent head 58ab8aa. Consider uploading reports for the commit 58ab8aa to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1110      +/-   ##
==========================================
- Coverage   97.28%   97.18%   -0.11%     
==========================================
  Files          67       71       +4     
  Lines        4235     4475     +240     
==========================================
+ Hits         4120     4349     +229     
- Misses        115      126      +11     
Impacted Files Coverage Δ
httpie/models.py 100.00% <ø> (+2.63%) ⬆️
tests/test_binary.py 100.00% <ø> (ø)
tests/test_output.py 99.47% <ø> (ø)
tests/conftest.py 75.75% <50.00%> (-11.20%) ⬇️
tests/test_ssl.py 89.88% <63.63%> (-5.05%) ⬇️
httpie/output/formatters/colors.py 92.66% <83.33%> (-0.92%) ⬇️
httpie/output/utils.py 97.87% <97.87%> (ø)
httpie/__init__.py 100.00% <100.00%> (ø)
httpie/cli/argparser.py 95.60% <100.00%> (-0.76%) ⬇️
httpie/cli/constants.py 100.00% <100.00%> (ø)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c89c70...58ab8aa. Read the comment docs.

@BoboTiG
Copy link
Contributor Author

BoboTiG commented Jul 29, 2021

How is this different from accessing response.apparent_encoding? The main problem with apparent_encoding is that it breaks our output streaming behavior.

How are we making sure the response body is still processed in chunks as opposed to being buffered first?

The workaround/solution is done in 3 times:

  1. We do not need to get the response encoding to display headers. The old way was a blocker with the new implementation: to display headers, we would fetch the body to get its encoding. Now, as headers should contain only US-ASCII characters, we use the terminal encoding or fallback to UTF-8 (it seems the more robust idea).
  2. It seems we never need to fetch the whole body directly, we either use HTTPResponse.iter_body() either HTTPResponse.iter_lines(). I reworked both to store the first chunk into HTTPResponse._body and use it to guess the whole body encoding. Why does it work? Because we need to encode stuff after having retrieved the first chunk, so we can properly guess the encoding based on that first chunk.
  3. The encoding is detected only when really needed. The old implementation was retrieving the encoding in EncodedStream.__init__() which was not possible with the new one. I introduced EncodedStream.output_encoding and it will manage the encoding detection only when required and if not already done.

The only potential problem I could see with part 2 is that the guessed encoding is not the good one (because there is no enough data to properly guess the right encoding). I think we are quite good with the new implementation and I do not have a solution for such hypothetical problem for now.

I've had hard time adding pure stream tests. So I simply added tests to cover reported problems, and the covered code is still good (I did not introduced uncovered code).

@BoboTiG BoboTiG marked this pull request as ready for review July 29, 2021 15:59
@BoboTiG

This comment has been minimized.

@BoboTiG

This comment has been minimized.

@BoboTiG

This comment has been minimized.

@BoboTiG

This comment has been minimized.

tests/test_httpie.py Outdated Show resolved Hide resolved
@jkbrzt
Copy link
Member

jkbrzt commented Jul 30, 2021

Does this ignore the Content-Type charset when specified? I believe the auto-detection should only kick in when the message doesn’t say.

I’m also afraid that relying on the first chunk only (i.e., the first line) won’t be very reliable. I’d imagine applying some heuristics to each chunk until we build confidence (e.g., look for a non-whitespace / non-ASCII chunk).

This also fails:

# windows-1250 response with a correctly specified charset
@responses.activate
def test_POST_encoding_detection_from_content():
    url = 'http://example.org'  # Note: URL never fetched
    body = 'Všichni lidé jsou si rovni.'.encode('windows-1250')
    responses.add(responses.POST, url, body=body, content_type='text/plain; charset=windows-1250')
    r = http('--form', 'POST', url)
    assert 'Všichni lidé jsou si rovni.' in r

The first line can also be something like <?xml version="1.0" encoding="windows-1250"?>, for example, which the current implementation would detect as UTF-8 (even with a properly set charset in Content-Type).

@Ousret

This comment was marked as spam.

@BoboTiG
Copy link
Contributor Author

BoboTiG commented Jul 30, 2021

I would not recommend building confidence yourself using chunks but rather use as many bytes as possible, performances should not be a concern most of the time. If you can use the whole body, do it to assure you the best possible guess.

We do not want to use the wole body, that would kill the streaming stuff.

A concern I had was: OK, let's guess the encoding from the chunk 1. It returns UTF-8. If I try to guess the encoding using more data (more chunks), and if the detected encoding changes, that will be a mess to handle.

I'm not sure how to warkaround that.

What do you have in mind when you say "use as many bytes as possible"? I mean, the encoding is needed right after the first chunk, I do not see yet how to get more data without storing the full body (and losing the streaming stuff). Even "part of the body" is subjective, maybe fetching and storing first N chunks would lower the problem.

I completely agree that the current proposed approach is too naive, let's improve it :)

@BoboTiG
Copy link
Contributor Author

BoboTiG commented Jul 30, 2021

Does this ignore the Content-Type charset when specified? I believe the auto-detection should only kick in when the message doesn’t say.

Indeed. I'll fix it.

@Ousret

This comment was marked as spam.

@BoboTiG
Copy link
Contributor Author

BoboTiG commented Jul 31, 2021

I pushed a patch to take into account the provided charset first, and then fallback to detection if none is specified or if the encoding detected by requests is empty.

@BoboTiG
Copy link
Contributor Author

BoboTiG commented Aug 2, 2021

I added streaming tests. And as you both pointed it, test_encoding_detection_from_content_with_no_declared_encoding_at_all() is failing because for now we are relying on the first chunk of data to guess the encoding: ascii is found and it does not work.

@BoboTiG

This comment has been minimized.

@Ousret

This comment was marked as spam.

@BoboTiG
Copy link
Contributor Author

BoboTiG commented Aug 3, 2021

I reworked the code to consume at least 512 bytes (not yet pushed). I'll integrate your test files to demonstrate (or not) the robustness of the new implementation. Thanks for your inputs, very valuable :)

httpie/models.py Outdated Show resolved Hide resolved
@BoboTiG
Copy link
Contributor Author

BoboTiG commented Aug 3, 2021

@Ousret I push my changes. I am not sure new tests are good.

Have a look at 7dd8b63, and the new test test_encoding_detection_from_content_with_no_declared_encoding_at_all. It uses from_bytes(file.read_bytes()).best().encoding to compare what HTTPie found with what charset_normalizer would fine using the entire file. But from what I see, results are not good (sometimes the encoding is None). Could you tell me what I've done wrong 🙏?

To launch tests:

$ git submodule update
$ PYTEST_ADDOPTS="-k test_encoding_detection_from_content_with_no_declared_encoding_at_all" make test

@Ousret

This comment was marked as spam.

@jkbrzt
Copy link
Member

jkbrzt commented Aug 3, 2021

I recommend taking a step back and really thinking about the problem, the use cases, and the different streaming-related modes in which HTTPie operates.

Q: What problem are we trying to solve?
A: Poor user experience when a response doesn’t specify charset, and the output gets misencoded.

Q: How can we solve the problem?
A: Use a charset detection library as a fallback.

Q: What are the constraints?
A: HTTPie uses a streaming model, which is incompatible with robust charset detection. We also don’t alter data unless the user asks for it.

Q: Okay, but are there any scenarios in which HTTPie fully loads a response before outputting it and altering data is okay?
A: Yes. When “prettifying” a response, it’s buffered first. Also, with --stream, each line can be considered a standalone document.

Q: Right, so what if we focus only on the buffered and --stream modes?
A: Yes, in fact, that would capture the majority of use cases. One can also argue that charset detection naturally fits the prettification concept.

Q: Okay, if we go with this, what would the practical implications be?
A: When a user makes a simple request on the terminal (http pie.dev/get), the charset detection fallback would be used (by default, we prettify, therefore buffer). Same with --stream. If the user disables prettifying (http --pretty=none pie.dev/get), we stream, so no auto-detection is possible (i.e., no change in behavior here).

Q: Is there anything else that should be considered?
A: Yes, this change alters a core behavior, so it needs to be approached holistically and think about all the possible scenarios. Perhaps the same needs to be applied to requests, etc. This QA is not comprehensive. The behavior also needs to be documented (a new section on charset handling would be a good idea).

@hroncok
Copy link
Contributor

hroncok commented Aug 4, 2021

Successfully waited for python-charset-normalizer-2.0.4-1.fc33 to appear in the f33-build repo
Successfully waited for python-charset-normalizer-2.0.4-1.fc34 to appear in the f34-build repo

If you include #1119, you should be able to run the tests on Fedora 33 and 34.

@hroncok
Copy link
Contributor

hroncok commented Aug 5, 2021

The current Fedora problem is:

No matching package to install: 'python3dist(requests) >= 2.26'
No matching package to install: 'python3dist(requests[socks]) >= 2.26'

@BoboTiG
Copy link
Contributor Author

BoboTiG commented Aug 5, 2021

Yes I was trying if it was working with that new version. I'll revert that change.

@BoboTiG
Copy link
Contributor Author

BoboTiG commented Aug 5, 2021

So I adapted the current implementation:

  • When the Content-Type header is set: no behavior change.

  • For streamed responses, the encoding is guessed from the content retrieved so far (best effort). That means the encoding can be different given the line that is streamed.

It seems to fix all reported issues but I am still not very satisfied though 🤔

@jakubroztocil di you think of other scenarii that would be broken with those changes? I would be happy to add more specific tests to ensure no regressions.

@jkbrzt
Copy link
Member

jkbrzt commented Aug 6, 2021

http pie.dev/get will use the EncodedStream class. The body is always streamed by line, which leads to the same issues as for --stream. It is the same scenario if we pass --pretty=none. That is because each line needs to be decoded before being buffered.
--pretty=all|colors|format will use the BufferedPrettyStream class, here as your said: body is buffered.

It uses BufferedPrettyStream because --pretty=all is the default for terminal output. https://httpie.io/docs#terminal-output

httpie/codec.py Outdated Show resolved Hide resolved
httpie/codec.py Outdated Show resolved Hide resolved
@BoboTiG BoboTiG changed the title Improve handling of responses without correct content-type charset Improve handling of responses without correct content-type encoding Sep 28, 2021
@BoboTiG BoboTiG changed the title Improve handling of responses without correct content-type encoding Improve handling of prettified responses without correct content-type encoding Sep 29, 2021
@jkbrzt jkbrzt merged commit 71adcd9 into httpie:master Sep 29, 2021
@BoboTiG BoboTiG deleted the mickael/dev-169-properly-detecting-response-encoding branch September 29, 2021 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants