Improve handling of prettified responses without correct content-type encoding #1110

BoboTiG · 2021-07-16T14:54:50Z

Supersedes #594.
Fixes #1022 and related already-closed issues.
Fixes #358.
Fixes #627.

tests/test_httpie.py

jkbrzt

How is this different from accessing response.apparent_encoding? The main problem with apparent_encoding is that it breaks our output streaming behavior.

How are we making sure the response body is still processed in chunks as opposed to being buffered first?

httpie/models.py

codecov-commenter · 2021-07-29T15:30:53Z

Codecov Report

Merging #1110 (5e7a37d) into master (4d7d6b6) will decrease coverage by 0.10%.
The diff coverage is 96.29%.

❗ Current head 5e7a37d differs from pull request most recent head 58ab8aa. Consider uploading reports for the commit 58ab8aa to get more accurate results

@@            Coverage Diff             @@
##           master    #1110      +/-   ##
==========================================
- Coverage   97.28%   97.18%   -0.11%     
==========================================
  Files          67       71       +4     
  Lines        4235     4475     +240     
==========================================
+ Hits         4120     4349     +229     
- Misses        115      126      +11

Impacted Files	Coverage Δ
httpie/models.py	`100.00% <ø> (+2.63%)`	⬆️
tests/test_binary.py	`100.00% <ø> (ø)`
tests/test_output.py	`99.47% <ø> (ø)`
tests/conftest.py	`75.75% <50.00%> (-11.20%)`	⬇️
tests/test_ssl.py	`89.88% <63.63%> (-5.05%)`	⬇️
httpie/output/formatters/colors.py	`92.66% <83.33%> (-0.92%)`	⬇️
httpie/output/utils.py	`97.87% <97.87%> (ø)`
httpie/__init__.py	`100.00% <100.00%> (ø)`
httpie/cli/argparser.py	`95.60% <100.00%> (-0.76%)`	⬇️
httpie/cli/constants.py	`100.00% <100.00%> (ø)`
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c89c70...58ab8aa. Read the comment docs.

BoboTiG · 2021-07-29T15:49:45Z

How is this different from accessing response.apparent_encoding? The main problem with apparent_encoding is that it breaks our output streaming behavior.

How are we making sure the response body is still processed in chunks as opposed to being buffered first?

The workaround/solution is done in 3 times:

We do not need to get the response encoding to display headers. The old way was a blocker with the new implementation: to display headers, we would fetch the body to get its encoding. Now, as headers should contain only US-ASCII characters, we use the terminal encoding or fallback to UTF-8 (it seems the more robust idea).
It seems we never need to fetch the whole body directly, we either use HTTPResponse.iter_body() either HTTPResponse.iter_lines(). I reworked both to store the first chunk into HTTPResponse._body and use it to guess the whole body encoding. Why does it work? Because we need to encode stuff after having retrieved the first chunk, so we can properly guess the encoding based on that first chunk.
The encoding is detected only when really needed. The old implementation was retrieving the encoding in EncodedStream.__init__() which was not possible with the new one. I introduced EncodedStream.output_encoding and it will manage the encoding detection only when required and if not already done.

The only potential problem I could see with part 2 is that the guessed encoding is not the good one (because there is no enough data to properly guess the right encoding). I think we are quite good with the new implementation and I do not have a solution for such hypothetical problem for now.

I've had hard time adding pure stream tests. So I simply added tests to cover reported problems, and the covered code is still good (I did not introduced uncovered code).

tests/test_httpie.py

jkbrzt · 2021-07-30T12:53:39Z

Does this ignore the Content-Type charset when specified? I believe the auto-detection should only kick in when the message doesn’t say.

I’m also afraid that relying on the first chunk only (i.e., the first line) won’t be very reliable. I’d imagine applying some heuristics to each chunk until we build confidence (e.g., look for a non-whitespace / non-ASCII chunk).

This also fails:

# windows-1250 response with a correctly specified charset
@responses.activate
def test_POST_encoding_detection_from_content():
    url = 'http://example.org'  # Note: URL never fetched
    body = 'Všichni lidé jsou si rovni.'.encode('windows-1250')
    responses.add(responses.POST, url, body=body, content_type='text/plain; charset=windows-1250')
    r = http('--form', 'POST', url)
    assert 'Všichni lidé jsou si rovni.' in r

The first line can also be something like <?xml version="1.0" encoding="windows-1250"?>, for example, which the current implementation would detect as UTF-8 (even with a properly set charset in Content-Type).

BoboTiG · 2021-07-30T17:09:56Z

I would not recommend building confidence yourself using chunks but rather use as many bytes as possible, performances should not be a concern most of the time. If you can use the whole body, do it to assure you the best possible guess.

We do not want to use the wole body, that would kill the streaming stuff.

A concern I had was: OK, let's guess the encoding from the chunk 1. It returns UTF-8. If I try to guess the encoding using more data (more chunks), and if the detected encoding changes, that will be a mess to handle.

I'm not sure how to warkaround that.

What do you have in mind when you say "use as many bytes as possible"? I mean, the encoding is needed right after the first chunk, I do not see yet how to get more data without storing the full body (and losing the streaming stuff). Even "part of the body" is subjective, maybe fetching and storing first N chunks would lower the problem.

I completely agree that the current proposed approach is too naive, let's improve it :)

BoboTiG · 2021-07-30T17:10:18Z

Does this ignore the Content-Type charset when specified? I believe the auto-detection should only kick in when the message doesn’t say.

Indeed. I'll fix it.

BoboTiG · 2021-07-31T06:10:30Z

I pushed a patch to take into account the provided charset first, and then fallback to detection if none is specified or if the encoding detected by requests is empty.

BoboTiG · 2021-08-02T13:01:55Z

I added streaming tests. And as you both pointed it, test_encoding_detection_from_content_with_no_declared_encoding_at_all() is failing because for now we are relying on the first chunk of data to guess the encoding: ascii is found and it does not work.

BoboTiG · 2021-08-03T08:56:47Z

I reworked the code to consume at least 512 bytes (not yet pushed). I'll integrate your test files to demonstrate (or not) the robustness of the new implementation. Thanks for your inputs, very valuable :)

httpie/models.py

BoboTiG · 2021-08-03T11:38:30Z

@Ousret I push my changes. I am not sure new tests are good.

Have a look at 7dd8b63, and the new test test_encoding_detection_from_content_with_no_declared_encoding_at_all. It uses from_bytes(file.read_bytes()).best().encoding to compare what HTTPie found with what charset_normalizer would fine using the entire file. But from what I see, results are not good (sometimes the encoding is None). Could you tell me what I've done wrong 🙏?

To launch tests:

$ git submodule update
$ PYTEST_ADDOPTS="-k test_encoding_detection_from_content_with_no_declared_encoding_at_all" make test

jkbrzt · 2021-08-03T13:20:42Z

I recommend taking a step back and really thinking about the problem, the use cases, and the different streaming-related modes in which HTTPie operates.

Q: What problem are we trying to solve?
A: Poor user experience when a response doesn’t specify charset, and the output gets misencoded.

Q: How can we solve the problem?
A: Use a charset detection library as a fallback.

Q: What are the constraints?
A: HTTPie uses a streaming model, which is incompatible with robust charset detection. We also don’t alter data unless the user asks for it.

Q: Okay, but are there any scenarios in which HTTPie fully loads a response before outputting it and altering data is okay?
A: Yes. When “prettifying” a response, it’s buffered first. Also, with --stream, each line can be considered a standalone document.

Q: Right, so what if we focus only on the buffered and --stream modes?
A: Yes, in fact, that would capture the majority of use cases. One can also argue that charset detection naturally fits the prettification concept.

Q: Okay, if we go with this, what would the practical implications be?
A: When a user makes a simple request on the terminal (http pie.dev/get), the charset detection fallback would be used (by default, we prettify, therefore buffer). Same with --stream. If the user disables prettifying (http --pretty=none pie.dev/get), we stream, so no auto-detection is possible (i.e., no change in behavior here).

Q: Is there anything else that should be considered?
A: Yes, this change alters a core behavior, so it needs to be approached holistically and think about all the possible scenarios. Perhaps the same needs to be applied to requests, etc. This QA is not comprehensive. The behavior also needs to be documented (a new section on charset handling would be a good idea).

hroncok · 2021-08-04T16:42:43Z

Successfully waited for python-charset-normalizer-2.0.4-1.fc33 to appear in the f33-build repo
Successfully waited for python-charset-normalizer-2.0.4-1.fc34 to appear in the f34-build repo

If you include #1119, you should be able to run the tests on Fedora 33 and 34.

hroncok · 2021-08-05T13:27:23Z

The current Fedora problem is:

No matching package to install: 'python3dist(requests) >= 2.26'
No matching package to install: 'python3dist(requests[socks]) >= 2.26'

BoboTiG · 2021-08-05T13:28:41Z

Yes I was trying if it was working with that new version. I'll revert that change.

BoboTiG · 2021-08-05T14:09:01Z

So I adapted the current implementation:

When the Content-Type header is set: no behavior change.
For streamed responses, the encoding is guessed from the content retrieved so far (best effort). That means the encoding can be different given the line that is streamed.

It seems to fix all reported issues but I am still not very satisfied though 🤔

@jakubroztocil di you think of other scenarii that would be broken with those changes? I would be happy to add more specific tests to ensure no regressions.

jkbrzt · 2021-08-06T15:00:21Z

http pie.dev/get will use the EncodedStream class. The body is always streamed by line, which leads to the same issues as for --stream. It is the same scenario if we pass --pretty=none. That is because each line needs to be decoded before being buffered.
--pretty=all|colors|format will use the BufferedPrettyStream class, here as your said: body is buffered.

It uses BufferedPrettyStream because --pretty=all is the default for terminal output. https://httpie.io/docs#terminal-output

httpie/codec.py

Co-authored-by: claudiatd <[email protected]>

[skip ci]

BoboTiG requested a review from jkbrzt July 16, 2021 14:54

This comment has been minimized.

Sign in to view

BoboTiG commented Jul 16, 2021

View reviewed changes

tests/test_httpie.py Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

BoboTiG marked this pull request as draft July 17, 2021 16:36

jkbrzt reviewed Jul 17, 2021

View reviewed changes

httpie/models.py Outdated Show resolved Hide resolved

httpie/models.py Outdated Show resolved Hide resolved

This was referenced Jul 23, 2021

Minor clean-up #1112

Merged

Uniformize UTF-8 naming #1115

Merged

BoboTiG marked this pull request as ready for review July 29, 2021 15:59

This comment has been minimized.

Sign in to view

jkbrzt reviewed Jul 30, 2021

View reviewed changes

tests/test_httpie.py Outdated Show resolved Hide resolved

This comment was marked as spam.

Sign in to view

This comment has been minimized.

Sign in to view

This comment was marked as spam.

Sign in to view

jkbrzt reviewed Aug 3, 2021

View reviewed changes

httpie/models.py Outdated Show resolved Hide resolved

This comment was marked as spam.

Sign in to view

BoboTiG mentioned this pull request Aug 4, 2021

Refine abstract methods and properties #1118

Merged

Ousret reviewed Aug 19, 2021

View reviewed changes

httpie/codec.py Outdated Show resolved Hide resolved

httpie/codec.py Outdated Show resolved Hide resolved

BoboTiG mentioned this pull request Sep 1, 2021

Allow to overwrite the response Content-Type from options #1134

Merged

BoboTiG and others added 12 commits September 27, 2021 17:29

Improve handling of responses without correct content-type charset

c8db246

[skip ci] Minor tweaks in tests

0d7b1d5

[skip ci] Add documentation

0becf64

Co-authored-by: claudiatd <[email protected]>

Improve unknown encoding test

a72d0d9

[skip ci]

Review mime and options retrieval

be21f1f

Add full content-type example in help output

247d8a1

Simplify decoder

66d9ce1

[skip ci] s/charset/encoding/

8e6a227

Tweaks

dbebb8a

[skip ci] Fix type annotation

d87c0f8

[skip ci] s/charset/encoding/

170f2c8

Tweaks

19a5bba

BoboTiG changed the title ~~Improve handling of responses without correct content-type charset~~ Improve handling of responses without correct content-type encoding Sep 28, 2021

BoboTiG added 4 commits September 28, 2021 17:15

Fix type annoation

4acb76a

Improvement

4870019

Introduce codec.encode()

5e7a37d

[skip ci] Tweak changelog

58ab8aa

BoboTiG changed the title ~~Improve handling of responses without correct content-type encoding~~ Improve handling of prettified responses without correct content-type encoding Sep 29, 2021

jkbrzt merged commit 71adcd9 into httpie:master Sep 29, 2021

BoboTiG deleted the mickael/dev-169-properly-detecting-response-encoding branch September 29, 2021 19:09

This was referenced Sep 29, 2021

Use 'UTF-8' as default encoding #594

Closed

Encoding error with non-prettified encoded responses #1167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of prettified responses without correct content-type encoding #1110

Improve handling of prettified responses without correct content-type encoding #1110

BoboTiG commented Jul 16, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

jkbrzt left a comment

codecov-commenter commented Jul 29, 2021 •

edited

Loading

BoboTiG commented Jul 29, 2021 •

edited by jkbrzt

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jkbrzt commented Jul 30, 2021 •

edited

Loading

This comment was marked as spam.

BoboTiG commented Jul 30, 2021 •

edited

Loading

BoboTiG commented Jul 30, 2021

This comment was marked as spam.

BoboTiG commented Jul 31, 2021

BoboTiG commented Aug 2, 2021 •

edited

Loading

This comment has been minimized.

This comment was marked as spam.

BoboTiG commented Aug 3, 2021

BoboTiG commented Aug 3, 2021 •

edited

Loading

This comment was marked as spam.

jkbrzt commented Aug 3, 2021 •

edited

Loading

hroncok commented Aug 4, 2021

hroncok commented Aug 5, 2021

BoboTiG commented Aug 5, 2021

BoboTiG commented Aug 5, 2021 •

edited

Loading

jkbrzt commented Aug 6, 2021 •

edited

Loading

Improve handling of prettified responses without correct content-type encoding #1110

Improve handling of prettified responses without correct content-type encoding #1110

Conversation

BoboTiG commented Jul 16, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

jkbrzt left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 29, 2021 • edited Loading

Codecov Report

BoboTiG commented Jul 29, 2021 • edited by jkbrzt Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jkbrzt commented Jul 30, 2021 • edited Loading

This comment was marked as spam.

BoboTiG commented Jul 30, 2021 • edited Loading

BoboTiG commented Jul 30, 2021

This comment was marked as spam.

BoboTiG commented Jul 31, 2021

BoboTiG commented Aug 2, 2021 • edited Loading

This comment has been minimized.

This comment was marked as spam.

BoboTiG commented Aug 3, 2021

BoboTiG commented Aug 3, 2021 • edited Loading

This comment was marked as spam.

jkbrzt commented Aug 3, 2021 • edited Loading

hroncok commented Aug 4, 2021

hroncok commented Aug 5, 2021

BoboTiG commented Aug 5, 2021

BoboTiG commented Aug 5, 2021 • edited Loading

jkbrzt commented Aug 6, 2021 • edited Loading

BoboTiG commented Jul 16, 2021 •

edited

Loading

codecov-commenter commented Jul 29, 2021 •

edited

Loading

BoboTiG commented Jul 29, 2021 •

edited by jkbrzt

Loading

jkbrzt commented Jul 30, 2021 •

edited

Loading

BoboTiG commented Jul 30, 2021 •

edited

Loading

BoboTiG commented Aug 2, 2021 •

edited

Loading

BoboTiG commented Aug 3, 2021 •

edited

Loading

jkbrzt commented Aug 3, 2021 •

edited

Loading

BoboTiG commented Aug 5, 2021 •

edited

Loading

jkbrzt commented Aug 6, 2021 •

edited

Loading