Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of responses without correct content-type charset #1022

Closed
Almad opened this issue Jan 20, 2021 · 12 comments · Fixed by #1110
Closed

Better handling of responses without correct content-type charset #1022

Almad opened this issue Jan 20, 2021 · 12 comments · Fixed by #1110
Assignees
Labels
enhancement New feature or enhancement needs product design We like the idea, but we want to explore the problem deeper, and consider the solution holistically

Comments

@Almad
Copy link

Almad commented Jan 20, 2021

Problem

If server doesn't provide correct content-type charset, we're defaulting to latin1 because requests underlying does that. This is undesirable for user experience.

Possible solutions

  • Default to utf8
  • For well-known content-types, handle them as defined in respective RFC (utf8 for application/json, first line or BOM for XML, meta tag in HTML)
  • Use chardet to detect from the body

Considerations

  • Streaming mode and chunks
  • Discrepancies between streaming and display mode: should this only be done for readability when showing to the terminal, or also when piping to other commands? Current decision: only doing this for terminal display. Consider flag for enforcing this for piping as well
  • chardet requires downloading the whole body, so it would download a whole video to tell you it's binary data
@Ousret

This comment was marked as spam.

@Almad
Copy link
Author

Almad commented May 16, 2021

@Ousret Excellent! It's a bit unclear from the discussion: will that be merged into the requests or would we need to monkeypatch around?

@Ousret

This comment was marked as spam.

@Ousret

This comment was marked as spam.

@Ousret

This comment was marked as spam.

@BoboTiG
Copy link
Contributor

BoboTiG commented Jul 15, 2021

@Ousret I was trying the move from chardet to charset-normalizer but there is no behavior change:

>>> import requests
>>> r = requests.get('https://zoek.officielebekendmakingen.nl/kst-34200-14/metadata_owms.xml')
>>> r.headers['Content-Type']
'text/xml'
>>> r.encoding
'ISO-8859-1'  # <-- not good

# Check 1
>>> r.content[:38]
b'<?xml version="1.0" encoding="UTF-8"?>'

# Check 2
>>> requests.compat.chardet
<module 'charset_normalizer' from '/.../lib/python3.9/site-packages/charset_normalizer/__init__.py'>

# Check 3
>>> requests.compat.chardet.detect(r.content)
{'encoding': 'utf-8', 'language': 'English', 'confidence': 0.975}

So the new module is working fine, I had no doubt about that ;)
But requests does not seem to use it? Maybe I did something wrong while testing.

@Ousret

This comment was marked as spam.

@BoboTiG
Copy link
Contributor

BoboTiG commented Jul 15, 2021

Oh I see. Unfortunately it will not be so easy:

$ python -m httpie 'https://zoek.officielebekendmakingen.nl/kst-34200-14/metadata_owms.xml'
HTTP/1.1 200 OK
Cache-Control: private
Content-Disposition: inline; filename=metadata_owms.xml
Content-Encoding: gzip
Content-Length: 871
Content-Security-Policy: frame-ancestors 'self'
Content-Type: text/xml
Date: Thu, 15 Jul 2021 15:00:26 GMT
Expect-CT: enforce, max-age=30
Permissions-Policy: geolocation=(), midi=(), notifications=(), push=(), microphone=(), camera=(), magnetometer=(), gyroscope=(), speaker=(), vibrate=(), fullscreen=(), payment=()
Referrer-Policy: strict-origin-when-cross-origin
Server: 
Strict-Transport-Security: max-age=31536000; includeSubDomains;
Vary: Accept-Encoding
X-AspNet-Version: 
X-AspNetMvc-Version: 
X-Content-Security-Policy: frame-ancestors 'self'
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Powered-By: 
X-XSS-Protection: 1; mode=block
x-webkit-csp: frame-ancestors 'self'


__main__.py: error: RuntimeError: The content for this response was already consumed

apparent_encoding will consume content, and that is not something we want for HTTPie: https://github.com/httpie/httpie/blob/41c251ec7c537059b44c69b75b89c6dd671a25fc/httpie/models.py#L83-L91

Out of curiosity, do you see a workaround? :D

@BoboTiG
Copy link
Contributor

BoboTiG commented Jul 15, 2021

The root cause seems to be in get_encoding_from_headers(): it will return ISO-8859-1 when text is part of the content type. It is not bad, it is just that we are based on that value to prevent fetching the whole body of a response.

@BoboTiG
Copy link
Contributor

BoboTiG commented Jul 15, 2021

psf/requests#2086 is interesting to follow.

@Ousret

This comment was marked as spam.

@BoboTiG
Copy link
Contributor

BoboTiG commented Jul 15, 2021

Yes, let me try something :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or enhancement needs product design We like the idea, but we want to explore the problem deeper, and consider the solution holistically
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants