Better handling of responses without correct content-type charset #1022

Almad · 2021-01-20T14:38:50Z

Problem

If server doesn't provide correct content-type charset, we're defaulting to latin1 because requests underlying does that. This is undesirable for user experience.

Possible solutions

Default to utf8
For well-known content-types, handle them as defined in respective RFC (utf8 for application/json, first line or BOM for XML, meta tag in HTML)
Use chardet to detect from the body

Considerations

Streaming mode and chunks
Discrepancies between streaming and display mode: should this only be done for readability when showing to the terminal, or also when piping to other commands? Current decision: only doing this for terminal display. Consider flag for enforcing this for piping as well
chardet requires downloading the whole body, so it would download a whole video to tell you it's binary data

The text was updated successfully, but these errors were encountered:

Almad · 2021-05-16T09:40:28Z

@Ousret Excellent! It's a bit unclear from the discussion: will that be merged into the requests or would we need to monkeypatch around?

BoboTiG · 2021-07-15T14:53:40Z

@Ousret I was trying the move from chardet to charset-normalizer but there is no behavior change:

>>> import requests
>>> r = requests.get('https://zoek.officielebekendmakingen.nl/kst-34200-14/metadata_owms.xml')
>>> r.headers['Content-Type']
'text/xml'
>>> r.encoding
'ISO-8859-1'  # <-- not good

# Check 1
>>> r.content[:38]
b'<?xml version="1.0" encoding="UTF-8"?>'

# Check 2
>>> requests.compat.chardet
<module 'charset_normalizer' from '/.../lib/python3.9/site-packages/charset_normalizer/__init__.py'>

# Check 3
>>> requests.compat.chardet.detect(r.content)
{'encoding': 'utf-8', 'language': 'English', 'confidence': 0.975}

So the new module is working fine, I had no doubt about that ;)
But requests does not seem to use it? Maybe I did something wrong while testing.

BoboTiG · 2021-07-15T15:05:04Z

Oh I see. Unfortunately it will not be so easy:

$ python -m httpie 'https://zoek.officielebekendmakingen.nl/kst-34200-14/metadata_owms.xml'
HTTP/1.1 200 OK
Cache-Control: private
Content-Disposition: inline; filename=metadata_owms.xml
Content-Encoding: gzip
Content-Length: 871
Content-Security-Policy: frame-ancestors 'self'
Content-Type: text/xml
Date: Thu, 15 Jul 2021 15:00:26 GMT
Expect-CT: enforce, max-age=30
Permissions-Policy: geolocation=(), midi=(), notifications=(), push=(), microphone=(), camera=(), magnetometer=(), gyroscope=(), speaker=(), vibrate=(), fullscreen=(), payment=()
Referrer-Policy: strict-origin-when-cross-origin
Server: 
Strict-Transport-Security: max-age=31536000; includeSubDomains;
Vary: Accept-Encoding
X-AspNet-Version: 
X-AspNetMvc-Version: 
X-Content-Security-Policy: frame-ancestors 'self'
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Powered-By: 
X-XSS-Protection: 1; mode=block
x-webkit-csp: frame-ancestors 'self'


__main__.py: error: RuntimeError: The content for this response was already consumed

apparent_encoding will consume content, and that is not something we want for HTTPie: https://github.com/httpie/httpie/blob/41c251ec7c537059b44c69b75b89c6dd671a25fc/httpie/models.py#L83-L91

Out of curiosity, do you see a workaround? :D

BoboTiG · 2021-07-15T15:08:38Z

The root cause seems to be in get_encoding_from_headers(): it will return ISO-8859-1 when text is part of the content type. It is not bad, it is just that we are based on that value to prevent fetching the whole body of a response.

BoboTiG · 2021-07-15T15:12:48Z

psf/requests#2086 is interesting to follow.

BoboTiG · 2021-07-15T15:28:17Z

Yes, let me try something :)

jkbrzt added enhancement New feature or enhancement needs product design We like the idea, but we want to explore the problem deeper, and consider the solution holistically labels Feb 17, 2021

This comment was marked as spam.

Sign in to view

BoboTiG self-assigned this Jul 15, 2021

BoboTiG mentioned this issue Jul 16, 2021

Improve handling of prettified responses without correct content-type encoding #1110

Merged

jkbrzt closed this as completed in #1110 Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of responses without correct content-type charset #1022

Better handling of responses without correct content-type charset #1022

Almad commented Jan 20, 2021 •

edited

Loading

This comment was marked as spam.

Almad commented May 16, 2021

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

BoboTiG commented Jul 15, 2021 •

edited

Loading

This comment was marked as spam.

BoboTiG commented Jul 15, 2021

BoboTiG commented Jul 15, 2021

BoboTiG commented Jul 15, 2021 •

edited

Loading

This comment was marked as spam.

BoboTiG commented Jul 15, 2021

Better handling of responses without correct content-type charset #1022

Better handling of responses without correct content-type charset #1022

Comments

Almad commented Jan 20, 2021 • edited Loading

Problem

Possible solutions

Considerations

This comment was marked as spam.

Almad commented May 16, 2021

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

BoboTiG commented Jul 15, 2021 • edited Loading

This comment was marked as spam.

BoboTiG commented Jul 15, 2021

BoboTiG commented Jul 15, 2021

BoboTiG commented Jul 15, 2021 • edited Loading

This comment was marked as spam.

BoboTiG commented Jul 15, 2021

Almad commented Jan 20, 2021 •

edited

Loading

BoboTiG commented Jul 15, 2021 •

edited

Loading

BoboTiG commented Jul 15, 2021 •

edited

Loading