Bug 428054 - urllib and urllib2 also saving headers when using urllib.urlretrieve & urllib2.read
urllib and urllib2 also saving headers when using urllib.urlretrieve & urllib...
Status: CLOSED UPSTREAM
Product: Fedora
Classification: Fedora
Component: python (Show other bugs)
rawhide
i686 Linux
low Severity medium
: ---
: ---
Assigned To: James Antill
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-08 17:00 EST by Ivo Manca
Modified: 2008-01-08 23:33 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-01-08 23:33:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
python download comparison script (1.18 KB, text/plain)
2008-01-08 17:00 EST, Ivo Manca
no flags Details

  None (edit)
Description Ivo Manca 2008-01-08 17:00:46 EST
Hey all,

I'm having trouble with python's urllib and urllib2. Most of the files download
correctly, however one file _always_ fails while using urllib.urlretrieve or
urllib2.read.
However, it doesn't completely fail, it just pastes the headers sent by the
server to the output, rendering the output file useless.
Is this an (for me) feature, or is it just a bug?

I've made a small python script (attached) to list the difference between the
results of wget, urllib.urlretrieve and urllib2.read (since I was ffraid it was
an urllib limitation)

Added info:
rpm -q lists:
-----------------------------------------------------------------------------------------------
fedora-release-8-5
python-2.5.1-19.fc9
python-libs-2.5.1-19.fc9
wget-1.10.2-17.fc9
python-urlgrabber-3.0.0-3.fc8


Download tests:

Static http site (http://www.example.com/)
-----------------------------------------------------------------------------------------------
md5                               filename      filesize
32e8347a8caee51bd4474c4fbb7025c5  test.urllib   438
32e8347a8caee51bd4474c4fbb7025c5  test.urllib2  438
32e8347a8caee51bd4474c4fbb7025c5  test.wget     438

Seems to be correct, right?

Static https download link
(http://download.ing.be/homebank/security/windows/HBSecurity333.exe)
-----------------------------------------------------------------------------------------------
md5                               filename      filesize
45bb388af9bf7aeb110d95ca988c7ff6  test.urllib   806641
45bb388af9bf7aeb110d95ca988c7ff6  test.urllib2  806641
45bb388af9bf7aeb110d95ca988c7ff6  test.wget     806641

Also seems to be correct, so no error with https!

Static https download link
(https://helixcommunity.org/projects/player/files/download/2479)
-----------------------------------------------------------------------------------------------
md5                               filename      filesize
464cf8972fe8fa9f9a84c1eb7b3d9357  test.urllib   6645410
464cf8972fe8fa9f9a84c1eb7b3d9357  test.urllib2  6645410
413756781140113a62c7950950cf9de6  test.wget     6645239

Huh? Different md5sum and filesizes?
When I view the .urllib and .urllib2 versions, I notice:

"Content-disposition: attachment; filename="RealPlayer-10.0.9.809-20070726.i586.rpm"
Content-length: 6645239
Connection: close
Content-Type: application/octet-stream

"

at the top of the file. Why are the headers coming through? These should be
filtered out with the urlretrieve and read methods, or shouldn't they? 
If that's not supposed to be, then how am I supposed to filter them out, not
knowing _if_ they're coming through?
Comment 1 Ivo Manca 2008-01-08 17:00:46 EST
Created attachment 291094 [details]
python download comparison script
Comment 2 James Antill 2008-01-08 23:33:14 EST
 In theory you could say this is a bug in urllib2, however I'm not going to
deviate from upstream on the behaviour, and I wouldn't be surprised if you find
it easier to get the above server at helixcommunity.org fixed instead, you can
see the problem if you do:

curl --trace trace -o data
https://helixcommunity.org/projects/player/files/download/2479

...then you can see at the start of the trace file:

<= Recv header, 17 bytes (0x11)
0000: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d HTTP/1.1 200 OK.
0010: 0a                                              .

...this is the response line, note it ends with 0x0d 0x0a as it should. Then you
have the headers:

<= Recv header, 37 bytes (0x25)
0000: 44 61 74 65 3a 20 57 65 64 2c 20 30 39 20 4a 61 Date: Wed, 09 Ja
0010: 6e 20 32 30 30 38 20 30 34 3a 30 36 3a 35 32 20 n 2008 04:06:52 
0020: 47 4d 54 0d 0a                                  GMT..
<= Recv header, 32 bytes (0x20)
0000: 53 65 72 76 65 72 3a 20 41 70 61 63 68 65 2f 32 Server: Apache/2
0010: 2e 30 2e 35 32 20 28 43 65 6e 74 4f 53 29 0d 0a .0.52 (CentOS)..
<= Recv header, 25 bytes (0x19)
0000: 58 2d 50 6f 77 65 72 65 64 2d 42 79 3a 20 50 48 X-Powered-By: PH
0010: 50 2f 34 2e 33 2e 39 0d 0a                      P/4.3.9..

...and dito. for the first three headers, but then:

<= Recv header, 70 bytes (0x46)
0000: 50 33 50 3a 20 70 6f 6c 69 63 79 72 65 66 3d 22 P3P: policyref="
0010: 68 74 74 70 73 3a 2f 2f 77 77 77 2e 68 65 6c 69 https://www.heli
0020: 78 63 6f 6d 6d 75 6e 69 74 79 2e 6f 72 67 2f 77 xcommunity.org/w
0030: 33 63 2f 70 33 70 2e 78 6d 6c 22 2c 20 43 50 3d 3c/p3p.xml", CP=
0040: 22 41 44 4d 61 0a                               "ADMa.
<= Recv header, 22 bytes (0x16)
0000: 4f 55 52 20 49 4e 44 20 44 53 50 20 49 44 43 20 OUR IND DSP IDC 
0010: 43 4f 52 22 0d 0a                               COR"..

...here the P3P headers has a 0x0a byte in it, which curl is "confused about"
and urllib2 gets a lot more confused about.

 The relevant part of the HTTP std. is:

19.3 Tolerant Applications
[...]
   The line terminator for message-header fields is the sequence CRLF.
   However, we recommend that applications, when parsing such headers,
   recognize a single LF as a line terminator and ignore the leading CR.

...so presumably urllib follows this strictly, and thus sees the next line as:

<= Recv header, 22 bytes (0x16)
0000: 4f 55 52 20 49 4e 44 20 44 53 50 20 49 44 43 20 OUR IND DSP IDC 
0010: 43 4f 52 22 0d 0a                               COR"..

...which is obviously invalid HTTP, and so it ends the headers there. The next
line being:

<= Recv header, 85 bytes (0x55)
0000: 43 6f 6e 74 65 6e 74 2d 64 69 73 70 6f 73 69 74 Content-disposit
0010: 69 6f 6e 3a 20 61 74 74 61 63 68 6d 65 6e 74 3b ion: attachment;
[...]

...etc.

 I've looked and I can't explicitly find wording that disallows putting a LF in
a header, and some wording that kind of indicates it might be ok. ... but it's
_very_ unusual, and the above wording in 19.3 strongly suggests you shouldn't be
doing it, IMO.

 So I'd say your three options are:

1. Tell helixcommunity to fix their servers P3P response header.
2. Use python bindings for something like curl which goes to extreme lengths to
parse broken HTTP responses found in the wild.
3. Try and get upstream to do a change request where if it sees a CRLF it
doesn't then use the "search for LF, and delete the CR if there" mode.

Note You need to log in before you can comment on or make changes to this bug.