428054 – urllib and urllib2 also saving headers when using urllib.urlretrieve & urllib2.read

Bug 428054 - urllib and urllib2 also saving headers when using urllib.urlretrieve & urllib2.read

Summary: urllib and urllib2 also saving headers when using urllib.urlretrieve & urllib...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	python
Sub Component:
Version:	rawhide
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	James Antill
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-01-08 22:00 UTC by Ivo Manca
Modified:	2008-01-09 04:33 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-01-09 04:33:14 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
python download comparison script (1.18 KB, text/plain) 2008-01-08 22:00 UTC, Ivo Manca	no flags	Details
View All

Description Ivo Manca 2008-01-08 22:00:46 UTC

Hey all,

I'm having trouble with python's urllib and urllib2. Most of the files download
correctly, however one file _always_ fails while using urllib.urlretrieve or
urllib2.read.
However, it doesn't completely fail, it just pastes the headers sent by the
server to the output, rendering the output file useless.
Is this an (for me) feature, or is it just a bug?

I've made a small python script (attached) to list the difference between the
results of wget, urllib.urlretrieve and urllib2.read (since I was ffraid it was
an urllib limitation)

Added info:
rpm -q lists:
-----------------------------------------------------------------------------------------------
fedora-release-8-5
python-2.5.1-19.fc9
python-libs-2.5.1-19.fc9
wget-1.10.2-17.fc9
python-urlgrabber-3.0.0-3.fc8


Download tests:

Static http site (http://www.example.com/)
-----------------------------------------------------------------------------------------------
md5                               filename      filesize
32e8347a8caee51bd4474c4fbb7025c5  test.urllib   438
32e8347a8caee51bd4474c4fbb7025c5  test.urllib2  438
32e8347a8caee51bd4474c4fbb7025c5  test.wget     438

Seems to be correct, right?

Static https download link
(http://download.ing.be/homebank/security/windows/HBSecurity333.exe)
-----------------------------------------------------------------------------------------------
md5                               filename      filesize
45bb388af9bf7aeb110d95ca988c7ff6  test.urllib   806641
45bb388af9bf7aeb110d95ca988c7ff6  test.urllib2  806641
45bb388af9bf7aeb110d95ca988c7ff6  test.wget     806641

Also seems to be correct, so no error with https!

Static https download link
(https://helixcommunity.org/projects/player/files/download/2479)
-----------------------------------------------------------------------------------------------
md5                               filename      filesize
464cf8972fe8fa9f9a84c1eb7b3d9357  test.urllib   6645410
464cf8972fe8fa9f9a84c1eb7b3d9357  test.urllib2  6645410
413756781140113a62c7950950cf9de6  test.wget     6645239

Huh? Different md5sum and filesizes?
When I view the .urllib and .urllib2 versions, I notice:

"Content-disposition: attachment; filename="RealPlayer-10.0.9.809-20070726.i586.rpm"
Content-length: 6645239
Connection: close
Content-Type: application/octet-stream

"

at the top of the file. Why are the headers coming through? These should be
filtered out with the urlretrieve and read methods, or shouldn't they? 
If that's not supposed to be, then how am I supposed to filter them out, not
knowing _if_ they're coming through?

Comment 1 Ivo Manca 2008-01-08 22:00:46 UTC

Created attachment 291094 [details]
python download comparison script

Comment 2 James Antill 2008-01-09 04:33:14 UTC

 In theory you could say this is a bug in urllib2, however I'm not going to
deviate from upstream on the behaviour, and I wouldn't be surprised if you find
it easier to get the above server at helixcommunity.org fixed instead, you can
see the problem if you do:

curl --trace trace -o data
https://helixcommunity.org/projects/player/files/download/2479

...then you can see at the start of the trace file:

<= Recv header, 17 bytes (0x11)
0000: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d HTTP/1.1 200 OK.
0010: 0a                                              .

...this is the response line, note it ends with 0x0d 0x0a as it should. Then you
have the headers:

<= Recv header, 37 bytes (0x25)
0000: 44 61 74 65 3a 20 57 65 64 2c 20 30 39 20 4a 61 Date: Wed, 09 Ja
0010: 6e 20 32 30 30 38 20 30 34 3a 30 36 3a 35 32 20 n 2008 04:06:52 
0020: 47 4d 54 0d 0a                                  GMT..
<= Recv header, 32 bytes (0x20)
0000: 53 65 72 76 65 72 3a 20 41 70 61 63 68 65 2f 32 Server: Apache/2
0010: 2e 30 2e 35 32 20 28 43 65 6e 74 4f 53 29 0d 0a .0.52 (CentOS)..
<= Recv header, 25 bytes (0x19)
0000: 58 2d 50 6f 77 65 72 65 64 2d 42 79 3a 20 50 48 X-Powered-By: PH
0010: 50 2f 34 2e 33 2e 39 0d 0a                      P/4.3.9..

...and dito. for the first three headers, but then:

<= Recv header, 70 bytes (0x46)
0000: 50 33 50 3a 20 70 6f 6c 69 63 79 72 65 66 3d 22 P3P: policyref="
0010: 68 74 74 70 73 3a 2f 2f 77 77 77 2e 68 65 6c 69 https://www.heli
0020: 78 63 6f 6d 6d 75 6e 69 74 79 2e 6f 72 67 2f 77 xcommunity.org/w
0030: 33 63 2f 70 33 70 2e 78 6d 6c 22 2c 20 43 50 3d 3c/p3p.xml", CP=
0040: 22 41 44 4d 61 0a                               "ADMa.
<= Recv header, 22 bytes (0x16)
0000: 4f 55 52 20 49 4e 44 20 44 53 50 20 49 44 43 20 OUR IND DSP IDC 
0010: 43 4f 52 22 0d 0a                               COR"..

...here the P3P headers has a 0x0a byte in it, which curl is "confused about"
and urllib2 gets a lot more confused about.

 The relevant part of the HTTP std. is:

19.3 Tolerant Applications
[...]
   The line terminator for message-header fields is the sequence CRLF.
   However, we recommend that applications, when parsing such headers,
   recognize a single LF as a line terminator and ignore the leading CR.

...so presumably urllib follows this strictly, and thus sees the next line as:

<= Recv header, 22 bytes (0x16)
0000: 4f 55 52 20 49 4e 44 20 44 53 50 20 49 44 43 20 OUR IND DSP IDC 
0010: 43 4f 52 22 0d 0a                               COR"..

...which is obviously invalid HTTP, and so it ends the headers there. The next
line being:

<= Recv header, 85 bytes (0x55)
0000: 43 6f 6e 74 65 6e 74 2d 64 69 73 70 6f 73 69 74 Content-disposit
0010: 69 6f 6e 3a 20 61 74 74 61 63 68 6d 65 6e 74 3b ion: attachment;
[...]

...etc.

 I've looked and I can't explicitly find wording that disallows putting a LF in
a header, and some wording that kind of indicates it might be ok. ... but it's
_very_ unusual, and the above wording in 19.3 strongly suggests you shouldn't be
doing it, IMO.

 So I'd say your three options are:

1. Tell helixcommunity to fix their servers P3P response header.
2. Use python bindings for something like curl which goes to extreme lengths to
parse broken HTTP responses found in the wild.
3. Try and get upstream to do a change request where if it sees a CRLF it
doesn't then use the "search for LF, and delete the CR if there" mode.

Note You need to log in before you can comment on or make changes to this bug.