Hey all, I'm having trouble with python's urllib and urllib2. Most of the files download correctly, however one file _always_ fails while using urllib.urlretrieve or urllib2.read. However, it doesn't completely fail, it just pastes the headers sent by the server to the output, rendering the output file useless. Is this an (for me) feature, or is it just a bug? I've made a small python script (attached) to list the difference between the results of wget, urllib.urlretrieve and urllib2.read (since I was ffraid it was an urllib limitation) Added info: rpm -q lists: ----------------------------------------------------------------------------------------------- fedora-release-8-5 python-2.5.1-19.fc9 python-libs-2.5.1-19.fc9 wget-1.10.2-17.fc9 python-urlgrabber-3.0.0-3.fc8 Download tests: Static http site (http://www.example.com/) ----------------------------------------------------------------------------------------------- md5 filename filesize 32e8347a8caee51bd4474c4fbb7025c5 test.urllib 438 32e8347a8caee51bd4474c4fbb7025c5 test.urllib2 438 32e8347a8caee51bd4474c4fbb7025c5 test.wget 438 Seems to be correct, right? Static https download link (http://download.ing.be/homebank/security/windows/HBSecurity333.exe) ----------------------------------------------------------------------------------------------- md5 filename filesize 45bb388af9bf7aeb110d95ca988c7ff6 test.urllib 806641 45bb388af9bf7aeb110d95ca988c7ff6 test.urllib2 806641 45bb388af9bf7aeb110d95ca988c7ff6 test.wget 806641 Also seems to be correct, so no error with https! Static https download link (https://helixcommunity.org/projects/player/files/download/2479) ----------------------------------------------------------------------------------------------- md5 filename filesize 464cf8972fe8fa9f9a84c1eb7b3d9357 test.urllib 6645410 464cf8972fe8fa9f9a84c1eb7b3d9357 test.urllib2 6645410 413756781140113a62c7950950cf9de6 test.wget 6645239 Huh? Different md5sum and filesizes? When I view the .urllib and .urllib2 versions, I notice: "Content-disposition: attachment; filename="RealPlayer-10.0.9.809-20070726.i586.rpm" Content-length: 6645239 Connection: close Content-Type: application/octet-stream " at the top of the file. Why are the headers coming through? These should be filtered out with the urlretrieve and read methods, or shouldn't they? If that's not supposed to be, then how am I supposed to filter them out, not knowing _if_ they're coming through?
Created attachment 291094 [details] python download comparison script
In theory you could say this is a bug in urllib2, however I'm not going to deviate from upstream on the behaviour, and I wouldn't be surprised if you find it easier to get the above server at helixcommunity.org fixed instead, you can see the problem if you do: curl --trace trace -o data https://helixcommunity.org/projects/player/files/download/2479 ...then you can see at the start of the trace file: <= Recv header, 17 bytes (0x11) 0000: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d HTTP/1.1 200 OK. 0010: 0a . ...this is the response line, note it ends with 0x0d 0x0a as it should. Then you have the headers: <= Recv header, 37 bytes (0x25) 0000: 44 61 74 65 3a 20 57 65 64 2c 20 30 39 20 4a 61 Date: Wed, 09 Ja 0010: 6e 20 32 30 30 38 20 30 34 3a 30 36 3a 35 32 20 n 2008 04:06:52 0020: 47 4d 54 0d 0a GMT.. <= Recv header, 32 bytes (0x20) 0000: 53 65 72 76 65 72 3a 20 41 70 61 63 68 65 2f 32 Server: Apache/2 0010: 2e 30 2e 35 32 20 28 43 65 6e 74 4f 53 29 0d 0a .0.52 (CentOS).. <= Recv header, 25 bytes (0x19) 0000: 58 2d 50 6f 77 65 72 65 64 2d 42 79 3a 20 50 48 X-Powered-By: PH 0010: 50 2f 34 2e 33 2e 39 0d 0a P/4.3.9.. ...and dito. for the first three headers, but then: <= Recv header, 70 bytes (0x46) 0000: 50 33 50 3a 20 70 6f 6c 69 63 79 72 65 66 3d 22 P3P: policyref=" 0010: 68 74 74 70 73 3a 2f 2f 77 77 77 2e 68 65 6c 69 https://www.heli 0020: 78 63 6f 6d 6d 75 6e 69 74 79 2e 6f 72 67 2f 77 xcommunity.org/w 0030: 33 63 2f 70 33 70 2e 78 6d 6c 22 2c 20 43 50 3d 3c/p3p.xml", CP= 0040: 22 41 44 4d 61 0a "ADMa. <= Recv header, 22 bytes (0x16) 0000: 4f 55 52 20 49 4e 44 20 44 53 50 20 49 44 43 20 OUR IND DSP IDC 0010: 43 4f 52 22 0d 0a COR".. ...here the P3P headers has a 0x0a byte in it, which curl is "confused about" and urllib2 gets a lot more confused about. The relevant part of the HTTP std. is: 19.3 Tolerant Applications [...] The line terminator for message-header fields is the sequence CRLF. However, we recommend that applications, when parsing such headers, recognize a single LF as a line terminator and ignore the leading CR. ...so presumably urllib follows this strictly, and thus sees the next line as: <= Recv header, 22 bytes (0x16) 0000: 4f 55 52 20 49 4e 44 20 44 53 50 20 49 44 43 20 OUR IND DSP IDC 0010: 43 4f 52 22 0d 0a COR".. ...which is obviously invalid HTTP, and so it ends the headers there. The next line being: <= Recv header, 85 bytes (0x55) 0000: 43 6f 6e 74 65 6e 74 2d 64 69 73 70 6f 73 69 74 Content-disposit 0010: 69 6f 6e 3a 20 61 74 74 61 63 68 6d 65 6e 74 3b ion: attachment; [...] ...etc. I've looked and I can't explicitly find wording that disallows putting a LF in a header, and some wording that kind of indicates it might be ok. ... but it's _very_ unusual, and the above wording in 19.3 strongly suggests you shouldn't be doing it, IMO. So I'd say your three options are: 1. Tell helixcommunity to fix their servers P3P response header. 2. Use python bindings for something like curl which goes to extreme lengths to parse broken HTTP responses found in the wild. 3. Try and get upstream to do a change request where if it sees a CRLF it doesn't then use the "search for LF, and delete the CR if there" mode.