Bug 674817

Summary: wget: unable to resolve host address “http://..."
Product: [Fedora] Fedora Reporter: Milos Malik <mmalik>
Component: wgetAssignee: Karsten Hopp <karsten>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: karsten, micah
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-03 23:41:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Milos Malik 2011-02-03 13:07:59 UTC
Description of problem:


Version-Release number of selected component (if applicable):
wget-1.12-2.fc12.i686 (Fedora 12)
wget-1.12-1.4.el6.x86_64 (RHEL-6.0)

How reproducible:
always

Steps to Reproduce:
$ wget -c http%3A%2F%2Ftdn.howestreet.com%2Faudio%2Fjohn_rubino_2011_0202.mp3
--2011-02-03 14:03:10--  http://[http://tdn.howestreet.com/audio/john_rubino_2011_0202.mp3]/
Resolving http://tdn.howestreet.com/audio/john_rubino_2011_0202.mp3... failed: Name or service not known.
wget: unable to resolve host address “http://tdn.howestreet.com/audio/john_rubino_2011_0202.mp3”
$ echo $?
4

Actual results:
the file is not downloaded

Expected results:
the file is downloaded

Additional info:
It seems that wget thinks that “http://tdn.howestreet.com/audio/john_rubino_2011_0202.mp3” is a host address.

Comment 1 Micah Cowan 2011-02-03 18:26:53 UTC
This is obviously not a bug. You're asking for a website whose "hostname" begins with "http://". No browser I've ever seen would handle your broken URL any differently from wget. Wget thinks it's a host address, because you told it it's a host address (by percent-encoding it).

Presumably, you didn't actually want to percent-encode all that. Try it again with

  wget -c http://tdn.howestreet.com/audio/john_rubino_2011_0202.mp3

which is what I assume you actually meant to do.

Comment 2 Karsten Hopp 2011-02-03 22:56:12 UTC
Micah: Would it be that bad to run the URL through url_unescape before trying to figure out the hostname and path ?
The following works for me:

diff -urN wget-1.12/src/url.c wget-1.12_new/src/url.c
--- wget-1.12/src/url.c 2009-09-22 05:05:53.000000000 +0200
+++ wget-1.12_new/src/url.c     2011-02-03 23:48:51.000000000 +0100
@@ -547,6 +547,10 @@
   if (url_scheme (url) != SCHEME_INVALID)
     return NULL;
 
+  if (strchr (url, '%'))
+    {
+      url_unescape (url);
+    }
   /* Look for a ':' or '/'.  The former signifies NcFTP syntax, the
      latter Netscape.  */
   p = strpbrk (url, ":/");

Comment 3 Micah Cowan 2011-02-03 23:18:38 UTC
The problem with that is it breaks legitimate URLs in order to support illegitimate ones. URLs are allowed to have percent-encoded characters in the hostname portion: this can particularly happen if the URL was translated from an IRI (internationalized resource identifier), without automatically punycoding the hostname.

There are also people out there that set up DNS registrations with bizarre hostnames, which could conceivably include colons and slashes (I wouldn't be surprised if there are hostnames out there that include those characters). For reaching such a server, using percent encoding would be the only way to reach them (even though such hostnames are obviously non-conforming - but there are plenty of those).

In other words, it's even conceivable that someone (silly) really would have a hostname that starts "http://", and this would be the legal way to specify that.

Comment 4 Karsten Hopp 2011-02-03 23:41:04 UTC
@Micah: Thanks for the explanation. I didn't know that those 'bizarre hostnames' are allowed. 

@Milos: FYI: Micah is the upstream maintainer of wget who is monitoring our wget bugzillas. Thanks a lot for that !

Closing per comment #1

Comment 5 Micah Cowan 2011-02-04 00:34:37 UTC
"Former maintainer", actually. :)

Giuseppe Scrivano is the current maintainer of wget (I'm no longer active in its development). So if there are any lingering doubts, it'd be best to bring this up to him (via the bug-wget mailing list, not personal email, of course); just be sure to link here for context.