Hide Forgot
This bug can be reproduced in htdig-3.2.0-0.10.b6.el6
Miroslav, could you please elaborate on the results you get when reproducing this problem in htdig on RHEL6? How do your actual results compare to the ones I originally reported as bug 435741. Looking at the patches and spec file in the source RPM for htdig (htdig-3.2.0-0.10.b6.el6.src.rpm), I can see that the patch for bug 435741 is in there and is being applied to the source, so I find it very puzzling that the problem still exists. Can you confirm that version 3.2.0-0.10.b6.el6 is indeed the one and only version of htdig you have running on your system, and that htdig is parsing its own debugging output when running with -vvv and configured with an external_parsers attribute that it can't execute?
Sorry, that should be -vvvv (4 v's) above, not -vvv. Some useful commands to check things out would be the following, for which I'd be interested in seeing the output: $ which htdig (should be /usr/bin/htdig) $ rpm -qf $(which htdig) (should be htdig-3.2.0-0.10.b6.el6) $ grep external_parsers /etc/htdig/htdig.conf $ htdig -vvvv
Gilles, I have only htdig-3.2.0-0.10.b6.el6 installed. # cat /etc/htdig/htdig.conf base_dir: /var/lib/htdig common_dir: /usr/share/htdig translate_latin1: false start_url: http://www.google.com external_parsers: text/html /usr/local/bin/htmlparserfake # htdig -vvv ht://dig Start Time: Fri Feb 18 11:32:59 2011 0:1:http://www.google.com/ New server: www.google.com, 80 - Persistent connections: enabled - HEAD before GET: enabled - Timeout: 30 - Connection space: 0 - Max Documents: -1 - TCP retries: 1 - TCP wait time: 5 - Accept-Language: Trying to retrieve robots.txt file Making HTTP request on http://www.google.com/robots.txt Header line: HTTP/1.1 200 OK Header line: Content-Length: 5570 Header line: Content-Type: text/plain Header line: Last-Modified: Mon, 14 Feb 2011 19:41:32 GMT Header line: Date: Fri, 18 Feb 2011 16:33:00 GMT Header line: Expires: Fri, 18 Feb 2011 16:33:00 GMT Header line: Cache-Control: private, max-age=0 Header line: Vary: Accept-Encoding Header line: X-Content-Type-Options: nosniff Header line: Server: sffe Header line: X-XSS-Protection: 1; mode=block Request time: 1 secs Header line: HTTP/1.1 200 OK Header line: Content-Type: text/plain Header line: Last-Modified: Mon, 14 Feb 2011 19:41:32 GMT Header line: Date: Fri, 18 Feb 2011 16:33:00 GMT Header line: Expires: Fri, 18 Feb 2011 16:33:00 GMT Header line: Cache-Control: private, max-age=0 Header line: Vary: Accept-Encoding Header line: X-Content-Type-Options: nosniff Header line: Server: sffe Header line: X-XSS-Protection: 1; mode=block Header line: Transfer-Encoding: chunked Request time: 0 secs Parsing robots.txt file using myname = htdig Robots.txt line: User-agent: * Found 'user-agent' line: * Robots.txt line: Disallow: /search [snip] 1:1:http://www.google.com/ skipped pick: www.google.com, # servers = 1 > www.google.com supports HTTP persistent connections (infinite) 0:2:0:http://www.google.com/: Making HTTP request on http://www.google.com/ Header line: HTTP/1.1 302 Found Header line: Location: http://www.google.cz/ Header line: Cache-Control: private Header line: Content-Type: text/html; charset=UTF-8 Header line: Set-Cookie: PREF=ID=2aeef165b7e5cb44:FF=0:TM=1298046780:LM=1298046780:S=6Y13eJy7ZFw3AqXT; expires=Sun, 17-Feb-2013 16:33:00 GMT; path=/; domain=.google.com Header line: Set-Cookie: NID=44=IHIeVx1KPEnPu8mPKVOc_OOaXVjYcxe_wSsIVvPkszdiulFy-t9bu1OKlSucqxPe_XinFRBrR49g8l31V3YfDiCvp1MaCl7WzBW6ruJqMrP-3bBYFcoPgij83296pztb; expires=Sat, 20-Aug-2011 16:33:00 GMT; path=/; domain=.google.com; HttpOnly Header line: Date: Fri, 18 Feb 2011 16:33:00 GMT Header line: Server: gws Header line: Content-Length: 218 Header line: X-XSS-Protection: 1; mode=block Request time: 0 secs redirect redirect: http://www.google.cz/ Rejected: URL not in the limits! pick: www.google.com, # servers = 1 > www.google.com supports HTTP persistent connections (infinite) ht://dig End Time: Fri Feb 18 11:33:00 2011 According to our test case there should be this error message: External parser error: Can't execute /usr/local/bin/htmlparserfake But I see none --------------------------- To your last post: # which htdig /usr/bin/htdig # rpm -qf $(which htdig) htdig-3.2.0-0.10.b6.el6.x86_64 # grep external_parsers /etc/htdig/htdig.conf external_parsers: text/html /usr/local/bin/htmlparserfake # htdig -vvvv | grep External (no output) The whole output: http://fpaste.org/IvUR/
The problem is with the test case you give, htdig never even gets around to attempting to parse an HTML document, so no attempt is made to call the external parser. The start_url of http://www.google.com redirects to http://www.google.cz/, which causes htdig to give this error: Rejected: URL not in the limits! pick: www.google.com, # servers = 1 This is because by default, limit_urls_to is set to the value of start_url, so that htdig doesn't stray off to other sites. You should try a simple start_url that just gives you a single HTML page with no redirects as your test case, e.g.: start_url: http://www.htdig.org/author.html Then you should get the error: execv: No such file or directory External parser error: Can't execute /usr/local/bin/htmlparserfake By the way, your first attribute, base_dir, isn't a standard htdig attribute name. I think you mean "database_dir".
Thanks very much! I fixed the start_url and database_dir. I didn't have much time to dig into it, it wasn't originally written by me. The test now passes. Closing as not a bug.