Created attachment 314850 [details] strace with problematic /etc/hosts that causes failing name resolution, see around line 1109 We have an /etc/hosts with total size 15409 bytes, 19 lines, longest line is 3738 bytes. With that /etc/hosts file in place, the MySQL client libraries (used through Perl DBD::mysql) will fail to resolve anything, including DNS names not in /etc/hosts. The process did not segfault, it just confusingly claims there is no such hostname. Strace makes us suspect this is due to libresolv in glibc failing. When the /etc/hosts lines are shortened (even if a huge overall /etc/hosts of 1.9 MB is in place, but has only short lines), name resolution by the MySQL client works. Version: # cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.2 (Tikanga) # rpm -q glibc glibc-2.5-24 How reproducible: Every time. A possibly related bug was filed by someone with Ubuntu here: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/130693 It appears that a bug around long lines in /etc/hosts was fixed back in RHEL 3: https://bugzilla.redhat.com/show_bug.cgi?id=140378 http://sources.redhat.com/ml/libc-hacker/2004-11/msg00058.html ... but either the bug has crept back in from upstream glibc, or it's a different bug with similar effects.
Created attachment 314851 [details] strace with ok /etc/hosts that shows name resolution succeeding
(In reply to comment #0) > We have an /etc/hosts with total size 15409 bytes, 19 lines, longest line is > 3738 bytes. Can you attach your /etc/hosts? If you can't attach it as-is because you don't want people to see your hostnames, replace them with some semi-random names.
I've tried to reproduce this, but haven't succeeded with an over 16KB /etc/hosts with ~ 4KB longest line. We really need your /etc/hosts, perhaps mangled in some way to hide the original host names or IPs, but with the same number of chars, different hostnames, etc. Also, do you use nscd or not? Can you reproduce it with simple getent hosts XXX or getent ahosts XXX ?
Ok, sorry I forgot about this for so long. I no longer had the original /etc/hosts and had to recreate the problem on a different server. It still happens on RHEL 5.5 the same way. # cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.5 (Tikanga) # rpm -q glibc glibc-2.5-49 glibc-2.5-49 I wasn't using nscd before, or now. "getent hosts corndog" does not cause the problem. (But it doesn't really seem to resolve the address per se, either; it just dumps the matching line from /etc/hosts.) I will attach all the files I used. Please let me know if you have any trouble reproducing the problem.
Created attachment 405174 [details] The /etc/hosts file that causes the failure
Created attachment 405175 [details] An /etc/hosts file that is similar but has no very long line and works
Created attachment 405176 [details] Perl program to demonstrate the resolve failure with DBD::mysql
Created attachment 405177 [details] strace of failure
Created attachment 405178 [details] strace of successful run
Created attachment 405179 [details] strace of getent hosts run that succeeded
Oh, and here's what the Perl test script runs look like. A successful name resolution (but failed MySQL connection because the connection information is bogus): $ ./resolve-test.pl DBI connect('database=somedb;host=corndog','someuser',...) failed: Can't connect to MySQL server on 'corndog' (111) at ./resolve-test.pl line 7 A failed name resolution with the fat hosts file: $ ./resolve-test.pl DBI connect('database=somedb;host=corndog','someuser',...) failed: Unknown MySQL server host 'corndog' (-1) at ./resolve-test.pl line 7
Reassigning to mysql component. As far as I can tell this is a problem with mysql-libs. F14 exhibits this problem and the problem persists regardless of what version of glibc or perl modules are installed. However, if mysql-libs is updated mysql-5.5.8-10 from F15 the the test works. A failing test will report something like this: DBI connect('database=somedb;host=corndog','someuser',...) failed: Unknown MySQL server host 'corndog' (-1) at /tmp/test line 7 Note the "Unknown MySQL server ..." A succeeding test will (after a long wait) report something like this: DBI connect('database=somedb;host=corndog','someuser',...) failed: Can't connect to MySQL server on 'corndog' (110) at /tmp/test line 7 Note it was unable to connect. ps. Make sure your nssswitch.conf only uses files for host lookups...
Jeff, very interesting find. Your conclusion seems sound. I'm not using any system with this combination of MySQL + long /etc/hosts anymore, so I guess I'll just say happy day once the latest mysql-libs is everywhere so people won't run into the bug anymore. :) Thanks for the update.
Hm. While I'm not looking at the mysql code at the moment, it wouldn't surprise me a bit if they had some hand-rolled code in there instead of using libresolv at all. Jon, had you seen failures with the long /etc/hosts file and any component *other* than mysql? > I'll just say happy day once the latest mysql-libs is everywhere That's gonna be a long time as far as RHEL is concerned :-(
I poked around in the mysql 5.1.x sources and found that the "Unknown MySQL server" error is issued if gethostbyname_r() fails, entirely independently of what the actual errno is. 5.5.x has replaced that whole code sequence with a getaddrinfo call, which probably explains the difference in behavior. Eyeballing the gethostbyname_r() call, my attention is drawn to the buf/buflen arguments. The gethostbyname_r man page says that it will return ERANGE if the buffer is "too small", which would fit the reported symptom, but nowhere is it suggested what "too small" might be. mysql 5.1.x is using a fixed buffer size, which is either sizeof(struct hostent_data) or 2048 depending on a nest of #ifdef's that I don't feel like deciphering right now. If gethostbyname_r() is expecting to fit a line of /etc/hosts into that buffer, then I think we have our explanation. Anybody know that code offhand?
I recently found and fixed a bug in x86_64 glibc which sounds somewhat similar to what is being described here: http://sourceware.org/bugzilla/show_bug.cgi?id=14307 The root cause there was also an ERANGE error, returned because the initial temporary buffer tried was too small (512 bytes, of which 400 were used for some internal struct) Did you verify if this problem indeed only occurs for x86_64 and not for 32-bit x86? If so, the solution could well be to increase the fixed buffer size used by mysql
Since RHEL5 is now in maintenance mode, this bug is not going to get fixed there. AFAICT newer versions of mysql don't have the issue.