Red Hat Bugzilla – Bug 161181
resolver fails to handle truncated UDP replies
Last modified: 2007-11-30 17:11:08 EST
Description of problem: I am trying to connect to remote host: gklab-59-001:~> rdesktop tkepczyx-mobl1.ger.corp.intel.com. ERROR: tkepczyx-mobl1.ger.corp.intel.com.: unable to resolve host nslookup output: gklab-59-001:~> nslookup tkepczyx-mobl1.ger.corp.intel.com. ;; Truncated, retrying in TCP mode. Server: 172.28.168.7 Address: 172.28.168.7#53 Non-authoritative answer: Name: tkepczyx-mobl1.ger.corp.intel.com Address: 172.28.37.68 dig output: gklab-59-001:~> dig tkepczyx-mobl1.ger.corp.intel.com. ;; Truncated, retrying in TCP mode. ; <<>> DiG 9.3.1 <<>> tkepczyx-mobl1.ger.corp.intel.com. ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17848 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 20, ADDITIONAL: 20 ;; QUESTION SECTION: ;tkepczyx-mobl1.ger.corp.intel.com. IN A ;; ANSWER SECTION: tkepczyx-mobl1.ger.corp.intel.com. 384 IN A 172.28.37.68 (AUTHORITY section is long and skipped here as I consider this information sensitive). I will attach strace. Version-Release number of selected component (if applicable): 1.4.0, FC4 fully upgraded to latest updates. How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Cannot connect to remote host using its name. Expected results: Connect to remote host using its name. Additional info: I guess the problem is due to large response which requires fallback to DNS over TCP and this is not correctly handled.
Created attachment 115731 [details] strace -i -s 1024 rdesktop tkepczyx-mobl1.ger.corp.intel.com.
I've just tried simple C++ program: #include <netdb.h> #include <cstdio> char name[] = "tkepczyx-mobl1.ger.corp.intel.com"; int main() { struct hostent *he; he = gethostbyname(name); printf("hostent: %p\n", he); if(he == NULL) printf("h_errno: %d\n", h_errno); return(0); } which also fails with h_errno = TRY_AGAIN, while dig and nslookup still work. This points to a problem in library, reassigning to glibc.
Reassigning to glibc maintainer.
A few other hints: - host I am trying to reach has 20 NS records associated with, other hosts with fewer NS records work fine (2-3 NS'es) - the problem did not exist in FC3 (but I am not 100% sure that in the mean time there were no changes in DNS) - I tried telnet and ssh to the same host with similar result
Can you reproduce it with some publicly accessible DNS?
No. But setting test zone with one A entry and 20 or so NS entries should do the trick. I used the following GENERATE statements to save typing: $GENERATE 1-50 @ NS nameserver${0,2} $GENERATE 1-50 nameserver${0,2} A 192.168.253.${200} I actually confirmed the fault with this kind of setup on x86_64 platform. I also tried CentOS 4 which ships with recompiled from source RHEL's glibc glibc-2.3.4-2.9 and it works fine. I can also add that adding the above lines to my usual setup completly screwed up my nfs client which uses hostnames.
*** Bug 165802 has been marked as a duplicate of this bug. ***
I think I fixed this now upstream. The next rawhide build will probably have it (look out for this bug number in the rpm changelog). Once it is available, consider trying it.
Yeah, glibc-2.3.90-9 and above should fix this.
Sorry guys, I've just tried it on glibc glibc-2.3.90-10 i686 and it does not work.
If you are using nscd, have you flushed nscd cache (i.e. nscd -i hosts)? Or stop nscd before testing. Then, please attach a new strace -i s 1024 log. The one in #1 showed connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("172.28.168.7")}, 28) = -1 EINVAL and similar errors, which is exactly what has been fixed in 2.3.90-9 and above. Now the last argument to connect in this case will be 16 and it shouldn't fail with EINVAL.
Created attachment 118202 [details] strace -i -s 1024 ping tkepczyx-mobl1.ger.corp.intel.com
Now I am not sure if it is glibc or ping. Telnet and rdesktop somehow works.
In the ping case it might be a SELinux policy issue. Look at your logs for audit messages.
I guess this may be it: type=AVC msg=audit(1125311948.892:16533349): avc: denied { name_connect } for pid=4163 comm="ping" dest=53 scontext=user_u:system_r:ping_t tcontext=system_u: object_r:dns_port_t tclass=tcp_socket type=SYSCALL msg=audit(1125311948.892:16533349): arch=40000003 syscall=102 succe ss=no exit=-13 a0=3 a1=bfed214c a2=b35ff4 a3=b7fb1690 items=0 pid=4163 auid=4327 0 uid=43270 gid=32602 euid=43270 suid=43270 fsuid=43270 egid=32602 sgid=32602 fs gid=32602 comm="ping" exe="/bin/ping"
Then the glibc bug is fixed. Whether this is a bug in selinux policy or whether use of nscd in this case is mandatory is something I'll leave to the selinux maintainers to decide.
Why would ping be trying to tcp connect to port 53?
In case UDP resolver query fails it is retried using TCP. And this was a case - UDP query returned so called "truncated" result (i.e. more data then UDP datagram can contain) and query in this case was retried and denied by SELinux. This probably happend "behind the scenes" in resolver library.
Ok added allow $1 dns_port_t:tcp_socket name_connect; to the can_ldap macro, which will allow all domains that use DNS to use eith UDP or TCP to resolve. Dan
Per request on fedora-list: I have a fully yum-updated FC3 machine with bind setup as caching DNS. This bug is present in glibc-2.3.5. # ping en.wikipedia.org ping: unknown host en.wikipedia.org I updated the machine temporarily with : binutils-2.16.91.0.2-4.i386.rpm glibc-2.3.90-10.i386.rpm glibc-2.3.90-10.i686.rpm glibc-common-2.3.90-10.i386.rpm glibc-devel-2.3.90-10.i386.rpm glibc-headers-2.3.90-10.i386.rpm The machine boots and appears stable, the truncation message from bind is still present, and ping works correctly (as expected). Reverting the machine to a glibc-2.3.5 setup and the earlier binutils once again breaks ping.
Fixed in selinux-policy-*-1.27.1-2.1
can someone please backport the glibc fix into FC4? thanks
The SE Linux issue is resolved, so now it's apparently just a glibc issue.
I've got fully updated system on x86_64 (with glibc-2.3.5-10.3 and selinux-policy-targeted-1.27.1-2.22) and both ping and ssh work fine on host with lots of nameservers configured (as described in #6) and for which dig reports retry in TCP. I believe this bug can be closed now.
What was the upstream bug number for the GLIBC side of this bug? I see a big list of upstream BZ numbers in the 2.3.6-1 rev. Is this fix one of them? Thanks, ccb
Closing bugs in MODIFIED state from prior Fedora releases. If this bug persists in a current Fedora release (such as Fedora Core 5 or later), please reopen and set the version appropriately.