Bug 958652

Summary: getaddrinfo returns EAI_SYSTEM but errno is zero
Product: [Fedora] Fedora Reporter: Jan Kaluža <jkaluza>
Component: glibcAssignee: Siddhesh Poyarekar <spoyarek>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 19CC: codonell, fweimer, jakub, law, mnewsome, pfrankli, schwab, spoyarek, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glibc-2.17-11.fc19 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-29 18:05:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 954007    
Attachments:
Description Flags
addrinfo.c
none
strace with -11 return code
none
strace with -2 return code
none
nsswitch.conf none

Description Jan Kaluža 2013-05-02 06:41:25 UTC
While fixing httpd Bug 954007, I have found out that getaddrinfo returns EAI_SYSTEM error, but errno is set to 0. This looks suspicious to me and I think this is getaddrinfo (or more low-level) bug.

For the description of configuration for which, please check the description of Bug 954007. Note that I'm not able to reproduce it myself, but the reporter of Bug 954007 is.

In APR (library that httpd uses) code, you can see this particular getaddrinfo call at line 365:

http://svn.apache.org/viewvc/apr/apr/trunk/network_io/unix/sockaddr.c?revision=1083931&view=markup#l365

errno returned at line 378 is 0 for the original reporter.

Comment 1 Zbigniew Jędrzejewski-Szmek 2013-05-02 17:45:27 UTC
Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=958934

Comment 2 Siddhesh Poyarekar 2013-05-10 05:41:55 UTC
Could you try this with an isolated reproducer on the system where you're able to replicate the problem.  By isolated I mean a program that does nothing other than getaddrinfo and checks the return codes (errno and return value).

Comment 3 Jan Kaluža 2013-05-10 06:41:18 UTC
Zbigniew,

can you please compile attached source code using:

gcc addrinfo.c -o addrinfo

Then run it like "./addrinfo"

Try it on machine where the httpd crashed for you and paste the output here, please.

Comment 4 Jan Kaluža 2013-05-10 07:36:45 UTC
Created attachment 745981 [details]
addrinfo.c

Comment 5 Zbigniew Jędrzejewski-Szmek 2013-05-10 16:00:01 UTC
(Original installation with #954007)
error: 0 -11 0
error: 2 -2 0
error: 10 -11 0

(Second container with #958934)
error: 0 -2 0
error: 2 -2 0
error: 10 -2 0

Comment 6 Jan Kaluža 2013-05-11 05:52:23 UTC
Siddhesh,

as you see it returns EAI_SYSTEM (-11) error with errno 0 for Zbigniew:

fprintf(stderr, "error: %d %d %d\n", family, error, errno);
error: 0 -11 0
error: 2 -2 0
error: 10 -11 0

Zbigniew,

please keep the installation with #954007. I think Siddhesh could have additional questions about environment later.

Comment 7 Siddhesh Poyarekar 2013-05-13 06:08:48 UTC
Thanks, the problem seems to be the same as:

http://sourceware.org/bugzilla/show_bug.cgi?id=15339

for which I already have a fix.  Zbingniew, would you be able to install and test a scratch package?

Comment 8 Siddhesh Poyarekar 2013-05-13 09:10:49 UTC
This is the scratch build to test:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5371240

Comment 9 Zbigniew Jędrzejewski-Szmek 2013-05-13 15:44:56 UTC
(In reply to comment #8)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=5371240

I installed glibc-common-2.17-7.fc19.0.test.x86_64, glibc-2.17-7.fc19.0.test.x86_64, glibc-debuginfo-common-2.17-7.fc19.0.test.x86_64, glibc-debuginfo-2.17-7.fc19.0.test.x86_64, since I don't have the other packages in the build.

I don't see any change:
error: 0 -11 0
error: 2 -2 0
error: 10 -11 0

Also, AFAICT, the network in my container is working fine: I have an IP and a route and three reachable nameservers in /etc/resolv.conf, yum downloads packages...

Comment 10 Siddhesh Poyarekar 2013-05-14 07:27:57 UTC
OK, thanks for testing that.  Can you run that program under an strace and attach the results?  Also, I'd like to know how you've set nsswitch.conf and if you have nscd running.  If nscd is running, then please keep it disabled whenever you're doing these tests.  Use this strace command:

strace -xvv -s 255 ./addrinfo

The strace may contain confidential information about your network (dns servers, network configuration, etc.) so I hope you can at least send it to me personally, if not attached to the bug report.

Comment 11 Zbigniew Jędrzejewski-Szmek 2013-05-14 19:20:41 UTC
(In reply to comment #10)
> Also, I'd like to know how you've set nsswitch.conf and
> if you have nscd running.
I don't have nscd running.

OK, I think I found the culprit: nss-myhostname. I had an old version installed which was broken (missing linking symbol). If I remove myhostname from /etc/nsswitch.conf, I get the following results from addrinfo:

error: 0 -2 0
error: 2 -2 0
error: 10 -2 0

Sorry guys for that, it seems to be entirely my fault.
I'll attach nsswitch.conf and the straces just in case.

Comment 12 Zbigniew Jędrzejewski-Szmek 2013-05-14 19:21:18 UTC
Created attachment 747875 [details]
strace with -11 return code

Comment 13 Zbigniew Jędrzejewski-Szmek 2013-05-14 19:21:43 UTC
Created attachment 747876 [details]
strace with -2 return code

Comment 14 Zbigniew Jędrzejewski-Szmek 2013-05-14 19:22:53 UTC
Created attachment 747877 [details]
nsswitch.conf

Comment 15 Siddhesh Poyarekar 2013-05-15 08:46:20 UTC
Thanks, could you elaborate on what exactly was wrong with nss-myhostname?  I'd like to try and replicate it to make sure that it's not something that glibc ought to have handled.  The strace does not show any errors in actually reading the plugin file.

Comment 16 Zbigniew Jędrzejewski-Szmek 2013-05-15 11:37:18 UTC
> what exactly was wrong with nss-myhostname?
systemd bug fixed in http://cgit.freedesktop.org/systemd/systemd/commit/?id=1e335af70: the .so file was wanting a symbol (log_<something>) which couldn't be resolved and the module could not be loaded. Should be trivial to recreate by adding whatever function call when the function is not defined in any of the libraries.

> make sure that it's not something that glibc ought to have handled
If anything should be changed, I think that glibc is the only place. The module wasn't even loaded, so it has nothing to say in this matter.

Comment 17 Siddhesh Poyarekar 2013-05-15 14:06:39 UTC
Even easier to reproduce - just move libnss_myhostname.so and put myhostname in nsswitch.conf.  I'll take this because this is related to upstream 15339.  As I had feared, the fix is not complete.

Comment 18 Siddhesh Poyarekar 2013-05-15 15:11:13 UTC
Please try this build once it is done.  My local testing seems to indicate that this is fixed:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5384387

Comment 19 Zbigniew Jędrzejewski-Szmek 2013-05-15 20:44:58 UTC
(In reply to comment #18)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=5384387
error: 0 -2 0
error: 2 -2 0
error: 10 -2 0

> My local testing seems to indicate that this is fixed:
So it seems.

Comment 20 Siddhesh Poyarekar 2013-05-31 13:31:16 UTC
The patch is upstream and will make it into rawhide with the next rebase.  Do you need an f19 backport?

commit 3d04f5db20c8f0d1ba3881b5f5373586a18cf188
Author: Siddhesh Poyarekar <siddhesh>
Date:   Tue May 21 21:54:41 2013 +0530

    Set EAI_SYSTEM only when h_errno is NETDB_INTERNAL
    
    Fixes BZ #15339.
    
    NSS_STATUS_UNAVAIL may mean that a necessary input resource is not
    available.  This could occur in a number of cases including when the
    network is down, system runs out of file descriptors, etc.  The
    correct differentiator in such a case is the h_errno, which gives the
    nature of failure.  In case of failures other than a simple 'not
    found', we set h_errno as NETDB_INTERNAL and let errno be the
    identifier for the exact error.

Comment 21 Jan Kaluža 2013-05-31 15:34:50 UTC
i was hoping you will fix that in F19, otherwise I will have to backport APR patch (it's upstream too), which is not a problem, but the real bug is in glibc and there could be other projects using the glibc the way httpd does.

Comment 22 Fedora Update System 2013-06-25 17:31:00 UTC
glibc-2.17-11.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/glibc-2.17-11.fc19

Comment 23 Fedora Update System 2013-06-26 17:09:26 UTC
Package glibc-2.17-11.fc19:
* should fix your issue,
* was pushed to the Fedora 19 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing glibc-2.17-11.fc19'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-11737/glibc-2.17-11.fc19
then log in and leave karma (feedback).

Comment 24 Fedora Update System 2013-06-29 18:05:30 UTC
glibc-2.17-11.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.