Bug 1226534
Summary: | evolution hangs in getaddrinfo call while reading netlink socket | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Erik van Pienbroek <erik-fedora> | ||||||
Component: | glibc | Assignee: | Carlos O'Donell <codonell> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 22 | CC: | arjun.is, codonell, erik-fedora, fweimer, gjasny, jakub, law, mnewsome, pfrankli | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-11-29 20:01:13 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Erik van Pienbroek
2015-05-30 11:35:24 UTC
Created attachment 1032444 [details]
backtrace as generated by the gstack tool
Created attachment 1032534 [details]
output of gdb command 'bt full'
The getaddrinfo call is blocking because the netlink request is blocking. We use the netlink interface to determine if we have ipv4 and ipv6 interfaces. The library sends a request to the kernel and expects a result back. Blocking at ../sysdeps/unix/sysv/linux/check_pf.c:166 means it blocked on the read of the netlink socket waiting for an answer from the kernel. I don't see any problem with glibc. Do you have a smaller reproducer that uses netlink to determine interfaces and it doesn't hang? Thanks for your response. I don't have a smaller reproducer yet, but now that you mention that it probably is a kernel issue I'll try upgrading my kernel first. According to https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/net/core/rtnetlink.c there have been several changes to the kernel netlink code recently so that might help. Do you think dumping kernel info using /proc/sysrq-trigger can be used to obtain more details about the reason why the netlink recvmsg call doesn't return? (In reply to Erik van Pienbroek from comment #4) > Thanks for your response. > I don't have a smaller reproducer yet, but now that you mention that it > probably is a kernel issue I'll try upgrading my kernel first. According to > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/net/core/ > rtnetlink.c there have been several changes to the kernel netlink code > recently so that might help. > > Do you think dumping kernel info using /proc/sysrq-trigger can be used to > obtain more details about the reason why the netlink recvmsg call doesn't > return? Yes, absolutely, seeing where the kernel is stuck *might* be useful, but sometimes the buffers are unconnected so you don't know why it's stuck, but it might give you a hint. Using /proc/sysrq-trigger gives me this trace for the hanging getaddrinfo/recvmsg thread: [295196.249529] Call Trace: [295196.249532] [<ffffffff8177d117>] schedule+0x37/0x90 [295196.249535] [<ffffffff8178002c>] schedule_timeout+0x17c/0x230 [295196.249538] [<ffffffff811fd615>] ? __kmalloc_node_track_caller+0x245/0x320 [295196.249541] [<ffffffff810dc7f4>] ? prepare_to_wait_exclusive+0x54/0x80 [295196.249544] [<ffffffff81656b99>] __skb_recv_datagram+0x4b9/0x520 [295196.249547] [<ffffffff8164e8b7>] ? skb_queue_tail+0x47/0x60 [295196.249550] [<ffffffff81656c60>] ? skb_recv_datagram+0x60/0x60 [295196.249553] [<ffffffff81656c3f>] skb_recv_datagram+0x3f/0x60 [295196.249556] [<ffffffff81695cdb>] netlink_recvmsg+0x5b/0x3b0 [295196.249560] [<ffffffff81646dbc>] sock_recvmsg+0x7c/0xc0 [295196.249563] [<ffffffff81647c66>] ___sys_recvmsg+0xf6/0x230 [295196.249566] [<ffffffff816475c6>] ? SYSC_sendto+0x1b6/0x200 [295196.249570] [<ffffffff81648931>] __sys_recvmsg+0x51/0xa0 [295196.249573] [<ffffffff81781408>] ? int_check_syscall_exit_work+0x34/0x3d [295196.249576] [<ffffffff81648992>] SyS_recvmsg+0x12/0x20 [295196.249579] [<ffffffff817811c9>] system_call_fastpath+0x12/0x17 My guess is that the kernel waits for an exclusive lock to become available on the netlink socket which never happens for some reason.. I'll try digging more into it later This indeed turned out to be a kernel bug: https://lkml.org/lkml/2015/10/5/771 It is solved in recent kernels so this bug can be closed |