Bug 1226534

Summary: evolution hangs in getaddrinfo call while reading netlink socket
Product: [Fedora] Fedora Reporter: Erik van Pienbroek <erik-fedora>
Component: glibcAssignee: Carlos O'Donell <codonell>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 22CC: arjun.is, codonell, erik-fedora, fweimer, gjasny, jakub, law, mnewsome, pfrankli
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-29 20:01:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
backtrace as generated by the gstack tool
none
output of gdb command 'bt full' none

Description Erik van Pienbroek 2015-05-30 11:35:24 UTC
The evolution mail program doesn't work stable on my Fedora 22 environment. It happens very frequently that the program keeps hanging in one thread. I've found out that one of the threads spawned by evolution/glib can become stuck in a getaddrinfo/recvmsg call which seems like a potential glibc issue to me.

Here's the stack trace of the stuck thread in question:

Thread 12 (Thread 0x7f55dffff700 (LWP 2871)):
#0  0x000000396d102edd in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000396d11f8ee in __check_pf (pid=-4344, fd=35) at ../sysdeps/unix/sysv/linux/check_pf.c:166
#2  0x000000396d11f8ee in __check_pf (seen_ipv4=seen_ipv4@entry=0x7f55dfffea32, seen_ipv6=seen_ipv6@entry=0x7f55dfffea33, in6ai=in6ai@entry=0x7f55dfffea40, in6ailen=in6ailen@entry=0x7f55dfffea48) at ../sysdeps/unix/sysv/linux/check_pf.c:324
#3  0x000000396d0eada1 in __GI_getaddrinfo (name=<optimized out>, name@entry=0x7f55e834b9a0 "mail.foo.bar", service=service@entry=0x0, hints=hints@entry=0x3fa057cae0, pai=pai@entry=0x7f55dfffeb70) at ../sysdeps/posix/getaddrinfo.c:2366
#4  0x0000003fa02882c0 in do_lookup_by_name (task=0x522bbb0 [GTask], source_object=<optimized out>, task_data=0x7f55e834b9a0, cancellable=<optimized out>) at gthreadedresolver.c:79
#5  0x0000003fa0285895 in g_task_thread_pool_thread (thread_data=0x522bbb0, pool_data=<optimized out>) at gtask.c:1231
#6  0x0000003f9f271a3e in g_thread_pool_thread_proxy (data=<optimized out>) at gthreadpool.c:307
#7  0x0000003f9f2710a5 in g_thread_proxy (data=0x7f55f4004230) at gthread.c:764
#8  0x000000396d807555 in start_thread (arg=0x7f55dffff700) at pthread_create.c:333
#9  0x000000396d101f3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

According to lsof fd 35 is a netlink socket

I've attached the complete backtrace and dump of all local variables to this bugzilla ticket.

The mail server in question (which is looked up in the getaddrinfo call) is only reachable after a VPN connection to my internal company network is up. This VPN connection wasn't up at the time of the stuck thread. When the VPN connection is up I don't have any stability issues with evolution.

The computer which shows this issue has both IPv4 and IPv6 set up and fully working.

The nsswitch.conf on this computer has the following entry:
  hosts:      files dns myhostname mymachines

When I use the tool 'dig' to query my local DNS server then both an A lookup as an MX lookup on the domain name in question both complete directly and return the error status NXDOMAIN to indicate that the domain name is unknown. This is expected as the domain name in question is only known when I use my company's DNS server (for which a VPN connection is required).

Package versions:
glibc-2.21-5.fc22.x86_64
glib2-2.44.1-1.fc22.x86_64
evolution-data-server-3.16.2-2.fc22.x86_64
evolution-3.16.2.1-1.fc22.x86_64

Comment 1 Erik van Pienbroek 2015-05-30 11:38:50 UTC
Created attachment 1032444 [details]
backtrace as generated by the gstack tool

Comment 2 Erik van Pienbroek 2015-05-30 17:01:01 UTC
Created attachment 1032534 [details]
output of gdb command 'bt full'

Comment 3 Carlos O'Donell 2015-06-03 04:18:29 UTC
The getaddrinfo call is blocking because the netlink request is blocking. We use the netlink interface to determine if we have ipv4 and ipv6 interfaces. The library sends a request to the kernel and expects a result back.

Blocking at ../sysdeps/unix/sysv/linux/check_pf.c:166 means it blocked on the read of the netlink socket waiting for an answer from the kernel.

I don't see any problem with glibc. Do you have a smaller reproducer that uses netlink to determine interfaces and it doesn't hang?

Comment 4 Erik van Pienbroek 2015-06-03 08:26:01 UTC
Thanks for your response.
I don't have a smaller reproducer yet, but now that you mention that it probably is a kernel issue I'll try upgrading my kernel first. According to https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/net/core/rtnetlink.c there have been several changes to the kernel netlink code recently so that might help.

Do you think dumping kernel info using /proc/sysrq-trigger can be used to obtain more details about the reason why the netlink recvmsg call doesn't return?

Comment 5 Carlos O'Donell 2015-06-03 13:34:55 UTC
(In reply to Erik van Pienbroek from comment #4)
> Thanks for your response.
> I don't have a smaller reproducer yet, but now that you mention that it
> probably is a kernel issue I'll try upgrading my kernel first. According to
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/net/core/
> rtnetlink.c there have been several changes to the kernel netlink code
> recently so that might help.
> 
> Do you think dumping kernel info using /proc/sysrq-trigger can be used to
> obtain more details about the reason why the netlink recvmsg call doesn't
> return?

Yes, absolutely, seeing where the kernel is stuck *might* be useful, but sometimes the buffers are unconnected so you don't know why it's stuck, but it might give you a hint.

Comment 6 Erik van Pienbroek 2015-06-04 07:29:34 UTC
Using /proc/sysrq-trigger gives me this trace for the hanging getaddrinfo/recvmsg thread:

[295196.249529] Call Trace:
[295196.249532]  [<ffffffff8177d117>] schedule+0x37/0x90
[295196.249535]  [<ffffffff8178002c>] schedule_timeout+0x17c/0x230
[295196.249538]  [<ffffffff811fd615>] ? __kmalloc_node_track_caller+0x245/0x320
[295196.249541]  [<ffffffff810dc7f4>] ? prepare_to_wait_exclusive+0x54/0x80
[295196.249544]  [<ffffffff81656b99>] __skb_recv_datagram+0x4b9/0x520
[295196.249547]  [<ffffffff8164e8b7>] ? skb_queue_tail+0x47/0x60
[295196.249550]  [<ffffffff81656c60>] ? skb_recv_datagram+0x60/0x60
[295196.249553]  [<ffffffff81656c3f>] skb_recv_datagram+0x3f/0x60
[295196.249556]  [<ffffffff81695cdb>] netlink_recvmsg+0x5b/0x3b0
[295196.249560]  [<ffffffff81646dbc>] sock_recvmsg+0x7c/0xc0
[295196.249563]  [<ffffffff81647c66>] ___sys_recvmsg+0xf6/0x230
[295196.249566]  [<ffffffff816475c6>] ? SYSC_sendto+0x1b6/0x200
[295196.249570]  [<ffffffff81648931>] __sys_recvmsg+0x51/0xa0
[295196.249573]  [<ffffffff81781408>] ? int_check_syscall_exit_work+0x34/0x3d
[295196.249576]  [<ffffffff81648992>] SyS_recvmsg+0x12/0x20
[295196.249579]  [<ffffffff817811c9>] system_call_fastpath+0x12/0x17

My guess is that the kernel waits for an exclusive lock to become available on the netlink socket which never happens for some reason.. I'll try digging more into it later

Comment 7 Erik van Pienbroek 2015-11-29 20:01:13 UTC
This indeed turned out to be a kernel bug: https://lkml.org/lkml/2015/10/5/771
It is solved in recent kernels so this bug can be closed