Description of problem: When a client running multiple threads tries to connect to a IPv6 server socket very often only one of the threads is able to successfully connect. The rest of them error out. This is very reproducible when clients try to connect on the ::1 loopback interface. Version-Release number of selected component (if applicable): How reproducible: When running a client program with 3 threads, very often one thread successfully connects and the other thread fails to connect almost immediately. The third thread blocks and fails to connect after ~3 minutes. Almost like waiting for some timeout to happen. Steps to Reproduce: 1. mkdir -p /tmp/tests; 2. Untar test.tar in /tmp/tests 3. run /tmp/tests/maketest.sh Please execute maketest.sh to build the testcase and start a loop of clients. The loop terminates when a client fails to connect to the server. The server socket is set to port 3999 and can be updated by changing the files server.c and threadclient.c. Actual results: Very often only one thread successfully connects. Rest of the threads fail. Expected results: All the threads are able to successfully connect. Additional info: Also attached is wireshark log captured when the client fails to connect.
Created attachment 159092 [details] Wireshark log captured when the error happens
Created attachment 159093 [details] testcase to reproduce the problem
We are seeing the same intermittent failure behavior when trying to use a site- local IPv6 address to connect to the server. Output from the client: #./client fec0::212:79ff:fe8f:ddcb Host name passed : fec0::212:79ff:fe8f:ddcb Connecting to host: fec0::212:79ff:fe8f:ddcb Host name passed : fec0::212:79ff:fe8f:ddcb Connecting to host: fec0::212:79ff:fe8f:ddcb Host name passed : fec0::212:79ff:fe8f:ddcb Connecting to host: fec0::212:79ff:fe8f:ddcb connect() failed: Connection reset by peer connect() failed: Connection timed out Thread 1 returns error: 1 Thread 2 returns error: 1 Test Failed Only one of client threads is able to connect to the server socket. The second thread fails to connect immediately and third thread waits for 3 minutes and fails. The same behavior as seen when connecting to ::1.
I've reproduced the connection timeout problem with a single threaded client. I updated the test case to only create one thread and started a loop to run the test. It takes longer to reproduce the timeout in this case, compared to the multithreaded client failure time. The client does the following: 1. Connect to a server socket 2. Send data 3. Receive data Running this client in a loop, after some time the client fails to connect to the server socket with a connection timeout (3 minutes) error. Attached is a Wireshark file that contains the IPv6 packet capture when client fails with a connection timeout. To view the failure scenario, in the attached file see the last part of the capture for port 46205 or set the Wireshark filter to (ipv6.addr eq ::1 and ipv6.addr eq ::1) and (tcp.port eq 46205 and tcp.port eq 3999). An interesting thing to note in the error case is a lot of "TCP previous segment lost" for the Sync->Ack messages sent from server to the client in the log. The Client somehow fails to receive even a single "acknowledge" from the server and there are many RST messages from the client to the server.
Created attachment 160039 [details] Wireshark capture log for a single threaded client failure To view the failure scenario, in the attached file see the last part of the capture for port 46205 or set the Wireshark filter to (ipv6.addr eq ::1 and ipv6.addr eq ::1) and (tcp.port eq 46205 and tcp.port eq 3999).
note to self: doing a little digging here, it appears this problem is being caused because tcp_connect is being called twice successively on the same socket. Not sure what path we're following to get there, but I'll figure that out shortly. We call connect twice on the sme socket, which leads to tcp_transmit_skb being called twice with the same SYN frame being sent each time, resulting in RST comming back from the server when it detects duplicate sport numbers.
Thomas, since I was looking at this, I'm reassigning to myself. If you'd rather I didn't, please let me know, and I'll gladly send it back your way. Thanks!
note to self, further investingation shows that its actually tcp_transmit_skb that we're calling twice, the second call comming from tcp_rcv_state_process, which is odd, as we're not waiting for a syn/ack frame from the peer. Investigating...
Created attachment 291141 [details] potential upstream fix... When checking for existing established connections during an ipv6 connect we wouldn't use the correct local port in the hash check. I'm not sure this is exactly the bug being seen here, but it might be.
Thanks Dave, that looks like a reasonable match. I'll try it out today.
So far it looks like we have a winner. With this patch, ive been running the reproducer for an hour without fail.
Created attachment 291195 [details] backprort of Dave M. referenced patch This patch does indeed seem to be solving the problem
Sushma, Can you confirm that this patch solves the problem for you. Let me know and I can spin an rpm for you. Thanks!
Neil, Thanks for looking into this! Sure, I'd be glad to try out the RPM. Thanks!
k, I'll have rpms for you in a bit, what arch(es) doe you need this built for?
x86_64, IA-64 and IA-32 rpms would be great. Thanks!
test kernels avilable at: http://people.redhat.com/nhorman/rpms/bz248052-kernels.tbz2 Please test and confirm that they solve the described problem for you. Thanks!
Neil, the patch fixes the problem on our systems. Our tests are running fine. Thanks for fixing it! Is this fix for a linux kernel component that other distros take advantage of?
Sure, its a patch to the ipv6 codepath and is upstream. Any distro can pick it up if they like.
in 2.6.18-74.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Quick question. Does this patch work for RHEL 5.1 as well? Also, when I go to the el5 link mentioned in comment 25, I don't see the 2.6.18- 74.el5 any more. If I pick up 83.el5, will the fix be included. We are seeing a problem on AMD x86_64 system where in the system is hung with the following stack in dmesg. The test kernel with this fix was applied on this sytem and this system is running RHEL 5.1 (2.6.18-65.el5.bz248052 #1 SMP Wed Jan 9 16:05:55 EST 2008 x86_64 x86_64 x86_64 GNU/Linux): BUG: soft lockup - CPU#0 stuck for 10s! [events/0:14] CPU 0: Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac ipv6 parport_pc lp parport sg k8_edac shpchp k8temp hwmon serio_raw tg3 edac_mc pcspkr dm_snapshot dm_zero dm_mirror dm_mod usb_storage sata_svw libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 14, comm: events/0 Tainted: G M 2.6.18-65.el5.bz248052 #1 RIP: 0010:[<ffffffff800743b8>] [<ffffffff800743b8>] __smp_call_function+0x6a/0x8b RSP: 0018:ffff8100bff13d90 EFLAGS: 00000297 RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 00000000000000ff RSI: 00000000000000bf RDI: 00000000000000c0 RBP: 0000000000000000 R08: 0000000000000004 R09: 000000000000003c R10: ffff8100bff13cf0 R11: ffff810161ef8800 R12: 0000000000000000 R13: 0000000000000000 R14: 000000000000000e R15: 0000000000000286 FS: 00002aaaaaac6f40(0000) GS:ffffffff8039c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000000569f000 CR3: 0000000000201000 CR4: 00000000000006e0 Call Trace: [<ffffffff80070b8d>] mcheck_check_cpu+0x0/0x2f [<ffffffff800744e6>] smp_call_function+0x32/0x47 [<ffffffff80070b8d>] mcheck_check_cpu+0x0/0x2f [<ffffffff80091126>] on_each_cpu+0x10/0x22 [<ffffffff8006feaf>] mcheck_timer+0x1c/0x6c [<ffffffff8004b580>] run_workqueue+0x94/0xe5 [<ffffffff80047eb2>] worker_thread+0x0/0x122 [<ffffffff80047fa2>] worker_thread+0xf0/0x122 [<ffffffff80089af8>] default_wake_function+0x0/0xe [<ffffffff800322e8>] kthread+0xfe/0x132 [<ffffffff8005cfb1>] child_rip+0xa/0x11 [<ffffffff800321ea>] kthread+0x0/0x132 [<ffffffff8005cfa7>] child_rip+0x0/0x11
Sure the patch will work for 5.1, although it won't be included there. The above stack trace appears to be completely unrelated to this particular bug.
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot1--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot3--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html