Bug 248052

Summary:

Clients unable to connect to an IPv6 server socket

Product:

Red Hat Enterprise Linux 5

Reporter:

Sushma <sushma_fernandes>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

high

Docs Contact:

Priority:

low

Version:

5.0

CC:

davem, denise.eckstein, dzickus, nhorman, poelstra

Target Milestone:

---

Keywords:

OtherQA

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2008-0314

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-05-21 14:46:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

246139, 296411, 372911, 420521, 422431, 422441, 425461

Attachments:

Description	Flags
Wireshark log captured when the error happens	none
testcase to reproduce the problem	none
Wireshark capture log for a single threaded client failure	none
potential upstream fix...	none
backprort of Dave M. referenced patch	none

Description Sushma 2007-07-12 20:16:09 UTC

Description of problem:

When a client running multiple threads tries to connect to a IPv6 server 
socket very often only one of the threads is able to successfully connect. The 
rest of them error out. This is very reproducible when clients try to connect 
on the ::1 loopback interface.

Version-Release number of selected component (if applicable):


How reproducible:
When running a client program with 3 threads, very often one thread 
successfully connects and the other thread fails to connect almost 
immediately. The third thread blocks and fails to connect after ~3 minutes. 
Almost like waiting for some timeout to happen. 

Steps to Reproduce:
1. mkdir -p /tmp/tests;
2. Untar test.tar in /tmp/tests
3. run /tmp/tests/maketest.sh

Please execute maketest.sh to build the testcase and start a loop of clients. 
The loop terminates when a client fails to connect to the server. The server 
socket is set to port 3999 and can be updated by changing the files server.c 
and threadclient.c.

  
Actual results:
Very often only one thread successfully connects. Rest of the threads fail.

Expected results:
All the threads are able to successfully connect.

Additional info:
Also attached is wireshark log captured when the client fails to connect.

Comment 1 Sushma 2007-07-12 20:18:21 UTC

Created attachment 159092 [details]
Wireshark log captured when the error happens

Comment 2 Sushma 2007-07-12 20:19:24 UTC

Created attachment 159093 [details]
testcase to reproduce the problem

Comment 4 Sushma 2007-07-19 18:01:45 UTC

We are seeing the same intermittent failure behavior when trying to use a site-
local IPv6 address to connect to the server. Output from the client:

#./client fec0::212:79ff:fe8f:ddcb
Host name passed : fec0::212:79ff:fe8f:ddcb
Connecting to host: fec0::212:79ff:fe8f:ddcb
Host name passed : fec0::212:79ff:fe8f:ddcb
Connecting to host: fec0::212:79ff:fe8f:ddcb
Host name passed : fec0::212:79ff:fe8f:ddcb
Connecting to host: fec0::212:79ff:fe8f:ddcb
connect() failed: Connection reset by peer
connect() failed: Connection timed out
Thread 1 returns error: 1
Thread 2 returns error: 1
Test Failed

Only one of client threads is able to connect to the server socket. The second 
thread fails to connect immediately and third thread waits for 3 minutes and 
fails. The same behavior as seen when connecting to ::1.

Comment 5 Sushma 2007-07-26 16:44:22 UTC

I've reproduced the connection timeout problem with a single threaded client. 
I updated the test case to only create one thread and started a loop to run 
the test. It takes longer to reproduce the timeout in this case, compared to 
the multithreaded client failure time. 

The client does the following:
1. Connect to a server socket
2. Send data
3. Receive data

Running this client in a loop, after some time the client fails to connect to 
the server socket with a connection timeout (3 minutes) error.

Attached is a Wireshark file that contains the IPv6 packet capture when client 
fails with a  connection timeout. To view the failure scenario, in the 
attached file see the last part of the capture for port 46205 or set the 
Wireshark filter to (ipv6.addr eq ::1 and ipv6.addr eq ::1) and (tcp.port eq 
46205 and tcp.port eq 3999).

An interesting thing to note in the error case is a lot of "TCP previous 
segment lost" for the Sync->Ack messages sent from server to the client in the 
log. The Client somehow fails to receive even a single "acknowledge" from the 
server and there are many RST messages from the client to the server.

Comment 6 Sushma 2007-07-26 16:50:03 UTC

Created attachment 160039 [details]
Wireshark capture log for a single threaded client failure

To view the failure scenario, in the attached file see the last part of the
capture for port 46205 or set the Wireshark filter to (ipv6.addr eq ::1 and
ipv6.addr eq ::1) and (tcp.port eq 46205 and tcp.port eq 3999).

Comment 9 Neil Horman 2007-12-21 21:18:49 UTC

note to self:
doing a little digging here, it appears this problem is being caused because
tcp_connect is being called twice successively on the same socket.  Not sure
what path we're following to get there, but I'll figure that out shortly.  We
call connect twice on the sme socket, which leads to tcp_transmit_skb being
called twice with the same SYN frame being sent each time, resulting in RST
comming back from the server when it detects duplicate sport numbers.

Comment 10 Neil Horman 2008-01-04 01:55:05 UTC

Thomas, since I was looking at this, I'm reassigning to myself.  If you'd rather
I didn't, please let me know, and I'll gladly send it back your way.  Thanks!

Comment 11 Neil Horman 2008-01-04 21:33:09 UTC

note to self, further investingation shows that its actually tcp_transmit_skb
that we're calling twice, the second call comming from tcp_rcv_state_process,
which is odd, as we're not waiting for a syn/ack frame from the peer. 
Investigating...

Comment 13 David Miller 2008-01-09 13:30:54 UTC

Created attachment 291141 [details]
potential upstream fix...

When checking for existing established connections during an ipv6
connect we wouldn't use the correct local port in the hash check.

I'm not sure this is exactly the bug being seen here, but it might
be.

Comment 14 Neil Horman 2008-01-09 15:05:32 UTC

Thanks Dave, that looks like a reasonable match.  I'll try it out today.

Comment 15 Neil Horman 2008-01-09 19:32:04 UTC

So far it looks like we have a winner.  With this patch, ive been running the
reproducer for an hour without fail.

Comment 16 Neil Horman 2008-01-09 20:13:49 UTC

Created attachment 291195 [details]
backprort of Dave M. referenced patch

This patch does indeed seem to be solving the problem

Comment 17 Neil Horman 2008-01-09 20:17:49 UTC

Sushma, Can you confirm that this patch solves the problem for you. Let me know
and I can spin an rpm for you.  Thanks!

Comment 18 Sushma 2008-01-09 20:26:40 UTC

Neil, Thanks for looking into this! Sure, I'd be glad to try out the RPM. 
Thanks!

Comment 19 Neil Horman 2008-01-09 20:50:27 UTC

k, I'll have rpms for you in a bit, what arch(es) doe you need this built for?

Comment 20 Sushma 2008-01-09 21:16:05 UTC

x86_64, IA-64 and IA-32 rpms would be great. Thanks!

Comment 21 Neil Horman 2008-01-10 00:26:29 UTC

test kernels avilable at:
http://people.redhat.com/nhorman/rpms/bz248052-kernels.tbz2
Please test and confirm that they solve the described problem for you.  Thanks!

Comment 22 Sushma 2008-01-23 04:41:41 UTC

Neil, the patch fixes the problem on our systems. Our tests are running fine. 
Thanks for fixing it!

Is this fix for a linux kernel component that other distros take advantage of?

Comment 23 Neil Horman 2008-01-23 11:51:10 UTC

Sure, its a patch to the ipv6 codepath and is upstream.  Any distro can pick it
up if they like.

Comment 25 Don Zickus 2008-01-24 16:08:46 UTC

in 2.6.18-74.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 27 Sushma 2008-02-26 06:33:39 UTC

Quick question. Does this patch work for RHEL 5.1 as well? 

Also, when I go to the el5 link mentioned in comment 25, I don't see the 2.6.18-
74.el5 any more. If I pick up 83.el5, will the fix be included. 

We are seeing a problem on AMD x86_64 system where in the system is hung with 
the following stack in dmesg. The test kernel with this fix was applied on this 
sytem and this system is running RHEL 5.1 (2.6.18-65.el5.bz248052 #1 SMP Wed 
Jan 9 16:05:55 EST 2008 x86_64 x86_64 x86_64 GNU/Linux):

 BUG: soft lockup - CPU#0 stuck for 10s! [events/0:14]
CPU 0:
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand 
dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi 
acpi_memhotplug ac ipv6 parport_pc lp parport sg k8_edac shpchp k8temp hwmon 
serio_raw tg3 edac_mc pcspkr dm_snapshot dm_zero dm_mirror dm_mod usb_storage 
sata_svw libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 14, comm: events/0 Tainted: G   M  2.6.18-65.el5.bz248052 #1
RIP: 0010:[<ffffffff800743b8>]  [<ffffffff800743b8>] 
__smp_call_function+0x6a/0x8b
RSP: 0018:ffff8100bff13d90  EFLAGS: 00000297
RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 00000000000000ff RSI: 00000000000000bf RDI: 00000000000000c0
RBP: 0000000000000000 R08: 0000000000000004 R09: 000000000000003c
R10: ffff8100bff13cf0 R11: ffff810161ef8800 R12: 0000000000000000
R13: 0000000000000000 R14: 000000000000000e R15: 0000000000000286
FS:  00002aaaaaac6f40(0000) GS:ffffffff8039c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000569f000 CR3: 0000000000201000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff80070b8d>] mcheck_check_cpu+0x0/0x2f
 [<ffffffff800744e6>] smp_call_function+0x32/0x47
 [<ffffffff80070b8d>] mcheck_check_cpu+0x0/0x2f
 [<ffffffff80091126>] on_each_cpu+0x10/0x22
 [<ffffffff8006feaf>] mcheck_timer+0x1c/0x6c
 [<ffffffff8004b580>] run_workqueue+0x94/0xe5
 [<ffffffff80047eb2>] worker_thread+0x0/0x122
 [<ffffffff80047fa2>] worker_thread+0xf0/0x122
 [<ffffffff80089af8>] default_wake_function+0x0/0xe
 [<ffffffff800322e8>] kthread+0xfe/0x132
 [<ffffffff8005cfb1>] child_rip+0xa/0x11
 [<ffffffff800321ea>] kthread+0x0/0x132
 [<ffffffff8005cfa7>] child_rip+0x0/0x11

Comment 28 Neil Horman 2008-02-26 12:12:16 UTC

Sure the patch will work for 5.1, although it won't be included there.  The
above stack trace appears to be completely unrelated to this particular bug.

Comment 29 John Poelstra 2008-03-21 03:56:21 UTC

Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot1--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 30 John Poelstra 2008-04-02 21:37:49 UTC

Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot3--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 32 errata-xmlrpc 2008-05-21 14:46:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html