Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 477012

Summary: network hangs with xen_vnif in FV RHEL5 guest
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Herbert Xu <herbert.xu>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: ddutile, dzickus, steved, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:57:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
net: Handle non-linear packets in skb_checksum_setup none

Description Jeff Layton 2008-12-18 16:57:25 UTC
Earlier this year, I got a lot of network hangs when the RHEL5.3-ish kernels switched to xen_vnif. I was never able to reliably reproduce it and after working with Don Dutile, we eventually figured it was something that was resolved somewhere else in the kernel.

I've just recently switched my RHEL5 xen guest back to xen_vnif and the problem has returned. The kernel is 2.6.18-128.el5.jtltest.57debug, which is basically -128.el5 with some extra patches (mostly filesystem stuff).

This time, I have a little more info...

The symptoms are basically that ssh sessions will go dead -- they'll just stop responding. I've gotten the box into this state and here's what I see on the network. This is a single keypress on one console session:

629.482804 10.11.228.36 -> 10.11.231.179 SSH Encrypted request packet len=48
629.483148 10.11.231.179 -> 10.11.228.36 TCP ssh > 40845 [ACK] Seq=369 Ack=673 Win=104 Len=0 TSV=528999 TSER=621174332

...so we got an ACK here.

During this I can also log into the serial console on the box and was able to get some lsof and strace output from the ssh session that holds the socket. Relevant fd's from lsof:

sshd    2441 root    3u  IPv6               8296            TCP dhcp231-179.rdu.redhat.com:ssh->barsoom.rdu.redhat.com:40844 (ESTABLISHED)
sshd    2441 root    4u  unix 0xffff81000acbd448           8527 socket
sshd    2441 root    5r  FIFO                0,6           8551 pipe
sshd    2441 root    6w  FIFO                0,6           8551 pipe
sshd    2441 root    7u   CHR                5,2            716 /dev/ptmx
sshd    2441 root    8u   CHR                5,2            716 /dev/ptmx
sshd    2441 root    9u   CHR                5,2            716 /dev/ptmx

...here's the strace from that single keypress:

2441  select(9, [3 5 8], [], NULL, NULL) = 1 (in [3])
2441  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
2441  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
2441  read(3, "U\374\320\27\350\257\337\7\r\376(\377\257\31\337\371\233\t\20\26\261N\346j\306\244MO\213\262'\311K"..., 16384) = 48

...encrypted data comes into socket.

2441  select(9, [3 5 8], [7], NULL, NULL) = 1 (out [7])
2441  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
2441  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
2441  write(7, "\r"..., 1)              = 1

...data decrypted -- sent to pty.

2441  select(9, [3 5 8], [], NULL, NULL) = 1 (in [8])
2441  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
2441  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
2441  read(8, "\r\n"..., 16384)         = 2

...full-duplex echo of the character to send back to client (with extra cr on end).

2441  select(9, [3 5 8], [3], NULL, NULL) = 1 (out [3])
2441  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
2441  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
2441  write(3, "\22z\302\213J]\343.,\336\233|(D\245`\23D\371\367\345\336\20$LoW\26\t.\250\255\305"..., 48) = 48

...encrypted data sent on socket.

2441  select(9, [3 5 8], [], NULL, NULL) = 1 (in [8])
2441  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
2441  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
2441  read(8, "\33]0;root@dhcp231-179:~\7"..., 16384) = 23
2441  select(9, [3 5 8], [3], NULL, NULL) = 2 (in [8], out [3])
2441  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
2441  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
2441  read(8, "[root@dhcp231-179 ~]# "..., 16384) = 22
2441  write(3, "\352I8\302\335\262\274\355J\324N\375\221\17\232\277\21\211|n^&Z\244}\3\3245\342\34\"+\304"..., 64) = 64
2441  select(9, [3 5 8], [3], NULL, NULL) = 1 (out [3])
2441  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
2441  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
2441  write(3, "C\3\356\351\22m\224\256B?\33q\344\313\327\263Pp\251\246)\0\22o\202\271\247\372\236\245>\222\264"..., 64) = 64
2441  select(9, [3 5 8], [], NULL, NULL <unfinished ...>

...next shell prompt displayed and sent on socket.

The client never receives any of the data sent by the server here. From this it looks like an occasional problem sending packets on a particular socket. The curious thing is that new sessions can be started and will generally work for a while. Once they're hung though, they stay that way.

I can try to reproduce again and collect more info if you can think of anything that would be helpful.

Comment 2 Herbert Xu 2009-02-10 06:05:13 UTC
What does netstat -nto show on the server? Also any chance you can let me log into the server (via serial console presumably) when this is happening? Thanks!

Comment 3 Jeff Layton 2009-02-10 13:02:30 UTC
I switched the guest back to using xen_vnif, and the first ssh session into the box hung within a few seconds:

# netstat -nto
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       Timer
tcp        0   1968 ::ffff:10.11.231.179:22     ::ffff:10.11.12.60:46360    ESTABLISHED on (27.84/7/0)

...guest kernel is:

2.6.18-131.el5.jtltest.61debug

...and host kernel is 2.6.18-128.el5virttest5xen. That one is a -128.el5 kernel with the patches for bug 470035.

I'll leave this box in this state for the time being. Find me on IRC if you want access to it.

Comment 4 Jeff Layton 2009-02-10 13:03:25 UTC
btw: the jtltest guest kernel is basically a -131.el5 kernel with some NFS and CIFS patches. Nothing that should affect the lower networking layers.

Comment 5 Herbert Xu 2009-02-12 11:15:16 UTC
11:08 <herbert> ok i think it's a bug in netfront that's causing an incorrectly laid out packet to be sent to netback
11:08 <herbert> which then discards it because it fails one of the sanity checks, e.g., by crossing a page boundary
11:08 <herbert> as the same packet is then retransmitted over and over again it'll never make it across, thus stalling the connection
11:09 <herbert> so what we need to find out now is exactly how the packet is broken
11:09 <herbert> could you please rebuild the netback module on dom0 after adding #define DEBUG to the top of the file?
11:09 <herbert> that way we can get the backend to print out what exactly  is wrong with the packet
11:09 <herbert> thanks!

Comment 6 Jeff Layton 2009-02-12 13:33:15 UTC
Ok, added:

#define DEBUG

to the top of drivers/xen/netback/netback.c, rebuilt that driver and installed netback.ko and netloop.ko in the dom0's kernel dir.

I logged into the rhel5 guest and poked around a bit, but didn't see any interesting messages. If I didn't do the correct thing, please send a patch and I'll give it another go.

Comment 7 Herbert Xu 2009-02-13 08:18:30 UTC
Created attachment 331815 [details]
net: Handle non-linear packets in skb_checksum_setup

This patch fixes the problem on Jeff's machine.

Comment 8 Jeff Layton 2009-02-16 11:30:38 UTC
I've been testing this and can confirm that it seems to work well. No connection hangs since this patch has been in place.

Comment 13 errata-xmlrpc 2009-09-02 08:57:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html