Bug 484590

Summary: Running Openswan ipsec vpn server with rhel-5.3 kernel-2.6.18-128.el5 causes crash
Product: Red Hat Enterprise Linux 5 Reporter: Tuomo Soini <tis>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.3CC: alain.richard, anton, bill-bugzilla.redhat.com, bj+bugzilla, cfalconer, davem, dennisml, dhoward, dzickus, emcnabb, jpirko, jplans, nhorman, pasteur, pjjw, pveiga, pwouters, rdassen, tgraf, yuri
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:10:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 496044    
Attachments:
Description Flags
udp: Fix rcv socket locking
none
backport of herberts patch
none
kernel crash on heavily loaded system with 2.6.18-128.1.6.el5 with proposed fix applied
none
Diff from 2.6.18-128.1.6.el5 to kernel crashed on comment #26
none
Possible fix for remaining locking issue none

Description Tuomo Soini 2009-02-08 20:09:39 UTC
Description of problem:

There is a regression in rhel-5.3 kernel when running openswan for ipsec vpn's.  System locks after 2 hours to 2 days, depending of number of ipsec tunnels. This lockup happens on every sytem I tested with this kernel version.

Version-Release number of selected component (if applicable):

kernel-2.6.18-128.el5

How reproducible:

Always.

Steps to Reproduce:
1. start ipsec service
2. wait some hours
3. crash
  
Actual results:

BUG: soft lockup - CPU#0 stuck for 10s! [pluto:6996]
CPU 0:
Modules linked in: krng ansi_cprng chainiv rng authenc hmac cryptomgr deflate zlib_deflate ccm serpent blowfish twofish ecb xc
bc crypto_hash cbc md5 sha256 sha512 des aes_generic testmgr_cipher testmgr crypto_blkcipher aes_x86_64 ah6 ah4 esp6 xfrm6_esp
 esp4 xfrm4_esp aead crypto_algapi xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode
_tunnel ipcomp ipcomp6 xfrm6_tunnel tunnel6 af_key softdog autofs4 w83627hf lm85 hwmon_vid i2c_isa hidp l2cap bluetooth cls_fw
 sch_sfq act_police cls_u32 sch_ingress sch_htb ip6table_filter ip6_tables xt_realm iptable_raw xt_comment xt_policy ipt_ULOG 
ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_TCPMSS ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt
_LOG ipt_iprange ipt_hashlimit ipt_ECN ipt_ecn ipt_DSCP ipt_dscp ipt_CLUSTERIP ipt_ah ipt_addrtype ip_nat_tftp ip_nat_snmp_bas
ic ip_nat_sip ip_nat_pptp ip_nat_irc ip_nat_h323 ip_nat_ftp ip_nat_amanda ip_conntrack_tftp ip_conntrack_sip ip_conntrack_pptp
 ip_conntrack_netbios_ns ip_conntrack_irc ip_conntrack_h323 ip_conntrack_ftp ts_kmp ip_conntrack_amanda xt_tcpmss xt_pkttype x
t_physdev bridge xt_NFQUEUE xt_multiport xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK 
xt_connmark xt_CLASSIFY xt_tcpudp xt_state iptable_nat ip_nat ip_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x
_tables sunrpc ipv6 xfrm_nalgo crypto_api 8021q bonding ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp lib
iscsi scsi_transport_iscsi cpufreq_ondemand powernow_k8 freq_table dm_mirror dm_multipath scsi_dh video backlight sbs i2c_ec b
utton battery asus_acpi acpi_memhotplug ac lp i2c_amd8111 i2c_amd756 i2c_core amd_rng k8temp tg3 hwmon serio_raw ide_cd libphy
 usblp parport_pc parport e100 mii k8_edac edac_mc cdrom sg pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_ca
che shpchp sata_sil libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 6996, comm: pluto Not tainted 2.6.18-128.el5 #1
RIP: 0010:[<ffffffff80064cb2>]  [<ffffffff80064cb2>] .text.lock.spinlock+0x0/0x30
RSP: 0018:ffff81006eea7bb0  EFLAGS: 00000286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff810156273380
RDX: 0000000000000000 RSI: 000000000000005c RDI: ffff8101562733c0
RBP: 0000000000000202 R08: ffff81006eea7a38 R09: 0000000000000000
R10: ffff81006ee3dbc0 R11: 0000000000000000 R12: ffff810156273380
R13: 0000100000000011 R14: 0000000400000000 R15: 0000000000000000
FS:  00002b22b1f92a60(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b22b40d8fc0 CR3: 000000015a345000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff800308a8>] release_sock+0x6b/0xaa
 [<ffffffff80052155>] udp_sendmsg+0x4de/0x5ce
 [<ffffffff80054af1>] sock_sendmsg+0xf3/0x110
 [<ffffffff8009db21>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80214ffa>] sys_sendto+0x11c/0x14f
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Expected results:

System working stable just like with 2.6.18-92.1.22.el5

Additional info:

I tested with 2.6.18-130.el5 test kernel. There was no fix for this problem but it might have taken somewhat more time to get lockup.

Comment 1 RHEL Program Management 2009-02-09 18:46:54 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 2 Tuomo Soini 2009-02-10 08:17:48 UTC
Reverting Patch22973: linux-2.6-net-udp-possible-recursive-locking.patch seems to fix problem.

I got suspicious when I was reading that patch and found out there are following changes in it:

 
+               bh_unlock_sock(sk);
                ret = udp_encap_rcv(sk, skb);
+               bh_lock_sock(sk);
                if (ret == 0) {

and

                        /* process the ESP packet */
+                       bh_unlock_sock(sk);
                        ret = xfrm4_rcv_encap(skb, up->encap_type);
+                       bh_lock_sock(sk);
                        return -ret;

I'm quite sure one of these two cases can happen in case where there wasn't previous lock. And that add lock to unlocked code path without unlock for it.

Comment 3 Tuomo Soini 2009-02-12 19:04:12 UTC
Still unfixed in 2.6.18-128.1.1.el5

And now it's fully verified, reverting
linux-2.6-net-udp-possible-recursive-locking.patch fixes this problem. System which didn't survive for 4 hours with the patch has been up 44 hours now without  patch.

Comment 4 RHEL Program Management 2009-02-16 15:41:34 UTC
Updating PM score.

Comment 5 Tuomo Soini 2009-02-23 08:11:43 UTC
Uptime on vpn-server used for debugging this problem is now 12 days so removing that patch really stabilizes system.

Comment 6 Herbert Xu 2009-02-24 10:08:50 UTC
Created attachment 333018 [details]
udp: Fix rcv socket locking

My patch was cherry-picked without its follow-up fix.  We need to apply this patch on top of it.

commit 93821778def10ec1e69aa3ac10adee975dad4ff3
Author: Herbert Xu <herbert.org.au>
Date:   Mon Sep 15 11:48:46 2008 -0700

    udp: Fix rcv socket locking
    
    The previous patch in response to the recursive locking on IPsec
    reception is broken as it tries to drop the BH socket lock while in
    user context.
    
    This patch fixes it by shrinking the section protected by the
    socket lock to sock_queue_rcv_skb only.  The only reason we added
    the lock is for the accounting which happens in that function.
    
    Signed-off-by: Herbert Xu <herbert.org.au>
    Signed-off-by: David S. Miller <davem>

Comment 7 Herbert Xu 2009-02-24 10:09:52 UTC
Neil, could you please take a look at this? Thanks!

Comment 8 Neil Horman 2009-02-24 11:37:47 UTC
Yeah, got it covered.  Thanks!

Comment 11 Neil Horman 2009-02-25 14:42:11 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-132.el5.bz484590.x86_64.rpm

Test kernel available.  Please verify this fixes the problem if you would.  Thanks!

Comment 12 Tuomo Soini 2009-02-25 17:51:33 UTC
Can you provide src.rpm or just the patch to test, I don't want to run developement kernel on production box.

Comment 13 Neil Horman 2009-02-25 17:56:06 UTC
Created attachment 333193 [details]
backport of herberts patch

Sure, here you go.  Although keep in mind, what I gave you was the 5.3 kernel + this patch.

Comment 14 Tuomo Soini 2009-02-25 20:55:49 UTC
Oh. I got confused because of release -132. I'm booting gateway machine to kernel with test patch now and make sure minicom will log on serial console if something happens. I'll give feedback tomorrow when I know if it did help or not.

Comment 15 Neil Horman 2009-02-25 21:06:10 UTC
ok, please let me know when you have results.  thanks.

Comment 16 Neil Horman 2009-02-25 21:06:23 UTC
ok, please let me know when you have results.  thanks.

Comment 17 Tuomo Soini 2009-02-26 07:01:02 UTC
Machine lasted over night, up 9:04 now. Seems like being fixed but time is too short to know 100% sure.

Comment 18 Neil Horman 2009-02-26 11:40:10 UTC
Ok, let it run the rest of the day, if its stays up, we'll call it fixed and move forward.  Thanks!

Comment 19 Tuomo Soini 2009-02-26 18:02:56 UTC
Now up 20:08, I'd call it fixed.

Comment 20 RHEL Program Management 2009-02-26 21:42:05 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 21 Don Zickus 2009-03-04 20:01:57 UTC
in kernel-2.6.18-133.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 23 Bjoern Engels 2009-03-25 11:03:57 UTC
I'm seeing exactly the same problem on two of our VPN gateways.
Don, do your later test kernels ( > kernel-2.6.18-133.el5) from http://people.redhat.com/dzickus/el5/ include the fix, too?

Comment 24 Tuomo Soini 2009-04-01 19:17:20 UTC
This patch was missing from 2.6.18-128.1.6.el5. what's the problem?

Comment 25 Bjoern Engels 2009-04-02 07:29:15 UTC
With kernel-2.6.18-133.el5 our VPN gateways have an uptime of 7 days now, after crashing in about half a day with kernel-2.6.18-128.el5.
Please fix this in your official kernels - at the moment we have the choice to run a crashing kernel or one that doesn's include patches for several CVEs.

Comment 26 Tuomo Soini 2009-04-04 06:11:01 UTC
Created attachment 338145 [details]
kernel crash on heavily loaded system with 2.6.18-128.1.6.el5 with proposed fix applied

I tried fix patch on heavily loaded firewall machines with lots of network load and system crashed very fast, In much less than an hour.

It is of course possible that cause was another regression in 2.6.18-128.1.6.el5 but I don't think so. 2.6.18-128.1.1.el5 with original broken patch removed is very stable.

Comment 28 Evan McNabb 2009-04-13 15:48:26 UTC
Tuomo,

Can you please verify this against kernel-2.6.18-133.el5 or later? We have a report in comment #25 that this did fix the problem.

Also, this as been proposed for an errata kernel, but in order to get full traction please file an official support ticket through your Red Hat support channel.

Comment 29 Tuomo Soini 2009-04-13 17:19:47 UTC
I reqeuest you to read comment #26 again. And check kernel oops with proposed patch.

Comment 31 Evan McNabb 2009-04-14 13:29:10 UTC
Tuomo, I did see comment #26. However, I figured it might be good to verify on our build since we had another report of it being fixed.

Setting this back to ASSIGNED for engineering to look at oopses.

Comment 32 Tuomo Soini 2009-04-14 13:45:34 UTC
On lightly loaded system fix patch works for more than ten days. But on heavily loaded system crash happens within seconds.

Comment 33 Neil Horman 2009-04-14 14:24:47 UTC
Um, I'm sorry, but I'll need this to be reproduced on our kernel to do anything at all about it.  I don't know what other changes are in this custom kernel, so I have no idea what else might be going on, nor do I have access to the changes to build symboled version of the kernel to debug.  If its reproduced on our kernel, then I can run crash on the vmcore with our debuginfo package (assuming a vmcore can be sent in) and figure out whats going wrong.

I'm setting this back to ON_QA until its reproduced on our kernel.

Comment 34 Tuomo Soini 2009-04-14 15:30:53 UTC
Created attachment 339514 [details]
Diff from 2.6.18-128.1.6.el5 to kernel crashed on comment #26

Hope this diff will give you idea about our changes...

Comment 35 Tuomo Soini 2009-04-14 15:36:03 UTC
Oh. and I tell it again. Comment #3 still applies. Reverting that change fixes it.

Comment 36 Tuomo Soini 2009-04-14 15:50:23 UTC
Has anybody ever hit that recursive locking issue which was fixed initially with broken patch?

Comment 37 Tuomo Soini 2009-04-15 10:37:09 UTC
Created attachment 339664 [details]
Possible fix for remaining locking issue

I checked original backport and noticed that there was following comment:

* Note of backporting
Since the implementation of udp_queue_rcv_skb() in RHEL 5 differs
from upstream, I added a lock to the function to close the gap.

And there was indeed one locking statement added by original backporting of the patch which didn't get removed when kernel.org fix patch was backported. This is no wonder because original patch didn't have that lock statement and so it didn't have any need to remove it.

Neil: Could this fix the issue?

Comment 38 Tuomo Soini 2009-04-16 06:52:01 UTC
I started to run kernel with my extra patch from Comment #37 applied over previous patches. I'll tell later if testing was successful. From lock location I guess crash was caused by natted vpn client connecting.

Comment 40 Tuomo Soini 2009-04-17 17:55:56 UTC
Kernel with my extra fix seem to be quite stable. No problems in 35 hours. But patch hasn't yet been tested with natted ipsec vpn road warriors.

Comment 41 Tuomo Soini 2009-04-21 09:32:13 UTC
With my patch from Comment #37 applied over Neil's backport will fix this issue with IPsec NAT-Traversal road warriors. Without my patch IPsec NAT-Traversal causes crash as shown in comment #26.

Comment 42 Tuomo Soini 2009-04-27 06:43:53 UTC
I have now 11 days uptime on my test domU and 5 days on real production vpn-gw. Extra patch from comment #37 applied this is finally stable. Please apply.

Comment 43 Tuomo Soini 2009-04-27 18:26:05 UTC
2.6.18-141.el5 is still missing important fix from comment #37.

Comment 45 Tuomo Soini 2009-04-28 05:25:15 UTC
Why was status set to POST? Bug fixed by comment 37 was local to rhel kernel only. Bug fixed by it was local caused by backport.

Comment 46 Jiri Pirko 2009-04-28 07:25:39 UTC
(In reply to comment #45)
> Why was status set to POST? Bug fixed by comment 37 was local to rhel kernel
> only. Bug fixed by it was local caused by backport.

fyi, POST status means the patch was posted to internal mailing list for review.

Comment 47 Tuomo Soini 2009-04-28 07:36:58 UTC
Exellent. Thank you. I was just a little confused because of bugzilla help about POST status :-)

Comment 48 Yuri Arabadji 2009-05-05 09:02:14 UTC
5 days uptime with Tuomo's patch with pretty heavy traffic - no oopses so far.

Comment 49 Don Zickus 2009-05-06 17:16:38 UTC
in kernel-2.6.18-144.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 52 Bill McGonigle 2009-05-13 20:58:28 UTC
I think I'm hitting this too - Don, I see -146 in your directory is new since your last comment - is that the best one to test now?

Comment 53 Don Zickus 2009-05-13 21:08:48 UTC
Bill, yes the builds are cumulative, so -146.el5 includes everything in -144.el5 plus additional patches.

Comment 59 Dennis Jacobfeuerborn 2009-07-10 16:12:24 UTC
When will this fix become part of an officially released kernel?

Comment 60 Jiri Pirko 2009-07-10 16:30:24 UTC
(In reply to comment #59)
> When will this fix become part of an officially released kernel?  
It's already - since kernel-2.6.18-128.1.8.el5

Comment 61 Dennis Jacobfeuerborn 2009-07-10 17:09:38 UTC
But wasn't the patch from comment 37 applied only to kernel-2.6.18-144.el5 and wouldn't that suggest that it's not in 128.1.8? Or am I interpreting the kernel release numbers wrong?

Comment 62 Jiri Pirko 2009-07-10 17:16:48 UTC
(In reply to comment #61)
> But wasn't the patch from comment 37 applied only to kernel-2.6.18-144.el5 and
> wouldn't that suggest that it's not in 128.1.8? Or am I interpreting the kernel
> release numbers wrong?  
Right, that one went in kernel-2.6.18-128.1.9.el5. Both patches for this are in both 5.4 and 5.3.z kernel trees.

Comment 63 Evan McNabb 2009-07-10 17:17:49 UTC
http://rhn.redhat.com/errata/RHSA-2009-0473.html

Comment 64 Priscila 2009-07-15 14:37:46 UTC
Using kernel 2.6.18-128.1.16.el5 x86_64 problem still ocurrs.

There is some correction?

Comment 66 Tuomo Soini 2009-07-16 10:13:22 UTC
Never seen it with kernel 2.6.18-128.1.10.el5+. Are you absolutely sure about version of kernel used?

Comment 68 errata-xmlrpc 2009-09-02 08:10:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html