Description of problem: There is a regression in rhel-5.3 kernel when running openswan for ipsec vpn's. System locks after 2 hours to 2 days, depending of number of ipsec tunnels. This lockup happens on every sytem I tested with this kernel version. Version-Release number of selected component (if applicable): kernel-2.6.18-128.el5 How reproducible: Always. Steps to Reproduce: 1. start ipsec service 2. wait some hours 3. crash Actual results: BUG: soft lockup - CPU#0 stuck for 10s! [pluto:6996] CPU 0: Modules linked in: krng ansi_cprng chainiv rng authenc hmac cryptomgr deflate zlib_deflate ccm serpent blowfish twofish ecb xc bc crypto_hash cbc md5 sha256 sha512 des aes_generic testmgr_cipher testmgr crypto_blkcipher aes_x86_64 ah6 ah4 esp6 xfrm6_esp esp4 xfrm4_esp aead crypto_algapi xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode _tunnel ipcomp ipcomp6 xfrm6_tunnel tunnel6 af_key softdog autofs4 w83627hf lm85 hwmon_vid i2c_isa hidp l2cap bluetooth cls_fw sch_sfq act_police cls_u32 sch_ingress sch_htb ip6table_filter ip6_tables xt_realm iptable_raw xt_comment xt_policy ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_TCPMSS ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt _LOG ipt_iprange ipt_hashlimit ipt_ECN ipt_ecn ipt_DSCP ipt_dscp ipt_CLUSTERIP ipt_ah ipt_addrtype ip_nat_tftp ip_nat_snmp_bas ic ip_nat_sip ip_nat_pptp ip_nat_irc ip_nat_h323 ip_nat_ftp ip_nat_amanda ip_conntrack_tftp ip_conntrack_sip ip_conntrack_pptp ip_conntrack_netbios_ns ip_conntrack_irc ip_conntrack_h323 ip_conntrack_ftp ts_kmp ip_conntrack_amanda xt_tcpmss xt_pkttype x t_physdev bridge xt_NFQUEUE xt_multiport xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY xt_tcpudp xt_state iptable_nat ip_nat ip_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x _tables sunrpc ipv6 xfrm_nalgo crypto_api 8021q bonding ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp lib iscsi scsi_transport_iscsi cpufreq_ondemand powernow_k8 freq_table dm_mirror dm_multipath scsi_dh video backlight sbs i2c_ec b utton battery asus_acpi acpi_memhotplug ac lp i2c_amd8111 i2c_amd756 i2c_core amd_rng k8temp tg3 hwmon serio_raw ide_cd libphy usblp parport_pc parport e100 mii k8_edac edac_mc cdrom sg pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_ca che shpchp sata_sil libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 6996, comm: pluto Not tainted 2.6.18-128.el5 #1 RIP: 0010:[<ffffffff80064cb2>] [<ffffffff80064cb2>] .text.lock.spinlock+0x0/0x30 RSP: 0018:ffff81006eea7bb0 EFLAGS: 00000286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff810156273380 RDX: 0000000000000000 RSI: 000000000000005c RDI: ffff8101562733c0 RBP: 0000000000000202 R08: ffff81006eea7a38 R09: 0000000000000000 R10: ffff81006ee3dbc0 R11: 0000000000000000 R12: ffff810156273380 R13: 0000100000000011 R14: 0000000400000000 R15: 0000000000000000 FS: 00002b22b1f92a60(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002b22b40d8fc0 CR3: 000000015a345000 CR4: 00000000000006e0 Call Trace: [<ffffffff800308a8>] release_sock+0x6b/0xaa [<ffffffff80052155>] udp_sendmsg+0x4de/0x5ce [<ffffffff80054af1>] sock_sendmsg+0xf3/0x110 [<ffffffff8009db21>] autoremove_wake_function+0x0/0x2e [<ffffffff80214ffa>] sys_sendto+0x11c/0x14f [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Expected results: System working stable just like with 2.6.18-92.1.22.el5 Additional info: I tested with 2.6.18-130.el5 test kernel. There was no fix for this problem but it might have taken somewhat more time to get lockup.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Reverting Patch22973: linux-2.6-net-udp-possible-recursive-locking.patch seems to fix problem. I got suspicious when I was reading that patch and found out there are following changes in it: + bh_unlock_sock(sk); ret = udp_encap_rcv(sk, skb); + bh_lock_sock(sk); if (ret == 0) { and /* process the ESP packet */ + bh_unlock_sock(sk); ret = xfrm4_rcv_encap(skb, up->encap_type); + bh_lock_sock(sk); return -ret; I'm quite sure one of these two cases can happen in case where there wasn't previous lock. And that add lock to unlocked code path without unlock for it.
Still unfixed in 2.6.18-128.1.1.el5 And now it's fully verified, reverting linux-2.6-net-udp-possible-recursive-locking.patch fixes this problem. System which didn't survive for 4 hours with the patch has been up 44 hours now without patch.
Updating PM score.
Uptime on vpn-server used for debugging this problem is now 12 days so removing that patch really stabilizes system.
Created attachment 333018 [details] udp: Fix rcv socket locking My patch was cherry-picked without its follow-up fix. We need to apply this patch on top of it. commit 93821778def10ec1e69aa3ac10adee975dad4ff3 Author: Herbert Xu <herbert.org.au> Date: Mon Sep 15 11:48:46 2008 -0700 udp: Fix rcv socket locking The previous patch in response to the recursive locking on IPsec reception is broken as it tries to drop the BH socket lock while in user context. This patch fixes it by shrinking the section protected by the socket lock to sock_queue_rcv_skb only. The only reason we added the lock is for the accounting which happens in that function. Signed-off-by: Herbert Xu <herbert.org.au> Signed-off-by: David S. Miller <davem>
Neil, could you please take a look at this? Thanks!
Yeah, got it covered. Thanks!
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-132.el5.bz484590.x86_64.rpm Test kernel available. Please verify this fixes the problem if you would. Thanks!
Can you provide src.rpm or just the patch to test, I don't want to run developement kernel on production box.
Created attachment 333193 [details] backport of herberts patch Sure, here you go. Although keep in mind, what I gave you was the 5.3 kernel + this patch.
Oh. I got confused because of release -132. I'm booting gateway machine to kernel with test patch now and make sure minicom will log on serial console if something happens. I'll give feedback tomorrow when I know if it did help or not.
ok, please let me know when you have results. thanks.
Machine lasted over night, up 9:04 now. Seems like being fixed but time is too short to know 100% sure.
Ok, let it run the rest of the day, if its stays up, we'll call it fixed and move forward. Thanks!
Now up 20:08, I'd call it fixed.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-133.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
I'm seeing exactly the same problem on two of our VPN gateways. Don, do your later test kernels ( > kernel-2.6.18-133.el5) from http://people.redhat.com/dzickus/el5/ include the fix, too?
This patch was missing from 2.6.18-128.1.6.el5. what's the problem?
With kernel-2.6.18-133.el5 our VPN gateways have an uptime of 7 days now, after crashing in about half a day with kernel-2.6.18-128.el5. Please fix this in your official kernels - at the moment we have the choice to run a crashing kernel or one that doesn's include patches for several CVEs.
Created attachment 338145 [details] kernel crash on heavily loaded system with 2.6.18-128.1.6.el5 with proposed fix applied I tried fix patch on heavily loaded firewall machines with lots of network load and system crashed very fast, In much less than an hour. It is of course possible that cause was another regression in 2.6.18-128.1.6.el5 but I don't think so. 2.6.18-128.1.1.el5 with original broken patch removed is very stable.
Tuomo, Can you please verify this against kernel-2.6.18-133.el5 or later? We have a report in comment #25 that this did fix the problem. Also, this as been proposed for an errata kernel, but in order to get full traction please file an official support ticket through your Red Hat support channel.
I reqeuest you to read comment #26 again. And check kernel oops with proposed patch.
Tuomo, I did see comment #26. However, I figured it might be good to verify on our build since we had another report of it being fixed. Setting this back to ASSIGNED for engineering to look at oopses.
On lightly loaded system fix patch works for more than ten days. But on heavily loaded system crash happens within seconds.
Um, I'm sorry, but I'll need this to be reproduced on our kernel to do anything at all about it. I don't know what other changes are in this custom kernel, so I have no idea what else might be going on, nor do I have access to the changes to build symboled version of the kernel to debug. If its reproduced on our kernel, then I can run crash on the vmcore with our debuginfo package (assuming a vmcore can be sent in) and figure out whats going wrong. I'm setting this back to ON_QA until its reproduced on our kernel.
Created attachment 339514 [details] Diff from 2.6.18-128.1.6.el5 to kernel crashed on comment #26 Hope this diff will give you idea about our changes...
Oh. and I tell it again. Comment #3 still applies. Reverting that change fixes it.
Has anybody ever hit that recursive locking issue which was fixed initially with broken patch?
Created attachment 339664 [details] Possible fix for remaining locking issue I checked original backport and noticed that there was following comment: * Note of backporting Since the implementation of udp_queue_rcv_skb() in RHEL 5 differs from upstream, I added a lock to the function to close the gap. And there was indeed one locking statement added by original backporting of the patch which didn't get removed when kernel.org fix patch was backported. This is no wonder because original patch didn't have that lock statement and so it didn't have any need to remove it. Neil: Could this fix the issue?
I started to run kernel with my extra patch from Comment #37 applied over previous patches. I'll tell later if testing was successful. From lock location I guess crash was caused by natted vpn client connecting.
Kernel with my extra fix seem to be quite stable. No problems in 35 hours. But patch hasn't yet been tested with natted ipsec vpn road warriors.
With my patch from Comment #37 applied over Neil's backport will fix this issue with IPsec NAT-Traversal road warriors. Without my patch IPsec NAT-Traversal causes crash as shown in comment #26.
I have now 11 days uptime on my test domU and 5 days on real production vpn-gw. Extra patch from comment #37 applied this is finally stable. Please apply.
2.6.18-141.el5 is still missing important fix from comment #37.
Why was status set to POST? Bug fixed by comment 37 was local to rhel kernel only. Bug fixed by it was local caused by backport.
(In reply to comment #45) > Why was status set to POST? Bug fixed by comment 37 was local to rhel kernel > only. Bug fixed by it was local caused by backport. fyi, POST status means the patch was posted to internal mailing list for review.
Exellent. Thank you. I was just a little confused because of bugzilla help about POST status :-)
5 days uptime with Tuomo's patch with pretty heavy traffic - no oopses so far.
in kernel-2.6.18-144.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
I think I'm hitting this too - Don, I see -146 in your directory is new since your last comment - is that the best one to test now?
Bill, yes the builds are cumulative, so -146.el5 includes everything in -144.el5 plus additional patches.
When will this fix become part of an officially released kernel?
(In reply to comment #59) > When will this fix become part of an officially released kernel? It's already - since kernel-2.6.18-128.1.8.el5
But wasn't the patch from comment 37 applied only to kernel-2.6.18-144.el5 and wouldn't that suggest that it's not in 128.1.8? Or am I interpreting the kernel release numbers wrong?
(In reply to comment #61) > But wasn't the patch from comment 37 applied only to kernel-2.6.18-144.el5 and > wouldn't that suggest that it's not in 128.1.8? Or am I interpreting the kernel > release numbers wrong? Right, that one went in kernel-2.6.18-128.1.9.el5. Both patches for this are in both 5.4 and 5.3.z kernel trees.
http://rhn.redhat.com/errata/RHSA-2009-0473.html
Using kernel 2.6.18-128.1.16.el5 x86_64 problem still ocurrs. There is some correction?
Never seen it with kernel 2.6.18-128.1.10.el5+. Are you absolutely sure about version of kernel used?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html