Bug 484590
Summary: | Running Openswan ipsec vpn server with rhel-5.3 kernel-2.6.18-128.el5 causes crash | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Tuomo Soini <tis> |
Component: | kernel | Assignee: | Neil Horman <nhorman> |
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.3 | CC: | alain.richard, anton, bill-bugzilla.redhat.com, bj+bugzilla, cfalconer, davem, dennisml, dhoward, dzickus, emcnabb, jpirko, jplans, nhorman, pasteur, pjjw, pveiga, pwouters, rdassen, tgraf, yuri |
Target Milestone: | rc | Keywords: | Regression, ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-09-02 08:10:21 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 496044 | ||
Attachments: |
Description
Tuomo Soini
2009-02-08 20:09:39 UTC
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. Reverting Patch22973: linux-2.6-net-udp-possible-recursive-locking.patch seems to fix problem. I got suspicious when I was reading that patch and found out there are following changes in it: + bh_unlock_sock(sk); ret = udp_encap_rcv(sk, skb); + bh_lock_sock(sk); if (ret == 0) { and /* process the ESP packet */ + bh_unlock_sock(sk); ret = xfrm4_rcv_encap(skb, up->encap_type); + bh_lock_sock(sk); return -ret; I'm quite sure one of these two cases can happen in case where there wasn't previous lock. And that add lock to unlocked code path without unlock for it. Still unfixed in 2.6.18-128.1.1.el5 And now it's fully verified, reverting linux-2.6-net-udp-possible-recursive-locking.patch fixes this problem. System which didn't survive for 4 hours with the patch has been up 44 hours now without patch. Updating PM score. Uptime on vpn-server used for debugging this problem is now 12 days so removing that patch really stabilizes system. Created attachment 333018 [details]
udp: Fix rcv socket locking
My patch was cherry-picked without its follow-up fix. We need to apply this patch on top of it.
commit 93821778def10ec1e69aa3ac10adee975dad4ff3
Author: Herbert Xu <herbert.org.au>
Date: Mon Sep 15 11:48:46 2008 -0700
udp: Fix rcv socket locking
The previous patch in response to the recursive locking on IPsec
reception is broken as it tries to drop the BH socket lock while in
user context.
This patch fixes it by shrinking the section protected by the
socket lock to sock_queue_rcv_skb only. The only reason we added
the lock is for the accounting which happens in that function.
Signed-off-by: Herbert Xu <herbert.org.au>
Signed-off-by: David S. Miller <davem>
Neil, could you please take a look at this? Thanks! Yeah, got it covered. Thanks! http://people.redhat.com/nhorman/rpms/kernel-2.6.18-132.el5.bz484590.x86_64.rpm Test kernel available. Please verify this fixes the problem if you would. Thanks! Can you provide src.rpm or just the patch to test, I don't want to run developement kernel on production box. Created attachment 333193 [details]
backport of herberts patch
Sure, here you go. Although keep in mind, what I gave you was the 5.3 kernel + this patch.
Oh. I got confused because of release -132. I'm booting gateway machine to kernel with test patch now and make sure minicom will log on serial console if something happens. I'll give feedback tomorrow when I know if it did help or not. ok, please let me know when you have results. thanks. ok, please let me know when you have results. thanks. Machine lasted over night, up 9:04 now. Seems like being fixed but time is too short to know 100% sure. Ok, let it run the rest of the day, if its stays up, we'll call it fixed and move forward. Thanks! Now up 20:08, I'd call it fixed. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-133.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. I'm seeing exactly the same problem on two of our VPN gateways. Don, do your later test kernels ( > kernel-2.6.18-133.el5) from http://people.redhat.com/dzickus/el5/ include the fix, too? This patch was missing from 2.6.18-128.1.6.el5. what's the problem? With kernel-2.6.18-133.el5 our VPN gateways have an uptime of 7 days now, after crashing in about half a day with kernel-2.6.18-128.el5. Please fix this in your official kernels - at the moment we have the choice to run a crashing kernel or one that doesn's include patches for several CVEs. Created attachment 338145 [details]
kernel crash on heavily loaded system with 2.6.18-128.1.6.el5 with proposed fix applied
I tried fix patch on heavily loaded firewall machines with lots of network load and system crashed very fast, In much less than an hour.
It is of course possible that cause was another regression in 2.6.18-128.1.6.el5 but I don't think so. 2.6.18-128.1.1.el5 with original broken patch removed is very stable.
Tuomo, Can you please verify this against kernel-2.6.18-133.el5 or later? We have a report in comment #25 that this did fix the problem. Also, this as been proposed for an errata kernel, but in order to get full traction please file an official support ticket through your Red Hat support channel. I reqeuest you to read comment #26 again. And check kernel oops with proposed patch. Tuomo, I did see comment #26. However, I figured it might be good to verify on our build since we had another report of it being fixed. Setting this back to ASSIGNED for engineering to look at oopses. On lightly loaded system fix patch works for more than ten days. But on heavily loaded system crash happens within seconds. Um, I'm sorry, but I'll need this to be reproduced on our kernel to do anything at all about it. I don't know what other changes are in this custom kernel, so I have no idea what else might be going on, nor do I have access to the changes to build symboled version of the kernel to debug. If its reproduced on our kernel, then I can run crash on the vmcore with our debuginfo package (assuming a vmcore can be sent in) and figure out whats going wrong. I'm setting this back to ON_QA until its reproduced on our kernel. Created attachment 339514 [details] Diff from 2.6.18-128.1.6.el5 to kernel crashed on comment #26 Hope this diff will give you idea about our changes... Oh. and I tell it again. Comment #3 still applies. Reverting that change fixes it. Has anybody ever hit that recursive locking issue which was fixed initially with broken patch? Created attachment 339664 [details]
Possible fix for remaining locking issue
I checked original backport and noticed that there was following comment:
* Note of backporting
Since the implementation of udp_queue_rcv_skb() in RHEL 5 differs
from upstream, I added a lock to the function to close the gap.
And there was indeed one locking statement added by original backporting of the patch which didn't get removed when kernel.org fix patch was backported. This is no wonder because original patch didn't have that lock statement and so it didn't have any need to remove it.
Neil: Could this fix the issue?
I started to run kernel with my extra patch from Comment #37 applied over previous patches. I'll tell later if testing was successful. From lock location I guess crash was caused by natted vpn client connecting. Kernel with my extra fix seem to be quite stable. No problems in 35 hours. But patch hasn't yet been tested with natted ipsec vpn road warriors. With my patch from Comment #37 applied over Neil's backport will fix this issue with IPsec NAT-Traversal road warriors. Without my patch IPsec NAT-Traversal causes crash as shown in comment #26. I have now 11 days uptime on my test domU and 5 days on real production vpn-gw. Extra patch from comment #37 applied this is finally stable. Please apply. 2.6.18-141.el5 is still missing important fix from comment #37. Why was status set to POST? Bug fixed by comment 37 was local to rhel kernel only. Bug fixed by it was local caused by backport. (In reply to comment #45) > Why was status set to POST? Bug fixed by comment 37 was local to rhel kernel > only. Bug fixed by it was local caused by backport. fyi, POST status means the patch was posted to internal mailing list for review. Exellent. Thank you. I was just a little confused because of bugzilla help about POST status :-) 5 days uptime with Tuomo's patch with pretty heavy traffic - no oopses so far. in kernel-2.6.18-144.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. I think I'm hitting this too - Don, I see -146 in your directory is new since your last comment - is that the best one to test now? Bill, yes the builds are cumulative, so -146.el5 includes everything in -144.el5 plus additional patches. When will this fix become part of an officially released kernel? (In reply to comment #59) > When will this fix become part of an officially released kernel? It's already - since kernel-2.6.18-128.1.8.el5 But wasn't the patch from comment 37 applied only to kernel-2.6.18-144.el5 and wouldn't that suggest that it's not in 128.1.8? Or am I interpreting the kernel release numbers wrong? (In reply to comment #61) > But wasn't the patch from comment 37 applied only to kernel-2.6.18-144.el5 and > wouldn't that suggest that it's not in 128.1.8? Or am I interpreting the kernel > release numbers wrong? Right, that one went in kernel-2.6.18-128.1.9.el5. Both patches for this are in both 5.4 and 5.3.z kernel trees. Using kernel 2.6.18-128.1.16.el5 x86_64 problem still ocurrs. There is some correction? Never seen it with kernel 2.6.18-128.1.10.el5+. Are you absolutely sure about version of kernel used? An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |