Bug 466303

Summary: IPSec kernel lockup.
Product: [Fedora] Fedora Reporter: Michael H. Warfield <mhw>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: medium    
Version: 9CC: bojan, kernel-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.26.6-79.fc9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-10-23 16:39:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael H. Warfield 2008-10-09 16:53:55 UTC
Description of problem:

Running IPSec (OpenSWAN) on 2.6.26.5-45 kernel.  System dies at random times.  No panic, no aiieee.  Network is dead and user space is dead.  It is possible to reboot the system using Alt-SysRq if that has been enabled.  Problem is NOT present in 2.6.26.3-29 kernel.  Examining the Changelog for 2.6.26.6 there appears to have been a serious problem in a bottom half handler fixing a recursive lock that is IPSec related.

From the changelog:

commit fc69b36cd5d05d78c7aa34fd490e8f156be9e5f6
Author: Herbert Xu <herbert.org.au>
Date:   Mon Sep 15 11:48:46 2008 -0700

    udp: Fix rcv socket locking
    
    [ Upstream commit 93821778def10ec1e69aa3ac10adee975dad4ff3 ]
    
    The previous patch in response to the recursive locking on IPsec
    reception is broken as it tries to drop the BH socket lock while in
    user context.
    
    This patch fixes it by shrinking the section protected by the
    socket lock to sock_queue_rcv_skb only.  The only reason we added
    the lock is for the accounting which happens in that function.
    
    Signed-off-by: Herbert Xu <herbert.org.au>
    Signed-off-by: David S. Miller <davem>
    Signed-off-by: Greg Kroah-Hartman <gregkh>

Uncertain if this fixes the problem or if there is some other problem lurking in this new lock.

Version-Release number of selected component (if applicable):

2.6.26.3 - Not broken
2.6.26.5 - Broken
2.6.26.6 - Undetermined

How reproducible:

Highly reliable, just time consuming.

Steps to Reproduce:
1. Set up an IPSec connection between two machines
2. Restart OpenSWAN on one of the machines a few times (dozen or so)
3. Machine eventually freezes

Alternative...  Just bring up the connection and wait for a few hours.  System will eventually die on it's own, it just takes longer.


Additional info:

I'm getting ready to test a kernel.org 2.6.26.6 kernel.  When that's built and tested, I'll add those test results.

Comment 1 Michael H. Warfield 2008-10-09 17:01:53 UTC
Another IPSec related fix in 2.6.26.6 which could possibly account for the problem:

commit b047cf6dfa81ca03b62f2e3ae63793ef5c300158
Author: Herbert Xu <herbert.org.au>
Date:   Tue Sep 30 02:03:19 2008 -0700

    ipsec: Fix pskb_expand_head corruption in xfrm_state_check_space
    
    [ Upstream commit d01dbeb6af7a0848063033f73c3d146fec7451f3 ]
    
    We're never supposed to shrink the headroom or tailroom.  In fact,
    shrinking the headroom is a fatal action.
    
    Signed-off-by: Herbert Xu <herbert.org.au>
    Signed-off-by: David S. Miller <davem>
    Signed-off-by: Greg Kroah-Hartman <gregkh>

Comment 2 Dave Jones 2008-10-09 17:18:00 UTC
A 2.6.26.6 based Fedora kernel should be in updates-testing really soon.

Comment 3 Michael H. Warfield 2008-10-09 20:28:53 UTC
That's good.  It took over 3 hours to build a stock kernel.org kernel for 2.6.26.6 but it's now been running in my VMware test environment for over an hour with restarting the IPSec environment every 5 minutes.  I won't claim that's definitive but the 2.5.26.5 kernel would have never lasted this long.  Looking good and looking forward to laying my hands on those kernel rpms.

Comment 4 Michael H. Warfield 2008-10-09 20:30:21 UTC
Ignore the typo in the past message...  2.6.26.6 not 2.5.26.5.  Duh.

Comment 5 Chuck Ebbert 2008-10-10 01:13:32 UTC
I'm going to assume that 2.6.26.6 fixes the problem.

Comment 6 Michael H. Warfield 2008-10-12 17:27:41 UTC
I think that's a very good assumption.

I have not had a single lockup, IPSec related or otherwise, in any of my testbeds running 2.6.26.6 either stock kernel.org kernels or the 2.6.26.6-67 from Koji.  I've just build my own 2.6.27-3 kernels for F9 from the Koji srpm and will be tested that next.

Comment 7 Fedora Update System 2008-10-14 19:50:56 UTC
kernel-2.6.26.6-46.fc8 has been submitted as an update for Fedora 8.
http://admin.fedoraproject.org/updates/kernel-2.6.26.6-46.fc8

Comment 8 Fedora Update System 2008-10-14 20:00:55 UTC
kernel-2.6.26.6-71.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/kernel-2.6.26.6-71.fc9

Comment 9 Fedora Update System 2008-10-20 20:28:59 UTC
kernel-2.6.26.6-79.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8929

Comment 10 Fedora Update System 2008-10-23 16:38:27 UTC
kernel-2.6.26.6-79.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.