Bug 166531

Summary: IPSec VPN Tunnels cause kernel panic when run over PPPoE (ADSL)
Product: Red Hat Enterprise Linux 3 Reporter: David Herselman <bbs2web>
Component: kernelAssignee: David Miller <davem>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: petrides
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-19 18:55:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Herselman 2005-08-23 00:08:33 UTC
Description of problem:


Version-Release number of selected component (if applicable): 2.4.21-32.0.1.EL


How reproducible: Always (can take up to 3 days)


Steps to Reproduce:
1. Configure network-network IPSec VPN tunnel over PPPoE (ADSL)
2. Does not require high utilisation
3. Wait for up to a maximum of 3 days (usually 1-2 days)
  
Actual results: Kernel panic
Log entries showing kernel panic:
Jul 27 17:24:32 unix-01 racoon: INFO: isakmp.c:1387:isakmp_open(): 10.0.0.1
[500] used as isakmp port (fd=8)
Jul 27 17:24:32 unix-01 racoon: INFO: isakmp.c:1387:isakmp_open(): 192.168.4.1
[500] used as isakmp port (fd=9)
Jul 27 17:24:32 unix-01 racoon: INFO: isakmp.c:1387:isakmp_open(): 127.0.0.1
[500] used as isakmp port (fd=10)
Jul 27 17:24:35 unix-01 kernel: KERNEL: assertion (x->km.state == 
XFRM_STATE_DEAD) failed at xfrm_state.c(193)
Jul 27 17:24:35 unix-01 kernel: KERNEL: assertion (x->km.state == 
XFRM_STATE_DEAD) failed at xfrm_state.c(193)
Jul 27 17:24:35 unix-01 kernel: ------------[ cut here ]------------
Jul 27 17:24:35 unix-01 kernel: kernel BUG at xfrm_state.c:54!
Jul 27 17:24:35 unix-01 kernel: invalid operand: 0000
Jul 27 17:24:35 unix-01 kernel: esp4 ah4 cls_u32 sch_sfq sch_cbq ipt_TOS 
ipt_limit ip_nat_irc ppp_synctty ppp_async ppp_generic slhc ipt_state ipt_owner 
ipt_REDIRECT ipt_REJECT ipt_LOG iptab


And another:
Jul 30 05:59:07 unix-01 kernel: KERNEL: assertion (x->km.state == 
XFRM_STATE_DEAD) failed at xfrm_state.c(193)
<nothing else logged>


Expected results:


Additional info:
Recompiled stock 2.4.21-32.0.1.EL RedHat kernel with the XFRM_State patch fro 
mthe 2.6.11.7 changelog but the system still locks up after the same amount of 
time (although the 'kernel BUG at xfrm_state.c:54' messages have dissapeared):
ICMP frag. IPSec deadlock: http://lists.openswan.org/pipermail/users/2005-
April/004540.html

Syslog from one of the machines affected by this problem:
Aug 18 02:45:18 unix-01 racoon: INFO: pfkey.c:1394:pk_recvexpire(): IPsec-SA 
expired: AH/Tunnel 196.25.242.202->165.146.30.88 spi=200953471(0xbfa4e7f)
Aug 18 02:45:18 unix-01 racoon: INFO: pfkey.c:1394:pk_recvexpire(): IPsec-SA 
expired: ESP/Tunnel 196.25.242.202->165.146.30.88 spi=847614(0xceefe)
Aug 18 02:45:18 unix-01 racoon: INFO: pfkey.c:1394:pk_recvexpire(): IPsec-SA 
expired: AH/Tunnel 165.146.30.88->196.25.242.202 spi=18140664(0x114cdf8)
Aug 18 08:07:37 unix-01 syslogd 1.4.1: restart.
Aug 18 08:07:37 unix-01 syslog: syslogd startup succeeded
Aug 18 08:07:37 unix-01 kernel: klogd 1.4.1, log source = /proc/kmsg started.


NB: The same system works flawlessly when we switch the connection over to a 
fractional T1 link (diginet) instead of using PPPoE (ADSL)...

These bugzilla cases sound similar:
 151044 - Code to different to compare directly
    https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151044)
 118885 - Appears to have been patched already
    https://bugzilla.redhat.com/bugzilla/long_list.cgi?buglist=118885

Previously entered as Bugzilla 164730
  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=164730

Comment 1 David Herselman 2005-08-25 06:37:08 UTC
Have tried to implement a work around whereby I run a cron job every 3 hours to 
restart the tunnels and the systems are staying up longer now but had 2 crash 
with the following this morning:

syslog:
Aug 24 18:15:00 unix-01 modprobe: modprobe: Can't locate module ripemd160
Aug 24 18:15:00 unix-01 modprobe: modprobe: Can't locate module cast128
Aug 24 18:15:00 unix-01 modprobe: modprobe: Can't locate module lzs
Aug 24 18:15:01 unix-01 modprobe: modprobe: Can't locate module lzjh
Aug 24 18:15:01 unix-01 kernel: KERNEL: assertion (x->km.state == 
XFRM_STATE_DEAD) failed at xfrm_state.c(193)
Aug 25 08:00:12 unix-01 syslogd 1.4.1: restart.
Aug 25 08:00:12 unix-01 syslog: syslogd startup succeeded
Aug 25 08:00:12 unix-01 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Screen:
Kernel bug at xfrm_state.c:54!
invalid operand : 0000
ide_cd cdrom esp4 ah4 cls_u32 sch_sfq sch_cbq ipt_TOS (did not finish
writing all of these there where a couple more)

CPU1
EIP: 0060 [<c028b15a>] Not tained
EFLAGS:0010202

EIP is at xfrm_state_gc destroy [KERNEL] 0x1a (2.4.21-32.0.1 Elmp /i686)

(Then there where a whole bunch of numbers)

Kernel panic: Fatal exception



2nd system that crashed, also running IPSec network-to-network VPN over PPPoE:
Aug 25 07:46:14 unix-01 pppd[3077]: LCP terminated by peer
Aug 25 07:46:14 unix-01 pppoe[3078]: Session 4481 terminated -- received PADT 
from peer
Aug 25 07:46:14 unix-01 pppoe[3078]: Sent PADT
Aug 25 07:46:14 unix-01 pppd[3077]: Modem hangup
Aug 25 07:46:14 unix-01 pppd[3077]: Connection terminated.
Aug 25 07:46:14 unix-01 pppd[3077]: Connect time 1440.2 minutes.
Aug 25 07:46:14 unix-01 pppd[3077]: Sent 112961249 bytes, received 345804065 
bytes.
Aug 25 07:46:14 unix-01 pppd[3077]: Exit.
Aug 25 07:46:14 unix-01 adsl-connect: ADSL connection lost; attempting re-
connection.
Aug 25 07:46:14 unix-01 /etc/hotplug/net.agent: NET unregister event not 
supported
Aug 25 07:46:18 unix-01 kernel: KERNEL: assertion (x->km.state == 
XFRM_STATE_DEAD) failed at xfrm_state.c(193)
Aug 25 08:14:43 unix-01 syslogd 1.4.1: restart.
Aug 25 08:14:43 unix-01 syslog: syslogd startup succeeded
Aug 25 08:14:43 unix-01 kernel: klogd 1.4.1, log source = /proc/kmsg started.




Sounds extremely relevant to the following kernel Bug posting:
  http://www.uwsg.indiana.edu/hypermail/linux/net/0307.3/0030.html


Comment 2 David Herselman 2005-08-25 07:06:41 UTC
Herbert's patch from the above posting has already been patched to the current 
system's kernel... Again, this only affects systems running IPSec tunnels over 
PPPoE connections, we switched one of the servers on to its backup route 
(fractional T1 (diginet)) and it hasn't locked up once.

Comment 3 David Herselman 2005-08-29 07:02:47 UTC
Is there any additional information I can supply to assist with resolving this 
problem? We've setup a RHEL4 test server running the same config so we'll see 
if this is specific to RHEL3 shortly...

Item of concern is how many people are actually doing this (especially via 
dynamic IP PPPoE connections) due to:
  1. The ifup-ipsec and ifdown-ipsec scripts being broken for net-to-net VPNs
  2. Racoon missing an init script
  3. Having to hack together a simple script to handle the changing IPs which
     updates the 'DST=' entry in /etc/sysconfig/network-scripts/ifcfg-ipsec?



Comment 4 David Herselman 2005-08-29 07:08:19 UTC
Could this possibly have something to do with IP addresses changing when the 
PPPoE connections re-establish?

Comment 5 David Herselman 2005-09-25 15:41:17 UTC
No feedback from anyone out there and I was under pressure to get this 
resolved... Got it working by installing kernel 2.6 from RHEL4.1 on the RHEL3 
servers.

Required packages:
kernel-2.6.9-11.EL.i686.rpm
lvm2-2.01.08-1.0.RHEL4.i386.rpm
depend/device-mapper-1.01.01-1.RHEL4.i386.rpm
depend/glibc-2.3.4-2.9.i686.rpm
depend/glibc-common-2.3.4-2.9.i386.rpm
depend/ipsec-tools-0.3.3-6.i386.rpm
depend/l2tpd-0.69-12jdl.i386.rpm
depend/libselinux-1.19.1-8.i386.rpm
depend/mkinitrd-4.2.1.3-1.i386.rpm
depend/module-init-tools-3.1-0.pre5.3.i386.rp
depend/nscd-2.3.4-2.9.i386.rpm


Installed like this:
rpm -e piranha
rpm -Uvh --nodeps lvm2-2.01.08-1.0.RHEL4.i386.rpm
rpm -Uvh depend/*.rpm
rpm -ivh kernel-2.6.9-11.EL.i686.rpm
vi /etc/lilo.conf
lilo


Comment 6 David Herselman 2005-09-25 15:43:07 UTC
Didn't get to test the following patch from Bugzilla #168458:
  http://sourceforge.net/mailarchive/forum.php?thread_id=3866075&forum_id=32000

Comment 7 RHEL Program Management 2007-10-19 18:55:38 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.