Bug 671516 - connect() call to socket hangs when IPsec tunnel down
Summary: connect() call to socket hangs when IPsec tunnel down
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openswan
Version: 5.4
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Avesh Agarwal
QA Contact: BaseOS QE Security Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-21 18:19 UTC by deepak.dg.gupta
Modified: 2018-11-14 14:58 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-01-18 18:55:39 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description deepak.dg.gupta 2011-01-21 18:19:41 UTC
Description of problem:

When an IPsec tunnel between a RHEL 5.4 server (2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux) and an external gateway is take down for some reason.  A subsequent connect() call to 
a socket on the other end of the openswan IPsec tunnel hangs for more than 30 
seconds.  The socket is set to non-blocking.




Version-Release number of selected component (if applicable):

RHEL Kernel:

2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Openswan version:

openswan-2.6.21-5.el5_5.3

How reproducible:

100% reproducible.  Happens every time.


Steps to Reproduce:
1.  Establish an IPsec tunnel between an RHEL server mentioned above and an 
external gateway
2.  Take the tunnel down
3.  Try to connect() to a socket on the other end of the tunnel
  
Actual results:

connect() hangs for over 30 secs.

Expected results:

The connect() should fail right away.  The socket is set to non-blocking.

Additional info:

Output from ps command that shows that thread in question is stuck on the 
wait chan xfrm_l which is an ipsec kernel module.  


bmc-hp1-0-0-0:/root-# ps -aefLl | grep __http_stack_asr
1 S nxuser   10847 19118 10905  0   15  75   0 - 134939 xfrm_l 17:28 ?       00:00:00 /opt/proxy/bin/http_stack Http Stack Bmc__http_stack_asr_1 257 asr-1 10847 localhost 108
bmc-hp1-0-0-0:/root-# date
Fri Jan 21 19:13:25 CET 2011
bmc-hp1-0-0-0:/root-# ps -aefLl | grep __http_stack_asr
1 S nxuser   10847 19118 10905  0   15  75   0 - 134939 xfrm_l 17:28 ?       00:00:00 /opt/proxy/bin/http_stack Http Stack Bmc__http_stack_asr_1 257 asr-1 10847 localhost 108
bmc-hp1-0-0-0:/root-# ps -aefLl | grep __http_stack_asr
1 S nxuser   10847 19118 10905  0   15  75   0 - 134939 xfrm_l 17:28 ?       00:00:00 /opt/proxy/bin/http_stack Http Stack Bmc__http_stack_asr_1 257 asr-1 10847 localhost 108
1 S nxuser   10847 19118 10905  0   15  75   0 - 134939 xfrm_l 17:28 ?       00:00:00 /opt/proxy/bin/http_stack Http Stack Bmc__http_stack_asr_1 257 asr-1 10847 localhost 108
bmc-hp1-0-0-0:/root-# date
Fri Jan 21 19:13:48 CET 2011
bmc-hp1-0-0-0:/root-# ps -aefLl | grep __http_stack_asr
1 S nxuser   10847 19118 10905  0   15  75   0 - 134939 xfrm_l 17:28 ?       00:00:00 /opt/proxy/bin/http_stack Http Stack Bmc__http_stack_asr_1 257 asr-1 10847 localhost 108
bmc-hp1-0-0-0:/root-# date
Fri Jan 21 19:13:54 CET 2011
bmc-hp1-0-0-0:/root-# date
Fri Jan 21 19:13:56 CET 2011
bmc-hp1-0-0-0:/root-# ps -aefLl | grep __http_stack_asr
1 S nxuser   10847 19118 10905  0   15  75   0 - 134811 -     17:28 ?        00:00:00 /opt/proxy/bin/http_stack Http Stack Bmc__http_stack_asr_1 257 asr-1 10847 localhost 108
bmc-hp1-0-0-0:/root-# date
Fri Jan 21 19:14:00 CET 2011
bmc-hp1-0-0-0:/root-#



gdb info:
> (gdb) up
> #5  0xf7e16bac in stSanityCheck () at dispatch.c:1351
> 1351    in dispatch.c
> (gdb) p d
> $7 = (dispatcher_t *) 0x9be6ef4
> (gdb) p d->cpuInfo
> $8 = {pid = 19205, cpu = 0, _idx = 0, _cpu = {0, 0, 0, 0, 0}, admissionRate = 100, selfAdmissionRate = 100, agentsAdmissionRate = 100, highTreshold = 90,
>   emergencyTreshold = 99, gainUpFactor = 1, gainDownFactor = 1, loadFactor = 0, count = 0, _oldtime = {tv_sec = 0, tv_usec = 0}, olduclicks = 0, oldsclicks = 0,
>   procFileFd = 0, procFileName = "/proc/19041/task/19205/stat", '\0'
> <repeats 228 times>, alarmTimer = 0, alarmRaised = false}
> (gdb) thread 8
> [Switching to thread 8 (Thread 4088118160 (LWP 19205))]#0  0xffffe410 
> in __kernel_vsyscall ()
> (gdb) bt
> #0  0xffffe410 in __kernel_vsyscall ()
> #1  0xf7bb3fa8 in connect () from /lib/libpthread.so.0
> #2  0xf7e2dfd2 in stCreateSock (dstAddr=0xf5903378,
> srcAddr=0xf59032f8) at sock.c:519
> #3  0xf7e2ff65 in stNewSocket (dispatcher=0x9be6ef4, 
> dstAddr=0xf5903378, srcAddr=0xf59032f8, fd=-1) at socket.c:242
> #4  0xf7e2ec74 in stGetStreamSocketByAddr (stream=0xf59031ac, ssl=0) 
> at
> sock.c:734
> #5  0xf7e2ef80 in stGetStreamSocket (stream=0xf59031ac, ssl=0) at
> sock.c:813
> #6  0x0806438f in ?? ()
> #7  0xf59031ac in ?? ()
> #8  0x00000000 in ?? ()
> (gdb)

Comment 1 deepak.dg.gupta 2011-01-21 21:06:43 UTC
Heres some more info I gathered by strace on this process:

I see that the socket is being set to NONBLOCK and then the connect() is being called to the IP that is the other end of the IPsec tunnel:

[pid 20528] fcntl64(58, F_SETFL, O_RDWR|O_NONBLOCK <unfinished ...>
[pid 20502] gettimeofday( <unfinished ...>
[pid 20528] <... fcntl64 resumed> )     = 0
[pid 20502] <... gettimeofday resumed> {1295642530, 830768}, NULL) = 0
[pid 20528] connect(58, {sa_family=AF_INET, sin_port=htons(7772), sin_addr=inet_addr("10.155.199.101")}, 128 <unfinished ...>


and then much later in the strace output, I see the same thread (20528) return from the connect() as such:

[pid 20528] <... connect resumed> )     = -1 ESRCH (No such process)


Time elapsed between these 2 strace segments was more than 30 seconds.

Comment 2 deepak.dg.gupta 2011-01-21 21:22:36 UTC
And heres the output of "ipsec auto --status".   There are 2 tunnels configured but both are down:

000 "ag-10.155.199.101": 10.155.155.78<10.155.155.78>[C=US, ST=OH, O=ALU, OU=MSG, CN=BMC1, E=BMC1,+S=C]---10.155.155.5...10.155.199.101<10.155.199.101>[C=US, ST=OH, O=ALU, OU=MSG, CN=AG01, E=AG01,+S=C]; prospective erouted; eroute owner: #0
000 "ag-10.155.199.101":     myip=unset; hisip=unset; mycert=bmc;
000 "ag-10.155.199.101":   CAs: 'C=US, ST=OH, L=COLUMBUS, O=ALU, OU=MSG, CN=CA, E=CA'...'%any'
000 "ag-10.155.199.101":   ike_life: 86400s; ipsec_life: 28800s; rekey_margin: 600s; rekey_fuzz: 100%; keyingtries: 0
000 "ag-10.155.199.101":   policy: RSASIG+ENCRYPT+TUNNEL+PFS+UP+IKEv2ALLOW; prio: 32,32; interface: bond1:alrt;
000 "ag-10.155.199.101":   newest ISAKMP SA: #0; newest IPsec SA: #0;
000 "ag-10.155.199.101":   IKE algorithms wanted: AES_CBC(7)_128-SHA1(2)-MODP2048(14); flags=-strict
000 "ag-10.155.199.101":   IKE algorithms found:  AES_CBC(7)_128-SHA1(2)_160-14,
000 "ag-10.155.199.101":   ESP algorithms wanted: AES(12)_128-SHA2_256(5); flags=-strict
000 "ag-10.155.199.101":   ESP algorithms loaded: AES(12)_128-SHA2_256(5)_256
000 "ag-10.155.199.102": 10.155.155.78<10.155.155.78>[C=US, ST=OH, O=ALU, OU=MSG, CN=BMC1, E=BMC1,+S=C]---10.155.155.5...10.155.199.102<10.155.199.102>[C=US, ST=OH, O=ALU, OU=MSG, CN=AG02, E=AG02,+S=C]; prospective erouted; eroute owner: #0
000 "ag-10.155.199.102":     myip=unset; hisip=unset; mycert=bmc;
000 "ag-10.155.199.102":   CAs: 'C=US, ST=OH, L=COLUMBUS, O=ALU, OU=MSG, CN=CA, E=CA'...'%any'
000 "ag-10.155.199.102":   ike_life: 86400s; ipsec_life: 28800s; rekey_margin: 600s; rekey_fuzz: 100%; keyingtries: 0
000 "ag-10.155.199.102":   policy: RSASIG+ENCRYPT+TUNNEL+PFS+UP+IKEv2ALLOW; prio: 32,32; interface: bond1:alrt;
000 "ag-10.155.199.102":   newest ISAKMP SA: #0; newest IPsec SA: #0;
000 "ag-10.155.199.102":   IKE algorithms wanted: AES_CBC(7)_128-SHA1(2)-MODP2048(14); flags=-strict
000 "ag-10.155.199.102":   IKE algorithms found:  AES_CBC(7)_128-SHA1(2)_160-14,
000 "ag-10.155.199.102":   ESP algorithms wanted: AES(12)_128-SHA2_256(5); flags=-strict
000 "ag-10.155.199.102":   ESP algorithms loaded: AES(12)_128-SHA2_256(5)_256
000
000 #297: "ag-10.155.199.101":500 STATE_MAIN_I1 (sent MI1, expecting MR1); EVENT_RETRANSMIT in 6s; nodpd; idle; import:admin initiate
000 #297: pending Phase 2 for "ag-10.155.199.101" replacing #0
000 #296: "ag-10.155.199.102":500 STATE_MAIN_I1 (sent MI1, expecting MR1); EVENT_RETRANSMIT in 6s; nodpd; idle; import:admin initiate
000 #296: pending Phase 2 for "ag-10.155.199.102" replacing #0
000

Comment 3 Chris Williams 2012-01-18 18:55:39 UTC
Issue resolved in Customer Portal case 00410706. Closing NOTABUG. Feel free to open a case with Red Hat Support via the Customer Portal if this is still an issue.


Note You need to log in before you can comment on or make changes to this bug.