Bug 1688390

Summary: OCP haproxy router creating new session on old PIDs after FINISH signal has been sent.
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: NetworkingAssignee: Dan Mace <dmace>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: unspecified CC: amcdermo, aos-bugs, bperkins
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1690146 (view as bug list) Environment:
Last Closed: 2019-10-23 14:47:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1690146    

Description Ryan Howe 2019-03-13 16:39:26 UTC
Description of problem:

503 errors are seen coming from router when making a connection. The OpenShift router with nbthreads configured is routing to old endpoints that no longer exist, due to creating new sessions linked to old pids that were send the FINISH (SIGTTOU and SIGUSR1) signal. 


At this time this is only seen when nbthread is configured. 


Version-Release number of selected component (if applicable):
OpenShift 3.9 -> 3.11 haproxy router 
Haproxy 1.8.8

Additional info:

- PID 51938 in pod:
    51938   Mar05  250:12  /usr/sbin/haproxy
     |
     v
- Links to pid 4104 on the host
1000000+   4104  2.9  0.1 355740 60468 ?        Ssl  Mar05 250:10 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 21158 51857
     |
     v
- lsof has the following established connections
haproxy     4104 10000000   15u     IPv4         1005375260         0t0        TCP y.y.y.18:https->x.x.x.14:40834 (ESTABLISHED)
haproxy     4104 10000000   19u     IPv4         1005375226         0t0        TCP y.y.y.18:https->x.x.x.13:47778 (ESTABLISHED)
haproxy     4104 10000000   21u     IPv4         1005365865         0t0        TCP y.y.y.18:https->x.x.x.13:46924 (ESTABLISHED)
haproxy     4104 10000000   27u     IPv4         1005379692         0t0        TCP y.y.y.18:https->x.x.x.14:40676 (ESTABLISHED)
haproxy     4104 10000000   29u     IPv4         1005366039         0t0        TCP y.y.y.18:https->x.x.x.13:46978 (ESTABLISHED)
haproxy     4104 10000000   52u     IPv4         1005345463         0t0        TCP y.y.y.18:https->x.x.x.14:59186 (ESTABLISHED)


- The PID for 51938 was sent finish signal  (likely by another pid but remains in list). The finish signal was sent at or before Mar 9th:
 
   1000000+  41881  2.7  0.1 355488 56384 ?        Ssl  Mar09  73:43 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 21158 51938 65730

- Haproxy logs show new sessions with this old pid:

  Mar 11 15:49:23 localhost haproxy[51938]: x.x.x.14:42972 [11/Mar/2019:15:49:23.872] fe_sni~ be_edge_http:myapp2:myapp2-bg/<NOSRV> 0/-1/-1/-1/0 503 3278 - - SCNN 1/1/0/0/0 0/0 "HEAD /lae2cae.html HTTP/1.1"
  Mar 11 15:49:19 localhost haproxy[51938]: x.x.x.13:50820 [11/Mar/2019:15:49:19.651] fe_sni~ be_edge_http:myapp:myapp-test/<NOSRV> 0/-1/-1/-1/0 503 3278 - - SCNN 2/1/0/0/0 0/0 "HEAD /lae2cae.html HTTP/1.1"


- Info from haproxy 
$ echo "show info" | socat /var/lib/haproxy/run/haproxy.sock stdio
Name: HAProxy
Version: 1.8.8
Release_date: 2018/04/19
Nbthread: 4
Nbproc: 1
Process_num: 1
Pid: 44561

Comment 2 Dan Mace 2019-03-14 21:45:37 UTC
Thanks for the detailed report, Ryan. Will let you know if we need any more information to continue diagnosis.