Bug 1743291
| Summary: | Stale haproxy process are not removed properly | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | jooho lee <jlee> |
| Component: | Networking | Assignee: | Andrew McDermott <amcdermo> |
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | high | ||
| Priority: | urgent | CC: | amcdermo, aos-bugs, bbennett, bperkins, ggore, mrobson, rhowe, steven.barre |
| Version: | 3.11.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-12-11 14:27:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
jooho lee
2019-08-19 14:35:56 UTC
Can we confirm the version of haproxy that is in use by running `haproxy -vv` from the POD that is running the router. I believe I do see a case where although we set 'timeout tunnel 5m' the old haproxy does not disappear:
Reloading haproxy every 5s:
Wed 30 Oct 2019 05:06:11 PM GMT -- oldpids: 10529
Wed 30 Oct 2019 05:06:16 PM GMT -- oldpids: 10565
Wed 30 Oct 2019 05:06:21 PM GMT -- oldpids: 10600
Wed 30 Oct 2019 05:06:26 PM GMT -- oldpids: 10634
Wed 30 Oct 2019 05:06:31 PM GMT -- oldpids: 10782
Wed 30 Oct 2019 05:06:36 PM GMT -- oldpids: 10818
Wed 30 Oct 2019 05:06:41 PM GMT -- oldpids: 11083 10818
Wed 30 Oct 2019 05:06:46 PM GMT -- oldpids: 11116 10818
Wed 30 Oct 2019 05:06:51 PM GMT -- oldpids: 11146 10818
Wed 30 Oct 2019 05:06:56 PM GMT -- oldpids: 11188 10818
...
Wed 30 Oct 2019 05:14:51 PM GMT -- oldpids: 14792 10818
Wed 30 Oct 2019 05:14:56 PM GMT -- oldpids: 14827 10818
Wed 30 Oct 2019 05:15:01 PM GMT -- oldpids: 14864 10818
Wed 30 Oct 2019 05:15:06 PM GMT -- oldpids: 14897 10818
Wed 30 Oct 2019 05:15:11 PM GMT -- oldpids: 14936 10818
Wed 30 Oct 2019 05:15:16 PM GMT -- oldpids: 14975 10818
More than 5m later pid 10818 is still lingering - it does have established connections:
$ lsof -p 10818
lsof: WARNING: can't stat() fuse file system /keybase
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
haproxy 10818 aim cwd DIR 8,49 40960 3670017 /home/aim
haproxy 10818 aim rtd DIR 253,0 4096 2 /
haproxy 10818 aim txt REG 8,49 8382936 11929158 /home/aim/haproxy-1.8/haproxy
haproxy 10818 aim mem REG 253,0 6697832 1447906 /usr/lib64/libc-2.29.so
haproxy 10818 aim mem REG 253,0 823192 1448140 /usr/lib64/libpthread-2.29.so
haproxy 10818 aim mem REG 253,0 61872 1447908 /usr/lib64/libdl-2.29.so
haproxy 10818 aim mem REG 253,0 260200 1462867 /usr/lib64/libcrypt.so.2.0.0
haproxy 10818 aim mem REG 253,0 436736 1443329 /usr/lib64/ld-2.29.so
haproxy 10818 aim 0u CHR 1,3 0t0 1031 /dev/null
haproxy 10818 aim 1u CHR 1,3 0t0 1031 /dev/null
haproxy 10818 aim 2u CHR 1,3 0t0 1031 /dev/null
haproxy 10818 aim 3u a_inode 0,14 0 13328 [eventpoll]
haproxy 10818 aim 4u IPv4 21803422 0t0 UDP *:50737
haproxy 10818 aim 7r FIFO 0,13 0t0 21802735 pipe
haproxy 10818 aim 8w FIFO 0,13 0t0 21802735 pipe
haproxy 10818 aim 9u a_inode 0,14 0 13328 [eventpoll]
haproxy 10818 aim 10u a_inode 0,14 0 13328 [eventpoll]
haproxy 10818 aim 11u a_inode 0,14 0 13328 [eventpoll]
haproxy 10818 aim 13u IPv4 21798818 0t0 TCP localhost:4242->localhost:45632 (ESTABLISHED)
haproxy 10818 aim 14u IPv4 21798820 0t0 TCP localhost:50362->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 18u IPv4 21795782 0t0 UDP *:34593
haproxy 10818 aim 19u IPv4 21795785 0t0 TCP localhost:4242->localhost:45642 (ESTABLISHED)
haproxy 10818 aim 22u IPv4 21795790 0t0 UDP *:36178
haproxy 10818 aim 23u IPv4 21795791 0t0 TCP localhost:50376->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 29u IPv4 21805253 0t0 UDP *:42369
haproxy 10818 aim 35u IPv4 21806129 0t0 TCP localhost:4242->localhost:45670 (ESTABLISHED)
haproxy 10818 aim 38u IPv4 21806133 0t0 TCP localhost:50406->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 67u IPv4 21795803 0t0 TCP localhost:4242->localhost:45736 (ESTABLISHED)
haproxy 10818 aim 71u IPv4 21795807 0t0 TCP localhost:4242->localhost:45744 (ESTABLISHED)
haproxy 10818 aim 73u IPv4 21795809 0t0 TCP localhost:4242->localhost:45748 (ESTABLISHED)
haproxy 10818 aim 77u IPv4 21795814 0t0 TCP localhost:50484->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 80u IPv4 21795817 0t0 TCP localhost:50490->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 83u IPv4 21795819 0t0 TCP localhost:50496->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 97u IPv4 21806193 0t0 TCP localhost:4242->localhost:45796 (ESTABLISHED)
haproxy 10818 aim 100u IPv4 21806198 0t0 TCP localhost:4242->localhost:45802 (ESTABLISHED)
haproxy 10818 aim 101u IPv4 21806200 0t0 TCP localhost:50532->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 102u IPv4 21806201 0t0 TCP localhost:4242->localhost:45806 (ESTABLISHED)
haproxy 10818 aim 132u IPv4 21806225 0t0 TCP localhost:4242->localhost:45862 (ESTABLISHED)
haproxy 10818 aim 141u IPv4 21806233 0t0 TCP localhost:50614->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 152u IPv4 21806241 0t0 TCP localhost:50636->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 153u IPv4 21806242 0t0 TCP localhost:50638->localhost:cslistener (ESTABLISHED)
haproxy 10818 aim 161u IPv4 21802809 0t0 TCP localhost:4242->localhost:45926 (ESTABLISHED)
haproxy 10818 aim 162u IPv4 21802811 0t0 TCP localhost:50656->localhost:cslistener (ESTABLISHED)
but given the tunnel timeout of 5m I would expect them to be closed.
Which they do... but it seems to take longer than 5m:
Wed 30 Oct 2019 05:16:42 PM GMT -- oldpids: 15575 10818
Wed 30 Oct 2019 05:16:47 PM GMT -- oldpids: 15612
Wed 30 Oct 2019 05:16:52 PM GMT -- oldpids: 15647
Wed 30 Oct 2019 05:16:58 PM GMT -- oldpids: 15685
Wed 30 Oct 2019 05:17:03 PM GMT -- oldpids: 15711
Wed 30 Oct 2019 05:17:08 PM GMT -- oldpids: 15760
Wed 30 Oct 2019 05:17:13 PM GMT -- oldpids: 16003
Investigating why this appears to be longer than the stated 5m value.
I misunderstood what `timeout tunnel 1h` actually meant. I thought, irrespective of active traffic, any old (post reloading) haproxy processes would be terminated after 1h. Whereas it's the opposite. If after $DURATION there has been _zero_ activity then, and only then, can the process be terminated. I created some debugging scripts here to show the behaviour: https://github.com/frobware/haproxy-hacks/commit/b3c67adf05defea58e3c83d16a97db96bd827038 ## Start up haproxy, reloading every second $ while :; do ~/haproxy-hacks/BZ1743291/reload-proxy ; sleep 5; done Thu 31 Oct 2019 05:45:00 PM GMT -- oldpids: Thu 31 Oct 2019 05:45:05 PM GMT -- oldpids: 8555 Thu 31 Oct 2019 05:45:10 PM GMT -- oldpids: 8574 Thu 31 Oct 2019 05:45:15 PM GMT -- oldpids: 8592 Thu 31 Oct 2019 05:45:20 PM GMT -- oldpids: 8610 Run the backend: $ go run server.go Make some connections: $ go run client.go From the constant reloading we see a PID that appears to be stuck/persistent: Thu 31 Oct 2019 05:46:06 PM GMT -- oldpids: 8794 Thu 31 Oct 2019 05:46:11 PM GMT -- oldpids: 8815 Thu 31 Oct 2019 05:46:16 PM GMT -- oldpids: 8834 Thu 31 Oct 2019 05:46:21 PM GMT -- oldpids: 8853 ... Fri 01 Nov 2019 08:29:55 AM GMT -- oldpids: 20369 2279 Fri 01 Nov 2019 08:30:00 AM GMT -- oldpids: 20369 2304 Fri 01 Nov 2019 08:30:05 AM GMT -- oldpids: 20369 2327 Fri 01 Nov 2019 08:30:10 AM GMT -- oldpids: 20369 2348 Fri 01 Nov 2019 08:30:15 AM GMT -- oldpids: 20369 2373 Fri 01 Nov 2019 08:30:20 AM GMT -- oldpids: 20369 2394 Fri 01 Nov 2019 08:30:25 AM GMT -- oldpids: 20369 2415 If we adjust the 'tunnel timeout' to 30s and now connect with a different client, entering "hello" only (and not closing the connection): $ websocat -E -n -t ws://127.0.0.1:4242/echo hello handled by "127.0.0.1:9000"; you said "hello\n" We see PID 4581 hang around for ~30s: Fri 01 Nov 2019 08:35:39 AM GMT -- oldpids: 20369 4507 Fri 01 Nov 2019 08:35:44 AM GMT -- oldpids: 20369 4541 Fri 01 Nov 2019 08:35:49 AM GMT -- oldpids: 20369 4563 Fri 01 Nov 2019 08:35:54 AM GMT -- oldpids: 20369 4581 Fri 01 Nov 2019 08:35:59 AM GMT -- oldpids: 20369 4631 4581 Fri 01 Nov 2019 08:36:04 AM GMT -- oldpids: 20369 4654 4581 Fri 01 Nov 2019 08:36:09 AM GMT -- oldpids: 20369 4680 4581 Fri 01 Nov 2019 08:36:14 AM GMT -- oldpids: 20369 4714 4581 Fri 01 Nov 2019 08:36:19 AM GMT -- oldpids: 20369 4733 4581 Fri 01 Nov 2019 08:36:24 AM GMT -- oldpids: 20369 4755 Fri 01 Nov 2019 08:36:29 AM GMT -- oldpids: 20369 4795 PID 4581 is eventually removed as there has been no traffic in the last 30s - we typed a single "hello", leaving the connection open. SOLUTION/FIX: A separate process/effort/entity needs to periodically clean up old haproxy PIDs that are still running longer than the configured tunnel timeout. These processes have established connections with active traffic (possibly websocket/keep-alive traffic?). With respect to the memory size of the processes it's not clear how the 1.4G value was obtained. Is this Virt memory (as reported in top) or Res? How/where was this number obtained? I did experiment with ballooning memory requests to make memory leaks more obvious by creating an interposer library for malloc/calloc. By a dding 1MB to each malloc() allocation I didn't see haproxy processes grow in size -- particularly for the processes that hung around post reload. The library and usage can be found here: https://github.com/frobware/haproxy-hacks/tree/master/BZ1743291#memory-leaks Note: you can kill all old haproxy processes using: $ killall --older-than 1h haproxy (In reply to Andrew McDermott from comment #7) > Note: you can kill all old haproxy processes using: > > $ killall --older-than 1h haproxy To be clear: additional work will be required to ensure that this doesn't kill every single haproxy process - the intent here was to highlight the `--older-than` option available in killall. Based on the observations in https://bugzilla.redhat.com/show_bug.cgi?id=1743291#c6, I'm closing this as NOTABUG - processes that still have active connections will not be terminated. There is the `hard-stop-after` option that can help auto-kill long lived processes: https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#3.1-hard-stop-after "This may be used to ensure that the instance will quit even if connections remain opened during a soft-stop (for example with long timeouts for a proxy in tcp mode). It applies both in TCP and HTTP mode." Using a 'tunnel timeout of 30s' I see the old process living on and on: $ while :; do ~/haproxy-hacks/BZ1743291/reload-proxy ; sleep 5; done Fri 01 Nov 2019 07:04:52 PM GMT -- oldpids: Fri 01 Nov 2019 07:04:57 PM GMT -- oldpids: 9302 Fri 01 Nov 2019 07:05:02 PM GMT -- oldpids: 9320 Fri 01 Nov 2019 07:05:07 PM GMT -- oldpids: 9338 Fri 01 Nov 2019 07:05:13 PM GMT -- oldpids: 9361 Fri 01 Nov 2019 07:05:18 PM GMT -- oldpids: 9475 9361 Fri 01 Nov 2019 07:05:23 PM GMT -- oldpids: 9499 9361 Fri 01 Nov 2019 07:05:28 PM GMT -- oldpids: 9518 9361 Fri 01 Nov 2019 07:05:33 PM GMT -- oldpids: 9540 9361 Fri 01 Nov 2019 07:05:38 PM GMT -- oldpids: 9561 9361 Fri 01 Nov 2019 07:05:43 PM GMT -- oldpids: 9580 9361 Fri 01 Nov 2019 07:05:48 PM GMT -- oldpids: 9604 9361 Fri 01 Nov 2019 07:05:53 PM GMT -- oldpids: 9632 9361 Fri 01 Nov 2019 07:05:58 PM GMT -- oldpids: 9652 9361 Fri 01 Nov 2019 07:06:03 PM GMT -- oldpids: 9670 9361 Fri 01 Nov 2019 07:06:08 PM GMT -- oldpids: 9693 9361 Fri 01 Nov 2019 07:06:13 PM GMT -- oldpids: 9714 9361 Fri 01 Nov 2019 07:06:18 PM GMT -- oldpids: 9732 9361 Fri 01 Nov 2019 07:06:23 PM GMT -- oldpids: 9750 9361 Fri 01 Nov 2019 07:06:28 PM GMT -- oldpids: 9770 9361 Fri 01 Nov 2019 07:06:33 PM GMT -- oldpids: 9789 9361 Fri 01 Nov 2019 07:06:39 PM GMT -- oldpids: 9808 9361 Fri 01 Nov 2019 07:06:44 PM GMT -- oldpids: 9827 9361 Fri 01 Nov 2019 07:06:49 PM GMT -- oldpids: 9851 9361 Fri 01 Nov 2019 07:06:54 PM GMT -- oldpids: 9873 9361 Fri 01 Nov 2019 07:06:59 PM GMT -- oldpids: 9891 9361 Fri 01 Nov 2019 07:07:04 PM GMT -- oldpids: 9915 9361 Adding: global hard-stop-after 30s $ while :; do ~/haproxy-hacks/BZ1743291/reload-proxy ; sleep 5; done Fri 01 Nov 2019 07:10:08 PM GMT -- oldpids: Fri 01 Nov 2019 07:10:13 PM GMT -- oldpids: 10782 Fri 01 Nov 2019 07:10:18 PM GMT -- oldpids: 10800 Fri 01 Nov 2019 07:10:23 PM GMT -- oldpids: 10819 Fri 01 Nov 2019 07:10:28 PM GMT -- oldpids: 10846 Fri 01 Nov 2019 07:10:34 PM GMT -- oldpids: 10871 Fri 01 Nov 2019 07:10:39 PM GMT -- oldpids: 10999 10871 Fri 01 Nov 2019 07:10:44 PM GMT -- oldpids: 11019 10871 Fri 01 Nov 2019 07:10:49 PM GMT -- oldpids: 11038 10871 Fri 01 Nov 2019 07:10:54 PM GMT -- oldpids: 11056 10871 Fri 01 Nov 2019 07:10:59 PM GMT -- oldpids: 11074 10871 Fri 01 Nov 2019 07:11:04 PM GMT -- oldpids: 11092 10871 Fri 01 Nov 2019 07:11:09 PM GMT -- oldpids: 11110 Fri 01 Nov 2019 07:11:14 PM GMT -- oldpids: 11133 Fri 01 Nov 2019 07:11:19 PM GMT -- oldpids: 11157 So all of the behaviour we see is by design; the option exists to mitigate this very problem. Perhaps this option can be introduced to help clean up long-lived processes. Maybe it shouldn't be done every hour, but could be done on, say, a 24h period. |