Description of problem: Customer see that one of the Haproxy processes keep being killed by kernel OOM killer but the thing is there are many haproxy processes(13 processes) Some of them are running more than 3 hours even though the "timeout tunnel 1h" is configured. From my understanding, the option can disconnect all connections after 1 hour even thought the connections are established, then the process should be shut down. How can I restrict the time that old haproxy processes keep running? Version-Release number of selected component (if applicable): How reproducible: Customer site happens but I could not reproduce it. Steps to Reproduce: 1. 2. 3. Actual results: Many haproxy processes are running even after the timeout. Expected results: few haproxy are only running in timeout (1h) Additional info: haproxy processes ~~~ 1000000+ 18841 100 3.8 1532768 1253540 ? Ssl 13:08 31:52 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1910 1896 1808 1560 1364 1336 836 696 270 1000000+ 29556 39.3 3.8 1531348 1245524 ? Ssl 13:31 3:34 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1938 1924 1910 1560 1364 1336 836 696 270 1000000+ 30384 47.2 3.7 1524476 1231048 ? Ssl 13:33 3:18 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1952 1938 1924 1910 1560 1364 1336 836 696 270 1000000+ 31215 64.5 3.7 1522756 1220768 ? Ssl 13:35 3:12 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1966 1952 1938 1924 1910 1560 1364 1336 836 696 270 1000000+ 32177 36.5 3.7 1519532 1224764 ? Ssl 08:26 114:36 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1546 1532 1378 1364 1336 836 696 270 1000000+ 36476 121 3.7 1534556 1243476 ? Ssl 13:37 3:36 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1980 1966 1952 1938 1924 1910 1560 1364 1336 836 696 270 1000000+ 37452 142 3.9 1545760 1279024 ? Ssl 13:39 0:41 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1994 1980 1966 1952 1938 1924 1910 1560 1364 1336 836 696 270 1000000+ 56818 11.6 3.7 1519700 1230852 ? Ssl Aug15 176:20 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 682 668 612 598 584 495 463 353 270 1000000+ 57732 27.2 3.8 1532008 1259488 ? Ssl 04:16 153:33 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1350 1336 836 696 270 1000000+ 97148 11.7 3.7 1519948 1229212 ? Ssl Aug15 166:07 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 822 808 794 780 766 752 738 710 696 270 1000000+ 113878 28.0 3.7 1522832 1231596 ? Ssl 01:16 208:28 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1318 836 696 270 1000000+ 122472 104 3.7 1530860 1241156 ? Ssl 12:04 99:25 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 1896 1878 1864 1836 1808 1794 1776 1724 1696 1682 1560 1364 1336 836 696 270 1000000+ 123785 5.6 3.7 1517188 1215052 ? Ssl Aug15 93:03 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -x /var/lib/haproxy/run/haproxy.sock -sf 21 ~~~ audio.log show memory violation(6) ~~~ ./var/log/audit/audit.log:type=ANOM_ABEND msg=audit(1565975876.607:679473): auid=4294967295 uid=1000000000 gid=0 ses=4294967295 subj=system_u:system_r:container_t:s0:c0,c1 pid=121726 comm="haproxy" reason="memory violation" sig=6 ~~~
Can we confirm the version of haproxy that is in use by running `haproxy -vv` from the POD that is running the router.
I believe I do see a case where although we set 'timeout tunnel 5m' the old haproxy does not disappear: Reloading haproxy every 5s: Wed 30 Oct 2019 05:06:11 PM GMT -- oldpids: 10529 Wed 30 Oct 2019 05:06:16 PM GMT -- oldpids: 10565 Wed 30 Oct 2019 05:06:21 PM GMT -- oldpids: 10600 Wed 30 Oct 2019 05:06:26 PM GMT -- oldpids: 10634 Wed 30 Oct 2019 05:06:31 PM GMT -- oldpids: 10782 Wed 30 Oct 2019 05:06:36 PM GMT -- oldpids: 10818 Wed 30 Oct 2019 05:06:41 PM GMT -- oldpids: 11083 10818 Wed 30 Oct 2019 05:06:46 PM GMT -- oldpids: 11116 10818 Wed 30 Oct 2019 05:06:51 PM GMT -- oldpids: 11146 10818 Wed 30 Oct 2019 05:06:56 PM GMT -- oldpids: 11188 10818 ... Wed 30 Oct 2019 05:14:51 PM GMT -- oldpids: 14792 10818 Wed 30 Oct 2019 05:14:56 PM GMT -- oldpids: 14827 10818 Wed 30 Oct 2019 05:15:01 PM GMT -- oldpids: 14864 10818 Wed 30 Oct 2019 05:15:06 PM GMT -- oldpids: 14897 10818 Wed 30 Oct 2019 05:15:11 PM GMT -- oldpids: 14936 10818 Wed 30 Oct 2019 05:15:16 PM GMT -- oldpids: 14975 10818 More than 5m later pid 10818 is still lingering - it does have established connections: $ lsof -p 10818 lsof: WARNING: can't stat() fuse file system /keybase Output information may be incomplete. COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME haproxy 10818 aim cwd DIR 8,49 40960 3670017 /home/aim haproxy 10818 aim rtd DIR 253,0 4096 2 / haproxy 10818 aim txt REG 8,49 8382936 11929158 /home/aim/haproxy-1.8/haproxy haproxy 10818 aim mem REG 253,0 6697832 1447906 /usr/lib64/libc-2.29.so haproxy 10818 aim mem REG 253,0 823192 1448140 /usr/lib64/libpthread-2.29.so haproxy 10818 aim mem REG 253,0 61872 1447908 /usr/lib64/libdl-2.29.so haproxy 10818 aim mem REG 253,0 260200 1462867 /usr/lib64/libcrypt.so.2.0.0 haproxy 10818 aim mem REG 253,0 436736 1443329 /usr/lib64/ld-2.29.so haproxy 10818 aim 0u CHR 1,3 0t0 1031 /dev/null haproxy 10818 aim 1u CHR 1,3 0t0 1031 /dev/null haproxy 10818 aim 2u CHR 1,3 0t0 1031 /dev/null haproxy 10818 aim 3u a_inode 0,14 0 13328 [eventpoll] haproxy 10818 aim 4u IPv4 21803422 0t0 UDP *:50737 haproxy 10818 aim 7r FIFO 0,13 0t0 21802735 pipe haproxy 10818 aim 8w FIFO 0,13 0t0 21802735 pipe haproxy 10818 aim 9u a_inode 0,14 0 13328 [eventpoll] haproxy 10818 aim 10u a_inode 0,14 0 13328 [eventpoll] haproxy 10818 aim 11u a_inode 0,14 0 13328 [eventpoll] haproxy 10818 aim 13u IPv4 21798818 0t0 TCP localhost:4242->localhost:45632 (ESTABLISHED) haproxy 10818 aim 14u IPv4 21798820 0t0 TCP localhost:50362->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 18u IPv4 21795782 0t0 UDP *:34593 haproxy 10818 aim 19u IPv4 21795785 0t0 TCP localhost:4242->localhost:45642 (ESTABLISHED) haproxy 10818 aim 22u IPv4 21795790 0t0 UDP *:36178 haproxy 10818 aim 23u IPv4 21795791 0t0 TCP localhost:50376->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 29u IPv4 21805253 0t0 UDP *:42369 haproxy 10818 aim 35u IPv4 21806129 0t0 TCP localhost:4242->localhost:45670 (ESTABLISHED) haproxy 10818 aim 38u IPv4 21806133 0t0 TCP localhost:50406->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 67u IPv4 21795803 0t0 TCP localhost:4242->localhost:45736 (ESTABLISHED) haproxy 10818 aim 71u IPv4 21795807 0t0 TCP localhost:4242->localhost:45744 (ESTABLISHED) haproxy 10818 aim 73u IPv4 21795809 0t0 TCP localhost:4242->localhost:45748 (ESTABLISHED) haproxy 10818 aim 77u IPv4 21795814 0t0 TCP localhost:50484->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 80u IPv4 21795817 0t0 TCP localhost:50490->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 83u IPv4 21795819 0t0 TCP localhost:50496->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 97u IPv4 21806193 0t0 TCP localhost:4242->localhost:45796 (ESTABLISHED) haproxy 10818 aim 100u IPv4 21806198 0t0 TCP localhost:4242->localhost:45802 (ESTABLISHED) haproxy 10818 aim 101u IPv4 21806200 0t0 TCP localhost:50532->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 102u IPv4 21806201 0t0 TCP localhost:4242->localhost:45806 (ESTABLISHED) haproxy 10818 aim 132u IPv4 21806225 0t0 TCP localhost:4242->localhost:45862 (ESTABLISHED) haproxy 10818 aim 141u IPv4 21806233 0t0 TCP localhost:50614->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 152u IPv4 21806241 0t0 TCP localhost:50636->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 153u IPv4 21806242 0t0 TCP localhost:50638->localhost:cslistener (ESTABLISHED) haproxy 10818 aim 161u IPv4 21802809 0t0 TCP localhost:4242->localhost:45926 (ESTABLISHED) haproxy 10818 aim 162u IPv4 21802811 0t0 TCP localhost:50656->localhost:cslistener (ESTABLISHED) but given the tunnel timeout of 5m I would expect them to be closed. Which they do... but it seems to take longer than 5m: Wed 30 Oct 2019 05:16:42 PM GMT -- oldpids: 15575 10818 Wed 30 Oct 2019 05:16:47 PM GMT -- oldpids: 15612 Wed 30 Oct 2019 05:16:52 PM GMT -- oldpids: 15647 Wed 30 Oct 2019 05:16:58 PM GMT -- oldpids: 15685 Wed 30 Oct 2019 05:17:03 PM GMT -- oldpids: 15711 Wed 30 Oct 2019 05:17:08 PM GMT -- oldpids: 15760 Wed 30 Oct 2019 05:17:13 PM GMT -- oldpids: 16003 Investigating why this appears to be longer than the stated 5m value.
I misunderstood what `timeout tunnel 1h` actually meant. I thought, irrespective of active traffic, any old (post reloading) haproxy processes would be terminated after 1h. Whereas it's the opposite. If after $DURATION there has been _zero_ activity then, and only then, can the process be terminated. I created some debugging scripts here to show the behaviour: https://github.com/frobware/haproxy-hacks/commit/b3c67adf05defea58e3c83d16a97db96bd827038 ## Start up haproxy, reloading every second $ while :; do ~/haproxy-hacks/BZ1743291/reload-proxy ; sleep 5; done Thu 31 Oct 2019 05:45:00 PM GMT -- oldpids: Thu 31 Oct 2019 05:45:05 PM GMT -- oldpids: 8555 Thu 31 Oct 2019 05:45:10 PM GMT -- oldpids: 8574 Thu 31 Oct 2019 05:45:15 PM GMT -- oldpids: 8592 Thu 31 Oct 2019 05:45:20 PM GMT -- oldpids: 8610 Run the backend: $ go run server.go Make some connections: $ go run client.go From the constant reloading we see a PID that appears to be stuck/persistent: Thu 31 Oct 2019 05:46:06 PM GMT -- oldpids: 8794 Thu 31 Oct 2019 05:46:11 PM GMT -- oldpids: 8815 Thu 31 Oct 2019 05:46:16 PM GMT -- oldpids: 8834 Thu 31 Oct 2019 05:46:21 PM GMT -- oldpids: 8853 ... Fri 01 Nov 2019 08:29:55 AM GMT -- oldpids: 20369 2279 Fri 01 Nov 2019 08:30:00 AM GMT -- oldpids: 20369 2304 Fri 01 Nov 2019 08:30:05 AM GMT -- oldpids: 20369 2327 Fri 01 Nov 2019 08:30:10 AM GMT -- oldpids: 20369 2348 Fri 01 Nov 2019 08:30:15 AM GMT -- oldpids: 20369 2373 Fri 01 Nov 2019 08:30:20 AM GMT -- oldpids: 20369 2394 Fri 01 Nov 2019 08:30:25 AM GMT -- oldpids: 20369 2415 If we adjust the 'tunnel timeout' to 30s and now connect with a different client, entering "hello" only (and not closing the connection): $ websocat -E -n -t ws://127.0.0.1:4242/echo hello handled by "127.0.0.1:9000"; you said "hello\n" We see PID 4581 hang around for ~30s: Fri 01 Nov 2019 08:35:39 AM GMT -- oldpids: 20369 4507 Fri 01 Nov 2019 08:35:44 AM GMT -- oldpids: 20369 4541 Fri 01 Nov 2019 08:35:49 AM GMT -- oldpids: 20369 4563 Fri 01 Nov 2019 08:35:54 AM GMT -- oldpids: 20369 4581 Fri 01 Nov 2019 08:35:59 AM GMT -- oldpids: 20369 4631 4581 Fri 01 Nov 2019 08:36:04 AM GMT -- oldpids: 20369 4654 4581 Fri 01 Nov 2019 08:36:09 AM GMT -- oldpids: 20369 4680 4581 Fri 01 Nov 2019 08:36:14 AM GMT -- oldpids: 20369 4714 4581 Fri 01 Nov 2019 08:36:19 AM GMT -- oldpids: 20369 4733 4581 Fri 01 Nov 2019 08:36:24 AM GMT -- oldpids: 20369 4755 Fri 01 Nov 2019 08:36:29 AM GMT -- oldpids: 20369 4795 PID 4581 is eventually removed as there has been no traffic in the last 30s - we typed a single "hello", leaving the connection open. SOLUTION/FIX: A separate process/effort/entity needs to periodically clean up old haproxy PIDs that are still running longer than the configured tunnel timeout. These processes have established connections with active traffic (possibly websocket/keep-alive traffic?). With respect to the memory size of the processes it's not clear how the 1.4G value was obtained. Is this Virt memory (as reported in top) or Res? How/where was this number obtained? I did experiment with ballooning memory requests to make memory leaks more obvious by creating an interposer library for malloc/calloc. By a dding 1MB to each malloc() allocation I didn't see haproxy processes grow in size -- particularly for the processes that hung around post reload. The library and usage can be found here: https://github.com/frobware/haproxy-hacks/tree/master/BZ1743291#memory-leaks
Note: you can kill all old haproxy processes using: $ killall --older-than 1h haproxy
(In reply to Andrew McDermott from comment #7) > Note: you can kill all old haproxy processes using: > > $ killall --older-than 1h haproxy To be clear: additional work will be required to ensure that this doesn't kill every single haproxy process - the intent here was to highlight the `--older-than` option available in killall.
Based on the observations in https://bugzilla.redhat.com/show_bug.cgi?id=1743291#c6, I'm closing this as NOTABUG - processes that still have active connections will not be terminated.
There is the `hard-stop-after` option that can help auto-kill long lived processes: https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#3.1-hard-stop-after "This may be used to ensure that the instance will quit even if connections remain opened during a soft-stop (for example with long timeouts for a proxy in tcp mode). It applies both in TCP and HTTP mode." Using a 'tunnel timeout of 30s' I see the old process living on and on: $ while :; do ~/haproxy-hacks/BZ1743291/reload-proxy ; sleep 5; done Fri 01 Nov 2019 07:04:52 PM GMT -- oldpids: Fri 01 Nov 2019 07:04:57 PM GMT -- oldpids: 9302 Fri 01 Nov 2019 07:05:02 PM GMT -- oldpids: 9320 Fri 01 Nov 2019 07:05:07 PM GMT -- oldpids: 9338 Fri 01 Nov 2019 07:05:13 PM GMT -- oldpids: 9361 Fri 01 Nov 2019 07:05:18 PM GMT -- oldpids: 9475 9361 Fri 01 Nov 2019 07:05:23 PM GMT -- oldpids: 9499 9361 Fri 01 Nov 2019 07:05:28 PM GMT -- oldpids: 9518 9361 Fri 01 Nov 2019 07:05:33 PM GMT -- oldpids: 9540 9361 Fri 01 Nov 2019 07:05:38 PM GMT -- oldpids: 9561 9361 Fri 01 Nov 2019 07:05:43 PM GMT -- oldpids: 9580 9361 Fri 01 Nov 2019 07:05:48 PM GMT -- oldpids: 9604 9361 Fri 01 Nov 2019 07:05:53 PM GMT -- oldpids: 9632 9361 Fri 01 Nov 2019 07:05:58 PM GMT -- oldpids: 9652 9361 Fri 01 Nov 2019 07:06:03 PM GMT -- oldpids: 9670 9361 Fri 01 Nov 2019 07:06:08 PM GMT -- oldpids: 9693 9361 Fri 01 Nov 2019 07:06:13 PM GMT -- oldpids: 9714 9361 Fri 01 Nov 2019 07:06:18 PM GMT -- oldpids: 9732 9361 Fri 01 Nov 2019 07:06:23 PM GMT -- oldpids: 9750 9361 Fri 01 Nov 2019 07:06:28 PM GMT -- oldpids: 9770 9361 Fri 01 Nov 2019 07:06:33 PM GMT -- oldpids: 9789 9361 Fri 01 Nov 2019 07:06:39 PM GMT -- oldpids: 9808 9361 Fri 01 Nov 2019 07:06:44 PM GMT -- oldpids: 9827 9361 Fri 01 Nov 2019 07:06:49 PM GMT -- oldpids: 9851 9361 Fri 01 Nov 2019 07:06:54 PM GMT -- oldpids: 9873 9361 Fri 01 Nov 2019 07:06:59 PM GMT -- oldpids: 9891 9361 Fri 01 Nov 2019 07:07:04 PM GMT -- oldpids: 9915 9361 Adding: global hard-stop-after 30s $ while :; do ~/haproxy-hacks/BZ1743291/reload-proxy ; sleep 5; done Fri 01 Nov 2019 07:10:08 PM GMT -- oldpids: Fri 01 Nov 2019 07:10:13 PM GMT -- oldpids: 10782 Fri 01 Nov 2019 07:10:18 PM GMT -- oldpids: 10800 Fri 01 Nov 2019 07:10:23 PM GMT -- oldpids: 10819 Fri 01 Nov 2019 07:10:28 PM GMT -- oldpids: 10846 Fri 01 Nov 2019 07:10:34 PM GMT -- oldpids: 10871 Fri 01 Nov 2019 07:10:39 PM GMT -- oldpids: 10999 10871 Fri 01 Nov 2019 07:10:44 PM GMT -- oldpids: 11019 10871 Fri 01 Nov 2019 07:10:49 PM GMT -- oldpids: 11038 10871 Fri 01 Nov 2019 07:10:54 PM GMT -- oldpids: 11056 10871 Fri 01 Nov 2019 07:10:59 PM GMT -- oldpids: 11074 10871 Fri 01 Nov 2019 07:11:04 PM GMT -- oldpids: 11092 10871 Fri 01 Nov 2019 07:11:09 PM GMT -- oldpids: 11110 Fri 01 Nov 2019 07:11:14 PM GMT -- oldpids: 11133 Fri 01 Nov 2019 07:11:19 PM GMT -- oldpids: 11157 So all of the behaviour we see is by design; the option exists to mitigate this very problem. Perhaps this option can be introduced to help clean up long-lived processes. Maybe it shouldn't be done every hour, but could be done on, say, a 24h period.