Description of problem: Core dumps are starting to appear in the /var/lib/haproxy/conf/ directoryafter upgrade to version 3.11.154 This is happening on 4 different clusters. Version-Release number of selected component (if applicable): Openshift 3.11.154 How reproducible: frequently Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Attaching one core dump and one sosreport of a host
Could this issue be related to this old one? https://bugzilla.redhat.com/show_bug.cgi?id=1692583
As extracted from the sosreport, the version of the image: openshift-enterprise-haproxy-router-container used in the cluster is v3.11.154-1 And the haproxy package used is "haproxy18-1.8.17-3.el7.x86_64" as indicated in: https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/ose-cli/images/v3.11.154-1
We have reached this KCS that explain a similar situation: https://access.redhat.com/solutions/4010101 We are in the process of testing an older image.
Adding some data that might be interesting for RCA. (gdb) bt #0 __GI_raise (sig=997) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007fa4a0d5ba68 in __GI_abort () at abort.c:88 #2 0x00007fa4a0d9cec7 in __libc_message (do_abort=2, fmt=0x7fa4a0eaf3f8 "next size (normal)") at ../sysdeps/unix/sysv/linux/libc_fatal.c:157 #3 0x00007fa4a0da56b9 in munmap_chunk (p=0x3e8) at malloc.c:2825 #4 free_atfork (mem=0x3e5, caller=<optimized out>) at arena.c:199 #5 0x00007fa49c1730a3 in ?? () #6 0x00007fa49c1730b0 in ?? () #7 0x3838353530303030 in ?? () #8 0x3064303766313138 in ?? () #9 0x00007fa49c173000 in ?? () #10 0x00007fa49c1730b0 in ?? () #11 0x0000000000000000 in ?? () (gdb) frame 0 #0 __GI_raise (sig=997) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) list 45,65 45 #include <errno.h> 46 #include <signal.h> 47 #include <stdarg.h> 48 #include <sys/resource.h> 49 #include <sys/wait.h> 50 #include <time.h> 51 #include <syslog.h> 52 #include <grp.h> 53 #ifdef USE_CPU_AFFINITY 54 #include <sched.h> 55 #ifdef __FreeBSD__ 56 #include <sys/param.h> 57 #include <sys/cpuset.h> 58 #include <pthread_np.h> 59 #endif 60 #endif 61 62 #ifdef DEBUG_FULL 63 #include <assert.h> 64 #endif 65 #if defined(USE_SYSTEMD) (gdb) frame 1 #1 0x00007fa4a0d5ba68 in __GI_abort () at abort.c:88 88 __libc_lock_unlock_recursive (lock); (gdb) l 80,98 80 { 81 /* This stage is special: we must allow repeated calls of 82 `abort' when a user defined handler for SIGABRT is installed. 83 This is risky since the `raise' implementation might also 84 fail but I don't see another possibility. */ 85 int save_stage = stage; 86 87 stage = 0; 88 __libc_lock_unlock_recursive (lock); 89 90 raise (SIGABRT); 91 92 __libc_lock_lock_recursive (lock); 93 stage = save_stage + 1; 94 } 95 96 /* There was a handler installed. Now remove it. */ 97 if (stage == 3) 98 {
The proposed workaround of using an older version of the router image (v. 3.11.69-3) that uses an older version of haproxy (haproxy18-1.8.14-2.el7.x86_64) not affected by this bug, is affected by another bug: https://bugzilla.redhat.com/show_bug.cgi?id=1660598 So the workaround is not viable.
Full backtrace: (gdb) bt #0 0x00007fa4a0d5a377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007fa4a0d5ba68 in __GI_abort () at abort.c:90 #2 0x00007fa4a0d9cec7 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fa4a0eaf3f8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196 #3 0x00007fa4a0da56b9 in malloc_printerr (ar_ptr=0x7fa4a10eb760 <main_arena>, ptr=<optimized out>, str=0x7fa4a0eaf520 "double free or corruption (!prev)", action=3) at malloc.c:4967 #4 _int_free (av=0x7fa4a10eb760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843 #5 0x000055887f50f808 in deinit_log_buffers () at src/log.c:1387 #6 0x000055887f50d3cb in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2484 #7 0x00007fa4a1e64ea5 in start_thread (arg=0x7fa49c175700) at pthread_create.c:307 #8 0x00007fa4a0e228cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) bt full #0 0x00007fa4a0d5a377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 resultvar = 0 pid = 997 selftid = 1000 #1 0x00007fa4a0d5ba68 in __GI_abort () at abort.c:90 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x7fa4a0d3676d, sa_sigaction = 0x7fa4a0d3676d}, sa_mask = {__val = {5, 140345051106417, 3, 140344970126990, 2, 140345051102759, 1, 140345051111473, 3, 140344970126964, 12, 140345051111477, 2, 140344970127776, 140344970127776, 140344970129536}}, sa_flags = 6, sa_restorer = 0x7fa49c172ba0} sigs = {__val = {32, 0 <repeats 15 times>}} #2 0x00007fa4a0d9cec7 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fa4a0eaf3f8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196 ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fa49c173090, reg_save_area = 0x7fa49c172fa0}} ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fa49c173090, reg_save_area = 0x7fa49c172fa0}} fd = 2 on_2 = <optimized out> list = <optimized out> nlist = <optimized out> cp = <optimized out> written = <optimized out> #3 0x00007fa4a0da56b9 in malloc_printerr (ar_ptr=0x7fa4a10eb760 <main_arena>, ptr=<optimized out>, str=0x7fa4a0eaf520 "double free or corruption (!prev)", action=3) at malloc.c:4967 buf = "00005588811f70d0" cp = <optimized out> ar_ptr = 0x7fa4a10eb760 <main_arena> ptr = <optimized out> str = 0x7fa4a0eaf520 "double free or corruption (!prev)" action = 3 #4 _int_free (av=0x7fa4a10eb760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843 size = <optimized out> fb = <optimized out> nextchunk = <optimized out> nextsize = <optimized out> nextinuse = <optimized out> prevsize = <optimized out> bck = <optimized out> fwd = <optimized out> errstr = 0x7fa4a0eaf520 "double free or corruption (!prev)" locked = <optimized out> #5 0x000055887f50f808 in deinit_log_buffers () at src/log.c:1387 No locals. #6 0x000055887f50d3cb in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2484 ptif = <optimized out> ptdf = 0x5588808dad40 start_lock = 0 #7 0x00007fa4a1e64ea5 in start_thread (arg=0x7fa49c175700) at pthread_create.c:307 __res = <optimized out> pd = 0x7fa49c175700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140344970139392, -8544586916138423878, 0, 8392704, 0, 140344970139392, 8512576625762527674, 8512582523388455354}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = { prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #8 0x00007fa4a0e228cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 No locals.
I'm moving this to the RHSCL haproxy component since this is a crash in the haproxy18-1.8.17-3.el7 package from the rhaos-3.11-rhel-7 branch. (gdb) bt #0 0x00007fa4a0d5a377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007fa4a0d5ba68 in __GI_abort () at abort.c:90 #2 0x00007fa4a0d9cec7 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fa4a0eaf3f8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196 #3 0x00007fa4a0da56b9 in malloc_printerr (ar_ptr=0x7fa4a10eb760 <main_arena>, ptr=<optimized out>, str=0x7fa4a0eaf520 "double free or corruption (!prev)", action=3) at malloc.c:4967 #4 _int_free (av=0x7fa4a10eb760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843 #5 0x000055887f50f808 in deinit_log_buffers () at src/log.c:1387 #6 0x000055887f50d3cb in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2484 #7 0x00007fa4a1e64ea5 in start_thread (arg=0x7fa49c175700) at pthread_create.c:307 #8 0x00007fa4a0e228cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Looks like perhaps this: commit 7c49711d6041d1afc42d5b310ddfd7d6f6817c3c Author: Olivier Houchard <ohouchard> Date: Thu Mar 7 14:19:24 2019 +0100 BUG/MEDIUM: logs: Only attempt to free startup_logs once. deinit_log_buffers() can be called once per thread, however startup_logs is common to all threads. So only attempt to free it once. This should be backported to 1.9 and 1.8. Mentioned here: HAProxy 1.8.18/19 thread crashes during reload #58 https://github.com/haproxy/haproxy/issues/58 Doesn't appear to be in v1.18.17 and from looking at rhpkg there hasn't been anything backported to haproxy18-1.8.17-3.el7. ptalbert@ocho:~/git/upstream/haproxy$ git tag --contains 7c49711d6041d1afc42d5b310ddfd7d6f6817c3c v2.0-dev2 v2.0-dev3 v2.0-dev4 v2.0-dev5 v2.0-dev6 v2.0-dev7 v2.0.0 v2.1-dev0 v2.1-dev1 v2.1-dev2 v2.1-dev3 v2.1-dev4 v2.1-dev5 v2.1.0 v2.2-dev0 v2.2-dev1 v2.2-dev2 ptalbert@ocho:~/git/upstream/haproxy$ git tag --contains 4c82743abd299f0aa8105e98ec92b76375a7f344 v1.8.20 v1.8.21 v1.8.22 v1.8.23 v1.8.24 ptalbert@ocho:~/git/upstream/haproxy$ git tag --contains 6fb9fadc5311cd00beb522a35596d3aa63f24fdb v1.8.20 v1.8.21 v1.8.22 v1.8.23 v1.8.24
Oh, I think this should be under RHEL 7 and haproxy if not in the 3.11.z OSE product, though it is not straightforward to me since it is falling in the rhaos-3.11-rhel-7 branch. Which seems to be OSE specific. Feel free to bounce it back to OSE 3.11.z if needed...
Build with the backported fix from 7c49711d included: build (rhaos-3.11-rhel-7-candidate, /rpms/haproxy:38b8a5b6430c5231718749497bf7794fa81005a2) | Task Info | Brew https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27328503 The builds completed successfully, and are available via a scratch repo that I have stood up on my people.redhat.com page: yum-config-manager --add-repo https://people.redhat.com/kwalker/repos/bz1810573/rhbz1810573.repo If the team wants to go with this path, then testing in the environment should be as simple as injecting the above repo into a layered image and (I believe) point the cluster "dc/router -n default" to the test build as described in: https://access.redhat.com/solutions/4010101
I tested this new build but directly on my desktop, not a cluster. I was able to reproduce the SIGABRT on my desktop. After switching to the new build I was not able to generate a SIGABRT. My scripts are captured here: https://github.com/frobware/haproxy-hacks/tree/master/BZ1810573 These are my reproducer steps (assuming you take the scripts from my repo): # Start our haproxy backends $ server node.js & # Start haproxy $ ./reload-haproxy # Verify that haproxy is running $ pgrep haproxy # I am using hey to generate load $ hey -c 1000 -m GET -z 10h http://localhost:4242 # Repeatedly reload haproxy in a 1 while loop: $ ./while-1-reload Now we look for coredumps from haproxy: $ while :; do coredumpctl list | cat | grep -w 6; sleep 2; date; done Thu 19 Mar 2020 04:40:10 PM GMT Thu 2020-03-19 15:27:19 GMT 25111 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 15:29:11 GMT 12877 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:36:44 GMT 13511 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:36:44 GMT 13535 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:36:44 GMT 13547 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:36:44 GMT 13559 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:36:44 GMT 13645 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:36:44 GMT 13633 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy I let this run and run accumulating SIGABRTs: Thu 2020-03-19 16:45:02 GMT 7019 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:45:03 GMT 6983 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:45:03 GMT 7001 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:45:03 GMT 7097 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:46:12 GMT 28840 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy At 16:47 I then switched to the new binary. To do this change the reload-script - swap out the line in the reload-script that has: : ${HAPROXY:=$topdir/haproxy18-1.8.17-3.el7.x86_64/haproxy} for : ${HAPROXY:=$topdir/haproxy18-1.8.17-4.el7.x86_64/haproxy} I then let this run for 1 hour and did not see haproxy core dump with signal 6 (ABRT) until "Thu 2020-03-19 17:50:17". This was the time I switched back to the old binary ("haproxy18-1.8.17-3"). Thu 2020-03-19 16:45:02 GMT 7019 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:45:03 GMT 6983 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:45:03 GMT 7001 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:45:03 GMT 7097 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 16:46:12 GMT 28840 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:50:17 GMT 19924 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy And letting this run and run with haproxy18-1.8.17-3 I see: Thu 19 Mar 2020 06:09:58 PM GMT Thu 2020-03-19 17:50:17 GMT 19924 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:51:13 GMT 6218 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:51:36 GMT 13489 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:51:37 GMT 13453 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:51:50 GMT 17775 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:51:50 GMT 17797 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:52:54 GMT 6079 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:53:21 GMT 14453 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:54:04 GMT 27993 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:54:04 GMT 28015 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:54:06 GMT 28518 1000 1000 6 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haprox Looking at the last SIGABRT in gdb I see: $ coredumpctl gdb 29965 PID: 29965 (haproxy) UID: 1000 (aim) GID: 1000 (aim) Signal: 6 (ABRT) Timestamp: Thu 2020-03-19 18:01:02 GMT (9min ago) Command Line: /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy -f /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy.cfg -p /var/tmp/haproxy/run/haproxy.pid -x /var/tmp/haproxy/run/haproxy.sock -sf 29945 29916 29898 29880 29862 29844 29826 29808 29790 29772 29754 29735 15785 Executable: /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Control Group: /user.slice/user-1000.slice/user/gnome-terminal-server.service Unit: user User Unit: gnome-terminal-server.service Slice: user-1000.slice Owner UID: 1000 (aim) Boot ID: 9cdcee89fdea41ad960e682904ffef7d Machine ID: 3ff4ac170c9447d4a10d272740ea39c8 Hostname: spicy Storage: /var/lib/systemd/coredump/core.haproxy.1000.9cdcee89fdea41ad960e682904ffef7d.29965.1584640862000000.lz4 Message: Process 29965 (haproxy) of user 1000 dumped core. Stack trace of thread 29969: #0 0x00007fd17b101e35 __GI_raise (libc.so.6) #1 0x00007fd17b0ec895 __GI_abort (libc.so.6) #2 0x00007fd17b14508f __libc_message (libc.so.6) #3 0x00007fd17b14c40c malloc_printerr (libc.so.6) #4 0x00007fd17b14de74 _int_free (libc.so.6) #5 0x00007fd17b1510bb tcache_thread_shutdown (libc.so.6) #6 0x00007fd17b69a4e6 start_thread (libpthread.so.0) #7 0x00007fd17b1c6163 __clone (libc.so.6) Stack trace of thread 29975: #0 0x00007fd17b6a374d __lll_lock_wait (libpthread.so.0) #1 0x00007fd17b69cdc4 __GI___pthread_mutex_lock (libpthread.so.0) #2 0x00007fd17b74e7ca _dl_open (ld-linux-x86-64.so.2) #3 0x00007fd17b2002e1 do_dlopen (libc.so.6) #4 0x00007fd17b200e09 __GI__dl_catch_exception (libc.so.6) #5 0x00007fd17b200ea3 __GI__dl_catch_error (libc.so.6) #6 0x00007fd17b2003e7 dlerror_run (libc.so.6) #7 0x00007fd17b20047a __GI___libc_dlopen_mode (libc.so.6) #8 0x00007fd17b6a524b pthread_cancel_init (libpthread.so.0) #9 0x00007fd17b6a5464 _Unwind_ForcedUnwind (libpthread.so.0) #10 0x00007fd17b6a3566 __GI___pthread_unwind (libpthread.so.0) #11 0x00007fd17b69b7b2 __do_cancel (libpthread.so.0) #12 0x000055be4631e485 n/a (/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy) GNU gdb (GDB) Fedora 8.3-7.fc30 Copyright (C) 2019 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy... Missing separate debuginfo for /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/d0/de6c87b2ee9a0f0afc9b1ad1283b64c73793f9.debug Reading symbols from .gnu_debugdata for /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy... (No debugging symbols found in .gnu_debugdata for /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy) [New LWP 29969] [New LWP 29975] [New LWP 29970] [New LWP 29974] [New LWP 29972] [New LWP 29977] [New LWP 29965] [New LWP 29976] [New LWP 29968] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.'. Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 return ret; [Current thread is 1 (Thread 0x7fd179bd9700 (LWP 29969))] (gdb) bt (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007fd17b0ec895 in __GI_abort () at abort.c:79 #2 0x00007fd17b14508f in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fd17b253a7e "%s\n") at ../sysdeps/posix/libc_fatal.c:181 #3 0x00007fd17b14c40c in malloc_printerr (str=str@entry=0x7fd17b255728 "double free or corruption (fasttop)") at malloc.c:5366 #4 0x00007fd17b14de74 in _int_free (av=0x7fd17b28aba0 <main_arena>, p=0x55be4683fb10, have_lock=<optimized out>) at malloc.c:4278 #5 0x00007fd17b1510bb in tcache_thread_shutdown () at malloc.c:2978 #6 __malloc_arena_thread_freeres () at arena.c:952 #7 0x00007fd17b1543e0 in __libc_thread_freeres () at thread-freeres.c:38 #8 0x00007fd17b69a4e6 in start_thread (arg=<optimized out>) at pthread_create.c:493 #9 0x00007fd17b1c6163 in clone () at ../sysdeps/unix/sysv/ This comes from tcache_thread_shutdown(), not deinit_log_buffers() as mentioned in comment #12. But comment #9 is a SIGABRT with a different trace. So, although the stack trace is not identical to the patch, I'm seeing that the new build fixes the SIGABRT. Conclusion: -- I ran haproxy18-1.8.17-3 for ~45 mins and saw many SIGABRTs. -- I ran haproxy18-1.8.17-4 for ~45 mins and saw ZERO SIGABRTs. -- I ran haproxy18-1.8.17-3 again and immediately saw many SIGABRTs. Other observations: I'm driving reload way harder/faster than we would normally do and I also see quite a lot of SEGVs in either version (-3, or -4): Thu 2020-03-19 17:56:29 GMT 10738 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:56:34 GMT 11719 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:56:37 GMT 12862 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:56:40 GMT 13620 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:56:45 GMT 15325 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:57:05 GMT 21969 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:57:37 GMT 31938 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:58:35 GMT 18224 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:58:35 GMT 18205 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:58:36 GMT 18537 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:58:38 GMT 19113 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:59:08 GMT 28852 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 17:59:28 GMT 2588 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 18:00:13 GMT 14684 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Switching to haproxy18-1.8.17-4 I also see SEGVs: Thu 2020-03-19 18:01:05 GMT 30665 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy Thu 2020-03-19 18:37:20 GMT 9801 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy Thu 2020-03-19 18:37:27 GMT 11912 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy Thu 2020-03-19 18:37:37 GMT 15092 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy Thu 2020-03-19 18:37:38 GMT 15294 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy Thu 2020-03-19 18:37:48 GMT 18500 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy Thu 2020-03-19 18:37:50 GMT 18960 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy Thu 2020-03-19 18:37:51 GMT 19480 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy I don't want to muddy the waters though because it seems to me that haproxy18-1.8.17-4 fixes the SIGABRTs. For completeness here's a SEGv backtrace: $ coredumpctl gdb 30665 Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000056273701bc33 in listener_accept () [Current thread is 1 (Thread 0x7f20c666f700 (LWP 30672))] (gdb) bt #0 0x000056273701bc33 in listener_accept () #1 0x000056273703e10f in fd_process_cached_events () #2 0x0000562736fea2f2 in run_thread_poll_loop () #3 0x00007f20ca1344c0 in start_thread (arg=<optimized out>) at pthread_create.c:479 #4 0x00007f20c9c60163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 $ coredumpctl gdb 19480 Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00005557c4edecb3 in listener_accept () [Current thread is 1 (Thread 0x7fb4add98700 (LWP 19490))] (gdb) bt #0 0x00005557c4edecb3 in listener_accept () #1 0x00005557c4f0118f in fd_process_cached_events () #2 0x00005557c4ead342 in run_thread_poll_loop () #3 0x00007fb4b30604c0 in start_thread (arg=<optimized out>) at pthread_create.c:479 #4 0x00007fb4b2b8c163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Is the SIGABRT the only core dump the customer is seeing?
Thanks Kyle for the quick patch!. Andrew thanks a lot for your extensive tests and troubleshooting; answering your question, the customer has the same behavior that you are describing in your tests; we only took one core to analyze as they were being generated quite often, but maybe we can take some others to see if we are missing something else. Anyway, as you say, looks like this may solves the issue. Let me get some more cores to analyze.
Following Kyle's lead, I've created a new haproxy 1.8 build which adds the following commit to the rhaos-3.11-rhel-7 branch: http://git.haproxy.org/?p=haproxy.git;a=patch;h=7c49711d6041d1afc42d5b310ddfd7d6f6817c3c Here's the build and repo: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27428159 http://brew-task-repos.usersys.redhat.com/repos/scratch/dmace/haproxy/1.8.17/4.el7/haproxy-1.8.17-4.el7-scratch.repo I verified the haproxy binary is byte-equivalent to Kyle's, but haven't yet run Andrew's testing against the new binary. Here's an OpenShift 3.11 PR which incorporates the scratch RPM to test for regressions: https://github.com/openshift/origin/pull/24756 It would be useful if the customer could test Kyle's public RPM to ensure it solves their problem. If the new build has no regressions, and passes Andrew's manual testing, and ideally passes the customer's testing, I'd feel pretty comfortable pushing out a new build into the supported haproxy image.
Now that I know where the upstream 1.8 source actually lives, I can see that the fix was already back ported into the 1.8 tree: http://git.haproxy.org/?p=haproxy-1.8.git;a=commit;h=4c82743abd299f0aa8105e98ec92b76375a7f344 We already have a patch pending to upgrade to 1.8.23 (https://github.com/openshift/origin/pull/24533), so let's finish that work and fix both bugs in the process.
The haproxy 1.8.23 build is here: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1145906 Will work on getting the new router image released.
I took the RPMs from this build: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1145906 I updated my scripts https://github.com/frobware/haproxy-hacks/tree/master/BZ1810573 to select haproxy-1.8.23 by default. I reran the testing I did in comment #19 for 30mins and I saw zero SIGABRTs. I still see some SEGVs but with 1.8.23 the rate is reduced: Starting at: Tue 24 Mar 2020 06:00:00 PM GMT $ while :; do coredumpctl list | grep haproxy; sleep 2; date; done Fri 2020-03-20 12:52:54 GMT 22408 1000 1000 11 missing /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy Tue 2020-03-24 18:07:09 GMT 2948 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.x86_64/haproxy Tue 2020-03-24 18:08:35 GMT 12941 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.x86_64/haproxy Tue 2020-03-24 18:18:52 GMT 25386 1000 1000 11 present /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.x86_64/haproxy Looking at each SEGV I see: $ coredumpctl gdb 2948 Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000564ef4d07873 in listener_accept () [Current thread is 1 (Thread 0x7f16e2183700 (LWP 2953))] (gdb) bt #0 0x0000564ef4d07873 in listener_accept () #1 0x0000564ef4d2a88f in fd_process_cached_events () #2 0x0000564ef4cd659a in run_thread_poll_loop () #3 0x00007f16e3c454c0 in start_thread (arg=<optimized out>) at pthread_create.c:479 #4 0x00007f16e3771163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 $ coredumpctl gdb 12941 Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00005643e98bb873 in listener_accept () [Current thread is 1 (Thread 0x7fd501579700 (LWP 12943))] (gdb) bt #0 0x00005643e98bb873 in listener_accept () #1 0x00005643e98de88f in fd_process_cached_events () #2 0x00005643e988a59a in run_thread_poll_loop () #3 0x00007fd50283a4c0 in start_thread (arg=<optimized out>) at pthread_create.c:479 #4 0x00007fd502366163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 $ coredumpctl gdb 25386 Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000559c82e6e873 in listener_accept () [Current thread is 1 (Thread 0x7fe6eea10700 (LWP 25388))] (gdb) bt #0 0x0000559c82e6e873 in listener_accept () #1 0x0000559c82e9188f in fd_process_cached_events () #2 0x0000559c82e3d59a in run_thread_poll_loop () #3 0x00007fe6efcd14c0 in start_thread (arg=<optimized out>) at pthread_create.c:479 #4 0x00007fe6ef7fd163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
I bumped the PR that is running origin CI against haproxy-1.8.23 to use the build from: - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1145906 New commits in the PR: - https://github.com/openshift/origin/pull/24533/commits/4e1ee2a536626f89ee52d90f5e085a49c74bc248 The purpose of the origin PR is to provide some sanity testing against 1.8.23.
haproxy18-1.8.23-2 will be shipped in openshift-enterprise-haproxy-router-container-v3.11.195-1.
Verified with "atomic-openshift-3.11.200-1.git.0.ecc3aa3.el7.x86_64" and it has passed. The haproxy packages have been updated to "haproxy18-1.8.23-2.el7.x86_64". The tested HAproxy version has no regressions.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1285