Bug 1810573 - Openshift 3.11 router haproxy core dump
Summary: Openshift 3.11 router haproxy core dump
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 3.11.z
Assignee: Dan Mace
QA Contact: Brandon Perkins
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-05 14:25 UTC by Mario Abajo
Modified: 2023-09-07 22:12 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-07 17:45:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:1285 0 None None None 2020-04-07 17:45:59 UTC

Description Mario Abajo 2020-03-05 14:25:22 UTC
Description of problem:
Core dumps are starting to appear in the /var/lib/haproxy/conf/ directoryafter upgrade to version 3.11.154
This is happening on 4 different clusters.

Version-Release number of selected component (if applicable):
Openshift 3.11.154

How reproducible:
frequently

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Attaching one core dump and one sosreport of a host

Comment 4 Mario Abajo 2020-03-05 15:57:56 UTC
Could this issue be related to this old one?

https://bugzilla.redhat.com/show_bug.cgi?id=1692583

Comment 5 Mario Abajo 2020-03-05 16:30:50 UTC
As extracted from the sosreport, the version of the image: 
  openshift-enterprise-haproxy-router-container used in the cluster is v3.11.154-1
And the haproxy package used is "haproxy18-1.8.17-3.el7.x86_64" as indicated in:
  https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/ose-cli/images/v3.11.154-1

Comment 6 Mario Abajo 2020-03-06 16:00:50 UTC
We have reached this KCS that explain a similar situation:
https://access.redhat.com/solutions/4010101

We are in the process of testing an older image.

Comment 9 Victor Hernando 2020-03-17 16:03:40 UTC
Adding some data that might be interesting for RCA.

(gdb) bt     
#0  __GI_raise (sig=997) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fa4a0d5ba68 in __GI_abort () at abort.c:88
#2  0x00007fa4a0d9cec7 in __libc_message (do_abort=2, fmt=0x7fa4a0eaf3f8 "next size (normal)") at ../sysdeps/unix/sysv/linux/libc_fatal.c:157
#3  0x00007fa4a0da56b9 in munmap_chunk (p=0x3e8) at malloc.c:2825
#4  free_atfork (mem=0x3e5, caller=<optimized out>) at arena.c:199
#5  0x00007fa49c1730a3 in ?? ()
#6  0x00007fa49c1730b0 in ?? ()
#7  0x3838353530303030 in ?? ()
#8  0x3064303766313138 in ?? ()
#9  0x00007fa49c173000 in ?? ()
#10 0x00007fa49c1730b0 in ?? ()
#11 0x0000000000000000 in ?? ()

(gdb) frame 0
#0  __GI_raise (sig=997) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) list 45,65
45	#include <errno.h>
46	#include <signal.h>
47	#include <stdarg.h>
48	#include <sys/resource.h>
49	#include <sys/wait.h>
50	#include <time.h>
51	#include <syslog.h>
52	#include <grp.h>
53	#ifdef USE_CPU_AFFINITY
54	#include <sched.h>
55	#ifdef __FreeBSD__
56	#include <sys/param.h>
57	#include <sys/cpuset.h>
58	#include <pthread_np.h>
59	#endif
60	#endif
61	
62	#ifdef DEBUG_FULL
63	#include <assert.h>
64	#endif
65	#if defined(USE_SYSTEMD)

(gdb) frame 1
#1  0x00007fa4a0d5ba68 in __GI_abort () at abort.c:88
88	      __libc_lock_unlock_recursive (lock);
(gdb) l 80,98
80	    {
81	      /* This stage is special: we must allow repeated calls of
82		 `abort' when a user defined handler for SIGABRT is installed.
83		 This is risky since the `raise' implementation might also
84		 fail but I don't see another possibility.  */
85	      int save_stage = stage;
86	
87	      stage = 0;
88	      __libc_lock_unlock_recursive (lock);
89	
90	      raise (SIGABRT);
91	
92	      __libc_lock_lock_recursive (lock);
93	      stage = save_stage + 1;
94	    }
95	
96	  /* There was a handler installed.  Now remove it.  */
97	  if (stage == 3)
98	    {

Comment 10 Mario Abajo 2020-03-18 09:39:57 UTC
The proposed workaround of using an older version of the router image (v. 3.11.69-3) that uses an older version of haproxy (haproxy18-1.8.14-2.el7.x86_64) not affected by this bug, is affected by another bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1660598

So the workaround is not viable.

Comment 12 Kyle Walker 2020-03-18 13:30:33 UTC
Full backtrace:

(gdb) bt
#0  0x00007fa4a0d5a377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fa4a0d5ba68 in __GI_abort () at abort.c:90
#2  0x00007fa4a0d9cec7 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fa4a0eaf3f8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007fa4a0da56b9 in malloc_printerr (ar_ptr=0x7fa4a10eb760 <main_arena>, ptr=<optimized out>, str=0x7fa4a0eaf520 "double free or corruption (!prev)", action=3) at malloc.c:4967
#4  _int_free (av=0x7fa4a10eb760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843
#5  0x000055887f50f808 in deinit_log_buffers () at src/log.c:1387
#6  0x000055887f50d3cb in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2484
#7  0x00007fa4a1e64ea5 in start_thread (arg=0x7fa49c175700) at pthread_create.c:307
#8  0x00007fa4a0e228cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(gdb)  bt full
#0  0x00007fa4a0d5a377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
        resultvar = 0
        pid = 997
        selftid = 1000
#1  0x00007fa4a0d5ba68 in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7fa4a0d3676d, sa_sigaction = 0x7fa4a0d3676d}, sa_mask = {__val = {5, 140345051106417, 3, 140344970126990, 2, 140345051102759, 1, 140345051111473, 3, 140344970126964, 12, 
              140345051111477, 2, 140344970127776, 140344970127776, 140344970129536}}, sa_flags = 6, sa_restorer = 0x7fa49c172ba0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007fa4a0d9cec7 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fa4a0eaf3f8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
        ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fa49c173090, reg_save_area = 0x7fa49c172fa0}}
        ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fa49c173090, reg_save_area = 0x7fa49c172fa0}}
        fd = 2
        on_2 = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#3  0x00007fa4a0da56b9 in malloc_printerr (ar_ptr=0x7fa4a10eb760 <main_arena>, ptr=<optimized out>, str=0x7fa4a0eaf520 "double free or corruption (!prev)", action=3) at malloc.c:4967
        buf = "00005588811f70d0"
        cp = <optimized out>
        ar_ptr = 0x7fa4a10eb760 <main_arena>
        ptr = <optimized out>
        str = 0x7fa4a0eaf520 "double free or corruption (!prev)"
        action = 3
#4  _int_free (av=0x7fa4a10eb760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843
        size = <optimized out>
        fb = <optimized out>
        nextchunk = <optimized out>
        nextsize = <optimized out>
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        errstr = 0x7fa4a0eaf520 "double free or corruption (!prev)"
        locked = <optimized out>
#5  0x000055887f50f808 in deinit_log_buffers () at src/log.c:1387
No locals.
#6  0x000055887f50d3cb in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2484
        ptif = <optimized out>
        ptdf = 0x5588808dad40
        start_lock = 0
#7  0x00007fa4a1e64ea5 in start_thread (arg=0x7fa49c175700) at pthread_create.c:307
        __res = <optimized out>
        pd = 0x7fa49c175700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140344970139392, -8544586916138423878, 0, 8392704, 0, 140344970139392, 8512576625762527674, 8512582523388455354}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {
              prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#8  0x00007fa4a0e228cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.

Comment 13 Patrick Talbert 2020-03-18 13:31:49 UTC
I'm moving this to the RHSCL haproxy component since this is a crash in the haproxy18-1.8.17-3.el7 package from the rhaos-3.11-rhel-7 branch.


(gdb) bt
#0  0x00007fa4a0d5a377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fa4a0d5ba68 in __GI_abort () at abort.c:90
#2  0x00007fa4a0d9cec7 in __libc_message (do_abort=do_abort@entry=2, 
    fmt=fmt@entry=0x7fa4a0eaf3f8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007fa4a0da56b9 in malloc_printerr (ar_ptr=0x7fa4a10eb760 <main_arena>, ptr=<optimized out>, 
    str=0x7fa4a0eaf520 "double free or corruption (!prev)", action=3) at malloc.c:4967
#4  _int_free (av=0x7fa4a10eb760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843
#5  0x000055887f50f808 in deinit_log_buffers () at src/log.c:1387
#6  0x000055887f50d3cb in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2484
#7  0x00007fa4a1e64ea5 in start_thread (arg=0x7fa49c175700) at pthread_create.c:307
#8  0x00007fa4a0e228cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Looks like perhaps this:

commit 7c49711d6041d1afc42d5b310ddfd7d6f6817c3c
Author: Olivier Houchard <ohouchard>
Date:   Thu Mar 7 14:19:24 2019 +0100

    BUG/MEDIUM: logs: Only attempt to free startup_logs once.
    
    deinit_log_buffers() can be called once per thread, however startup_logs
    is common to all threads. So only attempt to free it once.
    
    This should be backported to 1.9 and 1.8.



Mentioned here:

HAProxy 1.8.18/19 thread crashes during reload #58
https://github.com/haproxy/haproxy/issues/58



Doesn't appear to be in v1.18.17 and from looking at rhpkg there hasn't been anything backported to haproxy18-1.8.17-3.el7.

ptalbert@ocho:~/git/upstream/haproxy$ git tag --contains 7c49711d6041d1afc42d5b310ddfd7d6f6817c3c
v2.0-dev2
v2.0-dev3
v2.0-dev4
v2.0-dev5
v2.0-dev6
v2.0-dev7
v2.0.0
v2.1-dev0
v2.1-dev1
v2.1-dev2
v2.1-dev3
v2.1-dev4
v2.1-dev5
v2.1.0
v2.2-dev0
v2.2-dev1
v2.2-dev2
ptalbert@ocho:~/git/upstream/haproxy$ git tag --contains 4c82743abd299f0aa8105e98ec92b76375a7f344
v1.8.20
v1.8.21
v1.8.22
v1.8.23
v1.8.24
ptalbert@ocho:~/git/upstream/haproxy$ git tag --contains 6fb9fadc5311cd00beb522a35596d3aa63f24fdb
v1.8.20
v1.8.21
v1.8.22
v1.8.23
v1.8.24

Comment 15 Kyle Walker 2020-03-18 13:42:52 UTC
Oh, I think this should be under RHEL 7 and haproxy if not in the 3.11.z OSE product, though it is not straightforward to me since it is falling in the rhaos-3.11-rhel-7 branch. Which seems to be OSE specific. Feel free to bounce it back to OSE 3.11.z if needed...

Comment 17 Kyle Walker 2020-03-18 14:38:25 UTC
Build with the backported fix from 7c49711d included:

    build (rhaos-3.11-rhel-7-candidate, /rpms/haproxy:38b8a5b6430c5231718749497bf7794fa81005a2) | Task Info | Brew
    https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27328503

The builds completed successfully, and are available via a scratch repo that I have stood up on my people.redhat.com page:

    yum-config-manager --add-repo https://people.redhat.com/kwalker/repos/bz1810573/rhbz1810573.repo

If the team wants to go with this path, then testing in the environment should be as simple as injecting the above repo into a layered image and (I believe) point the cluster "dc/router -n default" to the test build as described in:

    https://access.redhat.com/solutions/4010101

Comment 19 Andrew McDermott 2020-03-19 18:46:01 UTC
I tested this new build but directly on my desktop, not a cluster. I
was able to reproduce the SIGABRT on my desktop. After switching to
the new build I was not able to generate a SIGABRT.

My scripts are captured here:

  https://github.com/frobware/haproxy-hacks/tree/master/BZ1810573

These are my reproducer steps (assuming you take the scripts from my repo):

# Start our haproxy backends
$ server node.js &

# Start haproxy
$ ./reload-haproxy

# Verify that haproxy is running
$ pgrep haproxy

# I am using hey to generate load 
$ hey -c 1000 -m GET -z 10h http://localhost:4242

# Repeatedly reload haproxy in a 1 while loop:
$ ./while-1-reload

Now we look for coredumps from haproxy:

$ while :; do coredumpctl list | cat | grep -w 6; sleep 2; date; done

Thu 19 Mar 2020 04:40:10 PM GMT
Thu 2020-03-19 15:27:19 GMT   25111  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 15:29:11 GMT   12877  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:36:44 GMT   13511  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:36:44 GMT   13535  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:36:44 GMT   13547  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:36:44 GMT   13559  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:36:44 GMT   13645  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:36:44 GMT   13633  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy

I let this run and run accumulating SIGABRTs:

Thu 2020-03-19 16:45:02 GMT    7019  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:45:03 GMT    6983  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:45:03 GMT    7001  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:45:03 GMT    7097  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:46:12 GMT   28840  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy

At 16:47 I then switched to the new binary. To do this change the
reload-script - swap out the line in the reload-script that has:

  : ${HAPROXY:=$topdir/haproxy18-1.8.17-3.el7.x86_64/haproxy}

for

  : ${HAPROXY:=$topdir/haproxy18-1.8.17-4.el7.x86_64/haproxy}

I then let this run for 1 hour and did not see haproxy core dump with
signal 6 (ABRT) until "Thu 2020-03-19 17:50:17". This was the time I
switched back to the old binary ("haproxy18-1.8.17-3").

Thu 2020-03-19 16:45:02 GMT    7019  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:45:03 GMT    6983  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:45:03 GMT    7001  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:45:03 GMT    7097  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 16:46:12 GMT   28840  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:50:17 GMT   19924  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy

And letting this run and run with haproxy18-1.8.17-3 I see:

Thu 19 Mar 2020 06:09:58 PM GMT
Thu 2020-03-19 17:50:17 GMT   19924  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:51:13 GMT    6218  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:51:36 GMT   13489  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:51:37 GMT   13453  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:51:50 GMT   17775  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:51:50 GMT   17797  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:52:54 GMT    6079  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:53:21 GMT   14453  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:54:04 GMT   27993  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:54:04 GMT   28015  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:54:06 GMT   28518  1000  1000   6 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haprox

Looking at the last SIGABRT in gdb I see:

$ coredumpctl gdb 29965
           PID: 29965 (haproxy)
           UID: 1000 (aim)
           GID: 1000 (aim)
        Signal: 6 (ABRT)
     Timestamp: Thu 2020-03-19 18:01:02 GMT (9min ago)
  Command Line: /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy -f /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy.cfg -p /var/tmp/haproxy/run/haproxy.pid -x /var/tmp/haproxy/run/haproxy.sock -sf 29945 29916 29898 29880 29862 29844 29826 29808 29790 29772 29754 29735 15785
    Executable: /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
 Control Group: /user.slice/user-1000.slice/user/gnome-terminal-server.service
          Unit: user
     User Unit: gnome-terminal-server.service
         Slice: user-1000.slice
     Owner UID: 1000 (aim)
       Boot ID: 9cdcee89fdea41ad960e682904ffef7d
    Machine ID: 3ff4ac170c9447d4a10d272740ea39c8
      Hostname: spicy
       Storage: /var/lib/systemd/coredump/core.haproxy.1000.9cdcee89fdea41ad960e682904ffef7d.29965.1584640862000000.lz4
       Message: Process 29965 (haproxy) of user 1000 dumped core.
                
                Stack trace of thread 29969:
                #0  0x00007fd17b101e35 __GI_raise (libc.so.6)
                #1  0x00007fd17b0ec895 __GI_abort (libc.so.6)
                #2  0x00007fd17b14508f __libc_message (libc.so.6)
                #3  0x00007fd17b14c40c malloc_printerr (libc.so.6)
                #4  0x00007fd17b14de74 _int_free (libc.so.6)
                #5  0x00007fd17b1510bb tcache_thread_shutdown (libc.so.6)
                #6  0x00007fd17b69a4e6 start_thread (libpthread.so.0)
                #7  0x00007fd17b1c6163 __clone (libc.so.6)
                
                Stack trace of thread 29975:
                #0  0x00007fd17b6a374d __lll_lock_wait (libpthread.so.0)
                #1  0x00007fd17b69cdc4 __GI___pthread_mutex_lock (libpthread.so.0)
                #2  0x00007fd17b74e7ca _dl_open (ld-linux-x86-64.so.2)
                #3  0x00007fd17b2002e1 do_dlopen (libc.so.6)
                #4  0x00007fd17b200e09 __GI__dl_catch_exception (libc.so.6)
                #5  0x00007fd17b200ea3 __GI__dl_catch_error (libc.so.6)
                #6  0x00007fd17b2003e7 dlerror_run (libc.so.6)
                #7  0x00007fd17b20047a __GI___libc_dlopen_mode (libc.so.6)
                #8  0x00007fd17b6a524b pthread_cancel_init (libpthread.so.0)
                #9  0x00007fd17b6a5464 _Unwind_ForcedUnwind (libpthread.so.0)
                #10 0x00007fd17b6a3566 __GI___pthread_unwind (libpthread.so.0)
                #11 0x00007fd17b69b7b2 __do_cancel (libpthread.so.0)
                #12 0x000055be4631e485 n/a (/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy)

GNU gdb (GDB) Fedora 8.3-7.fc30
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy...
Missing separate debuginfo for /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/d0/de6c87b2ee9a0f0afc9b1ad1283b64c73793f9.debug
Reading symbols from .gnu_debugdata for /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy...
(No debugging symbols found in .gnu_debugdata for /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy)
[New LWP 29969]
[New LWP 29975]
[New LWP 29970]
[New LWP 29974]
[New LWP 29972]
[New LWP 29977]
[New LWP 29965]
[New LWP 29976]
[New LWP 29968]

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	  return ret;
[Current thread is 1 (Thread 0x7fd179bd9700 (LWP 29969))]
(gdb) bt
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fd17b0ec895 in __GI_abort () at abort.c:79
#2  0x00007fd17b14508f in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fd17b253a7e "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fd17b14c40c in malloc_printerr (str=str@entry=0x7fd17b255728 "double free or corruption (fasttop)") at malloc.c:5366
#4  0x00007fd17b14de74 in _int_free (av=0x7fd17b28aba0 <main_arena>, p=0x55be4683fb10, have_lock=<optimized out>) at malloc.c:4278
#5  0x00007fd17b1510bb in tcache_thread_shutdown () at malloc.c:2978
#6  __malloc_arena_thread_freeres () at arena.c:952
#7  0x00007fd17b1543e0 in __libc_thread_freeres () at thread-freeres.c:38
#8  0x00007fd17b69a4e6 in start_thread (arg=<optimized out>) at pthread_create.c:493
#9  0x00007fd17b1c6163 in clone () at ../sysdeps/unix/sysv/

This comes from tcache_thread_shutdown(), not deinit_log_buffers() as
mentioned in comment #12. But comment #9 is a SIGABRT with a different
trace. So, although the stack trace is not identical to the patch, I'm
seeing that the new build fixes the SIGABRT.

Conclusion:

-- I ran haproxy18-1.8.17-3 for ~45 mins and saw many SIGABRTs.

-- I ran haproxy18-1.8.17-4 for ~45 mins and saw ZERO SIGABRTs.

-- I ran haproxy18-1.8.17-3 again and immediately saw many SIGABRTs.

Other observations:

I'm driving reload way harder/faster than we would normally do and I
also see quite a lot of SEGVs in either version (-3, or -4):

Thu 2020-03-19 17:56:29 GMT   10738  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:56:34 GMT   11719  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:56:37 GMT   12862  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:56:40 GMT   13620  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:56:45 GMT   15325  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:57:05 GMT   21969  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:57:37 GMT   31938  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:58:35 GMT   18224  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:58:35 GMT   18205  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:58:36 GMT   18537  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:58:38 GMT   19113  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:59:08 GMT   28852  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 17:59:28 GMT    2588  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 18:00:13 GMT   14684  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy

Switching to haproxy18-1.8.17-4 I also see SEGVs:

Thu 2020-03-19 18:01:05 GMT   30665  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.x86_64/haproxy
Thu 2020-03-19 18:37:20 GMT    9801  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy
Thu 2020-03-19 18:37:27 GMT   11912  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy
Thu 2020-03-19 18:37:37 GMT   15092  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy
Thu 2020-03-19 18:37:38 GMT   15294  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy
Thu 2020-03-19 18:37:48 GMT   18500  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy
Thu 2020-03-19 18:37:50 GMT   18960  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy
Thu 2020-03-19 18:37:51 GMT   19480  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy

I don't want to muddy the waters though because it seems to me that
haproxy18-1.8.17-4 fixes the SIGABRTs.

For completeness here's a SEGv backtrace:

$ coredumpctl gdb 30665
Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-3.el7.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000056273701bc33 in listener_accept ()
[Current thread is 1 (Thread 0x7f20c666f700 (LWP 30672))]
(gdb) bt
#0  0x000056273701bc33 in listener_accept ()
#1  0x000056273703e10f in fd_process_cached_events ()
#2  0x0000562736fea2f2 in run_thread_poll_loop ()
#3  0x00007f20ca1344c0 in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007f20c9c60163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

$ coredumpctl gdb 19480
Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005557c4edecb3 in listener_accept ()
[Current thread is 1 (Thread 0x7fb4add98700 (LWP 19490))]
(gdb) bt
#0  0x00005557c4edecb3 in listener_accept ()
#1  0x00005557c4f0118f in fd_process_cached_events ()
#2  0x00005557c4ead342 in run_thread_poll_loop ()
#3  0x00007fb4b30604c0 in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007fb4b2b8c163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Is the SIGABRT the only core dump the customer is seeing?

Comment 20 Mario Abajo 2020-03-20 14:04:57 UTC
Thanks Kyle for the quick patch!.
Andrew thanks a lot for your extensive tests and troubleshooting; answering your question, the customer has the same behavior that you are describing in your tests; we only took one core to analyze as they were being generated quite often, but maybe we can take some others to see if we are missing something else. Anyway, as you say, looks like this may solves the issue. Let me get some more cores to analyze.

Comment 21 Dan Mace 2020-03-24 12:12:24 UTC
Following Kyle's lead, I've created a new haproxy 1.8 build which adds the following commit to the rhaos-3.11-rhel-7 branch:

    http://git.haproxy.org/?p=haproxy.git;a=patch;h=7c49711d6041d1afc42d5b310ddfd7d6f6817c3c

Here's the build and repo:

    https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27428159
    http://brew-task-repos.usersys.redhat.com/repos/scratch/dmace/haproxy/1.8.17/4.el7/haproxy-1.8.17-4.el7-scratch.repo

I verified the haproxy binary is byte-equivalent to Kyle's, but haven't yet run Andrew's testing against the new binary.

Here's an OpenShift 3.11 PR which incorporates the scratch RPM to test for regressions:

    https://github.com/openshift/origin/pull/24756

It would be useful if the customer could test Kyle's public RPM to ensure it solves their problem. If the new build has no regressions, and passes Andrew's manual testing, and ideally passes the customer's testing, I'd feel pretty comfortable pushing out a new build into the supported haproxy image.

Comment 22 Dan Mace 2020-03-24 16:46:09 UTC
Now that I know where the upstream 1.8 source actually lives, I can see that the fix was already back ported into the 1.8 tree:

    http://git.haproxy.org/?p=haproxy-1.8.git;a=commit;h=4c82743abd299f0aa8105e98ec92b76375a7f344

We already have a patch pending to upgrade to 1.8.23 (https://github.com/openshift/origin/pull/24533), so let's finish that work and fix both bugs in the process.

Comment 23 Dan Mace 2020-03-24 17:43:57 UTC
The haproxy 1.8.23 build is here: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1145906

Will work on getting the new router image released.

Comment 24 Andrew McDermott 2020-03-24 18:33:12 UTC
I took the RPMs from this build: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1145906

I updated my scripts https://github.com/frobware/haproxy-hacks/tree/master/BZ1810573 to select haproxy-1.8.23 by default.

I reran the testing I did in comment #19 for 30mins and I saw zero SIGABRTs.

I still see some SEGVs but with 1.8.23 the rate is reduced:

Starting at: Tue 24 Mar 2020 06:00:00 PM GMT

$ while :; do coredumpctl list | grep haproxy; sleep 2; date; done
Fri 2020-03-20 12:52:54 GMT   22408  1000  1000  11 missing   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.17-4.el7.x86_64/haproxy
Tue 2020-03-24 18:07:09 GMT    2948  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.x86_64/haproxy
Tue 2020-03-24 18:08:35 GMT   12941  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.x86_64/haproxy
Tue 2020-03-24 18:18:52 GMT   25386  1000  1000  11 present   /home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.x86_64/haproxy

Looking at each SEGV I see:

$ coredumpctl gdb 2948
Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000564ef4d07873 in listener_accept ()
[Current thread is 1 (Thread 0x7f16e2183700 (LWP 2953))]
(gdb) bt
#0  0x0000564ef4d07873 in listener_accept ()
#1  0x0000564ef4d2a88f in fd_process_cached_events ()
#2  0x0000564ef4cd659a in run_thread_poll_loop ()
#3  0x00007f16e3c454c0 in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007f16e3771163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

$ coredumpctl gdb 12941
Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005643e98bb873 in listener_accept ()
[Current thread is 1 (Thread 0x7fd501579700 (LWP 12943))]
(gdb) bt
#0  0x00005643e98bb873 in listener_accept ()
#1  0x00005643e98de88f in fd_process_cached_events ()
#2  0x00005643e988a59a in run_thread_poll_loop ()
#3  0x00007fd50283a4c0 in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007fd502366163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

$ coredumpctl gdb 25386
Core was generated by `/home/aim/repos/github/frobware/haproxy-hacks/BZ1810573/haproxy18-1.8.23-2.el7.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000559c82e6e873 in listener_accept ()
[Current thread is 1 (Thread 0x7fe6eea10700 (LWP 25388))]
(gdb) bt
#0  0x0000559c82e6e873 in listener_accept ()
#1  0x0000559c82e9188f in fd_process_cached_events ()
#2  0x0000559c82e3d59a in run_thread_poll_loop ()
#3  0x00007fe6efcd14c0 in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007fe6ef7fd163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Comment 25 Andrew McDermott 2020-03-24 18:36:51 UTC
I bumped the PR that is running origin CI against haproxy-1.8.23 to use the build from:

- https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1145906

New commits in the PR:

- https://github.com/openshift/origin/pull/24533/commits/4e1ee2a536626f89ee52d90f5e085a49c74bc248

The purpose of the origin PR is to provide some sanity testing against 1.8.23.

Comment 26 Dan Mace 2020-03-25 12:42:21 UTC
haproxy18-1.8.23-2 will be shipped in openshift-enterprise-haproxy-router-container-v3.11.195-1.

Comment 36 Arvind iyengar 2020-04-03 09:25:15 UTC
Verified with "atomic-openshift-3.11.200-1.git.0.ecc3aa3.el7.x86_64" and it has passed. The haproxy packages have been updated to "haproxy18-1.8.23-2.el7.x86_64". The tested HAproxy version has no regressions.

Comment 38 errata-xmlrpc 2020-04-07 17:45:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1285


Note You need to log in before you can comment on or make changes to this bug.