1692788 – keepalived crashes in a loop when the vrrp interface does not exist

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1692788 - keepalived crashes in a loop when the vrrp interface does not exist

Summary: keepalived crashes in a loop when the vrrp interface does not exist

Keywords:
Status:	CLOSED DUPLICATE of bug 1693706
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	keepalived
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.0
Assignee:	Ryan O'Hara
QA Contact:	Brandon Perkins
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-26 12:28 UTC by Michele Baldessari
Modified:	2019-05-22 04:58 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-10 16:47:56 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Michele Baldessari 2019-03-26 12:28:25 UTC

Description of problem:
keepalived goes in a crashing loop if the interface on which vrrp is configured does not exist (which might be triggered by an ovs restart for example) 

Version-Release number of selected component (if applicable):
keepalived-2.0.10-1.el8.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Use the following keepalived.conf
global_defs {
  notification_email {
    root
  }
  notification_email_from keepalived
  smtp_server localhost
  smtp_connect_timeout 30
  router_id undercloud-0
}
 
static_ipaddress { }
 
vrrp_script haproxy {
  script   "test -S /var/lib/haproxy/stats && echo "show info" | socat /var/lib/haproxy/stats stdio"
  interval 2
  weight   2   
}

vrrp_instance 51 {
  virtual_router_id 51
  # Advert interval
  advert_int 1
  # for electing MASTER, highest priority wins.
  priority  101 
  state     MASTER
 
  interface br-ctlplane
  virtual_ipaddress {
      192.168.24.3 dev br-ctlplane
  }
 
  track_script { haproxy }
}

vrrp_instance 52 {
  virtual_router_id 52
  
  # Advert interval
  advert_int 1
  # for electing MASTER, highest priority wins.
  priority  101
  state     MASTER
  interface br-ctlplane
  virtual_ipaddress {
      192.168.24.2 dev br-ctlplane
  }
  
  track_script { haproxy }
}     

2. systemctl start keepalived
3. Observe the crash:
Mar 26 13:18:27 rhel8.int.rhx systemd-coredump[9615]: Process 9613 (keepalived) of user 0 dumped core.
                                                      
                                                      Stack trace of thread 9613:
                                                      #0  0x00007f77d5f6593f raise (libc.so.6)
                                                      #1  0x00007f77d5f4fd5e abort (libc.so.6)
                                                      #2  0x00007f77d5fa8d57 __libc_message (libc.so.6)
                                                      #3  0x00007f77d5faf68c malloc_printerr (libc.so.6)
                                                      #4  0x00007f77d5fb1027 _int_free (libc.so.6)
                                                      #5  0x00005616b1e8bccd free_global_data (keepalived)
                                                      #6  0x00005616b1ea8155 vrrp_terminate_phase2 (keepalived)
                                                      #7  0x00005616b1ea8361 stop_vrrp (keepalived)
                                                      #8  0x00005616b1ea86ee stop_vrrp (keepalived)
                                                      #9  0x00005616b1ea8c5f start_vrrp_child (keepalived)
                                                      #10 0x00005616b1ea8cb6 vrrp_respawn_thread (keepalived)
                                                      #11 0x00005616b1ed7623 thread_call (keepalived)
                                                      #12 0x00005616b1e8ada6 keepalived_main (keepalived)
                                                      #13 0x00007f77d5f51813 __libc_start_main (libc.so.6)
                                                      #14 0x00005616b1e890ee _start (keepalived)


Note that the crash loops all the time:
[root@rhel8 var]# coredumpctl list |grep keepalived|wc -l
3223

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 18446744073709551615 <repeats 12 times>, 140152668463007, 0, 532575944823}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007f77d5f4fd5e in __GI_abort () at abort.c:100
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {18446744073709551615 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f77d5fa8d57 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f77d60b6178 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
        ap = {{gp_offset = 24, fp_offset = 0, overflow_arg_area = 0x7fffb654ba70, reg_save_area = 0x7fffb654ba00}}
        fd = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#3  0x00007f77d5faf68c in malloc_printerr (str=str@entry=0x7f77d60b7e10 "double free or corruption (fasttop)") at malloc.c:5364
No locals.
#4  0x00007f77d5fb1027 in _int_free (av=0x7f77d62ecc60 <main_arena>, p=0x5616b25588f0, have_lock=<optimized out>) at malloc.c:4244
        idx = 0
        old = <optimized out>
        old2 = <optimized out>
        size = <optimized out>
        fb = 0x7f77d62ecc70 <main_arena+16>
        nextchunk = <optimized out>
        nextsize = <optimized out>
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        __PRETTY_FUNCTION__ = "_int_free"
#5  0x00005616b1e8bccd in free_global_data (data=0x5616b2553820) at global_data.c:325
No locals.
#6  0x00005616b1ea8155 in vrrp_terminate_phase2 (exit_status=exit_status@entry=3) at vrrp_daemon.c:261
        usage = {ru_utime = {tv_sec = 94655481232080, tv_usec = 94655474374192}, ru_stime = {tv_sec = 94655481220944, tv_usec = 94655474377042}, {ru_maxrss = 94655481258176, __ru_maxrss_word = 94655481258176}, {ru_ixrss = 94655474389450, __ru_ixrss_word = 94655474389450}, {ru_idrss = 94655481209472, __ru_idrss_word = 94655481209472}, {
            ru_isrss = 94655474398014, __ru_isrss_word = 94655474398014}, {ru_minflt = 0, __ru_minflt_word = 0}, {ru_majflt = 94655481268912, __ru_majflt_word = 94655481268912}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 94655481208496, __ru_inblock_word = 94655481208496}, {ru_oublock = 94655481220928, 
            __ru_oublock_word = 94655481220928}, {ru_msgsnd = 94655474261594, __ru_msgsnd_word = 94655474261594}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 94655481192528, __ru_nsignals_word = 94655481192528}, {ru_nvcsw = 0, __ru_nvcsw_word = 0}, {ru_nivcsw = 94655474262307, __ru_nivcsw_word = 94655474262307}}
#7  0x00005616b1ea8361 in stop_vrrp (status=status@entry=3) at vrrp_daemon.c:429
No locals.
#8  0x00005616b1ea86ee in stop_vrrp (status=3) at ../../lib/bitops.h:49
No locals.
#9  start_vrrp (old_global_data=old_global_data@entry=0x0) at vrrp_daemon.c:467
No locals.
#10 0x00005616b1ea8c5f in start_vrrp_child () at vrrp_daemon.c:1002
        pid = <optimized out>
        syslog_ident = <optimized out>
        pid = <optimized out>
        syslog_ident = <optimized out>
#11 0x00005616b1ea8cb6 in vrrp_respawn_thread (thread=<optimized out>) at vrrp_daemon.c:832
No locals.
#12 0x00005616b1ed7623 in thread_call (thread=0x5616b25590a0) at scheduler.c:1720
No locals.
#13 process_threads (m=0x5616b2558f40) at scheduler.c:1720
        thread = 0x5616b25590a0
        thread_list = <optimized out>
        thread_type = <optimized out>
#14 0x00005616b1ed7ff5 in launch_thread_scheduler (m=<optimized out>) at scheduler.c:1815
No locals.
#15 0x00005616b1e8ada6 in keepalived_main (argc=2, argv=<optimized out>) at main.c:1897
        report_stopped = true
        uname_buf = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "rhel8.int.rhx", '\000' <repeats 51 times>, release = "4.18.0-80.el8.x86_64", '\000' <repeats 44 times>, version = "#1 SMP Wed Mar 13 12:02:46 UTC 2019", '\000' <repeats 29 times>, machine = "x86_64", '\000' <repeats 58 times>, 
          domainname = "(none)", '\000' <repeats 58 times>}
        end = 0x7fffb654be46 ".int.rhx"
#16 0x00007f77d5f51813 in __libc_start_main (main=0x5616b1e890b0 <main>, argc=2, argv=0x7fffb654c628, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffb654c618) at ../csu/libc-start.c:308
        self = <optimized out>
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {0, 4084918265210392554, 94655474077888, 140736252397088, 0, 0, 7737893304674471914, 7670267560097533930}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x7fffb654c640, 0x7f77d8540150}, data = {prev = 0x0, cleanup = 0x0, canceltype = -1235958208}}}
        not_first_call = <optimized out>
#17 0x00005616b1e890ee in _start ()
No symbol table info available.

I think it is fine if keepalived keeps retrying the vrrp (after all we configured it for that interface), but it should not crash as that is really bringing the whole system down due to coredump constantly kicking in.

Comment 1 Ryan O'Hara 2019-05-10 16:43:48 UTC

I am unable to reproduce this.

# rpm -q keepalived
keepalived-2.0.7-2.el8.x86_64

# cat /etc/keepalived/keepalived.conf
global_defs {
    router_id MESA-01
}

vrrp_instance VRRP-01 {
    interface foo
    priority 141
    advert_int 1
    state BACKUP
    virtual_router_id 31
    virtual_ipaddress {
        10.15.85.31
    }
}

# systemctl start keepalived
# journalctl -afu keepalived
May 10 12:34:55 mesa-virt-01_RHEL8 systemd[1]: Starting LVS and VRRP High Availability Monitor...
May 10 12:34:55 mesa-virt-01_RHEL8 systemd[1]: keepalived.service: Can't open PID file /var/run/keepalived.pid (yet?) after start: No such file or directory
May 10 12:34:55 mesa-virt-01_RHEL8 Keepalived[30315]: Starting VRRP child process, pid=30316
May 10 12:34:55 mesa-virt-01_RHEL8 Keepalived[30315]: Keepalived_vrrp exited with permanent error CONFIG. Terminating
May 10 12:34:55 mesa-virt-01_RHEL8 systemd[1]: Started LVS and VRRP High Availability Monitor.

No core dump for non-existent interface.

Seems more likely that you're hitting the double-free bug that was fixed here:

https://bugzilla.redhat.com/show_bug.cgi?id=1693706

Comment 2 Michele Baldessari 2019-05-10 16:47:56 UTC

Agree looks to be the same root-cause. Will reopen if that is not the case

*** This bug has been marked as a duplicate of bug 1693706 ***

Comment 3 Ryan O'Hara 2019-05-10 17:22:54 UTC

OK, with a more recent version of keepalived I can recreate this problem.

# rpm -q keepalived
keepalived-2.0.10-1.el8.x86_64

Starting keepalived will repeatedly die (coredump) and log the following:

May 10 13:09:08 mesa-virt-01_RHEL8 Keepalived[30730]: Starting VRRP child process, pid=31311
May 10 13:09:08 mesa-virt-01_RHEL8 Keepalived_vrrp[31311]: Registering Kernel netlink reflector
May 10 13:09:08 mesa-virt-01_RHEL8 Keepalived_vrrp[31311]: Registering Kernel netlink command channel
May 10 13:09:08 mesa-virt-01_RHEL8 Keepalived_vrrp[31311]: Opening file '/etc/keepalived/keepalived.conf'.
May 10 13:09:08 mesa-virt-01_RHEL8 Keepalived_vrrp[31311]: (Line 17) WARNING - interface foo for vrrp_instance VRRP-01 doesn't exist
May 10 13:09:08 mesa-virt-01_RHEL8 Keepalived_vrrp[31311]: Non-existent interface specified in configuration
May 10 13:09:08 mesa-virt-01_RHEL8 Keepalived[30730]: Keepalived_vrrp exited due to signal 6
May 10 13:09:08 mesa-virt-01_RHEL8 Keepalived[30730]: VRRP child process(31311) died: Respawning

Note that you can stop keepalived from respawning by using the -R option.

But the reason for the coredump has nothing to do with the non-existent interface but rather the smtp configuration and a double free when keepalived stops. See rhbz#1693706. I've tested with the latest build for 8.1 and it works as expected. Closing this as duplicate.

Note You need to log in before you can comment on or make changes to this bug.