Bug 2075837

Summary: virtnwfilterd crashed when start->reload->restart virtnwfilterd with running guest having filter setting
Product: Red Hat Enterprise Linux 9 Reporter: yafu <yafu>
Component: libvirtAssignee: Michal Privoznik <mprivozn>
libvirt sub component: General QA Contact: yalzhang <yalzhang>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: jdenemar, jsuchane, lmen, mprivozn, virt-maint, xuzhang
Version: 9.1Keywords: Triaged, Upstream
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-8.4.0-1.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-15 10:04:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 8.4.0
Embargoed:

Description yafu 2022-04-15 15:30:53 UTC
Description of problem:
virtnwfilterd crashed when start->reload->restart virtnwfilterd with running guest having filter setting

Version-Release number of selected component (if applicable):
libvirt-8.2.0-1.el9.x86_64

How reproducible:
50%

Steps to Reproduce:
1.Start a guest with filter setting:
#virsh start r9

#virsh dumpxml r9
...
 <interface type='bridge'>
      <mac address='52:54:00:2f:32:97'/>
      <source bridge='virbr0'/>
      <target dev='if-0'/>
      <model type='rtl8139'/>
      <filterref filter='disallow-arp'/>
      <alias name='net2'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x03' function='0x0'/>
    </interface>
...


2.Do start->reload->restart virtnwfilterd:
# for i in {1..10}; do systemctl stop virtnwfilterd;sleep 5; systemctl start virtnwfilterd; systemctl reload virtnwfilterd; systemctl restart virtnwfilterd; done

3.Check the coredump:
# coredumpctl list
TIME                         PID UID GID SIG     COREFILE EXE                       SIZE
Fri 2022-04-15 11:25:37 EDT 2445   0   0 SIGSEGV present  /usr/sbin/virtnwfilterd 607.6K
Fri 2022-04-15 11:25:55 EDT 3396   0   0 SIGSEGV present  /usr/sbin/virtnwfilterd 608.2K
Fri 2022-04-15 11:26:02 EDT 3732   0   0 SIGSEGV present  /usr/sbin/virtnwfilterd 607.1K
Fri 2022-04-15 11:26:15 EDT 4367   0   0 SIGSEGV present  /usr/sbin/virtnwfilterd 608.3K
Fri 2022-04-15 11:26:21 EDT 4701   0   0 SIGSEGV present  
/usr/sbin/virtnwfilterd 606.8K

4.Check the backtrace:
Core was generated by `/usr/sbin/virtnwfilterd --timeout 120'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd2f5619a04 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fd2b67fc640 (LWP 4725))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-27.el9_0.x86_64 krb5-libs-1.19.1-15.el9_0.x86_64 openldap-2.4.59-3.el9.x86_64 openssl-libs-3.0.1-18.el9.x86_64 zlib-1.2.11-31.el9.x86_64
(gdb) 
(gdb) t a a bt

Thread 4 (Thread 0x7fd2f41d1f80 (LWP 4701)):
#0  0x00007fd2f43c30c4 in __do_global_dtors_aux () from /lib64/libcrypt.so.2
#1  0x00007fd2f5fc201b in _dl_fini () at dl-fini.c:138
#2  0x00007fd2f55ce475 in __run_exit_handlers () from /lib64/libc.so.6
#3  0x00007fd2f55ce5f0 in exit () from /lib64/libc.so.6
#4  0x00007fd2f55b6e57 in __libc_start_call_main () from /lib64/libc.so.6
#5  0x00007fd2f55b6efc in __libc_start_main_impl () from /lib64/libc.so.6
#6  0x00005581ed357f35 in _start ()

Thread 3 (Thread 0x7fd2b5ffb640 (LWP 4718)):
#0  0x00007fd2f56b9edf in poll () from /lib64/libc.so.6
#1  0x00007fd2f5aa359c in g_main_context_poll (priority=<optimized out>, n_fds=1, fds=0x7fd2a0009e10, timeout=<optimized out>, context=0x7fd2a000d360) at ../glib/gmain.c:4434
#2  g_main_context_iterate.constprop.0 (context=context@entry=0x7fd2a000d360, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4126
#3  0x00007fd2f5a4c5d3 in g_main_context_iteration (context=0x7fd2a000d360, may_block=may_block@entry=1) at ../glib/gmain.c:4196
#4  0x00007fd2f5a4c621 in glib_worker_main (data=<optimized out>) at ../glib/gmain.c:6089
#5  0x00007fd2f5a79d72 in g_thread_proxy (data=0x7fd2a0009400) at ../glib/gthread.c:826
#6  0x00007fd2f56166ca in start_thread () from /lib64/libc.so.6
#7  0x00007fd2f55b63f0 in clone3 () from /lib64/libc.so.6

Thread 2 (Thread 0x7fd2b57fa640 (LWP 4719)):
#0  0x00007fd2f56b9edf in poll () from /lib64/libc.so.6
#1  0x00007fd2f5aa359c in g_main_context_poll (priority=<optimized out>, n_fds=2, fds=0x7fd2a0021b80, timeout=<optimized out>, context=0x7fd2a001e9b0) at ../glib/gmain.c:4434
#2  g_main_context_iterate.constprop.0 (context=0x7fd2a001e9b0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4126
#3  0x00007fd2f5a4e463 in g_main_loop_run (loop=0x7fd2a001eaa0) at ../glib/gmain.c:4329
#4  0x00007fd2f58da5ca in gdbus_shared_thread_func (user_data=0x7fd2a001e980) at ../gio/gdbusprivate.c:280
#5  0x00007fd2f5a79d72 in g_thread_proxy (data=0x7fd2a0019800) at ../glib/gthread.c:826
#6  0x00007fd2f56166ca in start_thread () from /lib64/libc.so.6
#7  0x00007fd2f55b63f0 in clone3 () from /lib64/libc.so.6

Thread 1 (Thread 0x7fd2b67fc640 (LWP 4725)):
--Type <RET> for more, q to quit, c to continue without paging--
#0  0x00007fd2f5619a04 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007fd2f5c69089 in virMutexLock (m=<optimized out>) at ../src/util/virthread.c:91
#2  0x00007fd2f5c690ae in virLockGuardLock (m=0x30) at ../src/util/virthread.c:102
#3  0x00007fd2e4490c69 in nwfilterStateReload () at ../src/nwfilter/nwfilter_driver.c:330
#4  0x00007fd2f5e0f040 in virStateReload () at ../src/libvirt.c:759
#5  0x00005581ed3580e3 in daemonReloadHandlerThread (opaque=<optimized out>) at ../src/remote/remote_daemon.c:456
#6  0x00007fd2f5c71bb9 in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256
#7  0x00007fd2f56166ca in start_thread () from /lib64/libc.so.6
#8  0x00007fd2f55b63f0 in clone3 () from /lib64/libc.so.6


Actual results:


Expected results:


Additional info:

Comment 1 Jaroslav Suchanek 2022-04-21 13:14:59 UTC
Daniel, would you please look into this one? Thanks.

Comment 2 Michal Privoznik 2022-04-22 14:28:24 UTC
Current master has slightly different line numbers, but the offending code is this:

static int
nwfilterStateReload(void)
{
    if (!driver)
        return -1;

    if (!driver->privileged)
        return 0;

    virNWFilterDHCPSnoopEnd(NULL);
    /* shut down all threads -- they will be restarted if necessary */
    virNWFilterLearnThreadsTerminate(true);

    VIR_WITH_MUTEX_LOCK_GUARD(&driverMutex) {
        VIR_WITH_MUTEX_LOCK_GUARD(&driver->updateLock) {                    /* THIS is the line that causes SIGSEGV */
            virNWFilterObjListLoadAllConfigs(driver->nwfilters, driver->configDir);
        }


        virNWFilterBuildAll(driver, false);
    }

    return 0;
}



And I think I know where the problem is. The problematic line is the one that locks driver->updateLock because from the stack trace, driver is NULL at that point (indeed, updateLock member is at offset 30). And the problem is that after driverMutex lock was acquired we have to check again whether driver == NULL. It's not sufficient that this condition was checked at the beginning of the function, because that was done outside of the lock that guards the driver variable.

Comment 3 Michal Privoznik 2022-04-22 14:46:45 UTC
Patch posted onto the list:

https://listman.redhat.com/archives/libvir-list/2022-April/230337.html

Comment 5 Michal Privoznik 2022-05-19 14:50:26 UTC
v3 (although patch is v2 because it's implementing different approach):

https://listman.redhat.com/archives/libvir-list/2022-May/231663.html

Comment 6 Michal Privoznik 2022-05-23 12:47:34 UTC
Merged upstream as:

commit a54391fbc268e19252f5a6b2d91a6f8d84990243
Author:     Michal Prívozník <mprivozn>
AuthorDate: Mon Apr 25 10:31:56 2022 +0200
Commit:     Michal Prívozník <mprivozn>
CommitDate: Mon May 23 14:44:50 2022 +0200

    remote_daemon: Don't run virStateCleanup() if virStateReload() is still running
    
    When a SIGHUP is received a thread is spawned that runs
    virStateReload(). However, if SIGINT is received while the former
    thread is still running then we may get into problematic
    situation: the cleanup code in main() sees drivers initialized
    and thus calls virStateCleanup(). So now we have two threads, one
    running virStateReload() the other virStateCleanup(). In this
    situation it's very likely that a race condition occurs and
    either of threads causes SIGSEGV.
    
    To fix this, unmark drivers as initialized in the
    virStateReload() thread for the time the function runs.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2075837
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: Martin Kletzander <mkletzan>

v8.3.0-210-ga54391fbc2

Comment 7 yalzhang@redhat.com 2022-06-01 03:34:23 UTC
Reproduce the bug on libvirt-8.3.0-1.el9.x86_64 and pre-verify it on v8.4.0-rc2-13-g8b8fd1bc67 with the steps in comment 0.

Comment 10 yalzhang@redhat.com 2022-06-13 05:48:16 UTC
Test on libvirt-8.4.0-1.el9.x86_64 with the steps in comment 0, the issue is fixed.

Comment 12 errata-xmlrpc 2022-11-15 10:04:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: libvirt security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8003