Bug 1046594

Summary: Libvirtd crashed when loading/unloading ixgbe(82599) module repeatedly
Product: Red Hat Enterprise Linux 7 Reporter: Hu Jianwei <jiahu>
Component: netcfAssignee: Laine Stump <laine>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.0CC: acathrow, dallan, ddugger, dyuan, hannsj_uhl, honzhang, jamorgan, jane.lv, jkachuck, jmiao, john.ronciak, jvillalo, jwilleford, mjenner, mzhan, ruwang, xiaolong.wang
Target Milestone: rc   
Target Release: 7.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: netcf-0.2.3-6.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-13 11:13:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
all_thread_backtrace
none
valgrind_libvirtd
none
kernel message log
none
libvirtd log none

Description Hu Jianwei 2013-12-26 09:07:49 UTC
Created attachment 841790 [details]
all_thread_backtrace

Description of problem:
Libvirtd crashed when loading/unloading ixgbe(82599) module repeatedly

Version-Release number of selected component (if applicable):
libvirt-1.1.1-16.el7.x86_64
qemu-kvm-1.5.3-30.el7.x86_64
kernel-3.10.0-64.el7.x86_64

How reproducible:
90%

Steps to Reproduce:
1. load/unload ixgbe module many times.
[root@ibm-x3850x5-05 ~]#for i in {1..10}; do echo $i;modprobe -r ixgbe; sleep 1; modprobe ixgbe max_vfs=63; sleep 1; done

2. Check the libvirtd status
[root@ibm-x3850x5-05 ~]# systemctl status libvirtd
libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled)
   Active: failed (Result: signal) since Thu 2013-12-26 01:15:51 EST; 1min 38s ago
  Process: 15267 ExecStart=/usr/sbin/libvirtd $LIBVIRTD_ARGS (code=killed, signal=ABRT)
 Main PID: 15267 (code=killed, signal=ABRT)

Dec 26 01:15:51 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com libvirtd[15267]: 7f0e34000000-7f0e34623000 rw-p 00000000 00:00 0
Dec 26 01:15:51 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com libvirtd[15267]: 7f0e34623000-7f0e38000000 ---p 00000000 00:00 0
Dec 26 01:15:51 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com libvirtd[15267]: 7f0e38000000-7f0e38036000 rw-p 00000000 00:00 0
Dec 26 01:15:51 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com systemd[1]: libvirtd.service: main process exited, code=killed, status=6/ABRT
Dec 26 01:15:51 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com systemd[1]: Unit libvirtd.service entered failed state.
Dec 26 01:16:44 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com dnsmasq[16463]: reading /etc/resolv.conf
Dec 26 01:16:44 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com dnsmasq[16463]: using nameserver 10.66.127.10#53
Dec 26 01:16:44 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com dnsmasq[16463]: using nameserver 10.66.78.111#53
Dec 26 01:16:44 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com dnsmasq[16463]: using nameserver 10.66.78.117#53
Dec 26 01:16:44 ibm-x3850x5-05.qe.lab.eng.nay.redhat.com dnsmasq[16463]: using local addresses only for unqualified names

Actual results:
As shown above steps. Detailed log see attachments

Expected results:
Reload ixgbe module should not affect libvirtd's status.

Comment 1 Hu Jianwei 2013-12-26 09:08:18 UTC
Created attachment 841791 [details]
valgrind_libvirtd

Comment 2 Hu Jianwei 2013-12-26 09:08:57 UTC
Created attachment 841792 [details]
kernel message log

Comment 3 Hu Jianwei 2013-12-26 09:09:31 UTC
Created attachment 841793 [details]
libvirtd log

Comment 5 Laine Stump 2014-01-07 15:19:21 UTC
I was able to reproduce this on Fedora 20 (which has the same netcf version). The problem is the function aug_get_mac() - it doesn't initialize the local char *path to NULL, but then unconditionally frees it on error; the problem is that one error condition can be encountered prior to path getting set.

I'm changing the component to netcf and will be posting a patch upstream shortly.

Comment 6 Laine Stump 2014-01-07 15:22:09 UTC
Note that repproducing the bug will be quicker if you run this loop concurrent with the loop that is loading/unloading the network card driver (btw, on my system I had an 82576 card / igb driver, so you don't necessarily need an 82559):

  while true; do virsh iface-list; done

Comment 7 Laine Stump 2014-01-08 11:12:54 UTC
A patch has been posted to the upstream netcf mailing list:

https://lists.fedorahosted.org/pipermail/netcf-devel/2014-January/000853.html

I tested this patch by attaching gdb to the libvirtd process (so I would easily see when it crashed) and running these two shell scripts simultaneously in different windows:

---

  while true; do virsh iface-list; done
---

  i=1
  while true; do
    echo $i; i=$(expr $i + 1)
    modprobe -r igb
    usleep 1
    modprobe igb max_vfs=7
    usleep 1
  done

Without the patched netcf, libvirtd would crash within < 30 seconds. Without the patch I ran the test for several minutes with no crashes (the virsh iface-list will periodically fail due to interface config being in an inconsistent state through multiple calls to the virInterface API, but that is an unsolvable (and acceptable) problem.

Comment 8 Laine Stump 2014-01-08 14:27:15 UTC
Pushed this upstream:

commit 8ed36d22fbc792474ca9c3b06c8a326b1fb5af08
Author: Laine Stump <laine@laine.org>
Date:   Tue Jan 7 20:12:06 2014 +0200

    eliminate use of uninitialized data when getting mac address
    
    If the call to get_augeas() at the top of aug_get_mac() failed, we
    would goto error and FREE(path), which would not have been
    initialized. And if by some magic of fate we happened to get past
    that, we would return garbage for the return code, since r was also
    not initialized. This patch initializes both path and r to fix the
    crash documented in Bug 1046594.
    
    Although it doesn't directly impact the referenced bug, a quick audit
    of other functions in the same file showed that defnode() had the same
    problem with uninitialized "r". Beyond that, I also defensively
    initialized the pointer to mac address to NULL both in aug_get_mac()
    as well as two of its callers, to make future audits of the code
    easier, and to shut up both valgrind and whatever static analyzers
    might be run on the code.

Comment 9 Laine Stump 2014-01-21 13:16:05 UTC
Hanns-Joachim Uhl: Why do you consider this BZ to block an update to the ixbge driver? Did you encounter this crash while testing?

The problem is actually driver-agnostic (so it is just as likely to happen with the old isgbe as the new ixgbe). It can be completely avoided when testing by simply not running virt-manager (which happens to frequently call netcf) during the testing of the new driver (or at least during parts that involve unloading/loading a netdev driver). Depending on your tests, that may be less than ideal, but likely not a large problem.

(That said, a netcf build is likely coming this week, but lack of a build probably shouldn't prevent testing of the ixgbe driver)

Comment 10 Laine Stump 2014-01-21 13:20:09 UTC
To test that the fix is working properly, you will need to restart libvirtd.service after updating the netcf package (there is no method of automatically triggering that other than mandating offline updates). So:


   yum update netcf-*.x86_64.rpm
   systemctl restart libvirtd.service
   (now run the test scripts in Comment 7)

Comment 14 John Ronciak 2014-01-22 22:05:32 UTC
So how is then blocking BZ "Bug 726818 - [Intel 7.0 FEAT] Update ixgbe driver to latest upstream."?  It should not be any more correct?

Comment 15 Hanns-Joachim Uhl 2014-01-23 09:06:54 UTC
(In reply to John Ronciak from comment #14)
> So how is then blocking BZ "Bug 726818 - [Intel 7.0 FEAT] Update ixgbe
> driver to latest upstream."?  It should not be any more correct?
.
... correct ...

Comment 16 Laine Stump 2014-01-23 10:21:21 UTC
Since Bug 965845 was also in the "blocks" list, and it is just a duplicate of Bug 726818, I've also removed it from the blocks list.

Comment 17 Jincheng Miao 2014-01-24 08:38:04 UTC
In latest netcf-0.2.3-6.el7.x86_64, run with those two scripts in Comment 7, there is no crash happened.
So I choose to change the status to VERIFIED.

Comment 18 Ludek Smid 2014-06-13 11:13:40 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.