Bug 1249158

Summary: virsh start failures due to OLD libnl3 version
Product: Red Hat Enterprise Linux 7 Reporter: Anand Nande <anande>
Component: libnl3Assignee: Thomas Haller <thaller>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: urgent    
Version: 7.1CC: dcbw, hannsj_uhl, lmiksik, rhodain, thaller, tpelka, vbenes
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: libnl3-3.2.21-9.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1268767 (view as bug list) Environment:
Last Closed: 2015-11-19 14:52:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1268767    
Attachments:
Description Flags
[PATCH] dist-git patch with backports from upstream
none
python script to test the backports none

Description Anand Nande 2015-07-31 16:32:30 UTC
Description of problem:

On rare occasions 'virsh start' fails for customer on RHEL 7.1 with messages like these:

Jul 30 17:45:13 hplcp051-host02 journal: cannot connect to netlink socket with protocol 0: Address already in use
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/hostdevmgr/eth0_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/qemu/eth0_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/hostdevmgr/eth1_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/qemu/eth1_vf7': No such file or directory 

Google searching for them found multiple reports of similar issues on RHEL6.* loads, with the suggestion that the bug was in libnl and updating to a newer libnl version fixes the issue.  

A developer in Customers organization found that the libnl3 included in RHEL7.1 does have the issue, and tested with a private newer libnl3 and verified that he can reproduce the issue with the official libnl3 and the issue can NOT be reproduced with the private newer libnl3.  

Customer is requesting an update in RHEL 7.1 to libnl3 to include the fix identified in the changelog this way:

* Thu May 22 2014 Thomas Haller <thaller> - 3.2.24-3
- retry local port on ADDRINUSE (rh #1097175)

The following is a detailed mail by their developer:

# Problem Description:
=======================
The intermittent problem we are trying to solve is that "virsh start" fails
about 1 out of 50 times in a very specific scenario. This has so far only
been seen in our "high_speed" VM configuration which has 16 bridges and
4 sriov interfaces. We start the VMs in parallel 8 at a time, usually
4 on a blade. This problem has only been seen on RHEL7.

The error messages are:

# Our cm log:
2015-07-30 17:45:27.93 sc_vm_mgmt: ERROR: failure on:
        ssh host02 /opt/vcp/sbin/vm_mgmt -t create -w 180 -v 13-s04c02h0
        -f 1-11GB ERROR: vm_mgmt virsh start 13-s04c02h0, failure=1,
        output=error: Failed to start domain 13-s04c02h0

# /var/log/messages:
Jul 30 17:45:13 hplcp051-host02 journal: cannot connect to netlink socket
        with protocol 0: Address already in use
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open
        file '/var/run/libvirt/hostdevmgr/eth0_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open
        file '/var/run/libvirt/qemu/eth0_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open
        file '/var/run/libvirt/hostdevmgr/eth1_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open
        file '/var/run/libvirt/qemu/eth1_vf7': No such file or directory

################################################################
# Recreating the Problem:

To reproduce this issue, we need to create and destroy VMs repeatedly.
This is needed because the failure rate is quite low. We are unlikely
to hit this problem if the sample size is too small.

We also discovered that the problem typically does not occur on
subsequent VM creates unless we clear the mac/vlan information
for the VF before reattempting the create. Discovering this was key
to our ability to reproduce the problem.

This is an important issue because customers transitioning to
this software from an older release will start all of their VMs in
parallel for the first time on this load. This is the case when this
problem was first reported. We did not originally recognize that
subsequent VM creates were unlikely to hit this problem, so we were
initially unable to reproduce the problem.

We ran this test in hp051 (c7000, 12 Gen8 blades).
Four "high_speed" VMs were defined on each host. Then we
ran a script to:

        Loop 5 times:
                create the 48 VMs  (8 at a time in parallel)
                destroy the 48 VMs (8 at a time in parallel)
                reset the VF mac/vlan (ip link set ethX vf Y vlan 0 mac 0)

        We hit the VM create failure 5 times in total.

################################################################
# Seeking a Solution:

There were several notes on the web indicating a similar problem was
seen in RHEL6 due to a problem with "libnl". I decided to look at
the RHEL7 libnl package and to if there if there are newer
versions that may address this problem.

# Convenience library for kernel netlink sockets
# The rpm delivered with our VM509.00 load is:
libnl3-3.2.21-8.el7.x86_64.rpm

I found "libnl3-3.2.25-4.fc21.x86_64.rpm" on the web.

http://rpm.pbone.net/index.php3/stat/4/idpl/28329013/dir/fedora_21
                /com/libnl3-3.2.25-4.fc21.x86_64.rpm.html

        The changelog mentions "retry local port on ADDRINUSE (rh"
        so I thought it would be worth a try in lab to see if
        it has any impact on our problem symptoms.

################################################################
# Testing with the 3-3.2.25 libnl RPM

I reran my previous test but updated half of the hosts with this
newer libnl library. I downloaded "libnl3-3.2.25-4.fc21.x86_64.rpm"
and installed it on hosts 9-12. I did not modify the libnl library
on hosts 1-6. I rebooted hosts 9-12 to ensure nothing was running
that referenced the old libnl.

Then I reran my test. (I increased the number of loops this time)

        Loop 20 times:
                create the 48 VMs  (8 at a time in parallel)
                destroy the 48 VMs (8 at a time in parallel)
                reset the VF mac/vlan (ip link set ethX vf Y vlan 0 mac 0)

        We hit the VM create failure 8 times. Every single occurrence
        was on host01 through host06. The hosts with the updated
        libnl did not have any failures.

################################################################

Comment 2 Thomas Haller 2015-08-14 16:41:40 UTC
Created attachment 1063105 [details]
[PATCH] dist-git patch with backports from upstream

Took a whole bunch of patches related to this issue.

Comment 4 Thomas Haller 2015-08-24 16:18:25 UTC
Created attachment 1066510 [details]
python script to test the backports

Attached a script that tests several implications of the backports.



Call as
  python libnl3-test-rh1249158.py a
  python libnl3-test-rh1249158.py d
  python libnl3-test-rh1249158.py c
  python libnl3-test-rh1249158.py d


See that on libnl3-3.2.21-8.el7 fails in all cases.
See that libnl3-3.2.21-9.el7 passes.


(all these patches are also on master in libnl3-upstream. So master passes too).

Comment 9 errata-xmlrpc 2015-11-19 14:52:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2105.html