Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
On rare occasions 'virsh start' fails for customer on RHEL 7.1 with messages like these:
Jul 30 17:45:13 hplcp051-host02 journal: cannot connect to netlink socket with protocol 0: Address already in use
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/hostdevmgr/eth0_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/qemu/eth0_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/hostdevmgr/eth1_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/qemu/eth1_vf7': No such file or directory
Google searching for them found multiple reports of similar issues on RHEL6.* loads, with the suggestion that the bug was in libnl and updating to a newer libnl version fixes the issue.
A developer in Customers organization found that the libnl3 included in RHEL7.1 does have the issue, and tested with a private newer libnl3 and verified that he can reproduce the issue with the official libnl3 and the issue can NOT be reproduced with the private newer libnl3.
Customer is requesting an update in RHEL 7.1 to libnl3 to include the fix identified in the changelog this way:
* Thu May 22 2014 Thomas Haller <thaller> - 3.2.24-3
- retry local port on ADDRINUSE (rh #1097175)
The following is a detailed mail by their developer:
# Problem Description:
=======================
The intermittent problem we are trying to solve is that "virsh start" fails
about 1 out of 50 times in a very specific scenario. This has so far only
been seen in our "high_speed" VM configuration which has 16 bridges and
4 sriov interfaces. We start the VMs in parallel 8 at a time, usually
4 on a blade. This problem has only been seen on RHEL7.
The error messages are:
# Our cm log:
2015-07-30 17:45:27.93 sc_vm_mgmt: ERROR: failure on:
ssh host02 /opt/vcp/sbin/vm_mgmt -t create -w 180 -v 13-s04c02h0
-f 1-11GB ERROR: vm_mgmt virsh start 13-s04c02h0, failure=1,
output=error: Failed to start domain 13-s04c02h0
# /var/log/messages:
Jul 30 17:45:13 hplcp051-host02 journal: cannot connect to netlink socket
with protocol 0: Address already in use
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open
file '/var/run/libvirt/hostdevmgr/eth0_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open
file '/var/run/libvirt/qemu/eth0_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open
file '/var/run/libvirt/hostdevmgr/eth1_vf7': No such file or directory
Jul 30 17:45:13 hplcp051-host02 journal: Failed to open
file '/var/run/libvirt/qemu/eth1_vf7': No such file or directory
################################################################
# Recreating the Problem:
To reproduce this issue, we need to create and destroy VMs repeatedly.
This is needed because the failure rate is quite low. We are unlikely
to hit this problem if the sample size is too small.
We also discovered that the problem typically does not occur on
subsequent VM creates unless we clear the mac/vlan information
for the VF before reattempting the create. Discovering this was key
to our ability to reproduce the problem.
This is an important issue because customers transitioning to
this software from an older release will start all of their VMs in
parallel for the first time on this load. This is the case when this
problem was first reported. We did not originally recognize that
subsequent VM creates were unlikely to hit this problem, so we were
initially unable to reproduce the problem.
We ran this test in hp051 (c7000, 12 Gen8 blades).
Four "high_speed" VMs were defined on each host. Then we
ran a script to:
Loop 5 times:
create the 48 VMs (8 at a time in parallel)
destroy the 48 VMs (8 at a time in parallel)
reset the VF mac/vlan (ip link set ethX vf Y vlan 0 mac 0)
We hit the VM create failure 5 times in total.
################################################################
# Seeking a Solution:
There were several notes on the web indicating a similar problem was
seen in RHEL6 due to a problem with "libnl". I decided to look at
the RHEL7 libnl package and to if there if there are newer
versions that may address this problem.
# Convenience library for kernel netlink sockets
# The rpm delivered with our VM509.00 load is:
libnl3-3.2.21-8.el7.x86_64.rpm
I found "libnl3-3.2.25-4.fc21.x86_64.rpm" on the web.
http://rpm.pbone.net/index.php3/stat/4/idpl/28329013/dir/fedora_21
/com/libnl3-3.2.25-4.fc21.x86_64.rpm.html
The changelog mentions "retry local port on ADDRINUSE (rh"
so I thought it would be worth a try in lab to see if
it has any impact on our problem symptoms.
################################################################
# Testing with the 3-3.2.25 libnl RPM
I reran my previous test but updated half of the hosts with this
newer libnl library. I downloaded "libnl3-3.2.25-4.fc21.x86_64.rpm"
and installed it on hosts 9-12. I did not modify the libnl library
on hosts 1-6. I rebooted hosts 9-12 to ensure nothing was running
that referenced the old libnl.
Then I reran my test. (I increased the number of loops this time)
Loop 20 times:
create the 48 VMs (8 at a time in parallel)
destroy the 48 VMs (8 at a time in parallel)
reset the VF mac/vlan (ip link set ethX vf Y vlan 0 mac 0)
We hit the VM create failure 8 times. Every single occurrence
was on host01 through host06. The hosts with the updated
libnl did not have any failures.
################################################################
Created attachment 1066510[details]
python script to test the backports
Attached a script that tests several implications of the backports.
Call as
python libnl3-test-rh1249158.py a
python libnl3-test-rh1249158.py d
python libnl3-test-rh1249158.py c
python libnl3-test-rh1249158.py d
See that on libnl3-3.2.21-8.el7 fails in all cases.
See that libnl3-3.2.21-9.el7 passes.
(all these patches are also on master in libnl3-upstream. So master passes too).
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHBA-2015-2105.html
Description of problem: On rare occasions 'virsh start' fails for customer on RHEL 7.1 with messages like these: Jul 30 17:45:13 hplcp051-host02 journal: cannot connect to netlink socket with protocol 0: Address already in use Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/hostdevmgr/eth0_vf7': No such file or directory Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/qemu/eth0_vf7': No such file or directory Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/hostdevmgr/eth1_vf7': No such file or directory Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/qemu/eth1_vf7': No such file or directory Google searching for them found multiple reports of similar issues on RHEL6.* loads, with the suggestion that the bug was in libnl and updating to a newer libnl version fixes the issue. A developer in Customers organization found that the libnl3 included in RHEL7.1 does have the issue, and tested with a private newer libnl3 and verified that he can reproduce the issue with the official libnl3 and the issue can NOT be reproduced with the private newer libnl3. Customer is requesting an update in RHEL 7.1 to libnl3 to include the fix identified in the changelog this way: * Thu May 22 2014 Thomas Haller <thaller> - 3.2.24-3 - retry local port on ADDRINUSE (rh #1097175) The following is a detailed mail by their developer: # Problem Description: ======================= The intermittent problem we are trying to solve is that "virsh start" fails about 1 out of 50 times in a very specific scenario. This has so far only been seen in our "high_speed" VM configuration which has 16 bridges and 4 sriov interfaces. We start the VMs in parallel 8 at a time, usually 4 on a blade. This problem has only been seen on RHEL7. The error messages are: # Our cm log: 2015-07-30 17:45:27.93 sc_vm_mgmt: ERROR: failure on: ssh host02 /opt/vcp/sbin/vm_mgmt -t create -w 180 -v 13-s04c02h0 -f 1-11GB ERROR: vm_mgmt virsh start 13-s04c02h0, failure=1, output=error: Failed to start domain 13-s04c02h0 # /var/log/messages: Jul 30 17:45:13 hplcp051-host02 journal: cannot connect to netlink socket with protocol 0: Address already in use Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/hostdevmgr/eth0_vf7': No such file or directory Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/qemu/eth0_vf7': No such file or directory Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/hostdevmgr/eth1_vf7': No such file or directory Jul 30 17:45:13 hplcp051-host02 journal: Failed to open file '/var/run/libvirt/qemu/eth1_vf7': No such file or directory ################################################################ # Recreating the Problem: To reproduce this issue, we need to create and destroy VMs repeatedly. This is needed because the failure rate is quite low. We are unlikely to hit this problem if the sample size is too small. We also discovered that the problem typically does not occur on subsequent VM creates unless we clear the mac/vlan information for the VF before reattempting the create. Discovering this was key to our ability to reproduce the problem. This is an important issue because customers transitioning to this software from an older release will start all of their VMs in parallel for the first time on this load. This is the case when this problem was first reported. We did not originally recognize that subsequent VM creates were unlikely to hit this problem, so we were initially unable to reproduce the problem. We ran this test in hp051 (c7000, 12 Gen8 blades). Four "high_speed" VMs were defined on each host. Then we ran a script to: Loop 5 times: create the 48 VMs (8 at a time in parallel) destroy the 48 VMs (8 at a time in parallel) reset the VF mac/vlan (ip link set ethX vf Y vlan 0 mac 0) We hit the VM create failure 5 times in total. ################################################################ # Seeking a Solution: There were several notes on the web indicating a similar problem was seen in RHEL6 due to a problem with "libnl". I decided to look at the RHEL7 libnl package and to if there if there are newer versions that may address this problem. # Convenience library for kernel netlink sockets # The rpm delivered with our VM509.00 load is: libnl3-3.2.21-8.el7.x86_64.rpm I found "libnl3-3.2.25-4.fc21.x86_64.rpm" on the web. http://rpm.pbone.net/index.php3/stat/4/idpl/28329013/dir/fedora_21 /com/libnl3-3.2.25-4.fc21.x86_64.rpm.html The changelog mentions "retry local port on ADDRINUSE (rh" so I thought it would be worth a try in lab to see if it has any impact on our problem symptoms. ################################################################ # Testing with the 3-3.2.25 libnl RPM I reran my previous test but updated half of the hosts with this newer libnl library. I downloaded "libnl3-3.2.25-4.fc21.x86_64.rpm" and installed it on hosts 9-12. I did not modify the libnl library on hosts 1-6. I rebooted hosts 9-12 to ensure nothing was running that referenced the old libnl. Then I reran my test. (I increased the number of loops this time) Loop 20 times: create the 48 VMs (8 at a time in parallel) destroy the 48 VMs (8 at a time in parallel) reset the VF mac/vlan (ip link set ethX vf Y vlan 0 mac 0) We hit the VM create failure 8 times. Every single occurrence was on host01 through host06. The hosts with the updated libnl did not have any failures. ################################################################