Bug 955429

Summary: displayNetwork must have an IP address on host
Product: Red Hat Enterprise Virtualization Manager Reporter: Kevein Liu <yaliu>
Component: ovirt-engineAssignee: Moti Asayag <masayag>
Status: CLOSED ERRATA QA Contact: GenadiC <gcheresh>
Severity: low Docs Contact:
Priority: low    
Version: 3.1.3CC: acathrow, cwei, danken, dyuan, gcheresh, iheim, jkt, kcleveng, lbopf, lpeer, lyarwood, masayag, mhuth, mkalinin, mprivozn, myakove, mzhan, Rhev-m-bugs, s.kieske, sputhenp, tdosek, ydu, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: av3 Doc Type: Bug Fix
Doc Text:
Previously, virtual machines failed to start due to "libvirtError: internal error ifname "vnet20" not in key". This happened because the display network to which the virtual machine was assigned did not have an IP address configured on the host. Now, the engine blocks "setupNetwork" of a display network with no address, and the scheduler will attempt to start virtual machines only on a host on which the display network is configured with an IP address.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-09 14:58:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1019461, 1078909, 1142926    
Attachments:
Description Flags
Problematic OVF file none

Description Kevein Liu 2013-04-23 03:28:05 UTC
Description of problem:

When trying to start a VM, it failed with the following error:
~~~
Thread-314::ERROR::2013-04-22 08:27:18,770::vm::680::vm.Vm::(_startUnderlyingVm) vmId=`cb5e3ac6-f351-4708-9729-c5287f991783`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 642, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1475, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2645, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: internal error ifname "vnet20" not in key map
~~~

Version-Release number of selected component (if applicable):
* libvirt-0.10.2-18.el6_4.2.x86_64
* vdsm-4.10.2-1.8.el6ev.x86_64
* rhevm-3.1.0-43.el6ev.noarch

How reproducible:
Since the issue happened occasionally, so we don't know how to reproduce it at this time.

Steps to Reproduce:
N/A
  
Actual results:
VM failed to start 

Expected results:
VM should start successfully.

Additional info:
Customer has uploaded the Log-collector to our ftp server:
  ftp://dropbox.redhat.com/sosreport-LogCollector-m2-20130422090712-b1c2.tar.xz

Comment 2 yanbing du 2013-04-23 09:49:54 UTC
Hi Kevein,
I'm trying to reproduce this bug.
As no specific reproduce steps, i just trying to start a vm which has more than 20 nic in RHEVM, but can reproduce it. :(
I'm using:
# rpm -q libvirt
libvirt-0.10.2-18.el6_4.4.x86_64
# rpm -q vdsm
vdsm-4.10.2-1.9.el6ev.x86_64

I just attached the vm xml file, and could your help to check it? 
BTW, could you please provide the domain xml which encounter this bug?
Thanks!

Comment 3 Michal Privoznik 2013-04-23 13:39:30 UTC
Unfortunately, I cannot access the logs. Kevein, can you please attach logs to the BZ? Hopefully, I will get more insight from the logs.

Comment 10 Kevein Liu 2013-04-24 10:03:03 UTC
Created attachment 739374 [details]
Problematic OVF file

Comment 12 Michal Privoznik 2013-04-24 13:21:33 UTC
I've managed to find and dig out libvirt logs. Here's a short snippet which is causing the trouble:

2013-04-22 00:25:15.071+0000: 43401: error : virNetDevGetIPv4Address:834 : Unable to get IPv4 address for interface vlan111: Cannot assign requested address
2013-04-22 00:25:15.071+0000: 43401: debug : virFileClose:72 : Closed fd 159
2013-04-22 00:25:15.071+0000: 43401: error : qemuBuildCommandLine:6130 : XML error: listen network 'vdsm-vlan111' had no usable address
2013-04-22 00:25:15.071+0000: 43401: error : virNWFilterDHCPSnoopEnd:2131 : internal error ifname "vnet20" not in key map

So there are two problems:
1) we are overwriting previously reported error
2) why doesn't "vdsm-vlan111" have any usable address

For the first problem I've just posted a patch:

https://www.redhat.com/archives/libvir-list/2013-April/msg01738.html

For the second problem, unfortunately, there's not an XML of the network in the logs so I don't know why it doesn't have any usable address. Kevein, Dan and others - do you have any bright idea in case 'virsh net-dumpxml vdsm-vlan111' doesn't work (even if it does - are we guaranteed it is the very same network?). I think the best solution is to gather logs immediately when the error occurs again. And by logs I mean not only libvirt/vdsm logs, but routing table, iptables, ebtables listings as well.

Comment 17 Dan Kenigsberg 2013-04-25 10:23:36 UTC
(In reply to comment #16)
> 
> On[e] thing is vlan111 doesn't have any IPv4 address
> assigned, the other is if it should have one. But I think, once we find
> setupNetwork we will know the XML immediately, isn't that right Dan?

Correct. My guess is that the vlan111 network was configured on host with no IP address. The problem is twofold:

1. Engine should block setupNetwork of display network with no address.

2. Engine should avoid starting VMs whose displayNetwork has no IP address on hosts that somehow lost their address (i.e. bad dhcp server)

> 
> BTW any reason for these comments to be private?

Privacy is viral :-(

Comment 21 Kevein Liu 2013-04-26 07:01:49 UTC
Hi,

This issue happened again, could anyone provide a check list that I can get those information for further investigation?

Thank you!

Comment 22 Dan Kenigsberg 2013-04-27 20:14:23 UTC
(In reply to comment #21)
> This issue happened again, could anyone provide a check list that I can get
> those information for further investigation?

What is their cluster displayNetwork? Still this vlan111 network? What is the IP configuration for this network on the cluster hosts (static/dhcp/none)? And in particular, on the host that fails to start the VM?

The admin has to ensure that the displayNetwork has an IP address on each and every host.

Comment 29 Michal Privoznik 2013-04-29 10:41:07 UTC
Mark et all,

I am still not fully convinced where the real bug is. I know we've moved from libvirt to ovirt-engine, but I'd like to be 100% sure. Which means, we need logs from network setup process. Do you think it is possible to gather logs from setupNetwork command in vdsm.log? I still can't find it anywhere. I know we have thousands of logs here, but none of them contains that kind of info.

I just want to make sure somebody really did started a network without an IP address. The other possibility is, the network was stated with an IP address assigned, but something has taken it away. Either libvirt itself, or ...

Comment 30 Mark Huth 2013-04-30 01:00:52 UTC
Hi Michal,

I did a grep of all the vdsm.log* files on the hypervisor and setupNetwork wasn't matched in any of them (and I'm sure the relevant logs hadn't been rotated away).

From the rhev-prio-list "New critical issues from China Zhuji" email thread ...

<thread>
> Then customer found the "rhevm" is not the Display network, and made
> it as Display network. So the issue was repaired.

So if I understand correctly the original conclusion of engineering is
correct and what should be fixed is to avoid (and ATM I don't say how)
to run Virtual Machines on a host that has it's Display-Network does
not have an IP + inform this problematic status to the user.
</thread>

We are not sure how the display network was changed in the RHEVM WebUI.  Customer is sure they didn't change it away from rhevm, but when they changed it back to the rhevm network, the problem was resolved (ie VMs could be started again).  So it seems the problem was on the RHEVM in that it was passing the wrong displayNetwork to the hypervisor thus preventing VMs from starting.

I hope that information is helpful.  If not, please let me know.

-- Mark

Comment 31 Moti Asayag 2013-08-22 12:56:26 UTC
(In reply to Dan Kenigsberg from comment #17)
> (In reply to comment #16)
> > 
> > On[e] thing is vlan111 doesn't have any IPv4 address
> > assigned, the other is if it should have one. But I think, once we find
> > setupNetwork we will know the XML immediately, isn't that right Dan?
> 
> Correct. My guess is that the vlan111 network was configured on host with no
> IP address. The problem is twofold:
> 
> 1. Engine should block setupNetwork of display network with no address.
> 

This can be achieved by requiring Static or DHCP boot protocol for the display network, I'd suggest also to require it on the attach/update network api of the engine which seems to be commonly used by customers.

> 2. Engine should avoid starting VMs whose displayNetwork has no IP address
> on hosts that somehow lost their address (i.e. bad dhcp server)
> 

This is more tricky since the notorious race between the response from the DHCP server to the getVdsCaps after the network command is completed may occur and we might block the operation when we shouldn't. We can however consider a specific refreshCapilities call to verify the actual address of the display network if not exist. However, running multiple VMs are once will cost more resources from the host for that. When Bug 999947 will be implemented, it will be simpler to validate the display network is properly configured with IP address.

> > 
> > BTW any reason for these comments to be private?
> 
> Privacy is viral :-(

Comment 32 lpeer 2013-08-27 11:44:33 UTC
lowering priority given comment#31.
Following best practice to have ip address on the display network should work.

Comment 33 Moti Asayag 2014-03-09 13:57:39 UTC
With the suggested fix, a host which has no boot protocol configured for its display network, will be selected by the scheduler to run vms.

Comment 34 GenadiC 2014-03-18 09:28:48 UTC
Verified in AV3 that the VM is started when the display network has configured IP and fail on Can do action when the display network doesn't have IP

Comment 35 errata-xmlrpc 2014-06-09 14:58:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html