Description of problem:
When trying to start a VM, it failed with the following error:
Thread-314::ERROR::2013-04-22 08:27:18,770::vm::680::vm.Vm::(_startUnderlyingVm) vmId=`cb5e3ac6-f351-4708-9729-c5287f991783`::The vm start process failed
Traceback (most recent call last):
File "/usr/share/vdsm/vm.py", line 642, in _startUnderlyingVm
File "/usr/share/vdsm/libvirtvm.py", line 1475, in _run
File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2645, in createXML
if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: internal error ifname "vnet20" not in key map
Version-Release number of selected component (if applicable):
Since the issue happened occasionally, so we don't know how to reproduce it at this time.
Steps to Reproduce:
VM failed to start
VM should start successfully.
Customer has uploaded the Log-collector to our ftp server:
I'm trying to reproduce this bug.
As no specific reproduce steps, i just trying to start a vm which has more than 20 nic in RHEVM, but can reproduce it. :(
# rpm -q libvirt
# rpm -q vdsm
I just attached the vm xml file, and could your help to check it?
BTW, could you please provide the domain xml which encounter this bug?
Unfortunately, I cannot access the logs. Kevein, can you please attach logs to the BZ? Hopefully, I will get more insight from the logs.
Created attachment 739374 [details]
Problematic OVF file
I've managed to find and dig out libvirt logs. Here's a short snippet which is causing the trouble:
2013-04-22 00:25:15.071+0000: 43401: error : virNetDevGetIPv4Address:834 : Unable to get IPv4 address for interface vlan111: Cannot assign requested address
2013-04-22 00:25:15.071+0000: 43401: debug : virFileClose:72 : Closed fd 159
2013-04-22 00:25:15.071+0000: 43401: error : qemuBuildCommandLine:6130 : XML error: listen network 'vdsm-vlan111' had no usable address
2013-04-22 00:25:15.071+0000: 43401: error : virNWFilterDHCPSnoopEnd:2131 : internal error ifname "vnet20" not in key map
So there are two problems:
1) we are overwriting previously reported error
2) why doesn't "vdsm-vlan111" have any usable address
For the first problem I've just posted a patch:
For the second problem, unfortunately, there's not an XML of the network in the logs so I don't know why it doesn't have any usable address. Kevein, Dan and others - do you have any bright idea in case 'virsh net-dumpxml vdsm-vlan111' doesn't work (even if it does - are we guaranteed it is the very same network?). I think the best solution is to gather logs immediately when the error occurs again. And by logs I mean not only libvirt/vdsm logs, but routing table, iptables, ebtables listings as well.
(In reply to comment #16)
> On[e] thing is vlan111 doesn't have any IPv4 address
> assigned, the other is if it should have one. But I think, once we find
> setupNetwork we will know the XML immediately, isn't that right Dan?
Correct. My guess is that the vlan111 network was configured on host with no IP address. The problem is twofold:
1. Engine should block setupNetwork of display network with no address.
2. Engine should avoid starting VMs whose displayNetwork has no IP address on hosts that somehow lost their address (i.e. bad dhcp server)
> BTW any reason for these comments to be private?
Privacy is viral :-(
This issue happened again, could anyone provide a check list that I can get those information for further investigation?
(In reply to comment #21)
> This issue happened again, could anyone provide a check list that I can get
> those information for further investigation?
What is their cluster displayNetwork? Still this vlan111 network? What is the IP configuration for this network on the cluster hosts (static/dhcp/none)? And in particular, on the host that fails to start the VM?
The admin has to ensure that the displayNetwork has an IP address on each and every host.
Mark et all,
I am still not fully convinced where the real bug is. I know we've moved from libvirt to ovirt-engine, but I'd like to be 100% sure. Which means, we need logs from network setup process. Do you think it is possible to gather logs from setupNetwork command in vdsm.log? I still can't find it anywhere. I know we have thousands of logs here, but none of them contains that kind of info.
I just want to make sure somebody really did started a network without an IP address. The other possibility is, the network was stated with an IP address assigned, but something has taken it away. Either libvirt itself, or ...
I did a grep of all the vdsm.log* files on the hypervisor and setupNetwork wasn't matched in any of them (and I'm sure the relevant logs hadn't been rotated away).
From the rhev-prio-list "New critical issues from China Zhuji" email thread ...
> Then customer found the "rhevm" is not the Display network, and made
> it as Display network. So the issue was repaired.
So if I understand correctly the original conclusion of engineering is
correct and what should be fixed is to avoid (and ATM I don't say how)
to run Virtual Machines on a host that has it's Display-Network does
not have an IP + inform this problematic status to the user.
We are not sure how the display network was changed in the RHEVM WebUI. Customer is sure they didn't change it away from rhevm, but when they changed it back to the rhevm network, the problem was resolved (ie VMs could be started again). So it seems the problem was on the RHEVM in that it was passing the wrong displayNetwork to the hypervisor thus preventing VMs from starting.
I hope that information is helpful. If not, please let me know.
(In reply to Dan Kenigsberg from comment #17)
> (In reply to comment #16)
> > On[e] thing is vlan111 doesn't have any IPv4 address
> > assigned, the other is if it should have one. But I think, once we find
> > setupNetwork we will know the XML immediately, isn't that right Dan?
> Correct. My guess is that the vlan111 network was configured on host with no
> IP address. The problem is twofold:
> 1. Engine should block setupNetwork of display network with no address.
This can be achieved by requiring Static or DHCP boot protocol for the display network, I'd suggest also to require it on the attach/update network api of the engine which seems to be commonly used by customers.
> 2. Engine should avoid starting VMs whose displayNetwork has no IP address
> on hosts that somehow lost their address (i.e. bad dhcp server)
This is more tricky since the notorious race between the response from the DHCP server to the getVdsCaps after the network command is completed may occur and we might block the operation when we shouldn't. We can however consider a specific refreshCapilities call to verify the actual address of the display network if not exist. However, running multiple VMs are once will cost more resources from the host for that. When Bug 999947 will be implemented, it will be simpler to validate the display network is properly configured with IP address.
> > BTW any reason for these comments to be private?
> Privacy is viral :-(
lowering priority given comment#31.
Following best practice to have ip address on the display network should work.
With the suggested fix, a host which has no boot protocol configured for its display network, will be selected by the scheduler to run vms.
Verified in AV3 that the VM is started when the display network has configured IP and fail on Can do action when the display network doesn't have IP
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.