Created attachment 696686 [details]
tcpdump and console log
On a local installation of Openstack on RHEL 6.4 beta, I have several images, on all of which networking works when libvirt_use_virtio_for_bridges=False
when I set libvirt_use_virtio_for_bridges=True then networking stops working for one of my images, the vm just keeps sending out DHCP requests
the image in question is an ubuntu precise image, which was sourced here
attached is the output of network traffic from the VM as it boots up and the tail of the console log
I have seen rumours before of certain Ubuntu releases having fubar virtio support, which I believe is why this flag was created in the first place. No one has ever been able to tell me which versions were fubar though. I'm suprised to see trouble with such a new Ubuntu release though. Might want to raise this with the ubuntu maintainers.
FYI this BZ is tracking RFE for configuring NIC model per guest image https://bugzilla.redhat.com/show_bug.cgi?id=887968
This will allow for something like this:
# glance image-update \
--property vif_model=e1000 \
Derek, can you check with ubuntu maintainers and verify that this is just a bug with their virtio support? Sounds like it's not a nova bug here.
Email sent, they are looking into it.
I'm not able to recreate this locally. I've verified functional virtio guest network interfaces in 3 different ways:
A.) run ubuntu 12.10 instance, and run devstack inside that, using nested kvm and libvirt to boot the latest released 12.04 image (20130222 ). kernel/libvirt versions:
For full setup, information, I did:
1.) boot instance with user-data from 
2.) modify /etc/nova/nova.conf, adding 'libvirt_use_virtio_for_bridges=true
' to 'DEFAULT' section (I think it is the default, though). Then kill and re-start nova-network and nova-net.
3.) download amd64 disk1.img at , then glance upload, and start an instance
4.) ssh into instance, then:
$ ethtool -i eth0 | grep driv
5.) verify that libvirt.xml for that instance has:
B.) Launch an instance in an internal Canonical openstack that we have. It runs the ubuntu cloud archive versions of openstack, 12.04 kernels and libvirt. Inside that instance:
$ ethtool -i eth0 | head -n 2
C.) launch an instance on HP's public cloud of 12.04, and ssh to it.
Image was: 67074 | Ubuntu Precise 12.04 LTS Server 64-bit 20121026
$ ethtool -i eth0 | head -n 2
$ uname -r
So it seems that we virtio networking is not simply broken. Its working in several different scenarios.
Did I miss something?
Wow, this keeps coming back
With GSO and virtio_net, we avoid adding checksums to packets sent from the host to the guest
This works fine for most applications, but dhclient is different - it uses a raw socket and rejects packets which don't have a valid checksum
We've carried a patch for a long, long time to have dhclient look at auxdata for the CSUMNOTREADY and ignore the lack of a checksum if the flag is set:
As best I can tell from e.g. https://code.launchpad.net/~ubuntu-branches/ubuntu/precise/isc-dhcp/precise Ubuntu still doesn't have this patch. Upstream ISC is definitely partly to blame for that, but I thought this issue was well enough known that Ubuntu would have had the patch by now.
Now ... when you're using the userspace virtio_net implementation in qemu, you'll never notice the problem because of this hack:
i.e. qemu specifically added checksums to DHCP packets
With the newer vhost-net, that codepath isn't used. Instead, the virtio_net implementation is now in the kernel. And the kernel developers, the contrary lot that they are, refused to include the hack there:
Another detail is this is only an issue when it's a DHCP packet sent from the host to the guest. If you're using an external DHCP server, it's not a problem. OpenStack manages its own dhcp server, as does libvirt's NATing virtual network.
libvirt uses a recently added --checksum-fill iptables mangle rule (thanks to mst for the hint here) to implement the equivalent of qemu's hack using iptables for this case:
Summary: we need to make OpenStack similarly use --checksum-fill
Ok, I've just confirmed this with Derek by adding this rule:
iptables -A POSTROUTING -t mangle -p udp --dport bootpc -j CHECKSUM --checksum-fill
and the Ubuntu VMs were able to get an IP again
The rule we want OpenStack to add should be more specific to the interface involved etc., but that's the general idea
Ok, need to file a bug against upstream OpenStack for this against both Grizzly and Folsom. Should also file a bug against Ubuntu's dhcp server, and probably Debian too, since I presume Ubuntu inherit this flaw from Debian packages.
Related ubuntu/openstack and debian bugs:
The isc-dhcp issue seems to be fixed in raring (bug 930962). I'll look into getting it SRU'd to 12.04 and even 12.10.
Could you confirm that ubuntu cloud images of raring do not exhibit this issue?
Just download one from http://cloud-images.ubuntu.com/server/raring .
This looks to me like it was fixed in trunk nova trunk at https://review.openstack.org/#/c/18336/ .
(In reply to comment #10)
> This looks to me like it was fixed in trunk nova trunk at
> https://review.openstack.org/#/c/18336/ .
Cool, thanks for pointing that out. We should consider proposing that for Folsom stable branch upstream too
Nice find Scott, thanks!
(/me should have grepped for checksum-fill :)
(In reply to comment #9)
> Could you confirm that ubuntu cloud images of raring do not exhibit this
> Just download one from http://cloud-images.ubuntu.com/server/raring .
confirmed, problem is gone in raring
(In reply to comment #11)
> (In reply to comment #10)
> > This looks to me like it was fixed in trunk nova trunk at
> > https://review.openstack.org/#/c/18336/ .
> Cool, thanks for pointing that out. We should consider proposing that for
> Folsom stable branch upstream too
It is on Folsom branch already https://review.openstack.org/#/c/18450/
> using openstack-nova-compute-2012.2.2-9.el6ost.noarch
4bfc8f1165b05c2cc7c5506641b9b85fa8e1e144 is in 2012.2.3
please test with openstack-nova-2012.2.3-1.el6ost
RHEL6.4 , FED17 & ubonto both on quantum and NOVA
ping and SSH to VMs
| ec417926-7229-4dc2-b64d-1d085a7f7fce | VM-FED17 | ACTIVE | net-vlan202=188.8.131.52 |
| 2e971ceb-3f01-4acd-b99e-2a4480b2800f | VM-ubonto | ACTIVE | net-vlan202=184.108.40.206 |
[root@puma34 ~(keystone_admin_tenant1)]$ ping 220.127.116.11
PING 18.104.22.168 (22.214.171.124) 56(84) bytes of data.
64 bytes from 126.96.36.199: icmp_seq=1 ttl=64 time=0.291 ms
64 bytes from 188.8.131.52: icmp_seq=2 ttl=64 time=0.222 ms
64 bytes from 184.108.40.206: icmp_seq=3 ttl=64 time=0.251 ms
--- 220.127.116.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2593ms
rtt min/avg/max/mdev = 0.222/0.254/0.291/0.033 ms
[root@puma34 ~(keystone_admin_tenant1)]$ ping 18.104.22.168
PING 22.214.171.124 (126.96.36.199) 56(84) bytes of data.
64 bytes from 188.8.131.52: icmp_seq=1 ttl=64 time=0.339 ms
ubuntu imaged taken from :
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.