Created attachment 696686 [details] tcpdump and console log On a local installation of Openstack on RHEL 6.4 beta, I have several images, on all of which networking works when libvirt_use_virtio_for_bridges=False when I set libvirt_use_virtio_for_bridges=True then networking stops working for one of my images, the vm just keeps sending out DHCP requests the image in question is an ubuntu precise image, which was sourced here http://uec-images.ubuntu.com/precise/20130211/precise-server-cloudimg-amd64-disk1.img using openstack-nova-compute-2012.2.2-9.el6ost.noarch attached is the output of network traffic from the VM as it boots up and the tail of the console log
I have seen rumours before of certain Ubuntu releases having fubar virtio support, which I believe is why this flag was created in the first place. No one has ever been able to tell me which versions were fubar though. I'm suprised to see trouble with such a new Ubuntu release though. Might want to raise this with the ubuntu maintainers. FYI this BZ is tracking RFE for configuring NIC model per guest image https://bugzilla.redhat.com/show_bug.cgi?id=887968 This will allow for something like this: # glance image-update \ --property vif_model=e1000 \ f16-x86_64-openstack-sda
Derek, can you check with ubuntu maintainers and verify that this is just a bug with their virtio support? Sounds like it's not a nova bug here.
Email sent, they are looking into it.
I'm not able to recreate this locally. I've verified functional virtio guest network interfaces in 3 different ways: A.) run ubuntu 12.10 instance, and run devstack inside that, using nested kvm and libvirt to boot the latest released 12.04 image (20130222 [1]). kernel/libvirt versions: kernel: linux-image-3.5.0-25-generic libvirt: 0.9.13-0ubuntu12.2 For full setup, information, I did: 1.) boot instance with user-data from [2] 2.) modify /etc/nova/nova.conf, adding 'libvirt_use_virtio_for_bridges=true ' to 'DEFAULT' section (I think it is the default, though). Then kill and re-start nova-network and nova-net. 3.) download amd64 disk1.img at [1], then glance upload, and start an instance 4.) ssh into instance, then: $ ethtool -i eth0 | grep driv driver: virtio_net 5.) verify that libvirt.xml for that instance has: <interface type="bridge"> <mac address="fa:16:3e:3e:ce:0a"/> <model type="virtio"/> <source bridge="br100"/> <filterref filter="nova-instance-instance-00000001-fa163e3ece0a"/> </interface> B.) Launch an instance in an internal Canonical openstack that we have. It runs the ubuntu cloud archive versions of openstack, 12.04 kernels and libvirt. Inside that instance: $ ethtool -i eth0 | head -n 2 driver: virtio_net version: 1.0.0 C.) launch an instance on HP's public cloud of 12.04, and ssh to it. Image was: 67074 | Ubuntu Precise 12.04 LTS Server 64-bit 20121026 Then, inside: $ ethtool -i eth0 | head -n 2 driver: virtio_net version: $ uname -r 3.2.0-32-virtual So it seems that we virtio networking is not simply broken. Its working in several different scenarios. Did I miss something? [1] http://cloud-images.ubuntu.com/releases/precise/release-20130222/ [2] https://gist.github.com/smoser/4795358
Wow, this keeps coming back With GSO and virtio_net, we avoid adding checksums to packets sent from the host to the guest This works fine for most applications, but dhclient is different - it uses a raw socket and rejects packets which don't have a valid checksum We've carried a patch for a long, long time to have dhclient look at auxdata for the CSUMNOTREADY and ignore the lack of a checksum if the flag is set: http://pkgs.fedoraproject.org/cgit/dhcp.git/tree/dhcp-4.2.2-xen-checksum.patch As best I can tell from e.g. https://code.launchpad.net/~ubuntu-branches/ubuntu/precise/isc-dhcp/precise Ubuntu still doesn't have this patch. Upstream ISC is definitely partly to blame for that, but I thought this issue was well enough known that Ubuntu would have had the patch by now. Now ... when you're using the userspace virtio_net implementation in qemu, you'll never notice the problem because of this hack: http://git.qemu.org/?p=qemu.git;a=commit;h=1d41b0c i.e. qemu specifically added checksums to DHCP packets With the newer vhost-net, that codepath isn't used. Instead, the virtio_net implementation is now in the kernel. And the kernel developers, the contrary lot that they are, refused to include the hack there: http://www.spinics.net/lists/kvm/msg37660.html Another detail is this is only an issue when it's a DHCP packet sent from the host to the guest. If you're using an external DHCP server, it's not a problem. OpenStack manages its own dhcp server, as does libvirt's NATing virtual network. libvirt uses a recently added --checksum-fill iptables mangle rule (thanks to mst for the hint here) to implement the equivalent of qemu's hack using iptables for this case: http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=fd5b15f Summary: we need to make OpenStack similarly use --checksum-fill
Ok, I've just confirmed this with Derek by adding this rule: iptables -A POSTROUTING -t mangle -p udp --dport bootpc -j CHECKSUM --checksum-fill and the Ubuntu VMs were able to get an IP again The rule we want OpenStack to add should be more specific to the interface involved etc., but that's the general idea
Ok, need to file a bug against upstream OpenStack for this against both Grizzly and Folsom. Should also file a bug against Ubuntu's dhcp server, and probably Debian too, since I presume Ubuntu inherit this flaw from Debian packages.
Related ubuntu/openstack and debian bugs: https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/930962 https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1029430 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=353161 The isc-dhcp issue seems to be fixed in raring (bug 930962). I'll look into getting it SRU'd to 12.04 and even 12.10. Could you confirm that ubuntu cloud images of raring do not exhibit this issue? Just download one from http://cloud-images.ubuntu.com/server/raring .
This looks to me like it was fixed in trunk nova trunk at https://review.openstack.org/#/c/18336/ .
(In reply to comment #10) > This looks to me like it was fixed in trunk nova trunk at > https://review.openstack.org/#/c/18336/ . Cool, thanks for pointing that out. We should consider proposing that for Folsom stable branch upstream too
Nice find Scott, thanks! (/me should have grepped for checksum-fill :)
(In reply to comment #9) > Could you confirm that ubuntu cloud images of raring do not exhibit this > issue? > Just download one from http://cloud-images.ubuntu.com/server/raring . confirmed, problem is gone in raring http://cloud-images.ubuntu.com/raring/20130307/raring-server-cloudimg-amd64-disk1.img
(In reply to comment #11) > (In reply to comment #10) > > This looks to me like it was fixed in trunk nova trunk at > > https://review.openstack.org/#/c/18336/ . > > Cool, thanks for pointing that out. We should consider proposing that for > Folsom stable branch upstream too It is on Folsom branch already https://review.openstack.org/#/c/18450/ > using openstack-nova-compute-2012.2.2-9.el6ost.noarch 4bfc8f1165b05c2cc7c5506641b9b85fa8e1e144 is in 2012.2.3 please test with openstack-nova-2012.2.3-1.el6ost
Tested RHEL6.4 , FED17 & ubonto both on quantum and NOVA ping and SSH to VMs | ec417926-7229-4dc2-b64d-1d085a7f7fce | VM-FED17 | ACTIVE | net-vlan202=88.66.66.34 | | 2e971ceb-3f01-4acd-b99e-2a4480b2800f | VM-ubonto | ACTIVE | net-vlan202=88.66.66.11 | +--------------------------------------+-----------+---------+-------------------------+ [root@puma34 ~(keystone_admin_tenant1)]$ ping 88.66.66.11 PING 88.66.66.11 (88.66.66.11) 56(84) bytes of data. 64 bytes from 88.66.66.11: icmp_seq=1 ttl=64 time=0.291 ms 64 bytes from 88.66.66.11: icmp_seq=2 ttl=64 time=0.222 ms 64 bytes from 88.66.66.11: icmp_seq=3 ttl=64 time=0.251 ms ^C --- 88.66.66.11 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2593ms rtt min/avg/max/mdev = 0.222/0.254/0.291/0.033 ms [root@puma34 ~(keystone_admin_tenant1)]$ ping 88.66.66.34 PING 88.66.66.34 (88.66.66.34) 56(84) bytes of data. 64 bytes from 88.66.66.34: icmp_seq=1 ttl=64 time=0.339 ms ubuntu imaged taken from : http://uec-images.ubuntu.com/precise/current/precise-server-cloudimg-i386-disk1.img
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0706.html