Bug 1221006
Summary: | IP address dropped somehow, causing vagrant to hang | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | James (purpleidea) <jshubin> | ||||
Component: | vagrant | Assignee: | Josef Stribny <jstribny> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rawhide | CC: | aweiteka, hhorak, jshubin, jstribny, madam, mattdm, ncoghlan, pschiffe, rbarlow, thrcka, tkimura, vondruch, walters | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | vagrant-1.7.2-7.fc22,vagrant-1.7.2-7.fc21.1 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-06-30 10:52:07 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
James (purpleidea)
2015-05-13 06:17:26 UTC
@aweiteka if you can please confirm that you also still have this issue it would be appreciated. I remember you spent some time into debugging the internals a bit, and your comments would be appreciated. For reference, I don't have this issue when I'm using the Vagrant upstream packages version 1.6.5 with vagrant-libvirt 0.0.26 James, I looked into it a bit and found out that your issue is still there with latest upstream Vagrant AND also with Vagrant 1.6.5. Can you check that the following option is not causing the issue? :libvirt__dhcp_enabled => false I tried to comment it out and it started to work. It would explain that the VM didn't get an IP. It's used only when creating new network. (In reply to Josef Stribny from comment #3) > James, I looked into it a bit and found out that your issue is still there > with latest upstream Vagrant AND also with Vagrant 1.6.5. > > Can you check that the following option is not causing the issue? > > :libvirt__dhcp_enabled => false > > I tried to comment it out and it started to work. It would explain that the > VM didn't get an IP. It's used only when creating new network. Here's the thing... You *unfortunately* need two IP's to get a sane setup. 1) To work with DHCP so that vagrant can find the machine in the first place. 2) A second static version so that you can do reliable static networking, so that IP's are consistent across reboots. I don't know how else to get this to work... (In reply to Josef Stribny from comment #3) > James, I looked into it a bit and found out that your issue is still there > with latest upstream Vagrant AND also with Vagrant 1.6.5. > > Can you check that the following option is not causing the issue? > > :libvirt__dhcp_enabled => false > > I tried to comment it out and it started to work. It would explain that the > VM didn't get an IP. It's used only when creating new network. And to answer more specifically... Look at the IP's in the machine. For the setup I gave you, you should have one ip of: 192.168.123.100 and another random DHCP given IP. Sometimes this works, sometimes no IP's are present. Keep up/down-ing and you'll see it reproduces. Changing that option doesn't change it. In particular, when it does work, they're both on the same interface, instead of one on eth0 and one on eth1... Feels like a race condition perhaps? Are you able to put together a minimal reproducer? Just taking the networking stuff and nothing else? This should be reported to upstream. (In reply to Josef Stribny from comment #6) > Are you able to put together a minimal reproducer? Just taking the > networking stuff and nothing else? This should be reported to upstream. To be honest, I don't have many cycles for this right now, sorry. I'd like for vagrant in fedora to succeed and be accessible, but there has to be a maintainer dedicated to making that happen, and it can't be me at the moment. If you can follow up and help fix this bug, I would appreciate it, and if not then I expect someone else will end up hitting this and hopefully patches will come out of that. Cheers, James > I don't have many cycles for this right now, sorry Putting together a clear (minimal) reproducer is a key for any issue, not just this one. Unfortunately you reported issues that are based on your oh-my-vagrant project which I and many other people are not that familiar with. I only asked for stripping down a minimal Vagrantfile so it's clear what you are trying to accomplish and what fails. > I'd like for vagrant in fedora to succeed and be accessible Me too, but this one is not Fedora specific, I already told you I hit the issue with upstream packages. In order to submit it upstream and work with them on fixing it, a clear report on what's wrong is needed. > If you can follow up and help fix this bug, I would appreciate it If things like this one can be tracked in upstream, many more people can resolve them. I would love to help to fix upstream bugs as well, but if you look at the upstream issues trackers, this bug is just one of many. > I expect someone else will end up hitting this and hopefully patches will come out of that That's why I would like to see a proper report in upstream tracker :). (In reply to James (purpleidea) from comment #7) > (In reply to Josef Stribny from comment #6) > > Are you able to put together a minimal reproducer? Just taking the > > networking stuff and nothing else? This should be reported to upstream. > > To be honest, I don't have many cycles for this right now, sorry. I'd like > for vagrant in fedora to succeed and be accessible, but there has to be a > maintainer dedicated to making that happen, and it can't be me at the moment. Well, the request to have a minimal reproducer is valid. It is even *essential* to have a minimal reproducer. Quite frankly, it is not necessarily the vagrant-libvirt maintainer's job to try to untangle the involved setup of omv and try to extract the core issue. At least it is as least as much the job of the omv maintainer, imho. ;) Like last time, when we debugged an issue together and it turned out that some special thing that omv did, triggers a bug vagrant that others are unlikely to hit. Coming back to the issue: Can this issue be related? https://github.com/pradels/vagrant-libvirt/issues/312 vagrant-libvirt does strange things with respect to dhcp and networking. I noticed that with my own setups. From all of my previous research, I share Josef's impression that your ":libvirt__dhcp_enabled => false" is wrong. vagrant-libvirt is actually intended to run correctly with statically configured interfaces. It needs the one (default) interface with dhcp for vagrant ssh etc. Iirc, vagrant-libvirt achieves static configuration by doing ifdown first on an interface that is originally brought up with dhcp (this ifdown will also kill the dhclient), then putting the static config in place and then doing ifup again. In several (of your) boxes this fails with various results, see the issue cited above: E.g. ifdown fails, so dhclient is still running and re-adds the dynamic IP address after a while. On other boxes, it fails to bring up the interface at all. So a problem seems to be that vagrant-libvirt's actions are not independent enough of the state of the network config in the box. Can't say it better now -- it's been a while back, when I last looked, but I thought I'd share it anyways... Cheers - Michael I'd love to have time to dig deeper into this, but I've got to do other work first. If I have time on the weekend I will, but otherwise maybe someone else can. Cheers I've also noticed this behavior when using OMV. Sometimes when I vagrant up my machine, it'll get no IP addresses. I've also noticed it getting both IP addresses on the same interface (which causes other issues). I also don't have a simple reproducer, however ☹ Created attachment 1028451 [details] Simple reproducer Should expect two interfaces: 1 that has a static IP address, the other which is from DHCP, which vagrant uses. Some of the time you see this: [root@test1 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:5c:c7:be brd ff:ff:ff:ff:ff:ff inet 192.168.121.214/24 brd 192.168.121.255 scope global dynamic eth0 valid_lft 3563sec preferred_lft 3563sec inet 192.168.123.100/24 brd 192.168.123.255 scope global eth0 valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:eb:c3:31 brd ff:ff:ff:ff:ff:ff 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 56:84:7a:fe:97:99 brd ff:ff:ff:ff:ff:ff inet 172.17.42.1/16 scope global docker0 valid_lft forever preferred_lft forever Some of the time, the machine doesn't finish the vagrant up, and you see this: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN qlen 1000 link/ether 52:54:00:83:b2:b3 brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:b9:c3:3b brd ff:ff:ff:ff:ff:ff 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 56:84:7a:fe:97:99 brd ff:ff:ff:ff:ff:ff inet 172.17.42.1/16 scope global docker0 valid_lft forever preferred_lft forever Also: $ cat /etc/sysconfig/network-scripts/ifcfg-eth0 # Generated by dracut initrd NAME="eth0" ONBOOT=yes NETBOOT=yes UUID="bd5b2625-fdce-41d4-997a-13bf1b70deca" IPV6INIT=yes BOOTPROTO=dhcp TYPE=Ethernet #VAGRANT-BEGIN # The contents below are automatically generated by Vagrant. Do not modify. NM_CONTROLLED=no BOOTPROTO=none ONBOOT=yes IPADDR=192.168.123.100 NETMASK=255.255.255.0 DEVICE=eth0 PEERDNS=no #VAGRANT-END So it might be related to dracut messing things up somehow... IDK. NOTE: the image used is here: https://download.gluster.org/pub/gluster/purpleidea/vagrant/centos-7.1-docker/ I think I may be seeing this as well, using the current vagrant and vagrant-libvirt packages on Fedora 21 as the host OS: $ rpm -qa vagrant vagrant-libvirt vagrant-libvirt-0.0.24-4.fc21.noarch vagrant-1.7.2-5.fc21.1.noarch Vagrant guest systems are all also Fedora 21, configured using this example as omv.yaml: https://github.com/purpleidea/oh-my-vagrant/blob/master/examples/kubernetes-ansible.yaml The symptoms I see inside the VMs are slightly different from those James reports in his simple reproducer. In my case, eth0 is clearly the DHCP-controlled interface created for remote access by vagrant, with eth1 as the separated statically configured interface intended for communication between the VMs: $ cat /etc/sysconfig/network-scripts/ifcfg-eth1 #VAGRANT-BEGIN # The contents below are automatically generated by Vagrant. Do not modify. NM_CONTROLLED=no BOOTPROTO=none ONBOOT=yes IPADDR=192.168.123.100 NETMASK=255.255.255.0 DEVICE=eth1 PEERDNS=no #VAGRANT-END That interface isn't showing any IPv4 address in ifconfig or "ip addr". Dropping the network interface and bringing it back up from inside the VM isn't having any effect either. The settings for both the omv network (which all the eth1 interfaces are connected to) and the vagrant-libvirt network (which all the eth0 interfaces are connected to) look fine in virtmanager, and they're both up and running. (In reply to Nick Coghlan from comment #13) > I think I may be seeing this as well, using the current vagrant and > vagrant-libvirt packages on Fedora 21 as the host OS: > > $ rpm -qa vagrant vagrant-libvirt > vagrant-libvirt-0.0.24-4.fc21.noarch > vagrant-1.7.2-5.fc21.1.noarch > > Vagrant guest systems are all also Fedora 21, configured using this example > as omv.yaml: > https://github.com/purpleidea/oh-my-vagrant/blob/master/examples/kubernetes- > ansible.yaml > > The symptoms I see inside the VMs are slightly different from those James > reports in his simple reproducer. In my case, eth0 is clearly the > DHCP-controlled interface created for remote access by vagrant, with eth1 as > the separated statically configured interface intended for communication > between the VMs: > > $ cat /etc/sysconfig/network-scripts/ifcfg-eth1 > #VAGRANT-BEGIN > # The contents below are automatically generated by Vagrant. Do not modify. > NM_CONTROLLED=no > BOOTPROTO=none > ONBOOT=yes > IPADDR=192.168.123.100 > NETMASK=255.255.255.0 > DEVICE=eth1 > PEERDNS=no > #VAGRANT-END > > That interface isn't showing any IPv4 address in ifconfig or "ip addr". > Dropping the network interface and bringing it back up from inside the VM > isn't having any effect either. > > The settings for both the omv network (which all the eth1 interfaces are > connected to) and the vagrant-libvirt network (which all the eth0 interfaces > are connected to) look fine in virtmanager, and they're both up and running. Can you debug the reason why the interface isn't showing the IP? Should be a (hopefully) straightforward networking issue, which maybe vagrant is setting up wrong, and thus isn't working? It's possible to get a static address with one interface using libvirt by pre-configuring a binding between the MAC address and DHCP before booting the VM. There are some examples here: http://libvirt.org/formatnetwork.html Trying this on my personal laptop running Fedora 22, and tweaking the omv.yaml file to use the OMV Fedora 22 Vagrant boxes (rather than the OMV Fedora 21 boxes), I get slightly different symptoms from those I saw with Fedora 21 as the host and guest: * the eth1 definition without an IPv4 address is still present in ifconfig * there's no ifcfg-eth1 network script at all (neither system generated nor vagrant generated) Attempting to restart the network services with "sudo systemctl restart network" gives the following result on all 3 machines: May 28 02:18:18 localhost.localdomain systemd[1]: Starting LSB: Bring up/down networking... -- Subject: Unit network.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit network.service has begun starting up. May 28 02:18:18 localhost.localdomain network[2542]: Bringing up loopback interface: Could not load file '/etc/sysconfig/network-scripts/ifcfg-lo' May 28 02:18:18 localhost.localdomain network[2542]: Could not load file '/etc/sysconfig/network-scripts/ifcfg-lo' May 28 02:18:18 localhost.localdomain network[2542]: Could not load file '/etc/sysconfig/network-scripts/ifcfg-lo' May 28 02:18:18 localhost.localdomain network[2542]: Could not load file '/etc/sysconfig/network-scripts/ifcfg-lo' May 28 02:18:18 localhost.localdomain network[2542]: [ OK ] May 28 02:18:18 localhost.localdomain network[2542]: Bringing up interface eth0: Error: Connection activation failed: Connection 'eth0' is already active on eth0 May 28 02:18:18 localhost.localdomain network[2542]: [FAILED] May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists May 28 02:18:18 localhost.localdomain systemd[1]: network.service: control process exited, code=exited status=1 May 28 02:18:18 localhost.localdomain systemd[1]: Failed to start LSB: Bring up/down networking. -- Subject: Unit network.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit network.service has failed. -- -- The result is failed. I can confirm a Fedora 22 host with Fedora 21 guests shows the same symptoms as the Fedora 21 host with Fedora 21 guests: there's an ifcfg-eth1 network script present, and the network service appears to be running as expected, but the eth1 interface has no IPv4 address. Correction to the previous post: if I make sure I'm using the Vagrant box from https://getfedora.org/en/cloud/download/, the network connection comes up fine (regardless of whether I use the simple reproducer, or OMV itself). I've only done this twice, so the existence of an intermittent failure might still be possible. However, it does mean it's specifically the default OMV Vagrant box from https://download.gluster.org/pub/gluster/purpleidea/vagrant/ that exhibited the problem with the network failing to come up correctly (and that was repeatable every time). For me, this issue looks like this. I'm using oh-my-vagrant and rhel-7 box. After provisioning I see in file /etc/sysconfig/network-scripts/ifcfg-eth0: # Generated by dracut initrd NAME="eth0" ONBOOT=yes NETBOOT=yes UUID="4c227002-aa87-42a4-b904-190d5ba80fdf" IPV6INIT=yes BOOTPROTO=dhcp TYPE=Ethernet DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6_AUTOCONF=yes IPV6_DEFROUTE=yes IPV6_FAILURE_FATAL=no PEERDNS=yes PEERROUTES=yes IPV6_PEERDNS=yes IPV6_PEERROUTES=yes #VAGRANT-BEGIN # The contents below are automatically generated by Vagrant. Do not modify. NM_CONTROLLED=no BOOTPROTO=none ONBOOT=yes IPADDR=192.168.91.102 NETMASK=255.255.255.0 DEVICE=eth0 PEERDNS=no #VAGRANT-END I don't have /etc/sysconfig/network-scripts/ifcfg-eth1 file. I'm able to work-around this - If I create libvirt network manually instead of oh-my-vagrant without DHCP and NAT enabled (RTNETLINK answers: File exists error message is probably because there are 2 network devices with NAT). Then I create /etc/sysconfig/network-scripts/ifcfg-eth1 file with content vagrant added to ifcfg-eth0 with correct DEVICE, remove the content vagrant added in ifcfg-eth0 and bring up eth1 with # ifup eth1 .. and it works.. The problem is Vagrant, it configures 2nd interface based on assumption that network interfaces are lo, eth0 and eth1. In Fedora, CentOS and RHEL have docker0, lo, eth0 and eth1 thus Vagrant misconfigured the eth1 interface and mess eth0. Upstream pull request is here, but it has not merged yet. https://github.com/mitchellh/vagrant/pull/5706 (In reply to Takayoshi Kimura from comment #20) > The problem is Vagrant, it configures 2nd interface based on assumption that > network interfaces are lo, eth0 and eth1. In Fedora, CentOS and RHEL have > docker0, lo, eth0 and eth1 thus Vagrant misconfigured the eth1 interface and > mess eth0. > > Upstream pull request is here, but it has not merged yet. > > https://github.com/mitchellh/vagrant/pull/5706 Wow, thanks! This patch works great. I highly recommend patching our Fedora versions of vagrant to include this. This also explains why certain boxes didn't work -- the boxes that didn't work were those that had a docker0 interface. Cheers, James This is now fixed both for Fedora 22 (update in stable) and 21 (update just pushed to stable). |