Bug 1221006

Summary:

IP address dropped somehow, causing vagrant to hang

Product:

[Fedora] Fedora

Reporter:

James (purpleidea) <jshubin>

Component:

vagrant

Assignee:

Josef Stribny <jstribny>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

rawhide

CC:

aweiteka, hhorak, jshubin, jstribny, madam, mattdm, ncoghlan, pschiffe, rbarlow, thrcka, tkimura, vondruch, walters

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

vagrant-1.7.2-7.fc22,vagrant-1.7.2-7.fc21.1

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-06-30 10:52:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Simple reproducer	none

Description James (purpleidea) 2015-05-13 06:17:26 UTC

Description of problem:

git clone an Oh-My-Vagrant environment.
use the omv.yaml file at the bottom of this bug
time vup omv1
some of the time it completes, some of the time it hangs
when it hangs, login to the vm with the console through virt-manager
you will see that no ip addresses are assigned to any interface
since vagrant can't contact the machine, it hangs and waits...

Version-Release number of selected component (if applicable):
F21 latest vagrant+vagrant-libvirt

At least one other person has been able to reproduce this issue (aweiteka)

Additional info:


---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.1-docker
:boxurlprefix: ''
:sync: rsync
:folder: ''
:extern: []
:puppet: false
:classes: []
:docker: false
:kubernetes: false
:ansible: []
:playbook: []
:cachier: false
:vms: []
:namespace: omv
:count: 1
:username: ''
:password: ''
:poolid: true
:repos: []
:update: false
:reallyrm: false

Comment 1 James (purpleidea) 2015-05-13 06:22:43 UTC

@aweiteka if you can please confirm that you also still have this issue it would be appreciated. I remember you spent some time into debugging the internals a bit, and your comments would be appreciated.

Comment 2 James (purpleidea) 2015-05-13 06:32:23 UTC

For reference, I don't have this issue when I'm using the Vagrant upstream packages version 1.6.5 with vagrant-libvirt 0.0.26

Comment 3 Josef Stribny 2015-05-13 12:08:12 UTC

James, I looked into it a bit and found out that your issue is still there with latest upstream Vagrant AND also with Vagrant 1.6.5.

Can you check that the following option is not causing the issue?

:libvirt__dhcp_enabled => false

I tried to comment it out and it started to work. It would explain that the VM didn't get an IP. It's used only when creating new network.

Comment 4 James (purpleidea) 2015-05-13 17:16:10 UTC

(In reply to Josef Stribny from comment #3)
> James, I looked into it a bit and found out that your issue is still there
> with latest upstream Vagrant AND also with Vagrant 1.6.5.
> 
> Can you check that the following option is not causing the issue?
> 
> :libvirt__dhcp_enabled => false
> 
> I tried to comment it out and it started to work. It would explain that the
> VM didn't get an IP. It's used only when creating new network.

Here's the thing...

You *unfortunately* need two IP's to get a sane setup.

1) To work with DHCP so that vagrant can find the machine in the first place.
2) A second static version so that you can do reliable static networking, so that IP's are consistent across reboots.

I don't know how else to get this to work...

Comment 5 James (purpleidea) 2015-05-13 17:20:47 UTC

(In reply to Josef Stribny from comment #3)
> James, I looked into it a bit and found out that your issue is still there
> with latest upstream Vagrant AND also with Vagrant 1.6.5.
> 
> Can you check that the following option is not causing the issue?
> 
> :libvirt__dhcp_enabled => false
> 
> I tried to comment it out and it started to work. It would explain that the
> VM didn't get an IP. It's used only when creating new network.

And to answer more specifically... Look at the IP's in the machine.
For the setup I gave you, you should have one ip of: 192.168.123.100 and another random DHCP given IP.

Sometimes this works, sometimes no IP's are present. Keep up/down-ing and you'll see it reproduces. Changing that option doesn't change it.

In particular, when it does work, they're both on the same interface, instead of one on eth0 and one on eth1... Feels like a race condition perhaps?

Comment 6 Josef Stribny 2015-05-14 07:13:45 UTC

Are you able to put together a minimal reproducer? Just taking the networking stuff and nothing else? This should be reported to upstream.

Comment 7 James (purpleidea) 2015-05-15 03:42:11 UTC

(In reply to Josef Stribny from comment #6)
> Are you able to put together a minimal reproducer? Just taking the
> networking stuff and nothing else? This should be reported to upstream.

To be honest, I don't have many cycles for this right now, sorry. I'd like for vagrant in fedora to succeed and be accessible, but there has to be a maintainer dedicated to making that happen, and it can't be me at the moment.

If you can follow up and help fix this bug, I would appreciate it, and if not then I expect someone else will end up hitting this and hopefully patches will come out of that.

Cheers,
James

Comment 8 Josef Stribny 2015-05-15 07:29:21 UTC

> I don't have many cycles for this right now, sorry

Putting together a clear (minimal) reproducer is a key for any issue, not just this one. Unfortunately you reported issues that are based on your oh-my-vagrant project which I and many other people are not that familiar with. I only asked for stripping down a minimal Vagrantfile so it's clear what you are trying to accomplish and what fails.

> I'd like for vagrant in fedora to succeed and be accessible

Me too, but this one is not Fedora specific, I already told you I hit the issue with upstream packages. In order to submit it upstream and work with them on fixing it, a clear report on what's wrong is needed.

> If you can follow up and help fix this bug, I would appreciate it

If things like this one can be tracked in upstream, many more people can resolve them. I would love to help to fix upstream bugs as well, but if you look at the upstream issues trackers, this bug is just one of many.

> I expect someone else will end up hitting this and hopefully patches will come out of that

That's why I would like to see a proper report in upstream tracker :).

Comment 9 Michael Adam 2015-05-15 09:27:38 UTC

(In reply to James (purpleidea) from comment #7)
> (In reply to Josef Stribny from comment #6)
> > Are you able to put together a minimal reproducer? Just taking the
> > networking stuff and nothing else? This should be reported to upstream.
> 
> To be honest, I don't have many cycles for this right now, sorry. I'd like
> for vagrant in fedora to succeed and be accessible, but there has to be a
> maintainer dedicated to making that happen, and it can't be me at the moment.

Well, the request to have a minimal reproducer is valid.
It is even *essential* to have a minimal reproducer.
Quite frankly, it is not necessarily the vagrant-libvirt
maintainer's job to try to untangle the involved setup
of omv and try to extract the core issue. At least
it is as least as much the job of the omv maintainer, imho. ;)

Like last time, when we debugged an issue together and it
turned out that some special thing that omv did, triggers
a bug vagrant that others are unlikely to hit.

Coming back to the issue:

Can this issue be related?

https://github.com/pradels/vagrant-libvirt/issues/312

vagrant-libvirt does strange things with respect to dhcp and
networking. I noticed that with my own setups. 
From all of my previous research, I share Josef's impression that
your ":libvirt__dhcp_enabled => false" is wrong.
vagrant-libvirt is actually intended to run correctly with
statically configured interfaces. It needs the one (default)
interface with dhcp for vagrant ssh etc.

Iirc, vagrant-libvirt achieves static configuration
by doing ifdown first on an interface that is originally
brought up with dhcp (this ifdown will also kill the dhclient),
then putting the static config in place and then doing ifup
again.

In several (of your) boxes this fails with various results,
see the issue cited above:
E.g. ifdown fails, so dhclient is still running and re-adds
the dynamic IP address after a while. On other boxes, it fails
to bring up the interface at all. So a problem seems to be
that vagrant-libvirt's actions are not independent enough of
the state of the network config in the box. Can't say it better
now -- it's been a while back, when I last looked, but I thought
I'd share it anyways...

Cheers - Michael

Comment 10 James (purpleidea) 2015-05-15 18:08:15 UTC

I'd love to have time to dig deeper into this, but I've got to do other work first. If I have time on the weekend I will, but otherwise maybe someone else can.

Cheers

Comment 11 Randy Barlow 2015-05-21 19:02:58 UTC

I've also noticed this behavior when using OMV. Sometimes when I vagrant up my machine, it'll get no IP addresses. I've also noticed it getting both IP addresses on the same interface (which causes other issues).

I also don't have a simple reproducer, however ☹

Comment 12 James (purpleidea) 2015-05-21 21:43:02 UTC

Created attachment 1028451 [details]
Simple reproducer

Should expect two interfaces: 1 that has a static IP address, the other which is from DHCP, which vagrant uses.

Some of the time you see this:
[root@test1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:5c:c7:be brd ff:ff:ff:ff:ff:ff
    inet 192.168.121.214/24 brd 192.168.121.255 scope global dynamic eth0
       valid_lft 3563sec preferred_lft 3563sec
    inet 192.168.123.100/24 brd 192.168.123.255 scope global eth0
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:eb:c3:31 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 56:84:7a:fe:97:99 brd ff:ff:ff:ff:ff:ff
    inet 172.17.42.1/16 scope global docker0
       valid_lft forever preferred_lft forever


Some of the time, the machine doesn't finish the vagrant up, and you see this:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN qlen 1000
    link/ether 52:54:00:83:b2:b3 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:b9:c3:3b brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 56:84:7a:fe:97:99 brd ff:ff:ff:ff:ff:ff
    inet 172.17.42.1/16 scope global docker0
       valid_lft forever preferred_lft forever


Also:
$ cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Generated by dracut initrd
NAME="eth0"
ONBOOT=yes
NETBOOT=yes
UUID="bd5b2625-fdce-41d4-997a-13bf1b70deca"
IPV6INIT=yes
BOOTPROTO=dhcp
TYPE=Ethernet
#VAGRANT-BEGIN
# The contents below are automatically generated by Vagrant. Do not modify.
NM_CONTROLLED=no
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.123.100
NETMASK=255.255.255.0
DEVICE=eth0


PEERDNS=no
#VAGRANT-END

So it might be related to dracut messing things up somehow... IDK.


NOTE: the image used is here: https://download.gluster.org/pub/gluster/purpleidea/vagrant/centos-7.1-docker/

Comment 13 Nick Coghlan 2015-05-27 03:40:01 UTC

I think I may be seeing this as well, using the current vagrant and vagrant-libvirt packages on Fedora 21 as the host OS:

$ rpm -qa vagrant vagrant-libvirt
vagrant-libvirt-0.0.24-4.fc21.noarch
vagrant-1.7.2-5.fc21.1.noarch

Vagrant guest systems are all also Fedora 21, configured using this example as omv.yaml: https://github.com/purpleidea/oh-my-vagrant/blob/master/examples/kubernetes-ansible.yaml

The symptoms I see inside the VMs are slightly different from those James reports in his simple reproducer. In my case, eth0 is clearly the DHCP-controlled interface created for remote access by vagrant, with eth1 as the separated statically configured interface intended for communication between the VMs:

$ cat /etc/sysconfig/network-scripts/ifcfg-eth1
#VAGRANT-BEGIN
# The contents below are automatically generated by Vagrant. Do not modify.
NM_CONTROLLED=no
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.123.100
NETMASK=255.255.255.0
DEVICE=eth1
PEERDNS=no
#VAGRANT-END

That interface isn't showing any IPv4 address in ifconfig or "ip addr". Dropping the network interface and bringing it back up from inside the VM isn't having any effect either.

The settings for both the omv network (which all the eth1 interfaces are connected to) and the vagrant-libvirt network (which all the eth0 interfaces are connected to) look fine in virtmanager, and they're both up and running.

Comment 14 James (purpleidea) 2015-05-27 06:36:08 UTC

(In reply to Nick Coghlan from comment #13)
> I think I may be seeing this as well, using the current vagrant and
> vagrant-libvirt packages on Fedora 21 as the host OS:
> 
> $ rpm -qa vagrant vagrant-libvirt
> vagrant-libvirt-0.0.24-4.fc21.noarch
> vagrant-1.7.2-5.fc21.1.noarch
> 
> Vagrant guest systems are all also Fedora 21, configured using this example
> as omv.yaml:
> https://github.com/purpleidea/oh-my-vagrant/blob/master/examples/kubernetes-
> ansible.yaml
> 
> The symptoms I see inside the VMs are slightly different from those James
> reports in his simple reproducer. In my case, eth0 is clearly the
> DHCP-controlled interface created for remote access by vagrant, with eth1 as
> the separated statically configured interface intended for communication
> between the VMs:
> 
> $ cat /etc/sysconfig/network-scripts/ifcfg-eth1
> #VAGRANT-BEGIN
> # The contents below are automatically generated by Vagrant. Do not modify.
> NM_CONTROLLED=no
> BOOTPROTO=none
> ONBOOT=yes
> IPADDR=192.168.123.100
> NETMASK=255.255.255.0
> DEVICE=eth1
> PEERDNS=no
> #VAGRANT-END
> 
> That interface isn't showing any IPv4 address in ifconfig or "ip addr".
> Dropping the network interface and bringing it back up from inside the VM
> isn't having any effect either.
> 
> The settings for both the omv network (which all the eth1 interfaces are
> connected to) and the vagrant-libvirt network (which all the eth0 interfaces
> are connected to) look fine in virtmanager, and they're both up and running.

Can you debug the reason why the interface isn't showing the IP? Should be a (hopefully) straightforward networking issue, which maybe vagrant is setting up wrong, and thus isn't working?

Comment 15 Colin Walters 2015-05-27 19:59:44 UTC

It's possible to get a static address with one interface using libvirt by pre-configuring a binding between the MAC address and DHCP before booting the VM.

There are some examples here:

http://libvirt.org/formatnetwork.html

Comment 16 Nick Coghlan 2015-05-28 06:20:17 UTC

Trying this on my personal laptop running Fedora 22, and tweaking the omv.yaml file to use the OMV Fedora 22 Vagrant boxes (rather than the OMV Fedora 21 boxes), I get slightly different symptoms from those I saw with Fedora 21 as the host and guest:

* the eth1 definition without an IPv4 address is still present in ifconfig
* there's no ifcfg-eth1 network script at all (neither system generated nor vagrant generated)

Attempting to restart the network services with "sudo systemctl restart network" gives the following result on all 3 machines:

May 28 02:18:18 localhost.localdomain systemd[1]: Starting LSB: Bring up/down networking...
-- Subject: Unit network.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit network.service has begun starting up.
May 28 02:18:18 localhost.localdomain network[2542]: Bringing up loopback interface:  Could not load file '/etc/sysconfig/network-scripts/ifcfg-lo'
May 28 02:18:18 localhost.localdomain network[2542]: Could not load file '/etc/sysconfig/network-scripts/ifcfg-lo'
May 28 02:18:18 localhost.localdomain network[2542]: Could not load file '/etc/sysconfig/network-scripts/ifcfg-lo'
May 28 02:18:18 localhost.localdomain network[2542]: Could not load file '/etc/sysconfig/network-scripts/ifcfg-lo'
May 28 02:18:18 localhost.localdomain network[2542]: [  OK  ]
May 28 02:18:18 localhost.localdomain network[2542]: Bringing up interface eth0:  Error: Connection activation failed: Connection 'eth0' is already active on eth0
May 28 02:18:18 localhost.localdomain network[2542]: [FAILED]
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain network[2542]: RTNETLINK answers: File exists
May 28 02:18:18 localhost.localdomain systemd[1]: network.service: control process exited, code=exited status=1
May 28 02:18:18 localhost.localdomain systemd[1]: Failed to start LSB: Bring up/down networking.
-- Subject: Unit network.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit network.service has failed.
-- 
-- The result is failed.

Comment 17 Nick Coghlan 2015-05-28 06:36:16 UTC

I can confirm a Fedora 22 host with Fedora 21 guests shows the same symptoms as the Fedora 21 host with Fedora 21 guests: there's an ifcfg-eth1 network script present, and the network service appears to be running as expected, but the eth1 interface has no IPv4 address.

Comment 18 Nick Coghlan 2015-05-28 07:54:23 UTC

Correction to the previous post: if I make sure I'm using the Vagrant box from https://getfedora.org/en/cloud/download/, the network connection comes up fine (regardless of whether I use the simple reproducer, or OMV itself). I've only done this twice, so the existence of an intermittent failure might still be possible.

However, it does mean it's specifically the default OMV Vagrant box from https://download.gluster.org/pub/gluster/purpleidea/vagrant/ that exhibited the problem with the network failing to come up correctly (and that was repeatable every time).

Comment 19 Peter Schiffer 2015-06-15 15:14:29 UTC

For me, this issue looks like this. I'm using oh-my-vagrant and rhel-7 box. After provisioning I see in file /etc/sysconfig/network-scripts/ifcfg-eth0:

# Generated by dracut initrd
NAME="eth0"
ONBOOT=yes
NETBOOT=yes
UUID="4c227002-aa87-42a4-b904-190d5ba80fdf"
IPV6INIT=yes
BOOTPROTO=dhcp
TYPE=Ethernet
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
PEERDNS=yes
PEERROUTES=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
#VAGRANT-BEGIN
# The contents below are automatically generated by Vagrant. Do not modify.
NM_CONTROLLED=no
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.91.102
NETMASK=255.255.255.0
DEVICE=eth0

PEERDNS=no
#VAGRANT-END


I don't have /etc/sysconfig/network-scripts/ifcfg-eth1 file.

I'm able to work-around this - If I create libvirt network manually instead of oh-my-vagrant without DHCP and NAT enabled (RTNETLINK answers: File exists error message is probably because there are 2 network devices with NAT). Then I create /etc/sysconfig/network-scripts/ifcfg-eth1 file with content vagrant added to ifcfg-eth0 with correct DEVICE, remove the content vagrant added in ifcfg-eth0 and bring up eth1 with # ifup eth1 .. and it works..

Comment 20 Takayoshi Kimura 2015-06-16 00:34:06 UTC

The problem is Vagrant, it configures 2nd interface based on assumption that network interfaces are lo, eth0 and eth1. In Fedora, CentOS and RHEL have docker0, lo, eth0 and eth1 thus Vagrant misconfigured the eth1 interface and mess eth0.

Upstream pull request is here, but it has not merged yet.

https://github.com/mitchellh/vagrant/pull/5706

Comment 21 James (purpleidea) 2015-06-16 03:15:46 UTC

(In reply to Takayoshi Kimura from comment #20)
> The problem is Vagrant, it configures 2nd interface based on assumption that
> network interfaces are lo, eth0 and eth1. In Fedora, CentOS and RHEL have
> docker0, lo, eth0 and eth1 thus Vagrant misconfigured the eth1 interface and
> mess eth0.
> 
> Upstream pull request is here, but it has not merged yet.
> 
> https://github.com/mitchellh/vagrant/pull/5706

Wow, thanks!

This patch works great. I highly recommend patching our Fedora versions of vagrant to include this. This also explains why certain boxes didn't work -- the boxes that didn't work were those that had a docker0 interface.

Cheers,
James

Comment 22 Josef Stribny 2015-06-30 10:52:07 UTC

This is now fixed both for Fedora 22 (update in stable) and 21 (update just pushed to stable).