Bug 1194623 - Bringing up the network on the cloud image is racy [NEEDINFO]
Summary: Bringing up the network on the cloud image is racy
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: initscripts
Version: 21
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Lukáš Nykrýn
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-02-20 11:58 UTC by Sitsofe Wheeler
Modified: 2015-12-02 17:29 UTC (History)
6 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2015-12-02 09:16:58 UTC
sitsofe: needinfo? (lnykryn)


Attachments (Terms of Use)
Attaching journal output for problem boot (191.75 KB, text/plain)
2015-02-20 11:59 UTC, Sitsofe Wheeler
no flags Details

Description Sitsofe Wheeler 2015-02-20 11:58:18 UTC
Description of problem:
Bringing up the network on the Fedora 21 cloud image is racy and will sometimes fail.

Version-Release number of selected component (if applicable):
initscripts-9.56.1-6.fc21.x86_64

How reproducible:
About three boots out of every four.

Steps to Reproduce:
1. Start Fedora 21 cloud image.
2. After the VM has started ping the IP address where it should appear.

Actual results:
Request timeout for icmp_seq 33
Request timeout for icmp_seq 34
Request timeout for icmp_seq 35

Expected results:
Request timeout for icmp_seq 33
64 bytes from 10.1.3.160: icmp_seq=34 ttl=64 time=0.578 ms
64 bytes from 10.1.3.160: icmp_seq=35 ttl=64 time=0.492 ms

Additional info:
The affected system is running on VMware Fusion 7. Here's a snippet of the journal around the affected time:
Feb 20 11:46:56  network[487]: Bringing up interface eth0:  ERROR    : [/etc/sysconfig/network-scripts/ifup-eth] Device eth0 does not seem to be present, delaying initialization.
Feb 20 11:46:56  /etc/sysconfig/network-scripts/ifup-eth[613]: Device eth0 does not seem to be present, delaying initialization.
Feb 20 11:46:56  network[487]: [FAILED]
Feb 20 11:46:56  kernel: fbcon: svgadrmfb (fb0) is primary device
Feb 20 11:46:56  kernel: Console: switching to colour frame buffer device 160x48
Feb 20 11:46:56  kernel: [drm] Initialized vmwgfx 2.6.1 20140704 for 0000:00:0f.0 on minor 0
Feb 20 11:46:56  systemd[1]: network.service: control process exited, code=exited status=1
Feb 20 11:46:56  systemd[1]: Failed to start LSB: Bring up/down networking.
Feb 20 11:46:56  systemd[1]: Unit network.service entered failed state.
Feb 20 11:46:56  systemd[1]: network.service failed.
Feb 20 11:46:56  systemd[1]: Starting Multi-User System.
Feb 20 11:46:56  systemd[1]: Reached target Multi-User System.
Feb 20 11:46:56  systemd[1]: Starting Update UTMP about System Runlevel Changes...
Feb 20 11:46:56  systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Feb 20 11:46:56  systemd[1]: Starting Network is Online.
Feb 20 11:46:56  systemd[1]: Reached target Network is Online.
Feb 20 11:46:56  systemd[1]: Started Update UTMP about System Runlevel Changes.
Feb 20 11:46:56  systemd[1]: Startup finished in 947ms (kernel) + 709ms (initrd) + 2.171s (userspace) = 3.827s.
Feb 20 11:46:56  kernel: e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
Feb 20 11:46:56  kernel: e1000: Copyright (c) 1999-2006 Intel Corporation.
Feb 20 11:46:57  kernel: e1000 0000:02:01.0 eth0: (PCI:66MHz:32-bit) 00:0c:29:98:37:8a
Feb 20 11:46:57  kernel: e1000 0000:02:01.0 eth0: Intel(R) PRO/1000 Network Connection

If you log into a VT and run
ifup eth0
the network will be brought up correctly:
Feb 20 11:53:09 login[524]: ROOT LOGIN ON tty1
Feb 20 11:53:21 kernel: e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Feb 20 11:53:21 dhclient[708]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0x2c91ef80)
Feb 20 11:53:21 dhclient[708]: DHCPACK from 10.1.0.2 (xid=0x2c91ef80)
Feb 20 11:53:24 NET[758]: /usr/sbin/dhclient-script : updated /etc/resolv.conf
Feb 20 11:53:24 dhclient[708]: bound to 10.1.3.160 -- renewal in 33622 seconds.

A variant of this issue on Hyper-V 2012 R2 (Bug #1095387 ) was originally seen back on Fedora 20.

Comment 1 Sitsofe Wheeler 2015-02-20 11:59:17 UTC
Created attachment 993873 [details]
Attaching journal output for problem boot

Comment 2 Lukáš Nykrýn 2015-02-20 12:45:36 UTC
Yep, this is common issue. But unfortunately there is nothing we can do, problem is that those device will appear after network initscript is run and this can't be fixed through ordering because we don't know that this device will eventually appear.

There are some workarounds:
1) DEVTIMEOUT=
this will tell initscripts to give that device some times to appear instead of priniting  "Device eth0 does not seem to be present"
2) Don't use pasive network-scripts
NetworkManager or networkd can react to events which happens after the boot of the system.
3) Write some custom udev rule
You can set onboot=no for that device and write a udev rule which will run ifup on that device after it appears.

In the past (and in rhel6), initscript had and hotplug script which was able to handle udev events after the system was boot up, but it was extremely racy and we have dropped it in favour of NM.

Comment 3 Sitsofe Wheeler 2015-02-20 15:38:13 UTC
Lukáš:
I see - perhaps there could be a release note mentioning this explicitly for cloud images? I've been playing with
DEVTIMEOUT=2
in /etc/sysconfig/network and it seems to help but I need the patch from Bug #1180837 for it to work.

Could a "solution" be to load the network card drivers forcefully at initramfs time so the devices are given the best possible chance of existing before the initscript runs? This could be done only for cloud installs... CC'ing jzb.

Comment 4 Lukáš Nykrýn 2015-02-24 09:00:23 UTC
> Could a "solution" be to load the network card drivers forcefully at
> initramfs time so the devices are given the best possible chance of existing
> before the initscript runs? This could be done only for cloud installs...
> CC'ing jzb.

No, solution for this is not to use network-scripts at all.
Even in rhel we recommend users to use them only for legacy stuff
https://access.redhat.com/solutions/783533

If you want to have something small in your images just give a shot to networkd.
It is installed anyway as a part of systemd package and it is really easy to configure. FOr basic setups you probably just need something like 

/etc/systemd/network/my.network
[Match]
Name=eth*

[Network]
DHCP=both


But anyway I have build initscripts-9.56.1-7.fc21 with the mentioned patch

Comment 5 Sitsofe Wheeler 2015-02-24 19:09:31 UTC
Lukáš:
OK what you're saying sounds pretty conclusive - initscripts should not be used on new systems for network bring up (unless you are in one the few situations mentioned in https://access.redhat.com/solutions/783533).

jzb:
Going forwards can Fedora cloud images stop defaulting to the use of network-scripts for network bring up and instead switch to networkd?

Comment 6 Dusty Mabe 2015-03-03 04:20:57 UTC
There could be one hiccup with regards to using systemd-networkd. While not a common situation it is possible to statically configure networking using the network-interfaces directive of cloud-init [1]. Doing this populates the files in the network-scripts directory which is not compatible with systemd-networkd. 

This is something that should be considered before making such a move. 

[1] - http://cloudinit.readthedocs.org/en/latest/topics/datasources.html

Comment 7 Lukáš Nykrýn 2015-03-03 08:00:39 UTC
My plan is to write a generator which would convert ifcfg file to networkd configuration, so that should solve your problem.

Comment 8 Dusty Mabe 2015-03-05 04:26:52 UTC
Lukas is there a tracker for this work that we can follow? 

Sitsofe, looks like work for systemd-networkd in cloud-init,etc.. won't be ready for Fedora 22. Have you managed to workaround at all? If so what are your workarounds? 


A few more questions:

Does this only happen on Hyper-V and VMWare Fusion? Other options are KVM/Xen/VirtualBox. 

If you soft reboot the VM does networking come up fine on the 2nd try?

Comment 9 Sitsofe Wheeler 2015-03-05 09:06:24 UTC
Dusty:
I have managed to workaround it by using the patch mentioned in #1180837 and adding
DEVTIMEOUT=2
to /etc/sysconfig/network . DEVTIMEOUT=1 was enough to solve the problem on VMware Fusion but there were still issues on Hyper-V that went away with DEVTIMEOUT=2.

I don't know if this happens on KVM/Xen/VirtualBox as my machines are currently all Hyper-V and VMware/ESXi based.

If I soft reboot sometimes the networking comes up and sometimes it doesn't. This issue happens frequently (one in five - ten times) but you can boot from cold and things are fine then soft reboot and end up without networking. Likewise you can boot from cold, hit this problem and have no networking, reboot from within the VM and things are fine.

I think for now a note saying "This is what can happen, workaround by doing this or that" should suffice. I suspect it only impacts people who have a comparatively fast and empty boot - if enough concurrent services are starting such that the networking script is always delayed by even a few tenths of a second I can well believe you will never see this issue.

Comment 10 Dusty Mabe 2015-06-11 00:08:54 UTC
(In reply to Lukáš Nykrýn from comment #7)
> My plan is to write a generator which would convert ifcfg file to networkd
> configuration, so that should solve your problem.

Hey Lukas, Did you make any progress on this? Is there a tracker for the work?

Comment 11 Lukáš Nykrýn 2015-06-11 06:56:47 UTC
https://github.com/lnykryn/ifcfg-generator
But it is not usable yet, it supports only basic ethernet and I have just write a code and did not test it.

Comment 12 Dusty Mabe 2015-06-13 16:40:38 UTC
(In reply to Lukáš Nykrýn from comment #11)
> https://github.com/lnykryn/ifcfg-generator
> But it is not usable yet, it supports only basic ethernet and I have just
> write a code and did not test it.

Progress! That is awesome. Do you think there is any chance we could have something in place for F23? We'd like to consider using systemd-networkd as the default for the F23 cloud image.

Comment 13 Sitsofe Wheeler 2015-08-02 07:13:29 UTC
This issue popped up again a month after I filed this but in a separate bug (Bug #1204612) and a default workaround was established there (cloud images from Fedora 22 default to DEVTIMEOUT=10).

Comment 14 Sitsofe Wheeler 2015-08-02 08:23:24 UTC
Adding NEEDINFO for comment #12 .

Comment 15 Fedora End Of Life 2015-11-04 10:58:38 UTC
This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 16 Sitsofe Wheeler 2015-11-09 21:36:33 UTC
I've moved over to systemd-networkd which can react to network devices appearing so this is no longer an issue for me after customisation.

Comment 17 Fedora End Of Life 2015-12-02 09:17:02 UTC
Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.