Description of problem: On a particular piece of hardware (ThinkPad T540p) I have network down 90% of times I boot netinst. I can't reproduce it on a different hardware or in a VM, it's just this particular hardware. However, I have seen it twice even on a freshly installed Server system, once on that ThinkPad, and once in a VM. Petr Schindler saw this as well once, in a VM. But this (on an installed system) is extremely hard to reproduce, it happens very rarely. The problem is just related to a particular boot - if I reboot, the network starts up properly. So, it seems something is wrong in the NetworkManager, because it affects both installed systems and netinst. Fortunately, the netinst + T540p combination fails very often, so I'm able to provide some logs. In all cases, running `dhclient <interface>` fixes the problem. This is how it looks like: $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 54:ee:75:46:cb:54 brd ff:ff:ff:ff:ff:ff 3: wlp4s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether cc:3d:82:48:95:2c brd ff:ff:ff:ff:ff:ff Version-Release number of selected component (if applicable): F22 Beta RC2 Server netinst, or an installed system How reproducible: extremely hard, unless running with my T540p + netinst, then very often (at the moment) Steps to Reproduce: 1. boot F22 Beta RC2 Server netinst 2. see in anaconda that network is down 3. check in terminal with `ip a` 4. fix in anaconda by manually enabling the device, or running `dhclient <interface>`
Created attachment 1014690 [details] netinst - anaconda.log
Created attachment 1014691 [details] netinst - ifcfg.log
Created attachment 1014692 [details] netinst - program.log
Created attachment 1014693 [details] netinst - syslog
Created attachment 1014694 [details] netinst - journal
Created attachment 1014695 [details] netinst - screenshot
Created attachment 1014696 [details] installed system - ip -a
Created attachment 1014697 [details] installed system - journal
Created attachment 1014698 [details] installed system - rpm-qa
We have tried a different T540p laptop and it happens there as well! It is still probably a race condition, but on certain hardware (T540p) it seems to happen very frequently, and on a different hardware (X220) it doesn't seem to happen at all.
This seems too racy to propose for Beta, but it seems to be a serious issue in the Server use case (imagine your server sometimes boots up without bringing up network), which warrants at least a FinalBlocker discussion.
Discussed at today's blocker review meeting [1]. This bug was rejected as Final Blocker - this is a provisional rejection: as we haven't seen any other reports of this it seems like it might be a very unusual case (perhaps related to both the hardware and the router?) and so not worthy of blocker status, it can be re-proposed if further research suggests it's more common than we thought [1] http://meetbot.fedoraproject.org/fedora-blocker-review/2015-04-20/
I can also confirm this on my T540p laptop. It does not occur on an installed system that I can identify, though it's possible it's related to an earlier blocker on the same hardware: BZ #1199098
After reading through bug 1199098 it's worth mentioning that in our office we also have IPv6 connectivity. But it's only local IPv6 (not routed to the Internet) and other computers don't have any issues, we only see it with T540p.
(In reply to Kamil Páral from comment #14) > After reading through bug 1199098 it's worth mentioning that in our office > we also have IPv6 connectivity. But it's only local IPv6 (not routed to the > Internet) and other computers don't have any issues, we only see it with > T540p. Kamil, that's the exact same situation that I was in when I hit 1199098. The fact that it's not actually routed externally is irrelevant. From the perspective of the affected machine, it has IPv6 connectivity. I should note that the "fix" for 1199098 wasn't a complete one; it was reducing the likelihood of hitting it by breaking the connection loop. It was hoped that would make it easier to spot the bug, but it apparently loosened up the affected race enough that I haven't hit it again.
Looking at the logs, three things draw my attention: (In reply to Kamil Páral from comment #4) > Created attachment 1014693 [details] > netinst - syslog 1) There could be a race in NetworkManager marking "startup complete" too early and do not activate 'Wired connection 1' when enp0s25 becomes ready. 11:08:55,029 INFO NetworkManager: <info> (enp0s25): device state change: unavailable -> disconnected (reason 'carrier-changed') [20 30 40] 11:08:55,030 INFO NetworkManager: <info> startup complete Would it be possible to run NetworkManager with debug logs? 2) There's a jump in the clock that may cause problems to dhclient: 11:09:02,973 INFO kernel:[ 32.984366] Btrfs loaded 12:05:23,584 WARNING chronyd: Forward time jump detected! (see bug 1183807) (In reply to Kamil Páral from comment #8) > Created attachment 1014697 [details] > installed system - journal 3) he DHCP transaction is started in this log Apr 15 13:40:26 localhost.localdomain NetworkManager[735]: <info> Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds) ... Apr 15 13:40:34 localhost.localdomain dhclient[1068]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0xe972a31e) but the log ends prematurely, so we can't see if the transaction could succeed.
Great findings, Jirka. We were mostly able to reproduce this problem just with T540p and netinst. Is there any way to make NM run in debug mode on netinst? E.g. some kernel cmdline argument? (We could try to generate our own netinst with modified config files, if no easier approach is available.)
(In reply to Kamil Páral from comment #17) > Great findings, Jirka. We were mostly able to reproduce this problem just > with T540p and netinst. Is there any way to make NM run in debug mode on > netinst? E.g. some kernel cmdline argument? (We could try to generate our > own netinst with modified config files, if no easier approach is available.) I am afraid that kernel command line won't help here. You need to pass an argument to NetworkManager process. It can be done (a) on command line: NetworkManager --no-daemon --log-level=DEBUG --log-domains=ALL which can be set in /lib/systemd/system/NetworkManager.service (b) or you can use NetworkManager configuration file /etc/NetworkManager/NetworkManager.conf and add this to the file: [logging] level=DEBUG domains=ALL
(In reply to Jirka Klimes from comment #16) > 2) > There's a jump in the clock that may cause problems to dhclient: > 11:09:02,973 INFO kernel:[ 32.984366] Btrfs loaded > 12:05:23,584 WARNING chronyd: Forward time jump detected! > (see bug 1183807) Reported as bug 1225479. However, that doesn't seem to be the cause of this issue, because I have no IP address assigned even before reaching the anaconda main hub screen. Investigating further...
After debugging with Jirka Klimes and Radek Vykydal in person, it seems that there is a race condition between anaconda and NetworkManager, and it is only seen on certain hardware (probably with somewhat slower NICs). The issue is that anaconda starts NetworkManager, modifies the current configuration with ONBOOT=no (which can be relevant to bug 963952), and dumps the config file. However, there seems to be a race and sometimes NM reads the ONBOOT=no directive before actually having initialized the device, and it skips doing that in that case. Attaching fresh logs with F22 DVD x86_64.
Created attachment 1031129 [details] anaconda.log
Created attachment 1031130 [details] ifcfg.log
Created attachment 1031131 [details] journal
Created attachment 1031132 [details] syslog
And the following logs are with NM debug messages enabled (through updates.img).
Created attachment 1031133 [details] anaconda.log (NM debug)
Created attachment 1031134 [details] ifcfg.log (NM debug)
Created attachment 1031135 [details] journal (NM debug)
Created attachment 1031136 [details] syslog (NM debug)
Radek has created these update images: * sleep 10 seconds on start: https://rvykydal.fedorapeople.org/updates.onboot_sleep10.img * don't set ONBOOT=no: https://rvykydal.fedorapeople.org/updates.onboot_remove.img I have tested both and both fix the problem with T540p - the network is active automatically. It seems to confirm that the problem is in a race condition between anaconda and NM. There is a slight difference - when using the second updates.img, the following is printed on tty1: "Not asking for VNC because we don't have a network" It is not present when using the first updates.img (sleep 10). I'm not sure if it is significant, though, because booting with inst.vnc works out of the box even with vanilla F22 DVD installer (the network is probably brought up by dracut).
I can confirm this is fixed with F23 Alpha.