Bug 1212009 - on certain hardware the network is not initialized during anaconda start
Summary: on certain hardware the network is not initialized during anaconda start
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: 22
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Radek Vykydal
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-15 11:55 UTC by Kamil Páral
Modified: 2015-08-11 09:54 UTC (History)
10 users (show)

Fixed In Version: anaconda-23.13-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-11 09:54:55 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
netinst - anaconda.log (7.00 KB, text/plain)
2015-04-15 11:56 UTC, Kamil Páral
no flags Details
netinst - ifcfg.log (2.28 KB, text/plain)
2015-04-15 11:56 UTC, Kamil Páral
no flags Details
netinst - program.log (38.77 KB, text/plain)
2015-04-15 11:56 UTC, Kamil Páral
no flags Details
netinst - syslog (91.05 KB, text/plain)
2015-04-15 11:56 UTC, Kamil Páral
no flags Details
netinst - journal (168.62 KB, text/plain)
2015-04-15 11:56 UTC, Kamil Páral
no flags Details
netinst - screenshot (520.71 KB, image/jpeg)
2015-04-15 11:57 UTC, Kamil Páral
no flags Details
installed system - ip -a (551 bytes, text/plain)
2015-04-15 12:00 UTC, Kamil Páral
no flags Details
installed system - journal (154.40 KB, text/plain)
2015-04-15 12:00 UTC, Kamil Páral
no flags Details
installed system - rpm-qa (21.03 KB, text/plain)
2015-04-15 12:00 UTC, Kamil Páral
no flags Details
anaconda.log (2.82 KB, text/plain)
2015-05-28 09:48 UTC, Kamil Páral
no flags Details
ifcfg.log (2.11 KB, text/plain)
2015-05-28 09:48 UTC, Kamil Páral
no flags Details
journal (160.54 KB, text/plain)
2015-05-28 09:48 UTC, Kamil Páral
no flags Details
syslog (107.04 KB, text/plain)
2015-05-28 09:48 UTC, Kamil Páral
no flags Details
anaconda.log (NM debug) (2.88 KB, text/plain)
2015-05-28 09:49 UTC, Kamil Páral
no flags Details
ifcfg.log (NM debug) (2.11 KB, text/plain)
2015-05-28 09:50 UTC, Kamil Páral
no flags Details
journal (NM debug) (488.66 KB, text/plain)
2015-05-28 09:50 UTC, Kamil Páral
no flags Details
syslog (NM debug) (356.23 KB, text/plain)
2015-05-28 09:50 UTC, Kamil Páral
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 963952 0 unspecified CLOSED Failure to connect to wired ethernet on reboots 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1016970 0 unspecified CLOSED wired ethernet is Off by default on F20 beta TC2 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1199098 0 unspecified CLOSED Wired networking does not work on Lenovo T540p 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1225479 0 unspecified CLOSED entering main hub shifts system clock forward by 56 minutes or 1h 40min 2021-02-22 00:41:40 UTC

Internal Links: 963952 1016970 1199098 1225479

Description Kamil Páral 2015-04-15 11:55:26 UTC
Description of problem:
On a particular piece of hardware (ThinkPad T540p) I have network down 90% of times I boot netinst. I can't reproduce it on a different hardware or in a VM, it's just this particular hardware.

However, I have seen it twice even on a freshly installed Server system, once on that ThinkPad, and once in a VM. Petr Schindler saw this as well once, in a VM. But this (on an installed system) is extremely hard to reproduce, it happens very rarely. The problem is just related to a particular boot - if I reboot, the network starts up properly.

So, it seems something is wrong in the NetworkManager, because it affects both installed systems and netinst. Fortunately, the netinst + T540p combination fails very often, so I'm able to provide some logs.

In all cases, running `dhclient <interface>` fixes the problem.

This is how it looks like:
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 54:ee:75:46:cb:54 brd ff:ff:ff:ff:ff:ff
3: wlp4s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether cc:3d:82:48:95:2c brd ff:ff:ff:ff:ff:ff


Version-Release number of selected component (if applicable):
F22 Beta RC2 Server netinst, or an installed system

How reproducible:
extremely hard, unless running with my T540p + netinst, then very often (at the moment)

Steps to Reproduce:
1. boot F22 Beta RC2 Server netinst
2. see in anaconda that network is down
3. check in terminal with `ip a`
4. fix in anaconda by manually enabling the device, or running `dhclient <interface>`

Comment 1 Kamil Páral 2015-04-15 11:56:35 UTC
Created attachment 1014690 [details]
netinst - anaconda.log

Comment 2 Kamil Páral 2015-04-15 11:56:40 UTC
Created attachment 1014691 [details]
netinst - ifcfg.log

Comment 3 Kamil Páral 2015-04-15 11:56:45 UTC
Created attachment 1014692 [details]
netinst - program.log

Comment 4 Kamil Páral 2015-04-15 11:56:50 UTC
Created attachment 1014693 [details]
netinst - syslog

Comment 5 Kamil Páral 2015-04-15 11:56:55 UTC
Created attachment 1014694 [details]
netinst - journal

Comment 6 Kamil Páral 2015-04-15 11:57:26 UTC
Created attachment 1014695 [details]
netinst - screenshot

Comment 7 Kamil Páral 2015-04-15 12:00:19 UTC
Created attachment 1014696 [details]
installed system - ip -a

Comment 8 Kamil Páral 2015-04-15 12:00:26 UTC
Created attachment 1014697 [details]
installed system - journal

Comment 9 Kamil Páral 2015-04-15 12:00:32 UTC
Created attachment 1014698 [details]
installed system - rpm-qa

Comment 10 Kamil Páral 2015-04-15 12:15:56 UTC
We have tried a different T540p laptop and it happens there as well! It is still probably a race condition, but on certain hardware (T540p) it seems to happen very frequently, and on a different hardware (X220) it doesn't seem to happen at all.

Comment 11 Kamil Páral 2015-04-15 12:29:40 UTC
This seems too racy to propose for Beta, but it seems to be a serious issue in the Server use case (imagine your server sometimes boots up without bringing up network), which warrants at least a FinalBlocker discussion.

Comment 12 Petr Schindler 2015-04-20 18:29:31 UTC
Discussed at today's blocker review meeting [1].

This bug was rejected as Final Blocker - this is a provisional rejection: as we haven't seen any other reports of this it seems like it might be a very unusual case (perhaps related to both the hardware and the router?) and so not worthy of blocker status, it can be re-proposed if further research suggests it's more common than we thought

[1] http://meetbot.fedoraproject.org/fedora-blocker-review/2015-04-20/

Comment 13 Stephen Gallagher 2015-04-20 19:08:42 UTC
I can also confirm this on my T540p laptop. It does not occur on an installed system that I can identify, though it's possible it's related to an earlier blocker on the same hardware: BZ #1199098

Comment 14 Kamil Páral 2015-04-21 06:43:12 UTC
After reading through bug 1199098 it's worth mentioning that in our office we also have IPv6 connectivity. But it's only local IPv6 (not routed to the Internet) and other computers don't have any issues, we only see it with T540p.

Comment 15 Stephen Gallagher 2015-04-21 11:47:04 UTC
(In reply to Kamil Páral from comment #14)
> After reading through bug 1199098 it's worth mentioning that in our office
> we also have IPv6 connectivity. But it's only local IPv6 (not routed to the
> Internet) and other computers don't have any issues, we only see it with
> T540p.

Kamil, that's the exact same situation that I was in when I hit 1199098. The fact that it's not actually routed externally is irrelevant. From the perspective of the affected machine, it has IPv6 connectivity.

I should note that the "fix" for 1199098 wasn't a complete one; it was reducing the likelihood of hitting it by breaking the connection loop. It was hoped that would make it easier to spot the bug, but it apparently loosened up the affected race enough that I haven't hit it again.

Comment 16 Jirka Klimes 2015-04-30 13:33:06 UTC
Looking at the logs, three things draw my attention:
(In reply to Kamil Páral from comment #4)
> Created attachment 1014693 [details]
> netinst - syslog
1) There could be a race in NetworkManager marking "startup complete" too early and do not activate 'Wired connection 1' when enp0s25 becomes ready.
11:08:55,029 INFO NetworkManager: <info>  (enp0s25): device state change: unavailable -> disconnected (reason 'carrier-changed') [20 30 40]
11:08:55,030 INFO NetworkManager: <info>  startup complete

Would it be possible to run NetworkManager with debug logs?

2)
There's a jump in the clock that may cause problems to dhclient:
11:09:02,973 INFO kernel:[   32.984366] Btrfs loaded
12:05:23,584 WARNING chronyd: Forward time jump detected!
(see bug 1183807)

(In reply to Kamil Páral from comment #8)
> Created attachment 1014697 [details]
> installed system - journal
3)
he DHCP transaction is started in this log
Apr 15 13:40:26 localhost.localdomain NetworkManager[735]: <info>  Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds)
...
Apr 15 13:40:34 localhost.localdomain dhclient[1068]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0xe972a31e)

but the log ends prematurely, so we can't see if the transaction could succeed.

Comment 17 Kamil Páral 2015-04-30 13:47:29 UTC
Great findings, Jirka. We were mostly able to reproduce this problem just with T540p and netinst. Is there any way to make NM run in debug mode on netinst? E.g. some kernel cmdline argument? (We could try to generate our own netinst with modified config files, if no easier approach is available.)

Comment 18 Jirka Klimes 2015-05-05 08:26:08 UTC
(In reply to Kamil Páral from comment #17)
> Great findings, Jirka. We were mostly able to reproduce this problem just
> with T540p and netinst. Is there any way to make NM run in debug mode on
> netinst? E.g. some kernel cmdline argument? (We could try to generate our
> own netinst with modified config files, if no easier approach is available.)

I am afraid that kernel command line won't help here. You need to pass an argument to NetworkManager process.
It can be done
(a) on command line:
NetworkManager --no-daemon --log-level=DEBUG --log-domains=ALL
which can be set in /lib/systemd/system/NetworkManager.service

(b) or you can use NetworkManager configuration file
/etc/NetworkManager/NetworkManager.conf
and add this to the file:
[logging]
level=DEBUG
domains=ALL

Comment 19 Kamil Páral 2015-05-27 13:22:23 UTC
(In reply to Jirka Klimes from comment #16)
> 2)
> There's a jump in the clock that may cause problems to dhclient:
> 11:09:02,973 INFO kernel:[   32.984366] Btrfs loaded
> 12:05:23,584 WARNING chronyd: Forward time jump detected!
> (see bug 1183807)

Reported as bug 1225479. However, that doesn't seem to be the cause of this issue, because I have no IP address assigned even before reaching the anaconda main hub screen. Investigating further...

Comment 20 Kamil Páral 2015-05-28 09:46:46 UTC
After debugging with Jirka Klimes and Radek Vykydal in person, it seems that there is a race condition between anaconda and NetworkManager, and it is only seen on certain hardware (probably with somewhat slower NICs).

The issue is that anaconda starts NetworkManager, modifies the current configuration with ONBOOT=no (which can be relevant to bug 963952), and dumps the config file. However, there seems to be a race and sometimes NM reads the ONBOOT=no directive before actually having initialized the device, and it skips doing that in that case. Attaching fresh logs with F22 DVD x86_64.

Comment 21 Kamil Páral 2015-05-28 09:48:47 UTC
Created attachment 1031129 [details]
anaconda.log

Comment 22 Kamil Páral 2015-05-28 09:48:50 UTC
Created attachment 1031130 [details]
ifcfg.log

Comment 23 Kamil Páral 2015-05-28 09:48:53 UTC
Created attachment 1031131 [details]
journal

Comment 24 Kamil Páral 2015-05-28 09:48:57 UTC
Created attachment 1031132 [details]
syslog

Comment 25 Kamil Páral 2015-05-28 09:49:40 UTC
And the following logs are with NM debug messages enabled (through updates.img).

Comment 26 Kamil Páral 2015-05-28 09:49:56 UTC
Created attachment 1031133 [details]
anaconda.log (NM debug)

Comment 27 Kamil Páral 2015-05-28 09:50:01 UTC
Created attachment 1031134 [details]
ifcfg.log (NM debug)

Comment 28 Kamil Páral 2015-05-28 09:50:10 UTC
Created attachment 1031135 [details]
journal (NM debug)

Comment 29 Kamil Páral 2015-05-28 09:50:17 UTC
Created attachment 1031136 [details]
syslog (NM debug)

Comment 30 Kamil Páral 2015-05-28 11:04:44 UTC
Radek has created these update images:
* sleep 10 seconds on start: https://rvykydal.fedorapeople.org/updates.onboot_sleep10.img
* don't set ONBOOT=no: https://rvykydal.fedorapeople.org/updates.onboot_remove.img

I have tested both and both fix the problem with T540p - the network is active automatically. It seems to confirm that the problem is in a race condition between anaconda and NM.

There is a slight difference - when using the second updates.img, the following is printed on tty1:
"Not asking for VNC because we don't have a network"
It is not present when using the first updates.img (sleep 10). I'm not sure if it is significant, though, because booting with inst.vnc works out of the box even with vanilla F22 DVD installer (the network is probably brought up by dracut).

Comment 31 Kamil Páral 2015-08-11 09:54:55 UTC
I can confirm this is fixed with F23 Alpha.


Note You need to log in before you can comment on or make changes to this bug.