Bug 1600823 - NetworkManager sometimes fails to start due to 'ordering cycle' problem
Summary: NetworkManager sometimes fails to start due to 'ordering cycle' problem
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: dnf
Version: rawhide
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Marek Blaha
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker
: 1622202 (view as bug list)
Depends On:
Blocks: F29BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2018-07-13 06:44 UTC by Adam Williamson
Modified: 2020-06-15 14:44 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-11 00:13:21 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
/var/log tarball from affected boot (599.35 KB, application/x-gzip)
2018-07-13 19:55 UTC, Adam Williamson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1247083 0 low CLOSED dnf timers require network access and thus should require systemd network target 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1601108 1 None None None 2021-01-20 06:05:38 UTC

Internal Links: 1247083 1601108

Description Adam Williamson 2018-07-13 06:44:23 UTC
Since Fedora-Rawhide-20180711.n.0 (it seems), some openQA tests have been failing due to the installed system having no network connection. A few tests seem to fail for each set of tests run; which ones fail seems to be random, this seems like a quasi-random problem.

I managed to hack a bit of debugging into the tests and reproduce the issue, in this test:

https://openqa.stg.fedoraproject.org/tests/328212

At https://openqa.stg.fedoraproject.org/tests/328212#step/_console_avc_crash/7 - just after boot of the installed system - we can see the problem: the ping of an IP that should be reachable (it's actually the 'worker host', the host on which the test VM is running) fails, and `systemctl status NetworkManager.service` shows that it is not running. We capture the whole contents of /var/log from the system - I will upload that as an attachment here, as it will eventually be garbage-collected from the openQA instance. In the journal there, we see this:

Jul 12 17:38:02 localhost.localdomain systemd[1]: network-online.target: Found ordering cycle on NetworkManager-wait-online.service/start
Jul 12 17:38:02 localhost.localdomain systemd[1]: network-online.target: Found dependency on NetworkManager.service/start
Jul 12 17:38:02 localhost.localdomain systemd[1]: network-online.target: Found dependency on basic.target/start
Jul 12 17:38:02 localhost.localdomain systemd[1]: network-online.target: Found dependency on dnf-makecache.timer/start
Jul 12 17:38:02 localhost.localdomain systemd[1]: network-online.target: Found dependency on network-online.target/start
Jul 12 17:38:02 localhost.localdomain systemd[1]: network-online.target: Job NetworkManager-wait-online.service/start deleted to break ordering cycle starting with network-online.target/start

I believe that because NetworkManager.service has 'Also=NetworkManager-wait-online.service', when NetworkManager-wait-online.service/start is deleted, NetworkManager.service/start is also deleted, so NetworkManager.service does not start.

While writing this bug report it occurred to me to look at the logs from a case where the bug did *not* happen, i.e. where the network *did* come up on startup, and lo and behold, we see what looks like basically the same cycle, only represented and broken differently:

Jul 12 17:27:50 localhost.localdomain systemd[1]: dnf-makecache.timer: Found ordering cycle on network-online.target/start
Jul 12 17:27:50 localhost.localdomain systemd[1]: dnf-makecache.timer: Found dependency on network.target/start
Jul 12 17:27:50 localhost.localdomain systemd[1]: dnf-makecache.timer: Found dependency on NetworkManager.service/start
Jul 12 17:27:50 localhost.localdomain systemd[1]: dnf-makecache.timer: Found dependency on basic.target/start
Jul 12 17:27:50 localhost.localdomain systemd[1]: dnf-makecache.timer: Found dependency on dnf-makecache.timer/start
Jul 12 17:27:50 localhost.localdomain systemd[1]: dnf-makecache.timer: Job network-online.target/start deleted to break ordering cycle starting with dnf-makecache.timer/start

so it seems that sometimes systemd decides the cycle is "in" dnf-makecache.timer (and hence disables it, meaning network will work), but at other times decides the cycle is "in" NetworkManager-wait-online.service and hence disables *that* instead.

From older logs I can find it rather looks like the cycle existed before the 20180711.n.0 compose, but for some reason it was never broken by removing NetworkManager-wait-online.service before that compose; I haven't figured out why not yet.

I guess the basic actual *problem* here is that dnf-makecache.timer has these lines:

Wants=network-online.target
After=network-online.target
WantedBy=basic.target

so it wants to start as part of basic.target, but after network-online.target...which AIUI is impossible, those are both systemd special targets and by design, network-online.target comes *after* basic.target. So systemd is faced with an impossible request and has to "pick a choice". Thus, I'm assigning this to dnf.

I believe this showed up in dnf 3.x, because in dnf 2.x, dnf-makecache.timer doesn't have these mutually-exclusive demands, it just has WantedBy=basic.target , but not After=network-online.target . So that at least explains why this only showed up relatively recently, but still, I would've expected it to start being a problem in the first successful compose with DNF 3, which was 20180627.n.0. Oh, well.

Proposing as an F29 Beta blocker as this can unexpectedly cause the network to not come up after installation.

Comment 1 Adam Williamson 2018-07-13 06:58:31 UTC
To state it more precisely, dnf-makecache.timer wanting to start After network-online.target means it must start After NetworkManager.service , but NetworkManager.service apparently must start after basic.target ...and dnf-makecache.timer is part of basic.target. Thus we have a loop: dnf-makecache requires NM requires basic requires dnf-makecache requires NM...

Comment 3 Adam Williamson 2018-07-13 19:55:38 UTC
Created attachment 1458779 [details]
/var/log tarball from affected boot

Comment 4 Lukas Ruzicka 2018-07-16 16:34:16 UTC
Discussed at the 2018-07-16 blocker review meeting [1]:

AcceptedBlocker (Beta) - accepted as a violation of all network-dependent criteria (e.g. updates, browser...) in the case where this prevents NetworkManager from starting

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2018-07-16/

Comment 5 Adam Williamson 2018-07-16 23:46:03 UTC
My fix was committed upstream, so I've backported it to the Rawhide package and sent a build to avoid further composes being affected by this:

https://koji.fedoraproject.org/koji/taskinfo?taskID=28348860

Comment 6 Adam Williamson 2018-08-11 00:13:21 UTC
This is fixed now, don't think we need to worry about it any more.

Comment 7 Zbigniew Jędrzejewski-Szmek 2018-08-22 16:30:40 UTC
Also https://github.com/rpm-software-management/dnf/pull/1180.

Comment 8 Marek Blaha 2018-11-07 14:37:56 UTC
*** Bug 1622202 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.