Bug 1264364

Summary: During installation there is no dns-server set: systemd-tmpfiles creates /etc/resolv.conf as a broken symlink, NetworkManager does not overwrite it
Product: [Fedora] Fedora Reporter: Joost van der Sluis <joost>
Component: systemdAssignee: systemd-maint
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: awilliam, dcbw, dshea, johannbg, jsynacek, lkundrak, lnykryn, lpoetter, msekleta, pschindl, psimerda, robatino, s, systemd-maint, walters, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-24 01:33:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1230431    

Description Joost van der Sluis 2015-09-18 10:06:44 UTC
Description of problem:

Anaconda fails to setup the base repository on a network-installation, because it can not resolve mirrors.fedoraproject.org.

Version-Release number of selected component (if applicable):

rawhide boot.iso from September 17th 2015.

How reproducible:

Always, but only tested on one piece of hardware. (Laptop, wireless-network does not work, so the wired-network is being used)

Steps to Reproduce:
1. Boot from boot.iso from usb-flash-drive
2. Select instal from boot menu
3. Select language

Actual results:

The software - installation source will show an error, that it can not setup the base repository.

Expected results:

A valid connection to the installation-source

Additional info:

On the console it is not possible to resolve using 'nslookup fedoraproject.org', but 'nslookup fedoraproject.org <dnsserver>' works.

/etc/resolve.conf is linked to ../run/systemd/resolve/resolve.conf, which does not exists.

Comment 1 Adam Williamson 2015-09-18 10:10:44 UTC
This is the bug Rawhide nightly openQA tests hit last night, I think.

The systemd network stuff should not be relevant here, the installer environment does not use it.

Comment 2 Adam Williamson 2015-09-18 15:57:11 UTC
davidshea reminds me that we already ran into this back in February - https://bugzilla.redhat.com/show_bug.cgi?id=1197204 . Seems like it came back. systemd is taking over resolv.conf when it shouldn't, because we're not using networkd.

Proposing as an F24 Alpha blocker, as it breaks networking in the installer: "When using a release-blocking dedicated installer image, the installer must be able to use either HTTP or FTP repositories (or both) as package sources."

Comment 3 Adam Williamson 2015-09-24 23:11:25 UTC
With systemd.log_level=debug I get this:

Sep 24 22:57:57 localhost systemd-sysusers[1290]: Group systemd-resolve already exists.
Sep 24 22:57:57 localhost systemd-sysusers[1290]: User systemd-resolve already exists.
Sep 24 22:57:57 localhost systemd-tmpfiles[1299]: Entry "/etc/resolv.conf" does not match any include prefix, skipping.
Sep 24 22:57:59 localhost systemd-tmpfiles[1410]: Running create action for entry L /etc/resolv.conf
Sep 24 22:57:59 localhost systemd-tmpfiles[1410]: Created symlink "/etc/resolv.conf".
Sep 24 22:57:59 localhost systemd-tmpfiles[1410]: Running remove action for entry L /etc/resolv.conf
Sep 24 22:58:00 localhost NetworkManager[1427]: <info>  DNS: using resolv.conf manager 'none'
Sep 24 22:58:04 localhost NetworkManager[1427]: <warn>  could not commit DNS changes: Could not stat /etc/resolv.conf: No such file or directory

Interesting thing is, though, that's all the same on F23, except that on F23, the NetworkManager error doesn't happen.

So I guess one of two things is happening:

1) On F23 NetworkManager overwrites /etc/resolv.conf if it's a broken symlink, but on F24 it doesn't.
2) On F23 systemd's 'remove action' removes the broken symlink, but on F24 it doesn't.

I'll see if I can figure out which it is...

Comment 4 Adam Williamson 2015-09-24 23:27:21 UTC
OK, so it seems to be 1); it's actually NetworkManager that changed. but systemd is still arguably doing something it shouldn't.

I tested like this:

1. boot installer
2. systemctl stop NetworkManager.service
3. rm -f /etc/resolv.conf
4. /usr/bin/systemd-tmpfiles --create --remove --boot --exclude-prefix=/dev
5. systemctl start NetworkManager.service

for both F23 Beta and current Rawhide, steps 1-4 have the same result: after 4, /etc/resolv.conf exists and is a broken symlink to ../run/systemd/resolve/resolv.conf . But on F23 Beta, after step 5, /etc/resolv.conf has changed into a regular file with valid contents (and networking works), whereas with Rawhide, after step 5, it's still the busted symlink, networking doesn't work, and the NetworkManager error about "No such file or directory" is in the journal.

Note that step 4) is what systemd-tmpfiles-setup.service does. I can't call 'systemctl restart systemd-tmpfiles-setup.service' as it's configured not to allow it (it says it 'may be requested by dependency only').

If the idea of https://bugzilla.redhat.com/show_bug.cgi?id=1197204 was that http://cgit.freedesktop.org/systemd/systemd/commit/tmpfiles.d?id=6921bf11fac23d93658b4c3f91d7b63a7f5b36c6 would somehow prevent the symlink being created before NetworkManager got a shot at creating the file, I don't think that's working.

Comment 5 Adam Williamson 2015-09-24 23:34:38 UTC
I think this probably showed up with the jump from NetworkManager-1.0.6-2.fc24 to NetworkManager-1.2.0-0.1.20150903gitde5d981.fc24 .

Comment 6 Adam Williamson 2015-09-25 00:14:19 UTC
Aha, yeah, so here's the smoking gun:

http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/src/dns-manager/nm-dns-manager.c?id=583568e1

despite being from 2014, that patch is only on 1.2.0, it's not in 1.0.6. NetworkManager doesn't overwrite systemd's busted symlink because, well, it's been told explicitly not to overwrite symlinks besides its own.

It does seem odd that something similar popped up in February (when NetworkManager probably didn't have the code to *explicitly* not overwrite symlinks, or at least not *this* code) and was somehow solved by the systemd change. I'll see if I can look into that, but regardless, it seems like our choices now are something like:

1) really make systemd not create this symlink before NetworkManager gets a shot / when systemd-networkd is not enabled
2) Make NetworkManager stomp on systemd's symlink somehow; make it OK to overwrite *broken* symlinks, perhaps?

Comment 7 Adam Williamson 2015-09-25 00:28:20 UTC
Aha, OK, I see the missing piece of the puzzle, why this briefly popped up in February then went away: the NetworkManager changes were briefly backported to Fedora - on 2015-01-21 - then removed again on 2015-03-05. So that all adds up now. The systemd config file change that was made for 1197204 may have solved the Atomic case Colin had(?), but it doesn't look like it *ever* solved the 'no network on boot.iso' case.

This bug has *always* been present when we've had the systemd that creates the symlink, and the newer NetworkManager code that refuses to overwrite symlinks - i.e. between 2015-02-18 (when systemd-219 landed) and 2015-03-05 (when the NetworkManager backport was reverted), and again since 2015-09-04 (when NetworkManager 1.2.0 snapshots landed in Rawhide and brought the NM change back into Fedora).

Comment 8 Dan Williams 2015-09-25 17:16:44 UTC
We did briefly have the symlinked resolv.conf behavior in F22 because we mistakenly included the patches.  After discussion somewhere we decided it was too big a change for F22 and reverted it since the whole symlink thing was very new at the time.

(In reply to awilliam from comment #6)
> 1) really make systemd not create this symlink before NetworkManager gets a
> shot / when systemd-networkd is not enabled

Is the issue that systemd is creating busted symlinks to a resolv.conf file?  If so, it shouldn't do that, or at the very least it should be cleaning up after itself if it finds a busted symlink.  Why does the systemd tempfiles setup create the hardlink at all, when it's not clear at that point if systemd should even be managing resolv.conf?  ISTM that the thing that's actually going to be touching /run/systemd/resolve/resolv.conf should be the thing that sets up the links.

> 2) Make NetworkManager stomp on systemd's symlink somehow; make it OK to
> overwrite *broken* symlinks, perhaps?

NM won't do that, because if something symlinked resolv.conf that indicates that something else wants to handle resolv.conf.  And NM won't (and shouldn't) attempt to stomp all over that, since that's Not Nice.  I guess we could hardcode something that stomps all over broken systemd links, but that also seems hackish and really not nice.

Comment 9 Adam Williamson 2015-09-25 19:58:42 UTC
Dan: "Is the issue that systemd is creating busted symlinks to a resolv.conf file?"

Essentially, yep. systemd-tmpfiles-setup.service creates a symlink (not hard link) from /run/systemd/resolve/resolv.conf to /etc/resolv.conf at boot time if the system does not already have any file present at /etc/resolv.conf . This is the case for the anaconda environment.

Now there seems to be some lack of clarity about whether systemd's exact current behaviour is actually *intended*. The previous bugs here are:

https://bugzilla.redhat.com/show_bug.cgi?id=1197204
https://bugzilla.redhat.com/show_bug.cgi?id=1116651

and to me it's not entirely clear what the systemd folks intend to happen. So I'll try and get some input from Colin, Lennart or someone else to this bug.

Considering the matter purely on its merits, I agree that it's strange for tmpfiles to be creating this symlink really at all. I don't understand what case they think they're fixing which couldn't equally well be fixed by networkd or resolved or whatever it is that actually provides /run/systemd/resolve/resolv.conf creating the symlink.

Comment 10 Colin Walters 2015-09-25 20:15:20 UTC
- resolved could create the file when it starts, not via tmpfiles.d

- NetworkManager could learn about resolved-built-but-not-enabled, i.e.:
  if (networkmanager_should_own_dns && readlink ("/etc/resolv.conf") == "/run/systemd/resolve/resolv.conf")
    unlink ("/etc/resolv.conf")

Comment 11 Adam Williamson 2015-09-25 20:43:31 UTC
like Dan said, option b) seems pretty hacky to me. why should NM have to have magic knowledge about how systemd works? and what happens when systemd changes how it works but no-one remembers to tell NM?

a) sounds like the most obvious fix, to me, though of course it'd be good to know why this was put in tmpfiles in the first place to make sure it wouldn't regress something else.

Comment 12 Adam Williamson 2015-09-26 14:37:09 UTC
Lennart suggested an option c):

NM should create the symlink to its version in its %post script, if the file doesn't exist yet. That way, if you install a fresh new Fedora, NM will win, as it is installed in the first set of RPMs, and its %post scripts will run before tmpfiles of the systemd package, which will only run on the first reboot.

Note this bug is not actually about the situation on a freshly-installed new Fedora, it's about the situation in the installer environment. But this approach would indeed happen to fix that case too (as long as lorax didn't strip either the source or target of the symlink).

Still, that approach still seems like systemd essentially requiring extra work of NM for a change systemd made that has no clear benefit. I'm still not understanding what creating the systemd symlink in tmpfiles gets anyone that networkd creating it when needed would not achieve.

Comment 13 Dan Williams 2015-09-30 18:59:04 UTC
Lennart said:
----------------------

resolved is activated on demand, not by default.

Again: we only put this in place if nothing else did before, and that
pretty late... It's really just the fallback if nothing else cared to
add it.

We want this behaviour so that stateless systems (where /etc is
unpopulated) get something into place for /etc/resolve.conf, if really
nothing else wants it.

The question you should ask is really: why didn't anything take
possession of the symlink before us, even though we do it so so
late and only as fall back.

Nah, we are not breaking stuff. We are just filling in gaps if nothing
else takes possession of this. Fedora just needs to make its mind up
and make sure that something takes posession if this, so that it's not
resolved which eventually does because nothing else did...

It's first come first serve. And systemd here is coming really really
really late to the game... Everybody can come first...

I mean, you are welcome to blame systemd if systemd would create the
link in %post or so, or remove pre-existing links. But we don't... we
create the link at the very latest instance, and we never change
anything that is already there...
-----------------

A couple of points in response to this:

1) why is it necessary to symlink resolv.conf at all if its missing?

2) if it's necessary, why is it OK to symlink it to something that doesn't exist at all, and how is that behavior better/worse than if there is no /etc/resolv.conf?

3) Why is the link being created when resolved isn't installed or is disabled?  Why shouldn't the resolved service (if its enabled by systemd) do that itself somehow?

4) In this case, nothing took posession of the symlink becuase there was *nothing* to write to resolv.conf; it would literally be empty, and there's no point in writing it.

NM only writes resolv.conf when there's something to put into it.  I suppose we can change that and if no symlink exists, and no /etc/resolv.conf exists, NM could make the symlink when it starts up.

Comment 14 Adam Williamson 2015-09-30 19:08:42 UTC
"NM only writes resolv.conf when there's something to put into it.  I suppose we can change that and if no symlink exists, and no /etc/resolv.conf exists, NM could make the symlink when it starts up."

That would not help this case.

Let's be clear about what exactly happens here. We are talking about the anaconda environment, which starts out as basically an image that has been built by installing packages into a chroot, and then had various files stripped from it by lorax.

'At rest', it contains no /etc/resolv.conf , and has systemd-tmpfiles.service and NetworkManager.service both configured to start when it's booted.

When it actually gets booted, systemd-tmpfiles.service runs *first*, before NetworkManager. It sees that there is no /etc/resolv.conf , and creates the symlink to /run/systemd/resolve/resolv.conf at that point. tmpfiles does not care if the target of the symlink exists or if resolved is enabled; it creates the symlink any time it's run during boot and there is no existing /etc/resolv.conf .

NetworkManager.service runs *later*, after the symlink has already been created. NM 1.0.x would overwrite the existing broken symlink, but NM 1.2.x does not, and so name resolution fails for anaconda and install doesn't work.

Comment 15 Adam Williamson 2015-10-05 18:21:59 UTC
Further thought I've had on this: could systemd split the tmpfiles config line out into a separate file? That opens up various other potential fixes:

* lorax could strip that file from the installer environment
* Fedora could split systemd-resolved into a subpackage and put the tmpfiles config file into that

Comment 16 Adam Williamson 2015-10-05 19:04:07 UTC
*** Bug 1268974 has been marked as a duplicate of this bug. ***

Comment 17 Petr Schindler 2015-11-23 17:41:21 UTC
Discussed at 2015-11-23 blocker review meeting: [1]. 

The decision was delayed to next blocker bug review: this is/was a bad bug, but does not seem to be affecting current nightly images; adamw will investigate what changed and see if bug needs to remain open

[1] http://meetbot.fedoraproject.org/fedora-blocker-review/2015-11-23/f24-blocker-review.2015-11-23-17.00.html

Comment 18 Adam Williamson 2015-11-24 01:33:03 UTC
Oh, right, now I remember how this was fixed:

https://github.com/rhinstaller/lorax/commit/e3d8b01afa7c3019a1fb9c8140ed13594bc43c55

so in the installer environment we just completely nerf tmpfiles's /etc configuration and don't let it do anything in there. That doesn't seem to have broken anything noticeable yet.

So I think we can consider this fixed for now.

Comment 19 Red Hat Bugzilla 2023-09-14 03:05:31 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days