Bug 1156198

Summary: Problematic dependency chain (glibc->basesystem->fedora-release->fedora-release-workstation->NetworkManager-config-fedora-connectivity->NetworkManager) in image creation causes broken 21 Beta TC4 32-bit Workstation
Product: [Fedora] Fedora Reporter: Mike Ruckman <mruckman>
Component: fedora-releaseAssignee: Dennis Gilmore <dennis>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 21CC: admiller, awilliam, dennis, jdisnard, johannbg, jsynacek, kdudka, kzak, lnykryn, mruckman, msekleta, ooprala, ovasik, pasik, pbrady, p, redhat.bugs, robatino, satellitgo, s, systemd-maint, twaugh, vpavlin, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: AcceptedBlocker
Fixed In Version: fedora-release-21-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-31 02:43:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1043124    
Attachments:
Description Flags
console output during boot
none
systemd.log_level=debug output none

Description Mike Ruckman 2014-10-23 19:15:40 UTC
Created attachment 950050 [details]
console output during boot

Description of problem:
i386 Live images fail to startup. The boot process never makes it past plymouth, and logs show multiple services not being able to start. After a time of attempting to start services systemd shuts the machine down.

Version-Release number of selected component (if applicable):
Fedora-Live-Workstation-i686-21_Beta-TC4.iso

How reproducible:
Always

Steps to Reproduce:
1. Launch Live Image
2. Wait.
3.

Actual results:
System shuts down after timeout.

Expected results:
System should boot to the DE.

Additional info:
This has been confirmed multiple times. x86_64 doesn't seem to be affected.

Comment 1 Fedora Blocker Bugs Application 2014-10-23 19:58:40 UTC
Proposed as a Blocker for 21-beta by Fedora user roshi using the blocker tracking app because:

 This is a pretty clear violation of the Beta requirements: All release-blocking images must boot in their supported configurations.

Comment 2 Mike Ruckman 2014-10-23 20:03:59 UTC
Created attachment 950057 [details]
systemd.log_level=debug output

Attached more verbose logs.

Comment 3 Adam Williamson 2014-10-23 20:15:58 UTC
There are some errors during package installation when composing 32-bit WS live images that do not occur when composing 64-bit WS lives.

Compare TC4 logs:

32-bit: https://kojipkgs.fedoraproject.org//work/tasks/4390/7894390/root.log
64-bit: https://kojipkgs.fedoraproject.org//work/tasks/4392/7894392/root.log

some errors occur in both, but these occur in 32-bit only:

Installing: device-mapper                ################### [ 179/1386]/var/tmp/rpm-tmp.Q8ApXX: line 1: groupadd: command not found
DEBUG util.py:283:  /var/tmp/rpm-tmp.Q8ApXX: line 4: useradd: command not found

Installing: libssh2                      ################### [ 181/1386]/var/tmp/rpm-tmp.CRBM0R: line 1: groupadd: command not found
DEBUG util.py:283:  /var/tmp/rpm-tmp.CRBM0R: line 3: useradd: command not found
DEBUG util.py:283:  warning: user tss does not exist - using root
DEBUG util.py:283:  warning: group tss does not exist - using root
DEBUG util.py:283:  warning: user tss does not exist - using root
DEBUG util.py:283:  warning: group tss does not exist - using root

Installing: libpwquality                 ################### [ 191/1386]warning: group dbus does not exist - using root

Installing: dbus                         ################### [ 192/1386]warning: group polkitd does not exist - using root
DEBUG util.py:283:  warning: group polkitd does not exist - using root

Installing: polkit-pkla-compat           ################### [ 193/1386]/var/tmp/rpm-tmp.WxBAJp: line 1: groupadd: command not found
DEBUG util.py:283:  /var/tmp/rpm-tmp.WxBAJp: line 2: useradd: command not found
DEBUG util.py:283:  warning: user polkitd does not exist - using root
DEBUG util.py:283:  warning: user polkitd does not exist - using root

Comment 4 Adam Williamson 2014-10-23 20:28:20 UTC
OK, yeah. This is looking likely.

I built a test 32-bit live with the systemd debug shell enabled, from that i can run the journal and see that indeed dbus fails to start, complaining about users and groups:

dbus-daemon[802]: Unknown username "polkitd" in message bus configuration file
dbus-daemon[802]: Unknown username "polkitd" in message bus configuration file
dbus-daemon[802]: Failed to start message bus: Could not get UID and GID for username "dbus"

So I think it's becoming clear where to go with this, let me finish my toast and I'll trace it further.

Comment 5 Adam Williamson 2014-10-23 20:59:39 UTC
It seems we may be dealing with one or more circular dep loops here. Here's one I've spotted:

coreutils requires openssl-libs which requires crypto-policies which requires coreutils

Comment 6 Adam Williamson 2014-10-23 21:00:05 UTC
note - shadow-utils requires coreutils, so loops involving coreutils (and coreutils' deps in general) are interesting.

Comment 7 Adam Williamson 2014-10-23 21:09:17 UTC
Between f20 and f21, coreutils started building against openssl. That adds rather a lot of deps (more than just the loop I noted above) and may well be the source of the problem here. I'm running a scratch coreutils build which doesn't use openssl right now, and will test a live compose with it.

Comment 8 Adam Williamson 2014-10-23 21:15:25 UTC
the bits of coreutils which actually wind up linked against openssl are md5sum, sha1sum, sha224sum, sha256sum, sha384sum, sha512sum, and (for some reason) sort.

Comment 9 Adam Williamson 2014-10-23 21:43:02 UTC
OK, looks like I nailed it.

I did a build of coreutils which does not build against openssl. Then I built a 32-bit live image with it. With that coreutils, there are almost no errors during package install (just one from polkit which looks like an issue where a subpackage installs before the main package has created the polkitd user), and the image boots successfully.

Basically, coreutils' linking against openssl causes a problem:

* Lots of things Requires(pre): shadow-utils to create users/groups, including dbus, which *other* things require and needs to get installed quite early
* shadow-utils requires coreutils
* coreutils requires openssl-libs, which itself has quite a big dep chain, including some of the things from Asterisk #1

when yum hits this kind of no-win situation where A needs B but B needs C which needs D which needs A, or whatever, it winds up getting resolved basically arbitrarily. As of right now it seems that for 32-bit Workstation live images it gets resolved in favour of dbus, so dbus installs before shadow-utils and can't create its user. (dbus redirects its user and group creation commands to /dev/null, which is why we don't see the errors). But I don't think we could rely on it being that way forever, and we don't know how it gets resolved in creation of each of the other live images, and other images...basically, as long as we have this mess, we could have serious borkage any time we're creating something that involves deploying a typical package set from scratch.

I think dropping openssl support from coreutils is probably OK as a short-term solution. What that does is make coreutils use its built-in hashing code (for the *sum utilities) instead of openssl - it doesn't lead to a loss of functionality, just code duplication.

Still, in the long term it's good for things not to be re-inventing stuff that should be shared, so I can see why we would want the openssl support in coreutils. To have it, though, we need to break the dep problem somehow. I can't immediately see a super simple way to do that. For Beta I'd suggest we just go with the no-openssl coreutils.

I am +1 blocker on this, the 32-bit live should boot obviously, and this could certainly be causing the same problem in other images, or other problems we haven't noticed yet (even though the 64-bit image *boots*, it still has errors during early package install that really shouldn't happen).

Comment 10 Dennis Gilmore 2014-10-23 21:52:57 UTC
+1 blocker, Adam's research seems sane. this is something could and likely will effect system installs using anaconda also, resulting in non booting installs as well as non booting lives.

Comment 11 Kalev Lember 2014-10-23 22:02:30 UTC
I think adamw's plan to build coreutils without openssl support is solid. We just can't afford to have large dependency cycles in low level packages such as coreutils.

+1 blocker.

Comment 12 Mike Ruckman 2014-10-23 22:13:36 UTC
+1 blocker

Comment 13 Adam Williamson 2014-10-23 22:21:48 UTC
another loop:
coreutils -> openssl-libs -> krb5-libs -> coreutils

Comment 14 Adam Williamson 2014-10-23 22:22:11 UTC
Marking AcceptedBlocker.

Comment 15 Adam Williamson 2014-10-23 22:30:02 UTC
Aha. thanks to the glory of rpmdep.pl, I found the real smoking gun.

openssl-libs requires libcom_err. libcom_err requires glibc. glibc requires basesystem, which requires setup, which requires fedora-release, which requires fedora-release-(product) which for Workstation is fulfilled by fedora-release-workstation, which requires NetworkManager_config_connectivity_fedora, which requires NetworkManager, which pulls in a whole bunch of stuff.

That actually may explain why this occurs on the Workstation live specifically. There are other smaller loops like the ones I noticed above, but this is the one which has the really big consequences.

Comment 16 Adam Williamson 2014-10-23 22:32:13 UTC
try this:

rpmdep -dot openssl-libs.dot openssl-libs
dot -Tps openssl-libs.dot -o openssl-libs.ps
gimp openssl-libs.ps

Comment 17 Fedora Update System 2014-10-23 22:42:31 UTC
coreutils-8.22-20.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/coreutils-8.22-20.fc21

Comment 18 Adam Williamson 2014-10-23 23:07:26 UTC
Wow, there's a *real* object lesson in 'what could possibly go wrong' here.

Compare:

Beta TC3 Workstation 32-bit: https://kojipkgs.fedoraproject.org//work/tasks/8252/7818252/root.log
Beta TC4 Workstation 32-bit: https://kojipkgs.fedoraproject.org//work/tasks/4390/7894390/root.log

notice how vastly different the order of package installation is. In TC3, a whole pile of libraries gets installed before we get anywhere near dbus, openssl-libs, or fedora-release-workstation - which is installed [ 970/1383]. glibc is still early - it's [  92/1383] - but it doesn't cause yum to have to try and resolve a massive problem by dragging in that whole messy NetworkManager dep chain. NetworkManager-glib is [ 797/1383] in TC3. In TC4 it's [ 201/1386], right in the middle of all the error messages.

In other words - the change in fedora-release 21-0.16 to have fedora-release require "system-release-product", which wasn't even *mentioned in the package changelog*, both caused the whole Generic mess - https://bugzilla.redhat.com/show_bug.cgi?id=1154235 - and caused this bug by massively affecting package ordering during installation.

I'm gonna file this one away to point at next time someone's telling me how their trivial change can't possibly break anything...;)

Comment 19 Fedora Update System 2014-10-24 00:05:17 UTC
NetworkManager-0.9.10.0-9.git20140704.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/NetworkManager-0.9.10.0-9.git20140704.fc21

Comment 20 Adam Williamson 2014-10-24 00:16:48 UTC
So my immediate fix for this is to drop NetworkManager-config-connectivity-fedora's dependency on NetworkManager. It doesn't need that dep, all it contains is a configuration file. I've tested that this produces a working image whose compose log looks a lot more like TC3's than TC4's.

However, re-assigning the bug to fedora-release , as kalev mentioned there was some kind of plan to have fedora-release-workstation depend on a bunch of stuff that 'defines' the Workstation product. That is not going to be viable so long as we have this dep chain from glibc up to fedora-release-workstation; that would have to be broken somewhere.

I suspect glibc's Requires(pre): basesystem is not actually necessary, but I certainly don't want to tinker with that right before a Beta release. it's been in there forever, I think back to like 2003 at least - it's sufficiently old I can't manage to track its addition from the git commit history (back in 2007 it was changed from Prereq: basesystem libgcc to Requires(pre): basesystem libgcc , but I'm still trying to track down when the Prereq: form was introduced).

Comment 21 Adam Williamson 2014-10-24 00:26:00 UTC
correction to c#18 - the 'system-release-product' change was mentioned in the package changelog, but missed from the f21 branch git log, and from the update description.

Comment 22 Adam Williamson 2014-10-24 00:43:57 UTC
so I just did some spelunking. glibc's dependency on basesystem dates back to somewhere between Red Hat Linux 6.2 and Red Hat Linux 7.0. The changelog of RHL 7.0's glibc package, however, leaves something to be desired:

 %changelog
 * %{date} Jakub Jelinek <jakub>
 - build from CVS archive

so I'm not sure I can manage to find the actual explanation of why glibc does it. But I think it may be reasonable to consider the possibility that maybe it doesn't need to any more, after...erm...14 years.

Comment 23 Ondrej Vasik 2014-10-24 07:34:46 UTC
AFAIK, basesystem package is there only to handle the right installation order -> basesystem, filesystem, setup, ... glibc ... to have, filesystem layout and basic users/groups on the system for other dependent packages. No other reason for that package, I'm not sure if it is still required or not, we may try to drop it completely from Rawhide to see if the dependency order hack is still required or not.

Comment 24 Pádraig Brady 2014-10-24 08:20:29 UTC
FYI the openssl use by coreutils is for speed. Upstream vendor architecture specific dev is concentrated in the openssl project, thus we get 50% speedups etc. on common architectures

Comment 25 Fedora Update System 2014-10-24 17:25:49 UTC
NetworkManager-0.9.10.0-10.git20140704.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/NetworkManager-0.9.10.0-10.git20140704.fc21

Comment 26 Adam Williamson 2014-10-24 20:48:53 UTC
The NM -9 build in Beta RC1 is confirmed to fix this - the 32-bit WS live boots, as do all other tested  images.

Comment 27 Fedora Update System 2014-10-31 02:43:35 UTC
NetworkManager-0.9.10.0-10.git20140704.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 28 Fedora Update System 2014-11-18 04:28:06 UTC
fedora-release-21-1 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/fedora-release-21-1

Comment 29 Fedora Update System 2014-11-22 00:45:35 UTC
fedora-release-21-1, fedora-repos-21-1 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 30 Richard 2015-03-11 13:46:48 UTC
Hi,

I just had the "group dbus does not exist - using root" error when creating a livecd on a x86_64 RHEL 6.x clone.

I found it was due to the host system using nscd to cache credentials.
Running service nscd stop before the build has helped :-)

Maybe that'll help someone else.

Cheers,

Rich