Bug 1837809

Summary: "FIPS module installed state definition is modified" changes cause systemctl segfaults during buildroot population
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: nosyncAssignee: Mikolaj Izdebski <mizdebsk>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 31CC: crypto-team, fedoraproject, fweimer, java-sig-commits, jorton, kevin, mboddu, mizdebsk, pbrobinson, praiskup, rjones, tmraz, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-03 23:16:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 910269    
Attachments:
Description Flags
core.udevadm
none
core.systemd-tmpfile
none
coredump-systemd-random
none
All coredumps none

Description Adam Williamson 2020-05-20 03:31:19 UTC
As Richard Jones reported on devel@:

https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/J2R7RFV2BDLRGKJMGRAMH2VX6Z457DKJ/

recent Rawhide builds of packages where systemd is installed to the buildroot tend to show a series of crashes when systemctl or other systemd-related commands run during build root population, like this:

DEBUG util.py:602:  Install  392 Packages
DEBUG util.py:602:  Total download size: 383 M
DEBUG util.py:602:  Installed size: 1.3 G
DEBUG util.py:602:  Downloading Packages:
DEBUG util.py:602:  --------------------------------------------------------------------------------
DEBUG util.py:602:  Total                                            50 MB/s | 383 MB     00:07     
DEBUG util.py:602:  Running transaction check
DEBUG util.py:602:  Transaction check succeeded.
DEBUG util.py:602:  Running transaction test
DEBUG util.py:602:  Transaction test succeeded.
DEBUG util.py:602:  Running transaction
DEBUG util.py:602:  /usr/bin/systemctl: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory
DEBUG util.py:602:  /sbin/udevadm: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory
DEBUG util.py:602:  /sbin/udevadm: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory
DEBUG util.py:602:  /sbin/udevadm: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory
DEBUG util.py:602:  /var/tmp/rpm-tmp.pYK3f6: line 6: 2005996 Segmentation fault      (core dumped) /usr/bin/systemctl --no-reload preset dbus.socket
DEBUG util.py:602:  /var/tmp/rpm-tmp.pYK3f6: line 13: 2005999 Segmentation fault      (core dumped) /usr/bin/systemctl --no-reload preset \--global dbus.socket
DEBUG util.py:602:  /var/tmp/rpm-tmp.heiSf7: line 6: 2006029 Segmentation fault      (core dumped) /usr/bin/systemctl --no-reload preset dbus-broker.service
DEBUG util.py:602:  /var/tmp/rpm-tmp.heiSf7: line 13: 2006032 Segmentation fault      (core dumped) /usr/bin/systemctl --no-reload preset \--global dbus-broker.service
DEBUG util.py:602:  /var/tmp/rpm-tmp.NScHA9: line 9: 2006061 Segmentation fault      (core dumped) /usr/bin/systemctl --no-reload preset polkit.service
DEBUG util.py:602:  /var/tmp/rpm-tmp.05AcI8: line 6: 2006069 Segmentation fault      (core dumped) /usr/bin/systemctl --no-reload preset dm-event.socket

The 'libibverbs.so.1' thing turned out to be a different bug, this is about the segfaults.

I looked into the segfaults and found they appeared to show up for the first time when openssl-1.1.1g-2.fc33 landed. So I did an openssl-1.1.1g-4.fc33 build with the change from -2.fc33 reverted:

https://src.fedoraproject.org/rpms/openssl/c/1bc9545b387216d41afda4a9080b39c1bbb8a207?branch=master

and it seems to have fixed the problem, I re-ran the nbdkit build and this time the root.log shows no segfaults (I also fixed the libibverbs issue elsewhere):

https://kojipkgs.fedoraproject.org//work/tasks/7666/44717666/root.log

so it seems the changes from this commit:

https://src.fedoraproject.org/rpms/openssl/c/89a24d69fca3f59d40038cc30e9bbf74cd38a6e1?branch=master

caused the problem. perhaps they cause basic systemd commands like systemctl to need some other library which we can't rely on to have been installed yet, when we're populating a buildroot like this?

I'm filing the bug to alert openssl maintainers to the issue even though my build "fixed" it for now, as obviously the changes were made for a reason, and you probably want to try and fix the bug while retaining the changes rather than simply leave them reverted...

Comment 1 Tomas Mraz 2020-05-20 08:40:13 UTC
The problem is the change should not change anything in the buildroot. I do not see how it could ever cause a segfault like this. Yes, the openssl will try to open and read the /proc/sys/crypto/fips_enabled, but if it is not present or it does not contain 1, it should just behave like it always did.

Comment 2 Tomas Mraz 2020-05-20 08:45:04 UTC
Florian, could any of secure_getenv(), open(), read() be problematic to be called from shared library constructor?

Comment 3 Florian Weimer 2020-05-20 09:20:50 UTC
No, everything in glibc should be initialized by this point, especially since libcrypto links against libc. I looked at the problematic OpenSSL version, and nothing in the FIPS initialization sequence stands out, either.

I would have to look at a coredump to debug this further, sorry.

Comment 4 Tomas Mraz 2020-05-20 09:35:24 UTC
I'm trying now to reproduce the segfault and I'll try to somehow make it to produce a usable coredump.

Comment 5 Tomas Mraz 2020-05-20 09:54:09 UTC
Hmm, it is not reproducible in mock :(

What to do now?

Comment 6 Florian Weimer 2020-05-20 09:57:41 UTC
(In reply to Tomas Mraz from comment #5)
> Hmm, it is not reproducible in mock :(
> 
> What to do now?

Perhaps file a releng ticket and hope that they can scrape a coredump off the builder? https://pagure.io/releng/issues

Comment 7 Adam Williamson 2020-05-20 15:45:54 UTC
I agree it's strange - I'd seen this change before I did my Koji research and it wasn't very high on my list of suspects as it seemed pretty innocuous. And I can't reproduce it in mock either :( But it really *does* seem to be the culprit - it's not just nbdkit, I've checked some other builds that have happened since I reverted the change and the bug seems to have gone from those too.

I guess there must be some wrinkle involving the builder environment here somehow, and yeah, we may need to try and get the coredump out from the builders. CCing nirik.

Comment 8 Mohan Boddu 2020-05-20 16:41:20 UTC
Created attachment 1690305 [details]
core.udevadm

Comment 9 Mohan Boddu 2020-05-20 16:42:08 UTC
Created attachment 1690306 [details]
core.systemd-tmpfile

Comment 10 Mohan Boddu 2020-05-20 16:42:54 UTC
Created attachment 1690318 [details]
coredump-systemd-random

Comment 11 Mohan Boddu 2020-05-20 16:44:21 UTC
Created attachment 1690319 [details]
All coredumps

Comment 12 Mohan Boddu 2020-05-20 16:45:29 UTC
@adamwill gave me a latest koji task (https://koji.fedoraproject.org/koji/buildinfo?buildID=1508895) and I grabbed the coredumps for the x86_64 task (https://koji.fedoraproject.org/koji/taskinfo?taskID=44714165).

I am attached the coredumps for that x86_64 task.

Comment 13 Florian Weimer 2020-05-20 17:11:51 UTC
I'm guessing this is probably related:

(gdb) print __environ[3]
$10 = 0x7ffd61cdced5 "LD_PRELOAD=/var/tmp/tmp.mock.fe1m8f16/$LIB/nosync.so"

Where can we get a copy of that file?

Comment 14 Florian Weimer 2020-05-20 17:16:00 UTC
I assume it's from the nosync package because that implementation is clearly buggy: it assumes that its ELF constructor has run if open is called, which is not a valid assumption for an interposing function: The interposition relationship is not taken into account for ELF constructor ordering.

Comment 15 Tomas Mraz 2020-05-20 17:24:27 UTC
OK, so it is the open() call in constructor that triggers this. I suppose we need then to fix the nosync because there is basically no way around this requirement to call open() on the /proc/sys/crypto/fips_enabled in the constructor.

Comment 16 Florian Weimer 2020-05-20 17:39:11 UTC
(In reply to Tomas Mraz from comment #15)
> OK, so it is the open() call in constructor that triggers this. I suppose we
> need then to fix the nosync because there is basically no way around this
> requirement to call open() on the /proc/sys/crypto/fips_enabled in the
> constructor.

Yes, it needs to be fixed in the nosync package: https://github.com/kjn/nosync/pull/4

Comment 17 Adam Williamson 2020-05-20 17:56:43 UTC
Thanks guys! I can do a nosync build with your PR backported and then we can test restoring the openssl change, if you like?

Comment 18 Florian Weimer 2020-05-20 17:58:38 UTC
(In reply to Adam Williamson from comment #17)
> Thanks guys! I can do a nosync build with your PR backported and then we can
> test restoring the openssl change, if you like?

Sure, I'd also appreciate a review of the patch itself (although it seems to work as expected in cursory tests).

Comment 19 Tomas Mraz 2020-05-20 18:02:49 UTC
Florian, thank you very much for the investigation and nosync patch.

Comment 20 Adam Williamson 2020-05-20 18:03:30 UTC
Florian: I don't think I'm qualified to review the patch :) Tomas would be a better choice I guess.

Comment 21 Tomas Mraz 2020-05-20 18:07:36 UTC
The patch looks good, I've provided a review on the github PR.

Comment 22 Adam Williamson 2020-05-20 18:12:16 UTC
Hum, looking at this a bit harder I think the nosync that gets used is actually from the mock *host* environment, i.e. whatever the builders are running in this case. So I think we'd need to send a nosync update for whatever release the builders are running and get it pushed stable (or at least installed on the builders)...

Comment 23 Adam Williamson 2020-05-20 18:14:51 UTC
BTW, this is probably why we couldn't reproduce in mock - the nosync stuff is just skipped over if you don't have nosync installed on the host. It may well reproduce if you install nosync on the host (and make sure the build uses an affected openssl somehow).

Comment 24 Fedora Update System 2020-05-20 18:31:35 UTC
FEDORA-2020-eb7b7b9aa8 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2020-eb7b7b9aa8

Comment 25 Fedora Update System 2020-05-20 18:31:36 UTC
FEDORA-2020-329ce47baf has been submitted as an update to Fedora 32. https://bodhi.fedoraproject.org/updates/FEDORA-2020-329ce47baf

Comment 26 Adam Williamson 2020-05-20 18:33:05 UTC
OK, so as you can see I've rebuilt nosync for 31, 32 (and Rawhide). Kevin, Mohan, can we update the builders to the new build now, or should we wait for it to go stable?

Comment 27 Kevin Fenzi 2020-05-20 21:34:02 UTC
We can do it before then. 

Mohan: can you do this? I'd say just tag the f31 build into f31-infra-candidate, let it get signed and land in f31-infra-stg and then move to 'f31-infra' and 'dnf --refresh -y update nosync' on builders.

Comment 28 Richard W.M. Jones 2020-05-20 22:13:15 UTC
(In reply to Florian Weimer from comment #13)
> I'm guessing this is probably related:
> 
> (gdb) print __environ[3]
> $10 = 0x7ffd61cdced5 "LD_PRELOAD=/var/tmp/tmp.mock.fe1m8f16/$LIB/nosync.so"

A loop-mounted nbdkit could be offer a better solution.  Purely
by coincidence (not knowing about nosync or its use in Fedora Koji)
I wrote a special nbdkit plugin to handle Koji Fedora/RISC-V builds,
which has the same drop flush behaviour:

https://rwmj.wordpress.com/2020/03/21/new-nbdkit-remote-tmpfs-tmpdisk-plugin/
http://libguestfs.org/nbdkit-tmpdisk-plugin.1.html
https://github.com/libguestfs/nbdkit/blob/0632acc76bfeb7d70d3eefa42fc842ce6b7be4f8/plugins/tmpdisk/tmpdisk.c#L182

Comment 29 Fedora Update System 2020-05-21 04:16:30 UTC
FEDORA-2020-eb7b7b9aa8 has been pushed to the Fedora 31 testing repository.
In short time you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-eb7b7b9aa8`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-eb7b7b9aa8

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 30 Fedora Update System 2020-05-21 05:23:45 UTC
FEDORA-2020-329ce47baf has been pushed to the Fedora 32 testing repository.
In short time you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-329ce47baf`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-329ce47baf

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 31 Pavel Raiskup 2020-05-21 06:11:09 UTC
Could we have update for EPEL7+ please?  (mock is still supported on el7+)

Comment 32 Tomas Mraz 2020-05-21 08:42:29 UTC
The nosync package from EPEL-7 (nosync-1.0-2.el7) should not have this bug as it was introduced in 1.1 upstream version. What is the nosync package version which you use?

Comment 33 Pavel Raiskup 2020-05-21 08:57:31 UTC
> it was introduced in 1.1 upstream version.

Good to know, thanks.  I didn't check this.

> What is the nosync package version which you use?

The default one.  I now see there's no nosync package for el8, yet.
So scratch my previous request.

Comment 34 Mohan Boddu 2020-05-21 19:43:57 UTC
(In reply to Kevin Fenzi from comment #27)
> We can do it before then. 
> 
> Mohan: can you do this? I'd say just tag the f31 build into
> f31-infra-candidate, let it get signed and land in f31-infra-stg and then
> move to 'f31-infra' and 'dnf --refresh -y update nosync' on builders.

This has been a roller coaster ride, due to nosync being multilib which is not supported in infra repos, anyway, all the builders are updated to nosync-1.1-8.fc31.

Comment 35 Adam Williamson 2020-05-21 19:45:29 UTC
Awesome, thanks! I'll unrevert the openssl change then try an nbdkit build again and see if it works.

Comment 36 Adam Williamson 2020-05-21 21:04:53 UTC
Fix looks good, I did an openssl -5 with the change reapplied and fired an nbdkit scratch build against that, the root.log from the x86_64 build doesn't show the bug:

https://kojipkgs.fedoraproject.org//work/tasks/6821/44786821/root.log

I guess we can close this when the updates go stable.

Comment 37 Tomas Mraz 2020-05-22 07:45:36 UTC
Thank you, Adam, for all the initial detective work and the verification of the fix.

Comment 38 Fedora Update System 2020-05-29 02:26:42 UTC
FEDORA-2020-eb7b7b9aa8 has been pushed to the Fedora 31 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 39 Fedora Update System 2020-05-29 04:09:17 UTC
FEDORA-2020-329ce47baf has been pushed to the Fedora 32 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 40 Ben Cotton 2020-11-03 16:59:08 UTC
This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.