1490632 – Service tries to start but fails in qemu VM

Bug 1490632 - Service tries to start but fails in qemu VM

Summary: Service tries to start but fails in qemu VM

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rng-tools
Sub Component:
Version:	29
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Neil Horman
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	AcceptedFreezeException
Depends On:
Blocks:	F27FinalFreezeException
TreeView+	depends on / blocked

Reported:	2017-09-11 22:16 UTC by Adam Williamson
Modified:	2023-09-14 04:07 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-11-04 19:18:03 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
proposed change to rngd to allow conditional starting based on entropy availabilty (3.58 KB, patch) 2017-10-27 19:28 UTC, Neil Horman	no flags	Details \| Diff
View All

Description Adam Williamson 2017-09-11 22:16:13 UTC

In current openQA tests for F27 and Rawhide, the test that checks whether any services tried to start up but failed consistently fails due to rngd.service failing to start, e.g.:

https://openqa.stg.fedoraproject.org/tests/158342#step/base_services_start/13

looking at the logs, I see:

Sep 11 08:28:43 localhost.localdomain rngd[669]: Failed to init entropy source 0: Hardware RNG Device
Sep 11 08:28:43 localhost.localdomain rngd[669]: Failed to init entropy source 1: TPM RNG Device
Sep 11 08:28:43 localhost.localdomain rngd[669]: Failed to init entropy source 2: Intel RDRAND Instruction RNG
Sep 11 08:28:43 localhost.localdomain rngd[669]: can't open any entropy source
Sep 11 08:28:43 localhost.localdomain rngd[669]: Maybe RNG device modules are not loaded
Sep 11 08:28:43 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rngd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Sep 11 08:28:43 localhost.localdomain systemd[1]: rngd.service: Main process exited, code=exited, status=1/FAILURE
Sep 11 08:28:43 localhost.localdomain systemd[1]: rngd.service: Unit entered failed state.
Sep 11 08:28:43 localhost.localdomain audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rngd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Sep 11 08:28:43 localhost.localdomain systemd[1]: rngd.service: Failed with result 'exit-code'.

openQA runs tests via qemu, directly, not with libvirt. A typical qemu command to run a test is:

/usr/bin/qemu-kvm -serial file:serial0 -soundhw ac97 -vga std -global isa-fdc.driveA= -m 2048 -cpu Nehalem -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1,serial=1 -drive file=raid/l1,cache=unsafe,if=none,id=hd1,format=qcow2 -drive media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/share/factory/iso/Fedora-KDE-Live-x86_64-Rawhide-20170911.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown -vnc :95,share=force-shared -qmp unix:qmp_socket,server,nowait -monitor unix:hmp_socket,server,nowait -S -monitor telnet:127.0.0.1:20052,server,nowait

Proposing as a Beta freeze exception. We had a similar case in the past and rejected it as a blocker:

https://bugzilla.redhat.com/show_bug.cgi?id=892178

on the grounds that it was hardware-dependent, so I won't propose this one as a blocker.

Comment 1 Adam Williamson 2017-09-11 22:17:47 UTC

This changed somewhere between Fedora-Rawhide-20170731.n.0 (where the test passed) and Fedora-Rawhide-20170822.n.0 (where it first failed due to this rngd bug). I don't have a more precise delta than that right now.

Comment 2 Kamil Páral 2017-09-18 17:47:29 UTC

Discussed at blocker review meeting [1]:

It's good thing to not have failing services on vms

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2017-09-18

Comment 3 Adam Williamson 2017-10-25 19:25:11 UTC

Transferring to Final, wasn't fixed for Beta.

Comment 4 Adam Williamson 2017-10-25 19:27:45 UTC

Still happens with 6.1 in current Rawhide, btw.

Comment 5 Neil Horman 2017-10-25 19:48:30 UTC

This is likely part of commit 5e6f8f90eda9ff312653e42a183d8933419de8c0. For quite some time, rng-tools had a bug in which the service would start, even if there was no entropy source, quietly not feeding entropy into the kernel entropy pool. I fixed that with the above commit (that admittedly does alot of other things too, since I only recently took over rng-tools upstream after it had fallen into disrepair.

Regardless, it would be my contention (without knowing more), that this is actually working as designed. To make a determination on that, I would ask:

1) Is there a hwrng assigned via SR-IOV or other passthrough to this guest, and is its corresponding driver module loaded? You can tell by looking at /sys/devices/virtual/misc/hw_random/rng_available and seeing if there is any value in that file

2) Does this cpu support the RDRAND instruction in a qemu guest?

3) Does this guest have a tpm device available to it?

If the answer to all three questions is no, then this is working as designed, because rngd cannot find an entropy source to draw random data from, and so it exits.

That said, I'm open to finding a way to make this work on virt guests. Commonly, the -r option is used with /dev/urandom, and encoded in the service file to provide a pseudo random device when no true rng is available, but I don't want to do that by default, as it masks true rngs when they are available, and doesn't provide a truly random source of entropy. If you have suggestions as to another solution, I'm happy to listen.

Comment 6 Adam Williamson 2017-10-25 20:02:39 UTC

What I want is for *the service not to fail*. It's fine if rngd doesn't actually run. But this should be done in such a way that the service does not show up as failed. I'm not sure off the top of my head what would be the best way to implement that, it depends a bit on the details of how the service is set up and how the actual executable behaves.

Comment 7 Adam Williamson 2017-10-25 20:33:45 UTC

So, here's some detail on what we did about this last time around.

In rng-tools-5-4 , Zbigniew made this change:

http://pkgs.fedoraproject.org/rpms/rng-tools/c/95fb228e859df8162028819da0b6d31e9e1a708a?branch=master

the intent there is to make rngd exit with code 66 when no hardware is available, and tell the rngd systemd service that exit code 66 should be treated as a 'successful' exit.

In most cases at least, this seems to have more or less solved the problem - it results in the service showing as 'inactive' rather than 'failed'.

However, one user complained that rngd still logged several error messages before it eventually exited:

https://bugzilla.redhat.com/show_bug.cgi?id=892178#c41

and we (QA) reported an apparent case where the service still failed:

https://bugzilla.redhat.com/show_bug.cgi?id=892178#c39

Zbigniew then made a *second* change, this one:

http://pkgs.fedoraproject.org/rpms/rng-tools/c/8d6b73f8ff5c33695c5462f3019ea1c4684e3107?branch=master

He never commented on the bug again, so we have to infer exactly what this was meant to fix, but from the commit log and the package changelog, I *believe* the intent of this further change was to make rngd quit faster when no hardware was present, to avoid the extra log messages. That is, the intent of this second change was just to try and address the 'excessive logging' problem. I don't think it was intended to address anything else, and I think it still intended that rngd should exit with code 66 in this case.

Now when Neil updated the package to the new upstream release 6, he dropped both the patches from the spec file:

http://pkgs.fedoraproject.org/rpms/rng-tools/c/e46e2a500f787117b438528d0662a75d89fd244c?branch=master

so that's why this started happening again.

I suspect that just restoring (and rediffing if necessary) Zbigniew's *first* patch - which changes the exit code - should solve the main part of the problem again (it'll stop the service showing up as failed). Neil says he doesn't like Zbigniew's second change, but if all that was intended to do was avoid the 'unnecessary' log messages, I think we can live without it.

Comment 8 Neil Horman 2017-10-25 20:34:38 UTC

I actually disagree that its *fine* if rngd doesn't run.  If its enabled by default it should do the job its written to do, not just fail quietly and pretend like everything is ok.  I agree that we can probably re-implement a middle ground in which we log a message and don't report failure on the console, but as I think about it, I'm still not overly comfortable with that.  The best solution may be to simply not enable it by default on systems, if the majority of users are going to see if fail.  Let people opt into it instead.  I'm not sure yet.

Comment 9 Neil Horman 2017-10-25 20:36:16 UTC

I'm definitely not just re diffing the original patch, as the entire code base has been largely rewritten.

Comment 10 Adam Williamson 2017-10-25 20:46:06 UTC

"I actually disagree that its *fine* if rngd doesn't run. If its enabled by default it should do the job its written to do, not just fail quietly and pretend like everything is ok."

From the point of view of the distribution, I believe what we really mean is simply this: "On systems with the necessary hardware, we want to have rngd running and doing its job". That's the 'meaning' behind enabling the service by default. It doesn't "mean" that we expect rngd to always do the job it was written to do in all scenarios, that's just not what we really intend at all.

I don't quite get what you mean by "I'm definitely not just re diffing the original patch, as the entire code base has been largely rewritten.", as the *first* patch is very trivial and the code it touches actually hasn't changed much at all. I just rediffed it. It took 30 seconds. I have a scratch build now: https://koji.fedoraproject.org/koji/taskinfo?taskID=22697756 .

"The best solution may be to simply not enable it by default on systems, if the majority of users are going to see if fail."

I really don't think that makes any sense at all. I've no idea if a 'majority' of users will see the service fail or not, that would require knowing what hardware (and VMs etc.) Fedora users use, and we really don't know that with anything like the required level of detail.

Fundamentally I just can't see anything wrong at all with the proposed behaviour here: have rngd exit with a special exit code if no hardware is available, and have the service treat this as a 'successful' exit. I just can't see any practical downsides to that at all. The reported status via systemctl is going to be perfectly correct: it will report that the service is 'inactive (dead)', and the log messages will make it clear that it's 'inactive' because the necessary hardware was not available. Nothing about that seems 'wrong' or 'misleading' or in any other way problematic to me. Zbigniew provided an example of just how it looks, in the other bug:

$ systemctl status rngd
● rngd.service - Hardware RNG Entropy Gatherer Daemon
Loaded: loaded (/usr/lib/systemd/system/rngd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since pon 2014-12-15 08:29:55 EST; 5min ago
Process: 663 ExecStart=/sbin/rngd -f (code=exited, status=66)
Main PID: 663 (code=exited, status=66)

gru 15 08:29:55 fedora21 systemd[1]: Started Hardware RNG Entropy Gatherer Daemon.
gru 15 08:29:55 fedora21 rngd[663]: Unable to open file: /dev/tpm0
gru 15 08:29:55 fedora21 rngd[663]: can't open any entropy source
gru 15 08:29:55 fedora21 rngd[663]: Maybe RNG device modules are not loaded

that just seems perfectly fine to me.

Comment 11 Neil Horman 2017-10-26 14:32:06 UTC

Just because it was written the way it was previously, doesnt mean I'm going to accept it that way moving forward.  I'm happy that you managed to shoehorn the patch back in place but that changes nothing.  But the more I think about it, the more I become convinced that allowing systemd to assume that an exit due to the lack of an entropy source is a bad idea.  I say that because it masks a legitimate problem in the case where a functional rngd is needed.  To put it into a table:

Entropy source Avail  |   RNGD Needed  |    Result
----------------------|----------------|---------------------------------------
Yes                   |   Yes          | RNGD must be started, but users
                      |                | who need it, should know to manually
                      |                | enable it via systemd
----------------------|----------------|--------------------------------------- 
Yes                   |    No          | There is no need to start RNGD here
----------------------|----------------|---------------------------------------
No                    |    Yes         | Starting RNGD won't help the situation
                      |                | here, as its guaranteed to fail
----------------------|----------------|---------------------------------------
No                    |     No         | Staring RNGD is worthless as it will
                      |                | but thats irrelevant as its not needed
                      |                | anyway, so why bother

Rows 2  and 4 are effectively don't care states as, while rngd would be nice to run to add entropy to the kernel pool, but its not necessary to the operation of the system.  Rows 1 and 3 are significant however.  By defaulting rngd to be enabled using the method we discussed, we make life very convenient for this group, as everything "just works", but we've made life very difficult for group 3, because the service will seem like it started, but will actually remain inactive, and it will be up to those users to figure out what we've done, which will require figuring out why systemd is behaving as it is, and what rngd is actually telling it.  On the other hand, if we allow systemd to fail, and don't enable the service by default, then we put ourselves in a situation where a lack of entropy source is immediately recognized for what it is in row 3, allowing for clear understanding of the problem (and fixing thereof).  Row 1's life is made a bit more difficult, but can easily be rectified in a kickstart file, or by manually enabling it after install.  That seems like the better trade off to me, and so the more I think about it, the more I think the right solution is to not enable rngd by default and let users turn it on by themselves in cases where they need it, and see the failures as they happen.

I appreciate that you don't see anything wrong with zbigniew's solution here, but the more I think about it, the more I disagree with it.

Comment 12 Neil Horman 2017-10-26 15:24:01 UTC

https://pagure.io/fedora-release/pull-request/118

Created a pull request to disable rngd by default

Comment 13 Adam Williamson 2017-10-26 15:44:07 UTC

Who exactly is in row 3? As in, real-world usage?

Comment 14 Adam Williamson 2017-10-26 15:46:25 UTC

Also, I think you're failing to understand that many people don't know that they "need" rngd, or in fact don't "need" it, but would benefit from it despite not having any idea what it is or even that it exists at all. We should be taking advantage of the best entropy source available to the system, automatically, without requiring that users know about the existence of hardware entropy providers and the existence of a service called 'rngd' that has to be turned on in order to actually take advantage of one.

I'm fairly sure that if your pull request is merged, the percentage of people who have HW entropy providers who are actually taking advantage of them with Fedora will drop from 100% to, say, 1-5%.

Comment 15 Simo Sorce 2017-10-26 16:05:18 UTC

It seem to me the problem here is understanding and executing on intent.

Problem:
We clearly have 2 intents here so the system should differentiate based on that, 1) automatically getting rngd entripy if the system can do it, 2) being abel to *require* rngd is on for cases where it is really critical to have it.

Possible solution:
Have 2 different service file that clearly convey this distinction.

One which I'll name opportunistic-rngd(1), is a service enabled by default which will try to make rngd work, but will quitely exit if not possible,

The other which I'll name required-rngd(2), is a service that trumpt the opportunistic one and needs to be explicitly enabled. If this service fails it doesn't do so quitely, perhaps even has the option to cause the system to shutdown/go emergency mode/or whatever, but it is not quiet.

Outcome:
Having good entropy by default *if available* is definitely worthy [se we want (1)], we need to only solve the problem for those people that actually *require* it(2) and would rather have some other dependent service also fails if that is not the case.

HTH

note: (1) and (2) names are arbitrary and even the implementation via service files is just an idea, what matters is the outcome.

Comment 16 Neil Horman 2017-10-27 13:53:09 UTC

Adam, regarding row 3, honestly, I don't know, but I have to assume they exist, just like you asserted that we have to cover all hardware combinations since we have no idea what the majority use case is in comment 10.  If you're looking for a real world use case for row 3, I'd cite perhaps providers like amazon, who have racks of cheap hardware that don't have entropy sources, but may be hosting users who are generating certificates on the vm to host web sites with.  I don't know, and thats the point, we have to consider all of these cases in light of no data.  But I'm simply not going to hide a failure case in support of making it look good.

That said, Simo, I like the idea.  I'd propose the following:

1) Create a check-entropy service, run as Type oneshot, gated on ConditionFirstBoot which runs rngd only to list the available entropy sources.  If at least one is found, it enables the proper rngd service.  This service always exits successfully


2) Modify the rngd service file to be disabled by default, and only get enabled via the service in (1)

I think that would give adam what he wants, in that he will never see a service fail because we can check if there is an entropy source first, and it will allow the actual rngd service to fail properly if it was manually started at a later time.

The only questions I would have here are can a systemd unit be enabled and started by another unit.  If thats possible, I would be ok with this solution

Thoughts?

Comment 17 Simo Sorce 2017-10-27 14:17:20 UTC

I know we do start units in freeIPA from a script, such that the ipa.service unit is enabled but none of the services we start from it are.

This is slightly frowned upon in systemd circles, but so far it works, so you have this methid if nothing else works.

That said I would look into various conditionals to know if this is a good idea.
systemd has also things like generators that may help in this case (the generator would generate the rngd service only if the preconditions are met or some such).

Comment 18 Simo Sorce 2017-10-27 14:18:02 UTC

https://www.freedesktop.org/software/systemd/man/systemd.generator.html

Comment 19 Neil Horman 2017-10-27 19:28:39 UTC

Created attachment 1344506 [details]
proposed change to rngd to allow conditional starting based on entropy availabilty

I'm not sure a generator is really whats needed here, though it probably could be used.  This is more what I had in mind.  I've not tested the firstboot functionality yet, but running systemctl start entropy-check.service does the right things on my vm (in that it detects an entropy source and enables/starts rngd).  I can look into using a generator, but so far this looks good to me.  Let me know what you think.

Comment 20 Adam Williamson 2017-10-27 21:00:35 UTC

I'm honestly not a huge fan of that, it seems rather hacky - I suspect it'd make zbyszek cry a bit. I really don't want to get too fancy with this. If you're absolutely convinced that you don't want to just treat "no hardware" as an OK exit for the service's purpose, I'd rather just leave things as they are, honestly; I can work around it in openQA if we definitely decide that it's behaving the way you want it to behave. I still think that approach is the best one, I just don't believe the idea that there are people out there who are trusting that "rngd.service didn't fail means we have hardware RNG!", I frankly don't believe a single user like that exists. But if you're convinced it's a case worth caring about, then let's just keep things simple and stick with the service failing, unless someone has a less icky implementation to suggest.

Comment 21 Neil Horman 2017-10-27 21:52:36 UTC

I'm not sure why you think this would be hacky, but silently ignoring the fact that the entropy daemon wasn't providing entropy to the kernel isn't.  As Simo notes, we do this in other packages already, so theres some precedent for it, despite the reaction that the systemd people would have.  I can look at using generators to make this cleaner, but this seems like the way to go to me.  I sympathize with your desire to always have rngd run when there is entropy to be gathered, and not have it fail where there isn't, but this really seems like the best of both worlds to me.  I'll keep trying to clean it up and commit it when I'm happy with it.

Comment 22 Adam Williamson 2017-10-27 23:38:20 UTC

Services that run 'systemctl' are, I think, fundamentally hacky. Also as Simo notes, the systemd folks really don't like the way ipa.service works. Also it just seems like a lot of work for a fairly marginal gain.

Once again, I don't agree with the characterization of the behaviour I'd prefer as "silently ignoring the fact that the entropy daemon wasn't providing entropy to the kernel". Nothing is 'silently ignored'. The daemon exits with a non-zero exit code, and the service does not show as active. Anyone who actually needs to be sure that rngd is providing entropy should be checking if rngd.service is *active*, not just that it didn't fail.

Comment 23 Adam Williamson 2017-10-27 23:38:53 UTC

For the record there's even a systemctl command *specifically for checking if services are active*, 'systemctl is-active'.

Comment 24 Neil Horman 2017-10-28 01:45:40 UTC

Ok, I'm sorry, I'm officially no longer interested in what you think.  I've tried to be accommodating, but you apparently can't let anything go without having the last word.  So, You've presented a problem for fixing, I'm fixing it in the way I feel as the maintainer is best.  Your input is no longer needed.

Comment 25 Adam Williamson 2017-10-28 02:11:28 UTC

It has nothing to do with 'having the last word'. It has a lot to do with trying to keep the distribution sane. But if you want to maintain this hacky approach forever (I'm sure systemd upstream doesn't support services that enable other services in this way, for instance, and I'm not at all sure how reliable ConditionFirstboot is, I don't know if we have *anything* else that uses it), that's entirely up to you, sure. It's just that yesterday you were saying you thought everything was fine the way it was, and I said I'd rather leave it that way than "fix" it this messy way. But if you prefer this, whatevs. I'll make sure to report any bugs it causes.

Comment 26 Zbigniew Jędrzejewski-Szmek 2017-10-29 09:50:57 UTC

> Created attachment 1344506 [details]

In general I don't think this a good approach. It makes the whole thing _much_ more complicated. And in the end the effect is almost exactly the same as the state in F27.

This approach also has the downside that it's static: hardware availability is checked once after installation, and anything added later is ignored. We really should make things "just work" when hardware is hotplugged. It's also really nice if one can take an disk or an VM image and plug it into a different machine and have things "just work".

(Also, having an executable called "check-*" in /usr/bin that modifies system configuration and starts services is a potential pitfall.)

(Also, it'd be necessary to patch fedora-release to enable entropy-check.service in presets.)

> https://www.freedesktop.org/software/systemd/man/systemd.generator.html

That is a possibility, technically, but it doesn't make much sense. Generators are used to turn on multiple units based on external configuration. Adding a generator to tell a single unit if to run on not, when this unit can decide on on its own better anyway, just isn't useful. Rngd is able to decide this on it's own, and *has* to decide on it's own anyway, because it checks what hardware is available.

--

Based on all the discussion, I think that the requirements that would make all sides happy can be summarized as:
- autodetect hwrngs and use automatically, by default [normal users]
- if not hwrngs are available, log and exit cleanly (with success) [users and QA]
- make it possible to easily make hwrng presence mandatory and fail verbosely if they not found [advanced users]

The state in F27 satisfied the first two points, and we need a way to implement the third. I can see two easy ways to achieve this:

a) add a configuration file, e.g. /etc/rngd.conf, with an optional line like:
   # Uncomment the line below to make rngd.service fail if no hardware is present
   # hardware_required = true
... with matching code in rngd.

b) move the SuccessExitStatus=66 to a seperate file, /usr/lib/systemd/system/rngd.d/soft-failure.conf, with contents like:
   # "Mask" this file, using
   #     mkdir /etc/systemd/system/rngd.d/
   #     ln -s /dev/null /etc/systemd/system/rngd.d/soft-failure.conf
   # to make rngd.service fail if no hardware is present

I'd go with b) because it is so simple and self-documenting.

Comment 27 Simo Sorce 2017-10-29 14:05:55 UTC

Oh I did not think of (b), it looks way cleaner and simpler, love it!
Neil please consider using this.

Comment 28 Adam Williamson 2017-10-30 16:41:53 UTC

I like that in theory, but I can see one objection to it from Neil's POV. If you are this mythical admin who believes that if they enable a service and it doesn't fail that must mean it's running and working, how would you discover this? You wouldn't see that rngd.service had failed, therefore you'd have no reason to investigate it and find this mechanism. You'd just assume that all your systems had hardware RNG because rngd.service hadn't failed on any of them.

I still don't really believe this hypothetical sysadmin exists, but if they *do*, I'm not sure approach b) *really* solves their problem - it provides a mechanism they could use, but I don't see how we expect them to *discover* that this mechanism exists and that they must use it.

Comment 29 Zbigniew Jędrzejewski-Szmek 2017-10-30 17:10:23 UTC

Such a hypothetical admin has also infinite time to read documentation and check everything... so she is bound to know to perform the manual steps necessary ;)

But more seriously, there is a mechanism to discover that rngd is not running. If there is not enough entropy, services that use /dev/random will stall and/or you'll receive warnings in dmesg. Such situation is an occasional issue in VMs and other entropy-contrained systems, so admins have to be aware and prepared to deal with it anyway.

Comment 30 Adam Williamson 2017-11-03 18:38:27 UTC

So it looks like Neil first implemented his proposed entropy-check.service approach, then changed it to remove that service and instead do the 'check for available entropy source' in %post:

http://pkgs.fedoraproject.org/rpms/rng-tools/c/2aa45beb753b7401fedcbfa3ccd0a4b005510f56?branch=master

Sorry, Neil, but again that seems like a bad idea. Specifically, package %post doesn't only get run on the actual system the package is deployed on. For instance, live images: live images are built by running an install via anaconda, including installing all the packages. So when we build a live image, this 'entropy check' will run during the live image build process, and decide whether rngd.service should be enabled on the live image - for everyone who ever boots the live image, or installs from it - based on whether we happen to have an HW entropy source available in the live image build environment. That's clearly wrong.

Comment 31 Kevin Fenzi 2017-11-04 20:14:53 UTC

Hey. Whatever you did in rng-tools-6.1-2.fc28 it broke rawhide composes. 

All the live media installs hung with: 

anaconda.log: 

10:26:51,909 CRT exception: Traceback (most recent call last):

  File "/usr/lib64/python3.6/site-packages/pyanaconda/threading.py", line 252, in run
    threading.Thread.run(self)

  File "/usr/lib64/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation.py", line 366, in doInstall
    installation_queue.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 304, in start
    item.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 304, in start
    item.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 472, in start
    self.run_task()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 438, in run_task
    self._task(*self._task_args, **self._task_kwargs)

  File "/usr/lib64/python3.6/site-packages/pyanaconda/payload/dnfpayload.py", line 947, in install
    if errors.errorHandler.cb(exc) == errors.ERROR_RAISE:

  File "/usr/lib64/python3.6/site-packages/pyanaconda/errors.py", line 305, in cb
    raise NonInteractiveError("Non interactive installation failed: %s" % exn)

pyanaconda.errors.NonInteractiveError: Non interactive installation failed: DNF error: Non-fatal POSTIN scriptlet failure in rpm package rng-tools

packaging.log: 

10:26:51,859 INF packaging: Installed: rng-tools-6.1-2.fc28.x86_64 1509635197 ca92320279cdff24dd10840dbea60c32d536e137
80e858989bed974685ec6c3e
10:26:51,866 INF packaging: Configuring (running scriptlet for): rng-tools-6.1-2.fc28.x86_64 1509635197 ca92320279cdff
24dd10840dbea60c32d536e13780e858989bed974685ec6c3e
10:26:51,906 ERR dnf.rpm: Non-fatal POSTIN scriptlet failure in rpm package rng-tools

I have untagged rng-tools-6.1-2.fc28. Please try and address this before pushing a new build. Thanks.

Comment 32 Adam Williamson 2017-11-07 17:15:29 UTC

Kevin: that'll be the thing I mentioned in #c30 - the %post script that attempts to enable/disable the service based on hardware availability.

BTW, to modify my analysis in #c30: I just noticed the %post block isn't run only on first package install, but on install and update. So that makes things different, but in a way, weirder.

If we ignore the fact that it actually breaks the compose process and imagine that live image compose actually worked, then we'd get live images with rngd.service disabled, because the live compose environment doesn't have the hardware (most likely). So when you installed from a live image, rngd.service would be disabled. Then if there happened to be any update to rng-tools , rngd.service would suddenly get enabled (assuming your system actually had the hardware) without you doing anything.

Then imagine if you happened to update rng-tools via a chroot or something for some reason; rngd would get turned off again.

It also, of course, precludes the sysadmin's ability to disable rngd.service in any meaningful way on a system which has the hardware, because it would be turned back on by the %post script any time rng-tools was updated. They'd have to mask the service, which wouldn't be at all obvious.

Comment 33 Paul W. Frields 2017-11-07 17:28:44 UTC

Neil, would it be possible to drop back and consult with Zbigniew (if needed) about a more robust solution?  Comment 26 looks like it has some useful info and might avoid pitfalls that sometimes come from using systemd oddly in packaging context.

Comment 34 Adam Williamson 2017-11-09 01:07:01 UTC

FWIW, I've tweaked openQA so it's not affected by this any more. It turned out to be pretty simple: there's an openQA setting I can use to enable the virtIO 'hardware' RNG device in the VM used for the test. I've enabled that setting for the test that checks that all system services started correctly, so now rngd.service doesn't fail because it finds the virtio RNG, and all is cool. So I have no reason to want this changed any more, I'm now fine with the 'service fails if hardware isn't found' behaviour.

Comment 35 D. Hugh Redelmeier 2017-11-17 08:00:33 UTC

The following comments are from a picky possible consumer of this feature.

I think that Simo's #15 is pretty reasonable.

Linux entropy always seemed opportunistic: use anything you can get your hands on.  Where this fails is that it isn't easy to know how much entropy you've got.  The kernel guesses, but I don't think that there is any reasonable rigour in that guess.  And it matters for security.

Having rngd quietly give up when there are no sources known to it is quite consistent with the rest of Linux behaviour.  That's good: no surprises (except to those who didn't know the overarching behaviour).

It would be good for fastidious sysadmins (really? and unicorns) and fastidious programs to be able to specify that they need rngd.  It seems as if a systemd dependency on the proposed required-rngd could do the job.

Having the existing mechanism fail does not seem like a good change.

(I came here from https://bugzilla.redhat.com/show_bug.cgi?id=1490632 )

Comment 36 Adam Williamson 2017-11-22 21:21:29 UTC

So, as per #c31 , nirik untagged the most recent build from rawhide as it broke composes.

However, the change has not been either reverted or fixed in git, so any time we get a rebuild for any reason, we'll get composes broken again.

I don't want to tread on Neil's toes too hard here (I obviously annoyed him enough already), but this is not a sustainable situation: the changes must either be reverted or fixed in git to prevent another broken package build appearing.

Comment 37 raynard 2018-01-24 00:05:34 UTC

I tend to agree with #26 (b), or something quite close.  It seems like having rngd exit with a known exit code when no entropy sources are available was a perfect solution -- it allows:

1) the distribution to choose either default (require entropy, or maintain the previous behavior), and
2) the local administrator to choose the opposite with a simple /etc/systemd/system/rngd.d/something.conf SuccessExitStatus= override or masking link.

I read through the comments, and it seemed like there was a moment of hovering near that conclusion, but then a wild exit ramp turn toward creating extra interconnected units, and I couldn't quite tell why -- have I missed something?

Sorry I've run into this so late, we just happened upon a convergence of a vm having no hw entropy source plus rngd's change in behavior in RHEL7 while updating some environment templates today, and we don't necessarily control the hardware/virt.env., or even configs of systems which will run this stuff.

I'd rather have kept the old behavior (agreeing w/ #35 and indirectly #15), but I'd be nearly as happy to just add a drop-in to my config templates, and I do see some value in considering an rngd failure as something an admin needs to examine (a la the spirit of #14 and #28).

Comment 38 Adam Williamson 2018-02-15 21:17:23 UTC

...aaand what I warned of in #c36 has now happened. The package was rebuilt for the f28 mass rebuild and is now breaking Rawhide composes again:

pyanaconda.errors.NonInteractiveError: Non interactive installation failed: DNF error: Non-fatal POSTIN scriptlet failure in rpm package rng-tools

https://koji.fedoraproject.org/koji/taskinfo?taskID=25077681

Comment 39 Adam Williamson 2018-02-15 21:32:02 UTC

For now I have reverted to the 6.1-1 state - i.e. a clean build of upstream 6.1, no attempts to 'fix' this bug - bumped to 6.1-4, and rebuilt. I don't want to essay anything more radical as Neil clearly thought I was treading on his toes, but we couldn't leave the package in the broken state.

Comment 40 Pat Riehecky 2018-03-19 21:25:45 UTC

Perhaps adding `ConditionFileNotEmpty=/sys/devices/virtual/misc/hw_random/rng_available` to  rngd.service would fix this?

My systems without a hardware rng do have this file empty.

Comment 41 Gerald Cox 2018-06-20 05:48:40 UTC

Changing to rawhide since this has already spanned several releases.  It is occurring in F28.

Comment 42 Adam Williamson 2018-06-20 22:27:56 UTC

rngd considers three 'entropy sources': "Hardware RNG Device", "TPM RNG Device" and "Intel RDRAND Instruction RNG". Are we sure that /sys/devices/virtual/misc/hw_random/rng_available will always be non-empty when *any one* of those three is available?

Comment 43 raynard 2018-06-21 00:06:20 UTC

Sorry to pile on, but... Even if (what amounts to) a "magic file" on the sys or proc or whichever virtual fs were reliable, y'know who would know for sure whether rngd found a suitable entropy source and, bugs notwithstanding, would never be wrong about it? rngd

Was there a technical reason the predictable exit code approach was dropped?

Comment 44 Gerald Cox 2018-06-21 01:07:12 UTC

I agree with #c8 at least until an approach can be decided upon and implemented to gracefully handle situations where the hardware does not exist.

We should never give any process default status in such a situation.

Comment 45 Jan Kurik 2018-08-14 11:21:54 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.

Comment 46 Ben Cotton 2019-10-31 20:44:08 UTC

This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 47 Adam Williamson 2019-11-04 19:18:03 UTC

it seems Neil just left things in the initial state, which is fine so far as I'm concerned. so let's just close this.

Comment 48 Red Hat Bugzilla 2023-09-14 04:07:42 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.

agross
awilliam
didier.fabert
gbcox
hugh
jaromir.capik
jgarzik
jiabanster
kevin
kparal
lewk
mark
mavit
nhorman
normand
rcyriac
riehecky
rkudyba
rsandwick
samuel-rhbugs
ssorce
zbyszek