Bug 1715899
| Summary: | Kernel pool runs out of entropy | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Vilém Maršík <vmarsik> |
| Component: | rng-tools | Assignee: | Neil Horman <nhorman> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Vilém Maršík <vmarsik> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.7 | CC: | nhorman, rvr |
| Target Milestone: | rc | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-15 17:19:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Could not reproduce this on my 8.1 installation, just 7.6 + 7.7 Can I have access to this system to try reproduce manually? If I had to take an initial guess here I would expect that this system has a tpm version 1.2 device on board, which in rng-tools we deprecated because more recent hardware exports tpm entropy through /dev/hwrng. If this is an older system, that change may have led to us turning off the tpm entropy source which we shouldnt have done, but I need to confirm that. Ok, so I've looked at this and I've noted a few things: 1) The FIPS test is taking longer. This is due to a few bugs that were fixed upstream between the 6.3 and 6.7 releases. I can try to backport them individually, but it might be easier and safer just to do an upstream rebase. Not sure if thats worthwhile though, looking for input here 2) The introduction of the jitterentropy entropy source is meant to provide entropy early during boot, when other entropy sources aren't available, but is, by design slow, and cpu bound. Running extracts of the entropy pool such as this test is creating a pessimal case for that entropy source, and causing cpu contention with your extract process, leading to lower entropy counts. Running rngd with the jitter daemon disabled will fix this, and can avoid us having to do (1) above. Let me know what you think (I've modified the test case to do that, and it passes in all cases) 3) I'm not sure I'm comfortable with the entropy expectations of the test itself. The assertion that the entropy count in the kernel should always be at half full or more is something we can hope for, but dependent on how much entropy is drained during the DD operation and the speed at which rngd can refill it, its not always feasible. while it may have worked in the past, theres no guarantee that at any specific moment in time, the entropy pool won't be low. Especially with the cpu based jitterentropy source running, it may take a few milliseconds to refill it, which isn't catastrophic, and if you check the pool size right after you do a drain operation (especially after a few in a row), it may just be low for a moment, and thats ok. So, in summary there are a few issues going on here: A) we have a few bugs in rng-tools, which are irritating, but not catastrophic. Some of them can be fixed by updating to the latest upstream rng-tool (this will fix some of the latency in the FIPS test) B) we are expecting entropy counts to remain high when we are running rngd in a configuration meant to provide entropy during early boot, but not to provide high levels of entropy during periods of extreme entropy need (i.e. we are running with jitterentropy enabled). This can be fixed by running rngd in the tests with the -x 5 option (which disables jitterentropy). This will allow the rdrand source to provide more entropy, more quickly and avoid the entropy level check failures. C) The entropy check failure is somewhat erroneous, as the expectation of always having a half full entropy pool can't be guaranteed at any specific moment in time. I would suggest that a more realistic test be to fail the test if the entropy count falls below the requested amount of entropy (1024) for more than two consecutive iterations, which would suggest that there is insufficient entropy to satisfy the next request Thoughts? Just as some additional information, we could modify rng-tools to mark jitter entropy as a slow entropy producing source and modify the gathering loop to try fast sources first, falling back to the slow sources only when they are the only ones available. That might be a happy medium for RHEL 7, as it would solve your problem above on systems that have other faster sources available, but still allows early boot to get entropy on systems with no other source of randomness. Let me know what you think Hi Neil, it is not a few miliseconds before the slow JITTER source can feed the entropy, it is multiple seconds before rngd with the fast RDSEED could compensate that slow consumption. Reading 1k blocks with delays of 1s or 100ms is not very fast. "a more realistic test be to fail the test if the entropy count falls below the requested amount of entropy (1024) for more than two consecutive iterations" - yes, we are not there yet, but the pool still got almost empty - 38 after 300ms, but with 1s delays, the minimum is further. The RDRAND alone should feed entropy many orders of magnitude faster than this. And rngd already had enough time to kick in. My greatest concern is entropy running to zero, and drastical performance drop of processes relying on randomness. I have not reproduced it yet, but the numbers are close to that. Do you think it cannot happen? If this was caused by waiting for the slow JITTER source, then yes, it makes sense to disable it with RDRAND, and prevent entropy shortage under stress. On the other hand, depending solely on a proprietary blackbox is not the most secure solution. Can we have both - not fully relying on RDRAND, while not getting blocked for seconds with JITTER? Thanks I'm not sure what you mean by "a blackbox" here. Theres nothing opaque about the entropy sources. And no, I'm not sure entropy can't run to zero, in fact I guarantee you that it can. But thats the case with any situation, regardless of the entropy sources we employ. What I'm saying is that its not really a fair test to just assume that we will always have a half full entropy source, because we have no standards for how fast we can fill or drain entropy. Its simply a matter of how fast you drain vs. how fast you can fill it back up. Just because your test drains it in 1024 byte blocks doesn't mean that some other user will drain it at 10x that amount. We don't have any global expectation of drain and fill rates. I've submitted a patch upstream to avoid using the slow entropy sources when the fast ones are providing sufficient entropy, I'll try backporting that, and you can see if that passes your tests (it should), but I maintain that we probably need to come to an agreement on what valid drain and fill rate measurements are. https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21993161 Heres a test build for you. I've backported two upstream patches: the first adds a flag to slow entropy sources, marking them as being slow to deliver entropy the second modifies the entropy gathering loop, such that by default it only attempts to collect entropy from sources that are not marked as slow. Only if we iterate over all available entropy sources and do not collect any entropy, do we then include slower sources in the entropy collection process. This should allow for systems without high speed entropy sources (i.e. virt guests) to continue using jitterentropy when no other source is available, but on systems with other high speed sources available, entropy will be collected at the rates at which those high speed sources can produce it, not being bound by any latency the slower sources might introduce in the nominal case. Please let me know if it passes your tests. If so, we can pursue the acks on this bz and get it integrated. By "a blackbox" I mean the internal implementation of RDRAND instruction. This cannot be audited and was disputed in the past. A bug there could have security implications, and relying purely on that blackbox might not be safe. We have no expectations of drain and fill rates, but this is about performance. I don't have any customer use case at hand, but can imagine e.g. generating crypto keys for network encryption with high number of incoming connections. In that case, much more than 10k/s might be needed. The throughput of RDRAND is in hundreds of MBps ( see e.g. https://github.com/randombit/botan/issues/911 ), so why should we be that much slower here? If the user wants random numbers and has a HWRNG, he will probably expect to reach a corresponding performance. Thanks for the build, will test it. ping, updates? closing for lack of update, please reopen if you get around to testing this Sorry for the delay, could not reproduce at the time. Now seeing the problem again: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2020/06/43249/4324918/8377931/111209020/taskout.log https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2020/06/43214/4321443/8371411/111136678/taskout.log Could you rebuild your patches again, so that I can try them? They seem to have expired in the meanwhile. Thanks, but how do I get the RPM? I only got to https://brewweb.engineering.redhat.com/brew/buildrootinfo?buildrootID=6042457 , but see no "Built RPMs" there. They've been reclaimed, you need to download them more quickly. you have anywhere from 3 to 7 days to grab scratch builds. I've started a new build for you: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29548200 Please collect the rpms ASAP for you testing Hi, it looks your change did not help - starting with original version: [root@ibm-x3250m5-01 hwrng]# make run (...) :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: entropy-pool :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ 18:33:30 ] :: [ PASS ] :: Starting rngd.service (Expected 0, got 0) :: [ 18:33:30 ] :: [ PASS ] :: rngd.service is active (Expected 0, got 0) :: [ 18:33:30 ] :: [ FAIL ] :: Available entropy at least 2048 (Assert: "1931" should be greater than "2047") :: [ 18:33:31 ] :: [ FAIL ] :: Available entropy at least 2048 (Assert: "913" should be greater than "2047") :: [ 18:33:32 ] :: [ FAIL ] :: Available entropy at least 2048 (Assert: "6" should be greater than "2047") :: [ 18:33:33 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3079" should be greater than "2047") :: [ 18:33:34 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3103" should be greater than "2047") :: [ 18:33:35 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3090" should be greater than "2047") :: [ 18:33:36 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "2727" should be greater than "2047") :: [ 18:33:38 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3082" should be greater than "2047") :: [ 18:33:39 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3081" should be greater than "2047") :: [ 18:33:40 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3072" should be greater than "2047") :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: Duration: 11s :: Assertions: 9 good, 3 bad :: RESULT: FAIL (entropy-pool) (...) [root@ibm-x3250m5-01 hwrng]# uname -r 3.10.0-1151.el7.x86_64 [root@ibm-x3250m5-01 hwrng]# rpm -q rng-tools rng-tools-6.3.1-5.el7.x86_64 [root@ibm-x3250m5-01 hwrng]# rpm -U http://download.eng.bos.redhat.com/brewroot/work/tasks/8225/29548225/rng-tools-6.3.1-6.el7.x86_64.rpm [root@ibm-x3250m5-01 hwrng]# systemctl restart rngd [root@ibm-x3250m5-01 hwrng]# make run (...) :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: entropy-pool :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ 18:42:28 ] :: [ PASS ] :: Starting rngd.service (Expected 0, got 0) :: [ 18:42:28 ] :: [ PASS ] :: rngd.service is active (Expected 0, got 0) :: [ 18:42:28 ] :: [ FAIL ] :: Available entropy at least 2048 (Assert: "1773" should be greater than "2047") :: [ 18:42:29 ] :: [ FAIL ] :: Available entropy at least 2048 (Assert: "751" should be greater than "2047") :: [ 18:42:30 ] :: [ FAIL ] :: Available entropy at least 2048 (Assert: "7" should be greater than "2047") :: [ 18:42:31 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3079" should be greater than "2047") :: [ 18:42:32 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3072" should be greater than "2047") :: [ 18:42:33 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3072" should be greater than "2047") :: [ 18:42:34 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "2579" should be greater than "2047") :: [ 18:42:35 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3072" should be greater than "2047") :: [ 18:42:36 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3072" should be greater than "2047") :: [ 18:42:38 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3072" should be greater than "2047") :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: Duration: 11s :: Assertions: 9 good, 3 bad :: RESULT: FAIL (entropy-pool) (...) [root@ibm-x3250m5-01 hwrng]# uname -r 3.10.0-1151.el7.x86_64 [root@ibm-x3250m5-01 hwrng]# rpm -q rng-tools rng-tools-6.3.1-6.el7.x86_64 [root@ibm-x3250m5-01 hwrng]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.9 Beta (Maipo) Is there anything more we can do? I'm not really sure Can you paste the output of /proc/cpuinfo here please? ping? ok, still no response for 3 weeks, closing again.... There you are, sorry for delay: # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Xeon(R) CPU E3-1220L v3 @ 1.10GHz stepping : 3 microcode : 0x28 cpu MHz : 1404.742 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d bogomips : 2194.76 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Xeon(R) CPU E3-1220L v3 @ 1.10GHz stepping : 3 microcode : 0x28 cpu MHz : 1418.237 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d bogomips : 2194.76 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Xeon(R) CPU E3-1220L v3 @ 1.10GHz stepping : 3 microcode : 0x28 cpu MHz : 1456.103 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d bogomips : 2194.76 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Xeon(R) CPU E3-1220L v3 @ 1.10GHz stepping : 3 microcode : 0x28 cpu MHz : 1392.993 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d bogomips : 2194.76 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: Anything else can be done here? |
Description of problem: Kernel runs out of entropy with rngd running, when 1k blocks are read from /dev/random with 0.1s / 1s delay in between. Machine has RDRAND, which should feed entropy many orders of magnitude faster. Version-Release number of selected component (if applicable): rng-tools-6.3.1-4.el7.x86_64 3.10.0-1052.el7.x86_64 Same happens with 7.6 versions Code from 7.5 does not show this How reproducible: 100% after each reboot Steps to Reproduce: 1. Install & run rng-tools-CoreOS-rng-tools-Sanity-hwrng-1.0-8.noarch 2. check results from "entropy-pool" part 3. reboot before running it again Actual results: :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: entropy-pool :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: [ 09:58:14 ] :: [ PASS ] :: Starting rngd.service (Expected 0, got 0) :: [ 09:58:14 ] :: [ PASS ] :: rngd.service is active (Expected 0, got 0) :: [ 09:58:14 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "2081" should be greater than "2047") :: [ 09:58:14 ] :: [ FAIL ] :: Available entropy at least 2048 (Assert: "1058" should be greater than "2047") :: [ 09:58:14 ] :: [ FAIL ] :: Available entropy at least 2048 (Assert: "38" should be greater than "2047") :: [ 09:58:15 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3104" should be greater than "2047") :: [ 09:58:15 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3090" should be greater than "2047") :: [ 09:58:15 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3084" should be greater than "2047") :: [ 09:58:16 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3081" should be greater than "2047") :: [ 09:58:16 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3072" should be greater than "2047") :: [ 09:58:16 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3100" should be greater than "2047") :: [ 09:58:16 ] :: [ PASS ] :: Available entropy at least 2048 (Assert: "3088" should be greater than "2047") :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :: Duration: 2s :: Assertions: 10 good, 2 bad :: RESULT: FAIL (entropy-pool) Expected results: Entropy should not run out with rngd running. Additional info: Code that runs the test: rlPhaseStartTest "entropy-pool" rlRun "systemctl start rngd.service" 0 "Starting rngd.service" rlRun "systemctl -q is-active rngd.service" 0 "rngd.service is active" for i in {1..10}; do dd if=/dev/random of=/dev/null bs=1024 count=1 ENTROPY=$(</proc/sys/kernel/random/entropy_avail) rlAssertGreater "Available entropy at least 2048" $ENTROPY 2047 if [ "$ENTROPY" -gt 2047 ]; then rlReport "entropy_avail" "PASS" "$ENTROPY" else rlReport "entropy_avail" "FAIL" "$ENTROPY" fi sleep 0.1 done rlPhaseEnd Increasing the sleep from 0.1s to 1s decreases this only slightly, the problem still appears. "rngd --list" shows RDRAND + JITTER" as enabled sources, plus TPM and NIST as disabled. Machine is a 12-core HP Z4 with Skylake CPU.