1663318 – NetworkManager DHCP sometimes fails in VMs using user-mode networking, broke between Fedora 29 Beta (20180919) and Fedora-29-20181015.n.0

Bug 1663318 - NetworkManager DHCP sometimes fails in VMs using user-mode networking, broke between Fedora 29 Beta (20180919) and Fedora-29-20181015.n.0

Summary: NetworkManager DHCP sometimes fails in VMs using user-mode networking, broke ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	bind
Sub Component:
Version:	29
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Petr Menšík
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-03 19:52 UTC by Adam Williamson
Modified:	2019-01-24 03:31 UTC (History)
CC List:	27 users (show)
Fixed In Version:	bind-9.11.4-13.P2.fc29 bind-9.11.4-13.P2.fc28
Clone Of:
Environment:
Last Closed:	2019-01-18 02:14:11 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Adam Williamson 2019-01-03 19:52:00 UTC

This is another fun openQA virt issue!

We recently got a couple of quite powerful boxes as new openQA worker hosts, so I configured them to have 30 worker processes each. This means that, sometimes, 30 qemu processes start more or less simultaneously, many of which use user-mode networking (a few use tap via openvswitch).

It seems that quite often, in a few of the VMs using user-mode, networking just...doesn't work right at all. DHCP requests time out, and attempting to ping 10.0.2.2 (the host) gives "Network is unreachable". This obviously typically results in the test failing, which is not good.

I've tried with both qemu 3.0.0 and 3.1.0 and they both do this.

Prior to getting these new machines, the maximum number of workers we were running on any one system was 10, and I don't recall running into this issue before now.

I can't find any relevant log messages in the host system logs or the logs openQA saves (which I think include at least some qemu output). Can certainly try hacking things up to catch additional debugging output from qemu if someone can say what would be useful.

Comment 1 Adam Williamson 2019-01-03 20:22:56 UTC

Looking at the records a bit more closely, I'm no longer confident this is only happening on the new worker boxes, actually. Here, for instance, is what looks like the same thing happening on qa09:

https://openqa.stg.fedoraproject.org/tests/451530

which is one of the older, 10-worker boxes.

When this happens in openQA now, the test is set to dump the contents of the journal out via a serial console, which openQA logs, so you can see the journal contents here:

https://openqa.stg.fedoraproject.org/tests/451530/file/serial0.txt

where we see these lines showing the DHCP failure:

19:10:50,816 WARNING NetworkManager:<warn>  [1546542650.8162] dhcp4 (ens5): request timed out
19:10:50,817 INFO NetworkManager:<info>  [1546542650.8163] dhcp4 (ens5): state changed unknown -> timeout
19:10:50,817 DEBUG NetworkManager:<debug> [1546542650.8163] device[0x55f03e5dd210] (ens5): new DHCPv4 client state 2
19:10:50,817 DEBUG NetworkManager:<debug> [1546542650.8164] device[0x55f03e5dd210] (ens5): DHCPv4 failed (ip_state conf)

you can also see where the test tries to ping the host and fails (this is the check it uses after the test has failed, to see whether the network is broken so it should send logs out via the serial console):

https://openqa.stg.fedoraproject.org/tests/451530#step/_boot_to_anaconda/19

Comment 2 Richard W.M. Jones 2019-01-03 21:28:34 UTC

So as a test you might try something like:

$ LIBGUESTFS_BACKEND=direct parallel -i guestfish --network -a /dev/null run -- `seq 1 20`

This will run 20 copies of the libguestfs appliance, enabling the network (--network),
and using the direct backend so it's using qemu and user-mode networking.  It should
neither fail nor print any output, if it's working.

(You might need to play with the parallel -j and -l options, as well as starting more
or fewer jobs by adjusting the seq command).

Comment 3 Adam Williamson 2019-01-03 21:39:05 UTC

handy, thanks. Currently trying to get tcpdump output as requested by dgilbert. Will play around with that next (maybe tomorrow). thanks again!

Comment 4 Adam Williamson 2019-01-04 05:06:54 UTC

I tried using `tcpdump -i any` to capture output from tcpdump while running `dhclient --timeout 25`, and it doesn't seem to capture anything at all: the dhclient call seems to return immediately, too.

I'll try Richard's thing, but I'm also going to see if I can try it with different virtual NIC models; curious if it may be specific to virtio-net...

Comment 5 Richard W.M. Jones 2019-01-04 08:38:19 UTC

I'm afraid virtio-net is baked in:
https://github.com/libguestfs/libguestfs/blob/6b80c5fb51f08d3e62393e6722655bbcd940f4e7/lib/launch-direct.c#L669

While virtual NIC models might make a difference, my impression is that virtio-net is going to be the
most widely used, fastest (since not emulated), most tested and therefore least buggy of all of them.

Comment 6 Adam Williamson 2019-01-05 00:46:37 UTC

So, hum. I've been poking around at this a bit in openQA today.

The NIC model doesn't seem to make any difference - same results with e1000e and rtl8139.

tcpdump wasn't coming out right because it's not included in the installer environment. With it installed and the tests tweaked a bit, an interesting result: it seems like running just 'dhclient -v --timeout 25' *works*. It assigns an IP, and after that, the system can ping and curl and stuff just fine. It's only NetworkManager's initial DHCP attempt that fails, for some reason.

Continuing to investigate...

Comment 7 Adam Williamson 2019-01-05 01:09:46 UTC

ooo:

https://openqa.stg.fedoraproject.org/tests/456741#step/_boot_to_anaconda/28

"/usr/libexec/nm-dhcp-helper: interface name too long (is 27)"

when I replicate the way NetworkManager runs dhclient at a console. this seems promising!

Comment 8 Adam Williamson 2019-01-05 01:13:27 UTC

oh, bah, no, just a typo. le sigh.

Comment 9 Adam Williamson 2019-01-05 01:47:21 UTC

OK, so without the typo, still an interesting result: it seems like calling dhclient the way anaconda does it looks like it succeeds, *but then running ip addr still shows no IP address assigned*:

https://openqa.stg.fedoraproject.org/tests/456766#step/_boot_to_anaconda/30

note that dhclient has just run and said it bound to 10.0.2.15, but 'ip addr' shows ens5 still with no IPv4 address.

One possibility is that the "helper script" (nm-dhcp-helper) is being very *unhelpful* and somehow interfering with the interface actually being properly configured, or unconfiguring it again immediately, or something...

One confounding thing, though, is that NetworkManager has not changed in Rawhide during the timeframe where it seems this bug showed up (which seems to be around early December, best as I can tell ATM). Neither has dhcp. Kernel may possibly be involved somehow...

Comment 10 Adam Williamson 2019-01-05 01:49:10 UTC

oh, another thing that may be involved is dbus; nm-dhcp-helper seems to use dbus to talk to NM, and we both switched to dbus-broker by default and got a version upgrade to dbus relatively recently in Rawhide.

Comment 11 Adam Williamson 2019-01-05 02:45:56 UTC

Another data point: I ran the set of tests I'm using to play with this - basically just the same test duplicated 12 times - on an F29 release ISO, and got two failures:

https://openqa.stg.fedoraproject.org/tests/overview?version=29&groupid=1&distri=fedora&build=Fedora-29-TCPDUMP

guess I'll play around with different versions of qemu or something next...

Comment 12 Adam Williamson 2019-01-11 00:35:08 UTC

OK, so, here's a thing: if I run the test on F*28*, the bug doesn't happen. I tried twice (with 12 tests in each run), and both times, not one of the tests failed in this way. I've never had a *single* clean run (no test failed this way) with F29 or Rawhide, let alone two in a row.

So I think this *is* actually something on the guest side, but it broke during F29 somewhere, not post-F29. I'll scrounge around for any F29 nightlies I have lying around to see if I can pinpoint when exactly it broke.

Comment 13 Adam Williamson 2019-01-11 01:41:49 UTC

So I think the narrowest I can pinpoint it right now is this: F29 Beta is OK, does not seem to have this bug. Fedora-29-20181015.n.0 is NOT OK, it *does* have this bug.

I don't have any installer nightlies from the period between Beta and 20181015.n.0, so I can't narrow it down any tighter at present, will have to see if anyone else has any nightlies from that period lying around :/

Comment 14 Adam Williamson 2019-01-11 05:11:25 UTC

So I've been building installer images based on the 29 Beta tree, but with single updated packages that match the versions in 20181015.n.0, to try and isolate the cause. But so far no dice :(

I've tried with the NetworkManager from 20181015.n.0, the dhcp from 20181015.n.0, and the kernel from 20181015.n.0 - but each time the image worked fine. So it seems like the cause is none of those three. dbus did not change between those two composes, so it's not dbus either. I'm running out of ideas as to what it *could* be. Anyone got any ideas?

Assigning to NetworkManager just for now as we're pretty sure it's not qemu at this point, and NM devs might have some thoughts...

Note you can find the two composes in question here:

https://dl.fedoraproject.org/pub/alt/stage/29_Beta-1.5 (Beta, compose date is 20180919, does not have the bug)
https://kojipkgs.fedoraproject.org/compose/branched/Fedora-29-20181015.n.0/compose (20181015.n.0 nightly, has the bug)

Comment 15 Thomas Haller 2019-01-11 06:54:35 UTC

Hi,

> https://dl.fedoraproject.org/pub/alt/stage/29_Beta-1.5 (Beta, compose date is 20180919, does not have the bug)

has NetworkManager-1.12.2-2.fc29.x86_64.rpm    

>https://kojipkgs.fedoraproject.org/compose/branched/Fedora-29-20181015.n.0/compose (20181015.n.0 nightly, has the bug)

has NetworkManager-1.12.4-1.fc29

I am not aware of a known issue introduced between these versions.


Looking at serial0.txt (1.14.4-2.fc30) in comment 1, I don't see something particularly suspicious either (aside the DHCP timeout).


There is

  <warn>  [1546542605.1542] secret-key: failure to generate good random data for secret-key (use non-persistent key)

but I think that is not enough reason for getting such frequent failures. I presume the failure also happens when you boot only one VM (not many in parallel)?


Where are other logfiles of a failed run available? I am not familiar with openQA or how to run these images. Is there an easy way to reproduce this myself, or could you provide more logs?

Thanks for enabling level=DEBUG logging (in comment 1), that is helpful/necessary. But if you happen to do new runs and modify the image, please configure level=TRACE and make sure that journald's rate limiting is disabled (if that's reasonably simple to configure). See [1] for comments how to achieve that.


[1] https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf


No ideas so far.

Comment 16 Adam Williamson 2019-01-11 07:02:49 UTC

"I am not aware of a known issue introduced between these versions."

Yeah, as I said, I actually tested an image built from Beta but with NM 1.12.4-1.fc29 added, that works OK. It doesn't seem like NetworkManager is directly the cause of this, but the bug was previously assigned to qemu and it definitely doesn't seem to be qemu either, and I can't really assign it to "wehaventgotaclue", so I thought I'd ping it over to NM just to bring you in at least :)

"Where are other logfiles of a failed run available? I am not familiar with openQA or how to run these images. Is there an easy way to reproduce this myself, or could you provide more logs?"

what logs do you need? the ones I have at present are all in serial0.txt - though actually, now I know just running 'dhclient' gets the network up, I could ditch the 'send the logs out via the serial line' thing and go back to uploading them properly.

"Thanks for enabling level=DEBUG logging (in comment 1), that is helpful/necessary."

I didn't actually do that; I think it must be set by default in the installer images / environment, or something.

"But if you happen to do new runs and modify the image, please configure level=TRACE and make sure that journald's rate limiting is disabled (if that's reasonably simple to configure). See [1] for comments how to achieve that."

I'll try and get that done somehow.

Comment 17 Adam Williamson 2019-01-11 07:03:45 UTC

"but I think that is not enough reason for getting such frequent failures. I presume the failure also happens when you boot only one VM (not many in parallel)?"

I haven't actually confirmed that yet (because doing a bunch of them in parallel is a lot quicker in determining whether a given image has the bug or not). I suspect the answer is "yes", but I'll check it tomorrow.

Comment 18 Adam Williamson 2019-01-11 20:33:41 UTC

So this seems to be bind!

I just did a diff between the packages in the Beta installer image and the packages in the 1015.n.0 installer image, marked off possible suspects on the list, and planned to work through them all one at a time...fortunately it turned out to be one that starts with a 'b' and not zlib :P

An image built from the Beta tree but with bind-9.11.4-10.P2.fc29 added has the bug. The build of bind that's in the Beta is bind-9.11.4-5.P1.fc29 . So somewhere between 5.P1 and 10.P2, this bug appeared, somehow. Re-assigning to bind.

Comment 19 Adam Williamson 2019-01-12 00:13:33 UTC

OK, so further tests suggest this broke between bind-9.11.4-6.P1.fc29 and bind-9.11.4-7.P1.fc29 . -7.P1.fc29 wasn't actually run as an official build, so I did a scratch build from commit 595af1f3d54fa8efa75a2f814a80338e815577ad as a test. So to be as precise as I can be right now, this broke between:

6.P1.fc29: https://src.fedoraproject.org/rpms/bind/c/37943d075e242293cc171728930fd5a5d74783be?branch=master

and:

7.P1.fc29: https://src.fedoraproject.org/rpms/bind/c/595af1f3d54fa8efa75a2f814a80338e815577ad?branch=master

I'll dig into this some more tomorrow or Monday, I'm off out now.

Comment 20 Adam Williamson 2019-01-12 08:55:30 UTC

Latest finding: I did a build of bind 9.11.4-11.P2.fc30 - the build currently in Rawhide - but with --disable-crypto-rand . I was expecting that to not have the bug, but interestingly it does. So just using the argument that's meant to 'disable' the crypto-random thing that was added in -7 doesn't seem to help; perhaps some of the 'architectural' change in the patch that changes things a bit even if the new feature is disabled is at fault here?

Comment 21 Adam Williamson 2019-01-12 16:19:07 UTC

Petr, since you wrote the patch in question, can you take a look at this? It's quite long and following everything it does is a slog for an outsider :) I'm going to look at it and try tweaking some bits that look *possibly* suspicious on Monday, but of course you may be able to figure it out faster than me! Thanks.

Comment 22 Adam Williamson 2019-01-12 22:53:54 UTC

Well hey, it looks like my first guess was lucky here.

I did a build with of 9.11.4-11.P2.fc30 but with this block dropped from bind-9.11-rt46047.patch:

      /* Protect ourselves against unseeded PRNG */
      if (RAND_status() != 1) {
              FATAL_ERROR(__FILE__, __LINE__,
                          "OpenSSL pseudorandom number generator "
                          "cannot be initialized (see the `PRNG not "
                          "seeded' message in the OpenSSL FAQ)");
      }

and that makes the problem go away. So I suspect that error is happening during NM's dhclient invocation with nm-dhcp-helper and whatknot, and making the interface bring-up fail.

The obvious guess is that we're basically getting starved of entropy here, which kinda ties in with the 'it seems to happen with multiple simultaneous guests' thing - obviously if we have a dozen or two installer tests all starting up, reaching this point, and expecting enough entropy to seed their PRNGs, the host might run out of enough entropy to feed all of them, I guess.

Note that the block in question is not ifdef'ed - that is, it takes effect whether you have the new crypto-random feature that the patch introduces enabled or not. I'm not sure whether that's a bug, or intentional.

Comment 23 Richard W.M. Jones 2019-01-14 12:28:09 UTC

Hang on though, guests should be getting entropy from /dev/urandom which never* runs out.
Can you check the qemu command line of the guests and see if there is an
-object rng-random,filename=XXX and what XXX points to?

Comment 24 Daniel Berrangé 2019-01-14 12:35:40 UTC

(In reply to Richard W.M. Jones from comment #23)
> Hang on though, guests should be getting entropy from /dev/urandom which
> never* runs out.
> Can you check the qemu command line of the guests and see if there is an
> -object rng-random,filename=XXX and what XXX points to?

Or indeed check that a 'virtio-rng' device even exists on the QEMU command line. I'm not sure how openqa configures its QEMU guests...

Comment 25 Adam Williamson 2019-01-14 16:25:49 UTC

It's configurable. But in this case we are using virtio-rng. The entire qemu command line is:

/usr/bin/qemu-system-x86_64 -vga std -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -global isa-fdc.driveA= -m 2048 -cpu Nehalem -device virtio-rng-pci -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot once=d,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown -vnc :92,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/2/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0

https://wiki.qemu.org/Features/VirtIORNG states:

"As of QEMU 1.3, the default backend is to use the host's /dev/random as a source of entropy."

Not /dev/urandom...

The question I had is, why are we on this path at all? What in bring-up of a network interface via NM relies on starting bind, and why?

Comment 26 Richard W.M. Jones 2019-01-14 16:33:20 UTC

Yes /dev/random was an unfortunate default in older versions of libvirt.  Cole
fixed this a while back:

commit 67f2b72723c242969c5282fcb9acf00cc01f2a54 (tag: v1.3.4-rc1)
Author: Cole Robinson <crobinso>
Date:   Wed Apr 13 15:09:30 2016 -0400

    conf: Drop restrictions on rng backend path
    
    Currently we only allow /dev/random and /dev/hwrng as host input
    for <rng><backend model='random'/> device. This was added after
    various upstream discussions in commit 4932ef45
    
    However this restriction has generated quite a few complaints over
    the years, so a new discussion was initiated:
    
    http://www.redhat.com/archives/libvir-list/2016-April/msg00987.html
    
    Several people suggested removing the restriction, and nobody really
    spoke up to defend it. So this patch drops the path restriction
    entirely
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1074464

This comes back to how these guests are being created by the CI
environment, but you'll probably want to change that so it creates
the guests with /dev/urandom as the backing for the RNG.

Comment 27 Adam Williamson 2019-01-14 17:03:02 UTC

We're concerned with the qemu default here, nothing to do with libvirt at all.

I can tweak openQA such that it uses /dev/urandom (in fact I just did that and it does seem to help), however, it still seems odd to me that initialization of a network interface relies on the availability of entropy. That doesn't, from a 'common sense' point of view, seem to make sense (and apart from anything else, wound up taking me four days to debug...)

Comment 28 Daniel Berrangé 2019-01-14 17:10:18 UTC

Another thing to bear in mind is that QEMU is the only hypervisor that provides a virtio-rng like solution.  

IOW, if something makes Fedora hang due to lack of entropy, then virtio-rng is not a general solution. The hang will still likely exist when run in Vmware, Hyperv, Xen, or the many public clouds based on QEMU that don't enable virtio-rng.

Other hypervisors have largely punted on the problem saying CPU "rdrand" support is good enough. I'll note the "Nehalem" CPU model openqa has set jere does *not* have rdrand support either as it is too old - IIRC IvyBridge or newer is needed.

Comment 29 Petr Menšík 2019-01-14 17:24:21 UTC

Well, crypto random backport(In reply to Adam Williamson from comment #22)
> Well hey, it looks like my first guess was lucky here.
> 
> I did a build with of 9.11.4-11.P2.fc30 but with this block dropped from
> bind-9.11-rt46047.patch:
> 
>       /* Protect ourselves against unseeded PRNG */
>       if (RAND_status() != 1) {
>               FATAL_ERROR(__FILE__, __LINE__,
>                           "OpenSSL pseudorandom number generator "
>                           "cannot be initialized (see the `PRNG not "
>                           "seeded' message in the OpenSSL FAQ)");
>       }
> 
> and that makes the problem go away. So I suspect that error is happening
> during NM's dhclient invocation with nm-dhcp-helper and whatknot, and making
> the interface bring-up fail.
> 
> The obvious guess is that we're basically getting starved of entropy here,
> which kinda ties in with the 'it seems to happen with multiple simultaneous
> guests' thing - obviously if we have a dozen or two installer tests all
> starting up, reaching this point, and expecting enough entropy to seed their
> PRNGs, the host might run out of enough entropy to feed all of them, I guess.
> 
> Note that the block in question is not ifdef'ed - that is, it takes effect
> whether you have the new crypto-random feature that the patch introduces
> enabled or not. I'm not sure whether that's a bug, or intentional.

This is not intentional, it should be possible to disable it. Note however bind is built two times. Once normal bind data (build directory), then only library in single threaded form to satisfy dependency of dhcp (export-libs directory).

However, RAND_status() manual page states it should never get out of random data if system supports urandom. This is a reason why there is no runtime option to disable it, it should not be required.

I admit I did not check how well dhcp obtains random data when entropy is depleted. There was similar issue with bind-chroot subpackage, when /dev/urandom was missing in chroot. It usually works, but fails in case /dev/random is depleted. Which might be much more easy on VM without "hardware" entropy module. One reason might be that /dev/urandom is unreachable when this happens. Can it be missing in boot image? If it is not missing, is it allowed by SELinux? Have to check those possibilities. Is there strace of dhclient in this scenario?

(In reply to Richard W.M. Jones from comment #23)
> Hang on though, guests should be getting entropy from /dev/urandom which
> never* runs out.
> Can you check the qemu command line of the guests and see if there is an
> -object rng-random,filename=XXX and what XXX points to?

Well real /dev/random should only use real random data, I think it should not require rng-random device change.
Starting rngd from rng-tools should help, but that would be just workaround, not a proper fix.

Comment 30 Adam Williamson 2019-01-14 17:31:47 UTC

"However, RAND_status() manual page states it should never get out of random data if system supports urandom"

Where does it say that? My copy says just:

"RAND_status() indicates whether or not the random generator has been sufficiently seeded. If not, functions such as RAND_bytes(3) will fail."

Nothing about /dev/urandom. There's an implication in the overall text that the seeding is done from "trusted system entropy sources", but what those *are* does not appear to be stated in that page.

Comment 31 Petr Menšík 2019-01-14 17:42:45 UTC

(In reply to Adam Williamson from comment #30)
> "However, RAND_status() manual page states it should never get out of random
> data if system supports urandom"
> 
> Where does it say that? My copy says just:
> 
> "RAND_status() indicates whether or not the random generator has been
> sufficiently seeded. If not, functions such as RAND_bytes(3) will fail."
> 
> Nothing about /dev/urandom. There's an implication in the overall text that
> the seeding is done from "trusted system entropy sources", but what those
> *are* does not appear to be stated in that page.

Was there in Fedora 27. It is not present on 29 manual however.

       OpenSSL makes sure that the PRNG state is unique for each thread. On
       systems that provide "/dev/urandom", the randomness device is used to
       seed the PRNG transparently. However, on all other systems, the
       application is responsible for seeding the PRNG by calling RAND_add(),
       RAND_egd(3) or RAND_load_file(3).

Comment 32 Adam Williamson 2019-01-14 17:45:07 UTC

Might be interesting to git blame that change :)

Comment 33 Petr Menšík 2019-01-14 19:27:19 UTC

Anyway, the patch is indeed missing additional #ifdef added later in upstream [1]. I think it already disables usage or RAND_bytes from OpenSSL, but it can fail on this fatal check anyway.

It might be worth to disable crypto-rand for DHCP library build, because there might be more limited source of entropy more often.

1. https://gitlab.isc.org/isc-projects/bind9/commit/8a98277811e

Comment 34 Adam Williamson 2019-01-14 19:54:04 UTC

"Anyway, the patch is indeed missing additional #ifdef added later in upstream [1]."

OK - so that #ifdef would've made my test where I used a bind built with `--disable-crypto-rand` work, but would not have solved the problem for our actual official builds and images.

"It might be worth to disable crypto-rand for DHCP library build, because there might be more limited source of entropy more often."

I don't really have a sufficient overview to judge this, but if there is no security sensitivity on the DHCP library path, it seems reasonable on the face of it.

In the meantime I got an openQA patch to use /dev/urandom merged which should hopefully address the issue for openQA purposes, I am pushing that out to our openQA instances as we speak.

Comment 35 Petr Menšík 2019-01-15 21:42:33 UTC

Normal address request by dhclient does not use RAND_bytes at least once. It always check RAND_status() however, which can fail in described scenario.
Because dhclient can be running quite early, lack of enough good entropy might be common.

I think there is no reason to demand high quality random data for dhclient that never uses it, I also doubt dhcp server would use it. When trying simple setup with dynamic dns update, It never called RAND_bytes. I did not try every possible configuration, but I guess keys would be generated by ddns-confgen, where secure random generator would be used. I am turning off crypto random for dhcp specific build.

Comment 36 Adam Williamson 2019-01-15 22:27:52 UTC

Sounds good. Thanks a lot!

Comment 37 Fedora Update System 2019-01-16 18:48:05 UTC

bind-9.11.4-13.P2.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-15b8002c67

Comment 38 Fedora Update System 2019-01-16 18:48:42 UTC

bind-9.11.4-13.P2.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2019-200865dc76

Comment 39 Fedora Update System 2019-01-17 02:11:29 UTC

bind-9.11.4-13.P2.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-15b8002c67

Comment 40 Fedora Update System 2019-01-17 02:50:29 UTC

bind-9.11.4-13.P2.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-200865dc76

Comment 41 Fedora Update System 2019-01-18 02:14:11 UTC

bind-9.11.4-13.P2.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 42 Fedora Update System 2019-01-24 03:31:55 UTC

bind-9.11.4-13.P2.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.

amit
anon.amish
berrange
bgalvani
cfergeau
dcbw
dwmw2
fgiudici
gnome-sig
itamar
john.j5live
lkundrak
mclasen
mruprich
msehnout
pbonzini
pemensik
pzhukov
rhughes
rjones
rstrode
sandmann
thaller
thozza
virt-maint
vonsch
zdohnal