Bug 1391530

Summary: RHV engine loses IP address after DHCP lease expires
Product: Red Hat Quickstart Cloud Installer Reporter: Tasos Papaioannou <tpapaioa>
Component: Installation - RHEVAssignee: Fabian von Feilitzsch <fabian>
Status: CLOSED ERRATA QA Contact: Tasos Papaioannou <tpapaioa>
Severity: medium Docs Contact: Dan Macpherson <dmacpher>
Priority: unspecified    
Version: 1.1CC: bthurber, fabian, jmatthew, qci-bugzillas
Target Milestone: ---Keywords: Triaged
Target Release: 1.1   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-28 01:40:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Tasos Papaioannou 2016-11-03 13:39:14 UTC
Description of problem:

12 hours after a RHV engine+hypervisor deployment, the provisioning NIC on the RHV engine loses its IP address.

Version-Release number of selected component (if applicable):

QCI-1.1-RHEL-7-20161101.t.0

How reproducible:

100%

Steps to Reproduce:
1.) Deploy RHV.
2.) Wait 12 hours (the default lease time of the Satellite's dhcpd service).
3.) Verify that the RHV engine/manager can no longer be accessed by its provisioning network IP address. Accessing the system via console shows that dhclient is no longer running:

# ps -eF|grep dhclient|grep -v grep
#


Actual results:

RHV engine loses IP address 12 hours after successful deployment.

Expected results:

No loss of IP address.

Additional info:

After the deployment completes, but before the DHCP lease expires, dhclient on the RHV engine is using the NetworkManager dhcp helper script, config file, pid file, and lease file, even though NetworkManager was disabled during the deployment:

# ps -eF|grep dhclient
root       771     1  0 28204 15836   0 19:24 ?        00:00:00 /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03-eth0.lease -cf /var/lib/NetworkManager/dhclient-eth0.conf eth0

An strace of this dhclient process shows that dhclient fails when trying to renew the lease because it can't communicate back to the NetworkManager service:

****
771   20:08:30 select(22, [5 6], [], NULL, {18591, 385808}) = 0 (Timeout)
771   01:18:22 sendto(3, "<30>Nov  3 01:18:22 dhclient[771]: DHCPREQUEST on eth0 to 192.168.100.1 port 67 (xid=0x6ba6d292)", 96, MSG_NOSIGNAL, NULL, 0) = 96
[...]
771   01:18:22 sendto(3, "<30>Nov  3 01:18:22 dhclient[771]: DHCPACK from 192.168.100.1 (xid=0x6ba6d292)", 78, MSG_NOSIGNAL, NULL, 0) = 78
771   01:18:22 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fac9b1d6b50) = 32090
[...]
32090 01:18:22 execve("/usr/libexec/nm-dhcp-helper", ["/usr/libexec/nm-dhcp-helper"], [...]
[...]
32090 01:18:22 write(2, "Error: could not connect to NetworkManager D-Bus socket: Could not connect: Connection refused\n", 95) = 95
32090 01:18:22 write(2, "Fatal error occured, killing dhclient instance with pid 771.\n", 61) = 61
32090 01:18:22 kill(771, SIGTERM)       = 0
****

Stopping and disabling NetworkManager is not enough to ensure that dhclient stops using these files. After rebooting or manually restarting eth0, dhclient no longer uses the NetworkManager files:

root     14465     1  0 28204 12780   0 13:25 ?        00:00:00 /sbin/dhclient -H mac525400843f0d -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0

If the deployment reboots or restarts the interface after NetworkManager is disabled on the engine, this will prevent the IP address from disappearing 12 hours after a successful deployment.

I don't see this issue on the hypervisor or on a self-hosted engine.

Comment 2 Fabian von Feilitzsch 2016-11-16 21:09:58 UTC
https://github.com/fusor/ansible-ovirt/pull/11

Restarting the network service seems to make dhclient pick up different networking scripts.

Comment 3 John Matthews 2016-11-22 13:39:27 UTC
Expected in 11/21 ISO

Comment 4 Tasos Papaioannou 2016-11-22 19:34:01 UTC
Verified on QCI-1.1-RHEL-7-20161121.t.0. After a successful RHV (engine+hypervisor) deployment, the engine is using the default non-NetworkManager dhclient configuration:

# ps -eF | grep dhclient | grep -v grep
root     14355     1  0 28206 12780   0 18:28 ?        00:00:00 /sbin/dhclient -H mac52540099e2a1 -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0
#

Comment 7 errata-xmlrpc 2017-02-28 01:40:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335