Bug 1188423

Summary: RHEL / Centos 7-based instances lose their default IPv4 gateway
Product: Red Hat Enterprise Linux 7 Reporter: Joe <joe>
Component: dhcpAssignee: Pavel Zhukov <pzhukov>
Status: CLOSED WONTFIX QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: ihrachys, kbsingh, srevivo
Target Milestone: pre-dev-freezeKeywords: FastFix
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-18 08:27:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1380362    
Attachments:
Description Flags
dhcp pcap none

Description Joe 2015-02-02 20:41:14 UTC
Description of problem:

This isn't exactly an RDO bug but I'm not sure where else to file this. Per the summary in this thread:

http://lists.openstack.org/pipermail/openstack-operators/2015-January/006032.html

I had RHEL and CentOS 7 based instances losing their default IPv4 gateway after a random amount of time. It looks like the problem was due to a `valid_lft` and `preferred_lft` time being set on the interfaces which is new for version 7.

After posting the message, I had someone email me off-list and confirm they had the same issue.

The patch that I have used for existing instances can be found here:

https://gist.github.com/jtopjian/589217cee0ba8f09825c

Version-Release number of selected component (if applicable):

* RHEL and CentOS 7 instances
* Havana and Icehouse OpenStack clouds using dnsmasq 2.59


How reproducible:

Happens randomly, but with the right environment, it's inevitable.


Steps to Reproduce:

Run RHEL or CentOS 7 in an OpenStack environment, best with the default DHCP lease of 60 seconds, and just wait. You can also run `watch ip a` and when the connection drops, you'll see `valid_lft` between 0-2 seconds.

Actual results:

The instance loses its IPv4 address and thus its gateway. The DHCP lease is renewed soon after, but the gateway is never re-added.

Expected results:

`valid_lft` should never hit 0.

Comment 1 Ihar Hrachyshka 2015-04-20 14:22:49 UTC
dhclient-script belongs to dhcp package in RHEL, not RDO. Moving to appropriate component.

Comment 3 Jiri Popelka 2015-04-20 18:02:04 UTC
(In reply to Joe from comment #0)
> The patch that I have used for existing instances can be found here:
> https://gist.github.com/jtopjian/589217cee0ba8f09825c

I don't want to remove the life-times completely,
they have been added due to bug #1032809.

> Run RHEL or CentOS 7 in an OpenStack environment, best with the default DHCP
> lease of 60 seconds, and just wait. You can also run `watch ip a` and when
> the connection drops, you'll see `valid_lft` between 0-2 seconds.
> 
> Actual results:
> 
> The instance loses its IPv4 address and thus its gateway. The DHCP lease is
> renewed soon after, but the gateway is never re-added.

I still don't understand what's happening there.
The only idea I have is that the address is not properly renewed (doesn't get any response to unicast DHCPREQUESTs) and then during rebinding (sending broadcast DHCPREQUESTs) it's being removed prior to rebinding finish.

I'd need to see either dhclient output or some packet dump to know more.

But if giving the address some more life-time works-around the problem, then I'm probably fine with that.

Could you add the following line somewhere after '# ### MAIN' to see if that helps ?

[[ "${new_dhcp_lease_time}" -lt "4294967235" ]] && new_dhcp_lease_time=$((new_dhcp_lease_time + 60))

Comment 4 Jiri Popelka 2015-04-21 08:45:13 UTC
I've added this commit upstream (Fedora)
http://pkgs.fedoraproject.org/cgit/dhcp.git/commit/?id=d12e0eb05e510268ce9b8dcb839e27d5eca9aff5

But it'd still be nice to see some dhclient output or packet dump when the problem occurs.

Comment 5 Joe 2015-04-21 14:18:32 UTC
Hi Jiri,

Thank you for looking this over and for adding the additional time.

I'm setting up a test instance in my OpenStack cloud, will run tcpdump, and wait for this issue to happen. It usually manifests itself within 24 hours.

From reviewing my notes, the core issue is that when the timeout happens, the default gateway of the instance / vm is dropped. After the timeout has lapsed and the late DHCP renewal arrives at the instance, the instance re-adds its IP but not its default gateway. 

I'll post some log entries and a packet dump once I have them.

Thanks,
Joe

Comment 6 Joe 2015-04-21 20:36:36 UTC
Created attachment 1017114 [details]
dhcp pcap

Comment 7 Joe 2015-04-21 20:40:43 UTC
Hi Jiri,

I was able to semi-reproduce this problem. While leaving "watch ip a" running, my session was cut and the last output on the screen was a valid_lft and preferred_lft of 1 second.... I think it's safe to assume that there was a timeout.

What's odd about this case is that when I logged back in, I had a default route. Normally the default route doesn't exist and have to log in through an out-of-band method to re-add it.

I used the latest CentOS 7 image from here:

http://cloud.centos.org/centos/7/images/

I haven't yet looked if there has been a dhclient update since I first ran into this issue.

I'm currently re-running the tests to see if I will run into an occurrence where the default gateway is _not_ re-added.

Attached is a pcap file of DHCP traffic. AFAICT, the bump in connectivity happened at packet 1523.

Thanks,
Joe

Comment 8 Joe 2015-04-23 19:52:41 UTC
I was able to reproduce this problem entirely yesterday -- lost gateway and all. It was at the end of the day when I noticed it had happened, so I just killed tcpdump and went home.

When I came back in today, I noticed the gateway had returned! I have no idea when, though. I even have a screenshot of my console showing both no default route and a default route if you'd like to double-check. :)

The packet dump looks the same as the last one I uploaded. I'd be happy to still upload it, though.

Comment 16 Pavel Zhukov 2019-07-18 08:27:52 UTC
This bug was evaluated by the sub-system and was not considered as a priority for the release, so it's being closed now as WONTFIX. Feel free to re-open the bug if there is a business reason to deliver a fix for this issue.
Note the workaround for this issue is included in Red Hat Enterprise Linux 8. Please check if the issue exists there and open new bug if it's the case.