1188423 – RHEL / Centos 7-based instances lose their default IPv4 gateway

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1188423 - RHEL / Centos 7-based instances lose their default IPv4 gateway

Summary: RHEL / Centos 7-based instances lose their default IPv4 gateway

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	dhcp
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	pre-dev-freeze
Target Release:	---
Assignee:	Pavel Zhukov
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1380362
TreeView+	depends on / blocked

Reported:	2015-02-02 20:41 UTC by Joe
Modified:	2019-07-18 08:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-18 08:27:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
dhcp pcap (26.40 KB, application/x-gzip) 2015-04-21 20:36 UTC, Joe	no flags	Details
View All

Description Joe 2015-02-02 20:41:14 UTC

Description of problem:

This isn't exactly an RDO bug but I'm not sure where else to file this. Per the summary in this thread:

http://lists.openstack.org/pipermail/openstack-operators/2015-January/006032.html

I had RHEL and CentOS 7 based instances losing their default IPv4 gateway after a random amount of time. It looks like the problem was due to a `valid_lft` and `preferred_lft` time being set on the interfaces which is new for version 7.

After posting the message, I had someone email me off-list and confirm they had the same issue.

The patch that I have used for existing instances can be found here:

https://gist.github.com/jtopjian/589217cee0ba8f09825c

Version-Release number of selected component (if applicable):

* RHEL and CentOS 7 instances
* Havana and Icehouse OpenStack clouds using dnsmasq 2.59


How reproducible:

Happens randomly, but with the right environment, it's inevitable.


Steps to Reproduce:

Run RHEL or CentOS 7 in an OpenStack environment, best with the default DHCP lease of 60 seconds, and just wait. You can also run `watch ip a` and when the connection drops, you'll see `valid_lft` between 0-2 seconds.

Actual results:

The instance loses its IPv4 address and thus its gateway. The DHCP lease is renewed soon after, but the gateway is never re-added.

Expected results:

`valid_lft` should never hit 0.

Comment 1 Ihar Hrachyshka 2015-04-20 14:22:49 UTC

dhclient-script belongs to dhcp package in RHEL, not RDO. Moving to appropriate component.

Comment 3 Jiri Popelka 2015-04-20 18:02:04 UTC

(In reply to Joe from comment #0)
> The patch that I have used for existing instances can be found here:
> https://gist.github.com/jtopjian/589217cee0ba8f09825c

I don't want to remove the life-times completely,
they have been added due to bug #1032809.

> Run RHEL or CentOS 7 in an OpenStack environment, best with the default DHCP
> lease of 60 seconds, and just wait. You can also run `watch ip a` and when
> the connection drops, you'll see `valid_lft` between 0-2 seconds.
> 
> Actual results:
> 
> The instance loses its IPv4 address and thus its gateway. The DHCP lease is
> renewed soon after, but the gateway is never re-added.

I still don't understand what's happening there.
The only idea I have is that the address is not properly renewed (doesn't get any response to unicast DHCPREQUESTs) and then during rebinding (sending broadcast DHCPREQUESTs) it's being removed prior to rebinding finish.

I'd need to see either dhclient output or some packet dump to know more.

But if giving the address some more life-time works-around the problem, then I'm probably fine with that.

Could you add the following line somewhere after '# ### MAIN' to see if that helps ?

[[ "${new_dhcp_lease_time}" -lt "4294967235" ]] && new_dhcp_lease_time=$((new_dhcp_lease_time + 60))

Comment 4 Jiri Popelka 2015-04-21 08:45:13 UTC

I've added this commit upstream (Fedora)
http://pkgs.fedoraproject.org/cgit/dhcp.git/commit/?id=d12e0eb05e510268ce9b8dcb839e27d5eca9aff5

But it'd still be nice to see some dhclient output or packet dump when the problem occurs.

Comment 5 Joe 2015-04-21 14:18:32 UTC

Hi Jiri,

Thank you for looking this over and for adding the additional time.

I'm setting up a test instance in my OpenStack cloud, will run tcpdump, and wait for this issue to happen. It usually manifests itself within 24 hours.

From reviewing my notes, the core issue is that when the timeout happens, the default gateway of the instance / vm is dropped. After the timeout has lapsed and the late DHCP renewal arrives at the instance, the instance re-adds its IP but not its default gateway. 

I'll post some log entries and a packet dump once I have them.

Thanks,
Joe

Comment 6 Joe 2015-04-21 20:36:36 UTC

Created attachment 1017114 [details]
dhcp pcap

Comment 7 Joe 2015-04-21 20:40:43 UTC

Hi Jiri,

I was able to semi-reproduce this problem. While leaving "watch ip a" running, my session was cut and the last output on the screen was a valid_lft and preferred_lft of 1 second.... I think it's safe to assume that there was a timeout.

What's odd about this case is that when I logged back in, I had a default route. Normally the default route doesn't exist and have to log in through an out-of-band method to re-add it.

I used the latest CentOS 7 image from here:

http://cloud.centos.org/centos/7/images/

I haven't yet looked if there has been a dhclient update since I first ran into this issue.

I'm currently re-running the tests to see if I will run into an occurrence where the default gateway is _not_ re-added.

Attached is a pcap file of DHCP traffic. AFAICT, the bump in connectivity happened at packet 1523.

Thanks,
Joe

Comment 8 Joe 2015-04-23 19:52:41 UTC

I was able to reproduce this problem entirely yesterday -- lost gateway and all. It was at the end of the day when I noticed it had happened, so I just killed tcpdump and went home.

When I came back in today, I noticed the gateway had returned! I have no idea when, though. I even have a screenshot of my console showing both no default route and a default route if you'd like to double-check. :)

The packet dump looks the same as the last one I uploaded. I'd be happy to still upload it, though.

Comment 16 Pavel Zhukov 2019-07-18 08:27:52 UTC

This bug was evaluated by the sub-system and was not considered as a priority for the release, so it's being closed now as WONTFIX. Feel free to re-open the bug if there is a business reason to deliver a fix for this issue.
Note the workaround for this issue is included in Red Hat Enterprise Linux 8. Please check if the issue exists there and open new bug if it's the case.

Note You need to log in before you can comment on or make changes to this bug.