Bug 1188423

Summary:

RHEL / Centos 7-based instances lose their default IPv4 gateway

Product:

Red Hat Enterprise Linux 7

Reporter:

Joe <joe>

Component:

dhcp

Assignee:

Pavel Zhukov <pzhukov>

Status:

CLOSED WONTFIX

QA Contact:

qe-baseos-daemons

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.0

CC:

ihrachys, kbsingh, srevivo

Target Milestone:

pre-dev-freeze

Keywords:

FastFix

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-07-18 08:27:52 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1380362

Attachments:

Description	Flags
dhcp pcap	none

Description Joe 2015-02-02 20:41:14 UTC

Description of problem:

This isn't exactly an RDO bug but I'm not sure where else to file this. Per the summary in this thread:

http://lists.openstack.org/pipermail/openstack-operators/2015-January/006032.html

I had RHEL and CentOS 7 based instances losing their default IPv4 gateway after a random amount of time. It looks like the problem was due to a `valid_lft` and `preferred_lft` time being set on the interfaces which is new for version 7.

After posting the message, I had someone email me off-list and confirm they had the same issue.

The patch that I have used for existing instances can be found here:

https://gist.github.com/jtopjian/589217cee0ba8f09825c

Version-Release number of selected component (if applicable):

* RHEL and CentOS 7 instances
* Havana and Icehouse OpenStack clouds using dnsmasq 2.59


How reproducible:

Happens randomly, but with the right environment, it's inevitable.


Steps to Reproduce:

Run RHEL or CentOS 7 in an OpenStack environment, best with the default DHCP lease of 60 seconds, and just wait. You can also run `watch ip a` and when the connection drops, you'll see `valid_lft` between 0-2 seconds.

Actual results:

The instance loses its IPv4 address and thus its gateway. The DHCP lease is renewed soon after, but the gateway is never re-added.

Expected results:

`valid_lft` should never hit 0.

Comment 1 Ihar Hrachyshka 2015-04-20 14:22:49 UTC

dhclient-script belongs to dhcp package in RHEL, not RDO. Moving to appropriate component.

Comment 3 Jiri Popelka 2015-04-20 18:02:04 UTC

(In reply to Joe from comment #0)
> The patch that I have used for existing instances can be found here:
> https://gist.github.com/jtopjian/589217cee0ba8f09825c

I don't want to remove the life-times completely,
they have been added due to bug #1032809.

> Run RHEL or CentOS 7 in an OpenStack environment, best with the default DHCP
> lease of 60 seconds, and just wait. You can also run `watch ip a` and when
> the connection drops, you'll see `valid_lft` between 0-2 seconds.
> 
> Actual results:
> 
> The instance loses its IPv4 address and thus its gateway. The DHCP lease is
> renewed soon after, but the gateway is never re-added.

I still don't understand what's happening there.
The only idea I have is that the address is not properly renewed (doesn't get any response to unicast DHCPREQUESTs) and then during rebinding (sending broadcast DHCPREQUESTs) it's being removed prior to rebinding finish.

I'd need to see either dhclient output or some packet dump to know more.

But if giving the address some more life-time works-around the problem, then I'm probably fine with that.

Could you add the following line somewhere after '# ### MAIN' to see if that helps ?

[[ "${new_dhcp_lease_time}" -lt "4294967235" ]] && new_dhcp_lease_time=$((new_dhcp_lease_time + 60))

Comment 4 Jiri Popelka 2015-04-21 08:45:13 UTC

I've added this commit upstream (Fedora)
http://pkgs.fedoraproject.org/cgit/dhcp.git/commit/?id=d12e0eb05e510268ce9b8dcb839e27d5eca9aff5

But it'd still be nice to see some dhclient output or packet dump when the problem occurs.

Comment 5 Joe 2015-04-21 14:18:32 UTC

Hi Jiri,

Thank you for looking this over and for adding the additional time.

I'm setting up a test instance in my OpenStack cloud, will run tcpdump, and wait for this issue to happen. It usually manifests itself within 24 hours.

From reviewing my notes, the core issue is that when the timeout happens, the default gateway of the instance / vm is dropped. After the timeout has lapsed and the late DHCP renewal arrives at the instance, the instance re-adds its IP but not its default gateway. 

I'll post some log entries and a packet dump once I have them.

Thanks,
Joe

Comment 6 Joe 2015-04-21 20:36:36 UTC

Created attachment 1017114 [details]
dhcp pcap

Comment 7 Joe 2015-04-21 20:40:43 UTC

Hi Jiri,

I was able to semi-reproduce this problem. While leaving "watch ip a" running, my session was cut and the last output on the screen was a valid_lft and preferred_lft of 1 second.... I think it's safe to assume that there was a timeout.

What's odd about this case is that when I logged back in, I had a default route. Normally the default route doesn't exist and have to log in through an out-of-band method to re-add it.

I used the latest CentOS 7 image from here:

http://cloud.centos.org/centos/7/images/

I haven't yet looked if there has been a dhclient update since I first ran into this issue.

I'm currently re-running the tests to see if I will run into an occurrence where the default gateway is _not_ re-added.

Attached is a pcap file of DHCP traffic. AFAICT, the bump in connectivity happened at packet 1523.

Thanks,
Joe

Comment 8 Joe 2015-04-23 19:52:41 UTC

I was able to reproduce this problem entirely yesterday -- lost gateway and all. It was at the end of the day when I noticed it had happened, so I just killed tcpdump and went home.

When I came back in today, I noticed the gateway had returned! I have no idea when, though. I even have a screenshot of my console showing both no default route and a default route if you'd like to double-check. :)

The packet dump looks the same as the last one I uploaded. I'd be happy to still upload it, though.

Comment 16 Pavel Zhukov 2019-07-18 08:27:52 UTC

This bug was evaluated by the sub-system and was not considered as a priority for the release, so it's being closed now as WONTFIX. Feel free to re-open the bug if there is a business reason to deliver a fix for this issue.
Note the workaround for this issue is included in Red Hat Enterprise Linux 8. Please check if the issue exists there and open new bug if it's the case.