1205369 – DHCP client takes time adjustment as a "best to reconfigure the interface" in order to deal with dhcp timers.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1205369 - DHCP client takes time adjustment as a "best to reconfigure the interface" in order to deal with dhcp timers.

Summary: DHCP client takes time adjustment as a "best to reconfigure the interface" in...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	dhcp
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Pavel Zhukov
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1380362
TreeView+	depends on / blocked

Reported:	2015-03-24 19:11 UTC by Leonid Natapov
Modified:	2019-03-18 09:12 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-16 14:08:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Leonid Natapov 2015-03-24 19:11:10 UTC

dhcp-libs-4.2.5-36.el7.x86_64
dhcp-common-4.2.5-36.el7.x86_64
dhcp-4.2.5-36.el7.x86_64


Description of problem:

I am doing a leap second vulnerability tests on HA deployment. RHOS6 A2.
I am running a script that inserts a leap second and checking if everything still works OK.  

I have HA deployment with 3 controllers and 2 compute nodes.
After some time running the leap second script on one of the controllers
I have noticed that it reboots. After investigating the issue I saw that it reboots because heart beat interface goes away and other cluster node fences the node where heart beat NIC is gone.

It happens only when I am running a leap second script and using DHCP.

It looks like a bug in DHCP client that can't deal with the leap second.
Dhcpclient takes the leap second as a "best to reconfigure the interface" in order to deal with dhcp timers.
In our case ,cluster's heart beat interface relays on DHCP. A fault in the dhcp server would take the cluster down.

dhcp server dies -> dhcp goes for a renewal -> can't renew -> IP goes away -> Cluster looses connectivity ti it's heart beat NIC--->Reboot.

In order to make sure that this is the problem I have re-configured cluster's heart beat interface not to use DHCP. I configured static IP and re-run the test.
No reboot occurs. Repeated test several times - no reboot. Everything work fine.

Here is /var/log/messages output before host reboots:
As you could see the failure related to corosync component.
------------------------------------------------------------------------------
    Apr  9 03:00:00 macf04da2732fb1 systemd: Time has been changed
        Apr  9 03:00:01 macf04da2732fb1 dhclient[1619]: DHCPDISCOVER on eno1 to 255.255.255.255 port 67 interval 8 (xid=0x8d3ac2)
        Apr  9 03:00:01 macf04da2732fb1 dhclient[1619]: DHCPREQUEST on eno1 to 255.255.255.255 port 67 (xid=0x8d3ac2)
        Apr  9 03:00:01 macf04da2732fb1 dhclient[1619]: DHCPOFFER from 192.168.0.1
        Apr  9 03:00:01 macf04da2732fb1 dhclient[1619]: DHCPACK from 192.168.0.1 (xid=0x8d3ac2)
        Apr 10 02:59:50 macf04da2732fb1 systemd: Time has been changed
        Apr 10 02:59:50 macf04da2732fb1 corosync[2777]: [TOTEM ] A processor failed, forming new configuration.
        Apr 10 02:59:50 macf04da2732fb1 corosync[2777]: [TOTEM ] The network interface is down.
        Apr 10 02:59:50 macf04da2732fb1 corosync[2777]: [TOTEM ] adding new UDPU member {192.168.0.3}
        Apr 10 02:59:50 macf04da2732fb1 corosync[2777]: [TOTEM ] adding new UDPU member {192.168.0.5}
        Apr 10 02:59:50 macf04da2732fb1 corosync[2777]: [TOTEM ] adding new UDPU member {192.168.0.6}
        Apr 10 02:59:50 macf04da2732fb1 Delay(ceilometer-delay)[105652]: INFO: Delay is running OK
        Apr 10 02:59:50 macf04da2732fb1 NET[106166]: /usr/sbin/dhclient-script : updated /etc/resolv.conf
        Apr 10 02:59:50 macf04da2732fb1 dhclient[1619]: bound to 192.168.0.5 -- renewal in -86093 seconds.
        Apr 10 02:59:50 macf04da2732fb1 dhclient[1619]: DHCPDISCOVER on eno1 to 255.255.255.255 port 67 interval 4 (xid=0x78f489f2)
        Apr 10 02:59:50 macf04da2732fb1 dhclient[1619]: DHCPREQUEST on eno1 to 255.255.255.255 port 67 (xid=0x78f489f2)
        Apr 10 02:59:50 macf04da2732fb1 dhclient[1619]: DHCPOFFER from 192.168.0.1
        Apr 10 02:59:50 macf04da2732fb1 dhclient[1619]: DHCPACK from 192.168.0.1 (xid=0x78f489f2)
        Apr 10 02:59:52 macf04da2732fb1 systemd-logind: Power key pressed.
        Apr 10 02:59:52 macf04da2732fb1 systemd-logind: Powering Off...
        Apr 10 02:59:52 macf04da2732fb1 systemd-logind: System is powering down.
        Apr 10 02:59:52 macf04da2732fb1 systemd: Stopping Session 86 of user root.
        Connection to macf04da2732fb1 closed by remote host.

Comment 2 Ofer Blaut 2015-03-25 06:35:09 UTC

RHEL version - Red Hat Enterprise Linux Server release 7.1 (Maipo)

Comment 3 Jiri Popelka 2015-03-25 10:46:53 UTC

Looks like this might be related to bug #1093803.
Leonid, could you check that bug whether it looks similar ?
Also, can I get the 'script that inserts a leap second' somehow or would the leap-a-day.c from https://access.redhat.com/articles/199563 be sufficient ?

Comment 4 Prarit Bhargava 2015-03-25 11:02:08 UTC

(In reply to Jiri Popelka from comment #3)
> Looks like this might be related to bug #1093803.
> Leonid, could you check that bug whether it looks similar ?
> Also, can I get the 'script that inserts a leap second' somehow or would the
> leap-a-day.c from https://access.redhat.com/articles/199563 be sufficient ?

Jiri, the test in that link should work.

P.

Comment 5 Jiri Kortus 2015-03-27 16:19:18 UTC

I analyzed this issue together with pholica and we found out that it is not related to the time change due to the leap second addition/removal, but to the time change related to the test program (leap-a-day.c) that changes the date one day forward (in addition to the leap second manipulation). 

When the date change (+1 day in this case) occurs just after DHCP ack and before the new offered address is bound, it results in a negative lease renewal time *). Therefore the lease is considered invalid, so dhclient removes route related to the interface address and sends a new DHCP request, which results in a short connectivity outage (probably for a couple of seconds) and the machine in question seems to be dead for this short period and is restarted before it has any chance to renew the lease (and set up the route).

However, when dhclient removes the route, IP address on the interface remains still set, so this incosistence between IP address setting on the interface vs. the related route in routing table might be considered a bug. Nevertheless the real impact is virtually zero since it probably can't occur in practice (it would involve very short DHCP lease times and/or big forward time shifts).


*) Apr 10 02:59:50 macf04da2732fb1 dhclient[1619]: bound to 192.168.0.5 -- renewal in -86093 seconds.

Comment 6 Pavel Holica 2015-03-27 16:38:31 UTC

I'd like to add note that the route issue was hit also when large time jump occurred after bind (between requests), but still, the jump has to be large enough (definitely more then second).

Comment 7 Jiri Popelka 2015-03-27 17:26:00 UTC

(In reply to Jiri Kortus from comment #5)
> I analyzed this issue together with pholica and we found out that it is not
> related to the time change due to the leap second addition/removal, but to
> the time change related to the test program (leap-a-day.c) that changes the
> date one day forward (in addition to the leap second manipulation). 

Thanks, I got to similar conclusion.

> Nevertheless the real impact is virtually zero since it probably can't occur
> in practice (it would involve very short DHCP lease times and/or big forward
> time shifts).

Lowering the severity then.

Comment 10 Dave Maley 2015-04-22 17:18:58 UTC

(In reply to Jiri Kortus from comment #5)
> I analyzed this issue together with pholica and we found out that it is not
> related to the time change due to the leap second addition/removal, but to
> the time change related to the test program (leap-a-day.c) that changes the
> date one day forward (in addition to the leap second manipulation). 

https://access.redhat.com/articles/199563 indicates that using the -s option would lead to this behaviour.  

*      -s:     Each iteration, set the date to 10 seconds before midnight GMT.
*              This speeds up the number of leapsecond transitions tested,
*              but because it calls settimeofday frequently, advancing the
*              time by 24 hours every ~16 seconds, it may cause application
*              disruption.

Would it be possible to re-test w/out the -s flag?

Note You need to log in before you can comment on or make changes to this bug.