Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1632967

Summary: [Azure] cloud-init dhcp.py dhcp_discovery() race with dhclient with preprovisioned VM in Azure
Product: Red Hat Enterprise Linux 7 Reporter: Jason Zions <jason.zions>
Component: cloud-initAssignee: Eduardo Otubo <eterrell>
Status: CLOSED ERRATA QA Contact: Yuxin Sun <yuxisun>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.6CC: danis, eterrell, jason.zions, jgreguske, ribarry, yujiang, yuxisun
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: cloud-init-18.2-2.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1644447 (view as bug list) Environment:
Last Closed: 2019-08-06 12:51:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1644447, 1691987    

Description Jason Zions 2018-09-25 23:48:59 UTC
Description of problem:
In /usr/lib/python2.7/site-packages/cloudinit/net/dhcp.py, dhcp_discovery() starts dhclient specifically so it will capture the DHCP leases in dhcp.leases. The function copies the dhclient binary and starts it with options naming unique lease and pid files. The function then waits for both the lease and pid files to appear before using the contents of the pid file to kill the dhclient instance.

There’s a behavior difference between the Ubuntu and RHEL versions of dhclient:
•	On Ubuntu, dhclient writes the DHCP lease response, forks/daemonizes, then writes the pid file with the daemonized process ID.
•	On RHEL, dhclient writes a pid file with the pre-daemon pid, writes the DHCP lease response, forks/daemonizes, then overwrites the pid file with the new (daemonized) pid.

On RHEL, there’s a race between dhcp_discovery() and dhclient:
1.	dhclient writes the pid file and lease file
2.	dhclient forks; the parent process exits
3.	dhcp_discovery() sees that the pid file and lease file exist
4.	dhcp_discovery() tries to kill the process named in the pid file, but it already exited in step 2
5.	dhclient child starts, daemonizes, and writes its pid in the pid file


Version-Release number of selected component (if applicable):


How reproducible:
Fairly common

Steps to Reproduce:
1. Create a VM in Azure from a "preprovisioned pool" of VMs
2.
3.

Actual results:
dhcp_discovery() in dhcp.py throws an error when trying to send SIGKILL to a non-existent process

Expected results:


Additional info:

Comment 2 Jason Zions 2018-09-25 23:52:34 UTC
Simplest fix seems to be waiting until the pid in the pid file created by dhclient actually identifies an existing process whose parent is pid 1 (i.e. the forked child instance of dhclient post-lease acquisition). We have a patch which does exactly this and it resolves the issue.

Comment 3 Rick Barry 2018-09-26 13:18:43 UTC
(In reply to Jason Zions from comment #2)
> Simplest fix seems to be waiting until the pid in the pid file created by
> dhclient actually identifies an existing process whose parent is pid 1 (i.e.
> the forked child instance of dhclient post-lease acquisition). We have a
> patch which does exactly this and it resolves the issue.

Thanks for the bug report and your debugging effort to identify the problem. We'll take a look.

Comment 4 Rick Barry 2018-10-08 14:32:30 UTC
Jason, was this the issue that was observed during the fast provisioning testing a few weeks ago? If this issue is resolved, do the tests pass?

Comment 5 Dan 2018-10-08 22:50:21 UTC
@Rick - yes, this was the issue seen and the fix resolved the issue with the fast prov testing.

Comment 6 Jason Zions 2018-10-22 18:43:09 UTC
(In reply to Rick Barry from comment #4)
> Jason, was this the issue that was observed during the fast provisioning
> testing a few weeks ago? If this issue is resolved, do the tests pass?

Dan is correct; with this issue resolved, fast provisioning passed our tests. I've submitted a patch to repair this issue, https://code.launchpad.net/~jasonzio/cloud-init/+git/cloud-init/+merge/357427

Comment 7 Dan 2018-10-24 03:47:18 UTC
Hi Rick, Eduardo, we have submitted a patch here:
https://bugs.launchpad.net/cloud-init/+bug/1794399

We are waiting on the merge decision, and will update. Once this has been accepted, we will need a test package based on the existing 18.2 RHEL 7.6 package.

Thanks!

Comment 9 Eduardo Otubo 2019-01-18 15:14:18 UTC
(In reply to Jason Zions from comment #8)
> Two commits have been merged:
> 
> https://code.launchpad.net/~jasonzio/cloud-init/+git/cloud-init/+merge/360905

The commit:

commit fdadcb5fae51f4e6799314ab98e3aec56c79b17c
Author: Jason Zions <jasonzio>
Date:   Tue Jan 15 21:37:17 2019 +0000

    net: Wait for dhclient to daemonize before reading lease file

Applied cleanly

> https://code.launchpad.net/~jasonzio/cloud-init/+git/cloud-init/+merge/361757

But this one:

commit f19dc8fa62d4fd8de33311c3c75c5b6da440bebe
Author: Jason Zions <jasonzio>
Date:   Tue Jan 15 17:05:47 2019 +0000

    [Azure] Increase retries when talking to Wireserver during metadata walk

Looks like is depending in lots of other commits. Is this commit absolutely mandatory to solve this BZ? If so I'll have to spend a little more time to work on the backport.

Comment 10 Jason Zions 2019-01-18 18:07:19 UTC
This is a one-line change. If you're seeing more, that's because I did a pull from master to my branch but failed to do a rebase, for which I apologize. The single commit you need is 26f2e40:

diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py
index 46efca4..a4f998b 100644
--- a/cloudinit/sources/DataSourceAzure.py
+++ b/cloudinit/sources/DataSourceAzure.py
@@ -416,7 +416,7 @@ class DataSourceAzure(sources.DataSource):
                     raise sources.InvalidMetaDataException(msg)
                 ret = self._reprovision()
             imds_md = get_metadata_from_imds(
-                self.fallback_interface, retries=3)
+                self.fallback_interface, retries=10)
             (md, userdata_raw, cfg, files) = ret
             self.seed = cdev
             crawled_data.update({

Comment 12 Yuxin Sun 2019-01-24 02:45:11 UTC
Waiting for MSFT providing fast provisioning image.

@Daniel, please help to update here if there's fast provisioning image in Azure. Thanks!

Comment 13 Miroslav Rezanina 2019-01-31 08:11:26 UTC
Fix included in cloud-init-18.2-2.el7

Comment 15 Dan 2019-04-29 15:22:32 UTC
We retested the package on the faster provisioning system, but the 18.5 package is missing the below, can you add this in please, and send a new package?

commit fdadcb5fae51f4e6799314ab98e3aec56c79b17c
Author: Jason Zions
Date: 1/15/2019 1:37 PM
net: Wait for dhclient to daemonize before reading lease file
cloud-init uses dhclient to fetch the DHCP lease so it can extract DHCP options. dhclient creates the leasefile, then writes to it; simply waiting for the leasefile to appear creates a race between dhclient and cloud-init. Instead, wait for dhclient to be parented by init. At that point, we know it has written to the leasefile, so it's safe to copy the file and kill the process.  cloud-init creates a temporary directory in which to execute dhclient, and deletes that directory after it has killed the process. If cloud-init abandons waiting for dhclient to daemonize, it will still attempt to delete the temporary directory, but will not report an exception should that attempt fail.
LP: #1794399

Comment 16 Yuxin Sun 2019-05-29 02:47:40 UTC
1. MSFT:

Copy https://bugzilla.redhat.com/show_bug.cgi?id=1687565#c44 here:

 Jason Zions 2019-05-24 19:19:13 UTC
...
Our testing has confirmed that the cloud-init-18.5-2.el7 package WORKS for pre-provisioned (PPS) and non-PPS VM creation in Azure.
- When used with RHEL 7.6, kernel version 3.10.0-957.12.2.el7.x86_64 (or newer, we expect) is REQUIRED.
- When this cloud-init package is installed on RHEL 7.6 with the 3.10.0-957.10.2.el7.x86_64 kernel, we observed some fraction of created VMs to have network instability (failed nslookup operations, problems with ssh connections to the VM, etc).

2. Red Hat QE:
Regression test result is passed. No blocker issue found.
Test run: https://polarion.engineering.redhat.com/polarion/#/project/RedHatEnterpriseLinux7/testrun?id=Azure_cloud-init-18_5-2_el7_x86_64_RHEL-7_6-ond-2018103108_tier1%202019-05-23%2004-18-57

According to the results above, set bug status to VERIFIED.

Comment 18 errata-xmlrpc 2019-08-06 12:51:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2123

Comment 19 Eduardo Otubo 2019-10-30 11:34:17 UTC
Clearing NEEDINFO flag