Bug 1632967
| Summary: | [Azure] cloud-init dhcp.py dhcp_discovery() race with dhclient with preprovisioned VM in Azure | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jason Zions <jason.zions> | |
| Component: | cloud-init | Assignee: | Eduardo Otubo <eterrell> | |
| Status: | CLOSED ERRATA | QA Contact: | Yuxin Sun <yuxisun> | |
| Severity: | high | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 7.6 | CC: | danis, eterrell, jason.zions, jgreguske, ribarry, yujiang, yuxisun | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | cloud-init-18.2-2.el7 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1644447 (view as bug list) | Environment: | ||
| Last Closed: | 2019-08-06 12:51:00 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1644447, 1691987 | |||
|
Description
Jason Zions
2018-09-25 23:48:59 UTC
Simplest fix seems to be waiting until the pid in the pid file created by dhclient actually identifies an existing process whose parent is pid 1 (i.e. the forked child instance of dhclient post-lease acquisition). We have a patch which does exactly this and it resolves the issue. (In reply to Jason Zions from comment #2) > Simplest fix seems to be waiting until the pid in the pid file created by > dhclient actually identifies an existing process whose parent is pid 1 (i.e. > the forked child instance of dhclient post-lease acquisition). We have a > patch which does exactly this and it resolves the issue. Thanks for the bug report and your debugging effort to identify the problem. We'll take a look. Jason, was this the issue that was observed during the fast provisioning testing a few weeks ago? If this issue is resolved, do the tests pass? @Rick - yes, this was the issue seen and the fix resolved the issue with the fast prov testing. (In reply to Rick Barry from comment #4) > Jason, was this the issue that was observed during the fast provisioning > testing a few weeks ago? If this issue is resolved, do the tests pass? Dan is correct; with this issue resolved, fast provisioning passed our tests. I've submitted a patch to repair this issue, https://code.launchpad.net/~jasonzio/cloud-init/+git/cloud-init/+merge/357427 Hi Rick, Eduardo, we have submitted a patch here: https://bugs.launchpad.net/cloud-init/+bug/1794399 We are waiting on the merge decision, and will update. Once this has been accepted, we will need a test package based on the existing 18.2 RHEL 7.6 package. Thanks! Two commits have been merged: https://code.launchpad.net/~jasonzio/cloud-init/+git/cloud-init/+merge/360905 https://code.launchpad.net/~jasonzio/cloud-init/+git/cloud-init/+merge/361757 Launchpad bug 1794399 has been resolved. (In reply to Jason Zions from comment #8) > Two commits have been merged: > > https://code.launchpad.net/~jasonzio/cloud-init/+git/cloud-init/+merge/360905 The commit: commit fdadcb5fae51f4e6799314ab98e3aec56c79b17c Author: Jason Zions <jasonzio> Date: Tue Jan 15 21:37:17 2019 +0000 net: Wait for dhclient to daemonize before reading lease file Applied cleanly > https://code.launchpad.net/~jasonzio/cloud-init/+git/cloud-init/+merge/361757 But this one: commit f19dc8fa62d4fd8de33311c3c75c5b6da440bebe Author: Jason Zions <jasonzio> Date: Tue Jan 15 17:05:47 2019 +0000 [Azure] Increase retries when talking to Wireserver during metadata walk Looks like is depending in lots of other commits. Is this commit absolutely mandatory to solve this BZ? If so I'll have to spend a little more time to work on the backport. This is a one-line change. If you're seeing more, that's because I did a pull from master to my branch but failed to do a rebase, for which I apologize. The single commit you need is 26f2e40:
diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py
index 46efca4..a4f998b 100644
--- a/cloudinit/sources/DataSourceAzure.py
+++ b/cloudinit/sources/DataSourceAzure.py
@@ -416,7 +416,7 @@ class DataSourceAzure(sources.DataSource):
raise sources.InvalidMetaDataException(msg)
ret = self._reprovision()
imds_md = get_metadata_from_imds(
- self.fallback_interface, retries=3)
+ self.fallback_interface, retries=10)
(md, userdata_raw, cfg, files) = ret
self.seed = cdev
crawled_data.update({
Waiting for MSFT providing fast provisioning image. @Daniel, please help to update here if there's fast provisioning image in Azure. Thanks! Fix included in cloud-init-18.2-2.el7 We retested the package on the faster provisioning system, but the 18.5 package is missing the below, can you add this in please, and send a new package? commit fdadcb5fae51f4e6799314ab98e3aec56c79b17c Author: Jason Zions Date: 1/15/2019 1:37 PM net: Wait for dhclient to daemonize before reading lease file cloud-init uses dhclient to fetch the DHCP lease so it can extract DHCP options. dhclient creates the leasefile, then writes to it; simply waiting for the leasefile to appear creates a race between dhclient and cloud-init. Instead, wait for dhclient to be parented by init. At that point, we know it has written to the leasefile, so it's safe to copy the file and kill the process. cloud-init creates a temporary directory in which to execute dhclient, and deletes that directory after it has killed the process. If cloud-init abandons waiting for dhclient to daemonize, it will still attempt to delete the temporary directory, but will not report an exception should that attempt fail. LP: #1794399 1. MSFT: Copy https://bugzilla.redhat.com/show_bug.cgi?id=1687565#c44 here: Jason Zions 2019-05-24 19:19:13 UTC ... Our testing has confirmed that the cloud-init-18.5-2.el7 package WORKS for pre-provisioned (PPS) and non-PPS VM creation in Azure. - When used with RHEL 7.6, kernel version 3.10.0-957.12.2.el7.x86_64 (or newer, we expect) is REQUIRED. - When this cloud-init package is installed on RHEL 7.6 with the 3.10.0-957.10.2.el7.x86_64 kernel, we observed some fraction of created VMs to have network instability (failed nslookup operations, problems with ssh connections to the VM, etc). 2. Red Hat QE: Regression test result is passed. No blocker issue found. Test run: https://polarion.engineering.redhat.com/polarion/#/project/RedHatEnterpriseLinux7/testrun?id=Azure_cloud-init-18_5-2_el7_x86_64_RHEL-7_6-ond-2018103108_tier1%202019-05-23%2004-18-57 According to the results above, set bug status to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2123 Clearing NEEDINFO flag |