Bug 2178793
| Summary: | cloud-init: backport fix for instance id detection on Azure [rhel-8.9] | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Chris Patterson <cpatterson> | ||||
| Component: | cloud-init | Assignee: | Ani Sinha <anisinha> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Huijuan Zhao <huzhao> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 8.7 | CC: | andavis, anisinha, bdas, eesposit, eterrell, huzhao, jgreguske, litian, mrezanin, xiachen, xiliang, yacao, yuxisun | ||||
| Target Milestone: | rc | Keywords: | Triaged | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | cloud-init-23.1.1-3.el8 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-11-14 15:31:09 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Chris Patterson
2023-03-15 18:04:43 UTC
Hi Chris, Thanks for raising the issue and the detail test steps, I can reproduce it[1] per comment 0. We will rebase to cloud-init-23.1 in the next rhel versions(8.9/9.3) which already included this patch. And the rhel-8.8/9.2(cloud-init-22.1) are in the exception+ phase now(Mainly fix urgent/important issues), could we fix this BZ in the rhel-8.8/9.2 first batch update? And on which Z streams would you prefer to backport the patch? Thanks! Huijuan [1] Steps to Reproduce: 1. create RHEL VM on Azure (config serial console and password login, as can not get IP after demonstration wrong config and reboot) 2. reboot or clean reboot for demonstration purposes 2.1 mv /usr/sbin/dhclient /usr/sbin/dhclient.x 2.2 cp /bin/false /usr/sbin/dhclient 3. reboot 4. grep config-users /var/log/cloud-init.log Actual results: Failed to get DHCP ip address # grep config-users /var/log/cloud-init.log <First boot where provisioning runs config-users-groups> 2023-03-22 03:02:23,247 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance 2023-03-22 03:02:23,248 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/36716c36-d59d-0842-9daa-850144b550bf/sem/config_users_groups'>) 2023-03-22 03:02:25,119 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully <Boot with /bin/false for dhclient, config-users unexpectedly re-runs> 2023-03-22 03:13:23,943 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance 2023-03-22 03:13:23,944 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource/sem/config_users_groups'>) 2023-03-22 03:13:24,039 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully > could we fix this BZ in the rhel-8.8/9.2 first batch update? And on which Z streams would you prefer to backport the patch?
Whatever makes sense to you, I don't have a strong opinion. I have seen a limited number of customer reports, but the impact may be quite undesirable in situations where the customer may have rotated credentials only to have their original user/keys re-installed, etc.
Hi Chris, Thank you for the update. I tested the scratch rebase cloud-init-23.1.1 build, the issue is fixed. But there are some error logs[3] per the test steps of demonstration purposes in Description, could you please help to take a look if this is expected? Test results: [1] After reboot(with /bin/false for dhclient), instance can get IP address successfully. [2] # grep config-users /var/log/cloud-init.log <First boot where provisioning runs config-users-groups> 2023-03-27 07:19:21,547 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance 2023-03-27 07:19:21,551 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/2da2ad7d-cc4c-c54d-9df3-fa09d35cb682/sem/config_users_groups'>) <Boot with /bin/false for dhclient, config-users-groups does not run> 2023-03-27 07:55:59,907 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully 2023-03-27 08:15:49,706 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance 2023-03-27 08:15:49,706 - helpers.py[DEBUG]: config-users-groups already ran (freq=once-per-instance) 2023-03-27 08:15:49,706 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups previously ran <cloud-init clean, reboot with /bin/false for dhclient, config-users-groups re-run successfully> 2023-03-27 07:19:22,647 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully 2023-03-27 07:55:59,841 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance 2023-03-27 07:55:59,843 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/2da2ad7d-cc4c-c54d-9df3-fa09d35cb682/sem/config_users_groups'>) [3] # cat /var/log/cloud-init.log <The below failed log was printed repeatedly for about 5 mins> azure.py[ERROR]: Command failed: cmd=['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient. pid', 'eth0', '-sf', '/bin/true'] stderr='' stdout='' exit_code=1 <Below was the last print> 7746 2023-03-27 08:15:46,703 - azure.py[ERROR]: Command failed: cmd=['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient. pid', 'eth0', '-sf', '/bin/true'] stderr='' stdout='' exit_code=1 7747 2023-03-27 08:15:46,704 - handlers.py[DEBUG]: finish: azure-ds/obtain-dhcp-lease: FAIL: obtain dhcp lease 7748 2023-03-27 08:15:46,704 - handlers.py[DEBUG]: finish: azure-ds/_setup_ephemeral_networking: FAIL: _setup_ephemeral_networking 7749 2023-03-27 08:15:46,704 - azure.py[DEBUG]: Could not retrieve pps configuration from IMDS: 'extended' 7750 2023-03-27 08:15:46,704 - azure.py[DEBUG]: Could not retrieve pps configuration from IMDS: 'extended' 7751 2023-03-27 08:15:46,704 - azure.py[DEBUG]: Could not retrieve pps configuration from IMDS: 'extended' 7752 2023-03-27 08:15:46,719 - azure.py[INFO]: PPS type: None 7753 2023-03-27 08:15:46,720 - handlers.py[DEBUG]: start: azure-ds/validate_imds_network_metadata: validate_imds_network_metadata 7754 2023-03-27 08:15:46,720 - handlers.py[DEBUG]: start: azure-ds/get_hv_netvsc_macs_normalized: get_hv_netvsc_macs_normalized Thanks Huijuan It looks like there is whitespace in the /run/dhclient.pid string? That path is hardcoded: https://github.com/canonical/cloud-init/blob/main/cloudinit/net/dhcp.py#L130 vs the string above '/run/dhclient. pid' -- was there perhaps whitespace introduced in a patch? Sorry, the whitespace was introduced by the wrong copy. The error log is like below: 2023-03-27 08:15:46,703 - azure.py[ERROR]: Command failed: cmd=['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', 'eth0', '-sf', '/bin/true'] stderr='' stdout='' exit_code=1 That seems unexpected. Can you attach the full log? Which image & vm size are you using? Created attachment 1953963 [details]
cloud-init.log
(In reply to Chris Patterson from comment #6) > That seems unexpected. Can you attach the full log? Which image & vm size > are you using? The image is rhel-9.2, vm size is Standard_DS1_v2. Attached the full cloud-init.log, please help to take a look, thanks! My mistake. Testing on RHEL 9.1 with slightly different patch level (22.1-5.el9) didn't have the retry behavior. It's fine (and expected) for your version. I expect that the last build in the logs (root202303221206) does not include the backported patch? (In reply to Chris Patterson from comment #9) > My mistake. Testing on RHEL 9.1 with slightly different patch level > (22.1-5.el9) didn't have the retry behavior. It's fine (and expected) for > your version. I expect that the last build in the logs (root202303221206) > does not include the backported patch? Thanks Chris for the update. So the test results in comment 3 is expected, right? And regarding the "I expect that the last build in the logs (root202303221206) does not include the backported patch?", are you asking if the patch[1] is in the last build(root202303221206)? Yes, it is. The build is a rhel rebase 23.1.1 build which already includes the patch[1](which was merged to 22.4 in upstream). [1] https://github.com/canonical/cloud-init/commit/b861ea8a5e1fd0eb33096f60f54eeff42d80d3bd OK, I assume you perhaps ran a cloud-init clean when updating the rpm? Assuming so, it looks good. If you want to spin up a new instance, upgrade without cleaning cloud-init, force dhclient failure, then send me an updated log I will give it a review. (In reply to Chris Patterson from comment #11) > OK, I assume you perhaps ran a cloud-init clean when updating the rpm? > Assuming so, it looks good. > > If you want to spin up a new instance, upgrade without cleaning cloud-init, > force dhclient failure, then send me an updated log I will give it a review. Also meet the error logs when reboot without cleaning cloud-init, the cloud-init.log in comment 7 includes this log(there are total twice reboot in the log, the first reboot with cleaning cloud-init, the second reboot without cleaning cloud-init). I think maybe this is because this patch[1], cloud-init re-config network on every boot from 21.4, so this is also expected result? [1] Set Azure to apply networking config every BOOT (#1023) https://github.com/canonical/cloud-init/commit/dc22786980a05129c5971e68ae37b1a9f76f882d You are right, I didn't notice that. Looks good! (In reply to Chris Patterson from comment #13) > You are right, I didn't notice that. Looks good! Thanks Chris for the confirmation. If I understand correctly, the test results in comment 3 are expected, the bug is fixed in the scratch rebase 23.1.1 build. Please correct me if anything wrong. We will release this fix in the rhel-8.9 and 9.3(with cloud-init-23.1.1). And we need to discuss in team to determine on which Z-stream to backport the fix. Tested with cloud-init-23.1.1-3.el8 in rhel-8.9, it includes the patch[1] mentioned in description, and the test result is almost same with comment 3. Only one difference: Failed to get dhcp ip address in rhel-8.9 after setting "cp /bin/false /usr/sbin/dhclient" and reboot. Can get dhcp ip successful after setting /usr/sbin/dhclient correctly and reboot. [1] https://github.com/canonical/cloud-init/commit/b861ea8a5e1fd0eb33096f60f54eeff42d80d3bd I think this is caused by the negative test "cp /bin/false /usr/sbin/dhclient", here is the log: -------------- 1350 2023-05-25 06:00:08,725 - dhcp.py[DEBUG]: Performing a dhcp discovery on eth0 1351 2023-05-25 06:00:08,725 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, captu re=True) 1352 2023-05-25 06:00:08,729 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', 'eth0', '-sf', '/bin/true'] with allowed return codes [0] (shell=False, capture=True) 1353 2023-05-25 06:00:08,732 - azure.py[ERROR]: Command failed: cmd=['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient. pid', 'eth0', '-sf', '/bin/true'] stderr='' stdout='' exit_code=1 1354 2023-05-25 06:00:09,733 - __init__.py[DEBUG]: Stable ifnames disabled by net.ifnames=0 in /proc/cmdline -------------- Moving to VERIFIED, please reopen the BZ if you find there is still issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: cloud-init security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6943 |