RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2178793 - cloud-init: backport fix for instance id detection on Azure [rhel-8.9]
Summary: cloud-init: backport fix for instance id detection on Azure [rhel-8.9]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: cloud-init
Version: 8.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Ani Sinha
QA Contact: Huijuan Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-15 18:04 UTC by Chris Patterson
Modified: 2023-11-14 16:41 UTC (History)
13 users (show)

Fixed In Version: cloud-init-23.1.1-3.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-14 15:31:09 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cloud-init.log (1.05 MB, text/plain)
2023-03-27 15:58 UTC, Huijuan Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-151986 0 None None None 2023-03-15 18:08:09 UTC
Red Hat Product Errata RHSA-2023:6943 0 None None None 2023-11-14 15:31:57 UTC

Internal Links: 2179017

Description Chris Patterson 2023-03-15 18:04:43 UTC
Description of problem:
cloud-init may re-configure VM after falling back to non-Azure datasource on Azure.

Some error cases will cause the instance ID to fall back to a default value of "iid-datasource".  This will cause cloud-init to reconfigure the VM thinking it is a new instance.

Version-Release number of selected component (if applicable):
All cloud-init versions < 22.4

How reproducible:
The customer cases we have seen have been due to DHCP failures for various reasons (either due to image misconfiguration or platform issues).

Steps to Reproduce:
1. create RHEL VM on Azure
2. reboot # clean reboot for demonstration purposes
2. mv /usr/sbin/dhclient /usr/sbin/dhclient.x
2. cp /bin/false /usr/sbin/dhclient
3. reboot
4. grep config-users /var/log/cloud-init.log

We should see something like:

<First boot where provisioning runs config-users-groups>
2023-02-24 16:46:08,014 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2023-02-24 16:46:08,017 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/05a707cf-665b-b547-8fd4-a88de9aa6215/sem/config_users_groups'>)
2023-02-24 16:46:09,851 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully

<Normal subsequent boot where config-users-groups does not run>
2023-03-15 15:14:03,278 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2023-03-15 15:14:03,279 - helpers.py[DEBUG]: config-users-groups already ran (freq=once-per-instance)
2023-03-15 15:14:03,279 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups previously ran

<Boot with /bin/false for dhclient, config-users unexpectedly re-runs>
2023-03-15 17:30:49,484 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2023-03-15 17:30:49,487 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource/sem/config_users_groups'>)
2023-03-15 17:30:49,491 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully

The upstream patch that should address this failure:
https://github.com/canonical/cloud-init/commit/b861ea8a5e1fd0eb33096f60f54eeff42d80d3bd

Comment 1 Huijuan Zhao 2023-03-22 03:43:51 UTC
Hi Chris,

Thanks for raising the issue and the detail test steps, I can reproduce it[1] per comment 0.

We will rebase to cloud-init-23.1 in the next rhel versions(8.9/9.3) which already included this patch.

And the rhel-8.8/9.2(cloud-init-22.1) are in the exception+ phase now(Mainly fix urgent/important issues), could we fix this BZ in the rhel-8.8/9.2 first batch update? And on which Z streams would you prefer to backport the patch? 

Thanks!
Huijuan


[1]
Steps to Reproduce:
1. create RHEL VM on Azure (config serial console and password login, as can not get IP after demonstration wrong config and reboot)
2. reboot or clean reboot for demonstration purposes
   2.1 mv /usr/sbin/dhclient /usr/sbin/dhclient.x
   2.2 cp /bin/false /usr/sbin/dhclient
3. reboot
4. grep config-users /var/log/cloud-init.log

Actual results:
Failed to get DHCP ip address 

# grep config-users /var/log/cloud-init.log

<First boot where provisioning runs config-users-groups>
2023-03-22 03:02:23,247 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2023-03-22 03:02:23,248 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/36716c36-d59d-0842-9daa-850144b550bf/sem/config_users_groups'>)
2023-03-22 03:02:25,119 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully

<Boot with /bin/false for dhclient, config-users unexpectedly re-runs>
2023-03-22 03:13:23,943 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2023-03-22 03:13:23,944 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource/sem/config_users_groups'>)
2023-03-22 03:13:24,039 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully

Comment 2 Chris Patterson 2023-03-22 12:02:13 UTC
> could we fix this BZ in the rhel-8.8/9.2 first batch update? And on which Z streams would you prefer to backport the patch?

Whatever makes sense to you, I don't have a strong opinion.  I have seen a limited number of customer reports, but the impact may be quite undesirable in situations where the customer may have rotated credentials only to have their original user/keys re-installed, etc.

Comment 3 Huijuan Zhao 2023-03-27 15:34:27 UTC
Hi Chris, 

Thank you for the update.

I tested the scratch rebase cloud-init-23.1.1 build, the issue is fixed. 
But there are some error logs[3] per the test steps of demonstration purposes in Description, could you please help to take a look if this is expected?

Test results:
[1] After reboot(with /bin/false for dhclient), instance can get IP address successfully.

[2] # grep config-users /var/log/cloud-init.log

<First boot where provisioning runs config-users-groups>
2023-03-27 07:19:21,547 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2023-03-27 07:19:21,551 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/2da2ad7d-cc4c-c54d-9df3-fa09d35cb682/sem/config_users_groups'>)

<Boot with /bin/false for dhclient, config-users-groups does not run>
2023-03-27 07:55:59,907 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully
2023-03-27 08:15:49,706 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2023-03-27 08:15:49,706 - helpers.py[DEBUG]: config-users-groups already ran (freq=once-per-instance)
2023-03-27 08:15:49,706 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups previously ran

<cloud-init clean, reboot with /bin/false for dhclient, config-users-groups re-run successfully>
2023-03-27 07:19:22,647 - handlers.py[DEBUG]: finish: init-network/config-users-groups: SUCCESS: config-users-groups ran successfully
2023-03-27 07:55:59,841 - handlers.py[DEBUG]: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2023-03-27 07:55:59,843 - helpers.py[DEBUG]: Running config-users-groups using lock (<FileLock using file '/var/lib/cloud/instances/2da2ad7d-cc4c-c54d-9df3-fa09d35cb682/sem/config_users_groups'>)


[3] # cat /var/log/cloud-init.log

<The below failed log was printed repeatedly for about 5 mins>
azure.py[ERROR]: Command failed: cmd=['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.        pid', 'eth0', '-sf', '/bin/true'] stderr='' stdout='' exit_code=1 


<Below was the last print>
   7746 2023-03-27 08:15:46,703 - azure.py[ERROR]: Command failed: cmd=['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.        pid', 'eth0', '-sf', '/bin/true'] stderr='' stdout='' exit_code=1
   7747 2023-03-27 08:15:46,704 - handlers.py[DEBUG]: finish: azure-ds/obtain-dhcp-lease: FAIL: obtain dhcp lease
   7748 2023-03-27 08:15:46,704 - handlers.py[DEBUG]: finish: azure-ds/_setup_ephemeral_networking: FAIL: _setup_ephemeral_networking
   7749 2023-03-27 08:15:46,704 - azure.py[DEBUG]: Could not retrieve pps configuration from IMDS: 'extended'
   7750 2023-03-27 08:15:46,704 - azure.py[DEBUG]: Could not retrieve pps configuration from IMDS: 'extended'
   7751 2023-03-27 08:15:46,704 - azure.py[DEBUG]: Could not retrieve pps configuration from IMDS: 'extended'
   7752 2023-03-27 08:15:46,719 - azure.py[INFO]: PPS type: None
   7753 2023-03-27 08:15:46,720 - handlers.py[DEBUG]: start: azure-ds/validate_imds_network_metadata: validate_imds_network_metadata
   7754 2023-03-27 08:15:46,720 - handlers.py[DEBUG]: start: azure-ds/get_hv_netvsc_macs_normalized: get_hv_netvsc_macs_normalized


Thanks
Huijuan

Comment 4 Chris Patterson 2023-03-27 15:41:51 UTC
It looks like there is whitespace in the /run/dhclient.pid string?  That path is hardcoded: https://github.com/canonical/cloud-init/blob/main/cloudinit/net/dhcp.py#L130 vs the string above '/run/dhclient.        pid' -- was there perhaps whitespace introduced in a patch?

Comment 5 Huijuan Zhao 2023-03-27 15:51:17 UTC
Sorry, the whitespace was introduced by the wrong copy.

The error log is like below:

2023-03-27 08:15:46,703 - azure.py[ERROR]: Command failed: cmd=['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', 'eth0', '-sf', '/bin/true'] stderr='' stdout='' exit_code=1

Comment 6 Chris Patterson 2023-03-27 15:54:17 UTC
That seems unexpected.  Can you attach the full log? Which image & vm size are you using?

Comment 7 Huijuan Zhao 2023-03-27 15:58:02 UTC
Created attachment 1953963 [details]
cloud-init.log

Comment 8 Huijuan Zhao 2023-03-27 16:02:54 UTC
(In reply to Chris Patterson from comment #6)
> That seems unexpected.  Can you attach the full log? Which image & vm size
> are you using?

The image is rhel-9.2, vm size is Standard_DS1_v2. Attached the full cloud-init.log, please help to take a look, thanks!

Comment 9 Chris Patterson 2023-03-28 17:55:20 UTC
My mistake.  Testing on RHEL 9.1 with slightly different patch level (22.1-5.el9) didn't have the retry behavior.  It's fine (and expected) for your version.  I expect that the last build in the logs (root202303221206) does not include the backported patch?

Comment 10 Huijuan Zhao 2023-03-29 01:02:03 UTC
(In reply to Chris Patterson from comment #9)
> My mistake.  Testing on RHEL 9.1 with slightly different patch level
> (22.1-5.el9) didn't have the retry behavior.  It's fine (and expected) for
> your version.  I expect that the last build in the logs (root202303221206)
> does not include the backported patch?

Thanks Chris for the update. So the test results in comment 3 is expected, right?

And regarding the "I expect that the last build in the logs (root202303221206) does not include the backported patch?", are you asking if the patch[1] is in the last build(root202303221206)? 
Yes, it is. The build is a rhel rebase 23.1.1 build which already includes the patch[1](which was merged to 22.4 in upstream).

[1] https://github.com/canonical/cloud-init/commit/b861ea8a5e1fd0eb33096f60f54eeff42d80d3bd

Comment 11 Chris Patterson 2023-03-29 19:36:10 UTC
OK, I assume you perhaps ran a cloud-init clean when updating the rpm? Assuming so, it looks good.

If you want to spin up a new instance, upgrade without cleaning cloud-init, force dhclient failure, then send me an updated log I will give it a review.

Comment 12 Huijuan Zhao 2023-03-30 01:34:56 UTC
(In reply to Chris Patterson from comment #11)
> OK, I assume you perhaps ran a cloud-init clean when updating the rpm?
> Assuming so, it looks good.
> 
> If you want to spin up a new instance, upgrade without cleaning cloud-init,
> force dhclient failure, then send me an updated log I will give it a review.

Also meet the error logs when reboot without cleaning cloud-init, the cloud-init.log in comment 7 includes this log(there are total twice reboot in the log, the first reboot with cleaning cloud-init, the second reboot without cleaning cloud-init).

I think maybe this is because this patch[1], cloud-init re-config network on every boot from 21.4, so this is also expected result?
[1] Set Azure to apply networking config every BOOT (#1023)
https://github.com/canonical/cloud-init/commit/dc22786980a05129c5971e68ae37b1a9f76f882d

Comment 13 Chris Patterson 2023-03-30 17:57:33 UTC
You are right, I didn't notice that. Looks good!

Comment 14 Huijuan Zhao 2023-03-30 22:35:10 UTC
(In reply to Chris Patterson from comment #13)
> You are right, I didn't notice that. Looks good!

Thanks Chris for the confirmation. If I understand correctly, the test results in comment 3 are expected, the bug is fixed in the scratch rebase 23.1.1 build. Please correct me if anything wrong.

We will release this fix in the rhel-8.9 and 9.3(with cloud-init-23.1.1). And we need to discuss in team to determine on which Z-stream to backport the fix.

Comment 25 Huijuan Zhao 2023-05-25 07:00:45 UTC
Tested with cloud-init-23.1.1-3.el8 in rhel-8.9, it includes the patch[1] mentioned in description, and the test result is almost same with comment 3. 
Only one difference: Failed to get dhcp ip address in rhel-8.9 after setting "cp /bin/false /usr/sbin/dhclient" and reboot. Can get dhcp ip successful after setting /usr/sbin/dhclient correctly and reboot. 

[1] https://github.com/canonical/cloud-init/commit/b861ea8a5e1fd0eb33096f60f54eeff42d80d3bd

I think this is caused by the negative test "cp /bin/false /usr/sbin/dhclient", here is the log:
--------------
   1350 2023-05-25 06:00:08,725 - dhcp.py[DEBUG]: Performing a dhcp discovery on eth0
   1351 2023-05-25 06:00:08,725 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, captu        re=True)
   1352 2023-05-25 06:00:08,729 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid',         'eth0', '-sf', '/bin/true'] with allowed return codes [0] (shell=False, capture=True)
   1353 2023-05-25 06:00:08,732 - azure.py[ERROR]: Command failed: cmd=['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.        pid', 'eth0', '-sf', '/bin/true'] stderr='' stdout='' exit_code=1
   1354 2023-05-25 06:00:09,733 - __init__.py[DEBUG]: Stable ifnames disabled by net.ifnames=0 in /proc/cmdline
--------------


Moving to VERIFIED, please reopen the BZ if you find there is still issue.

Comment 27 errata-xmlrpc 2023-11-14 15:31:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: cloud-init security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6943


Note You need to log in before you can comment on or make changes to this bug.