Bug 1803928

Summary: [RHEL8.3] Race condition of starting cloud-init and NetworkManager
Product: Red Hat Enterprise Linux 8 Reporter: Ladislav Jozsa <ljozsa>
Component: cloud-initAssignee: Eduardo Otubo <eterrell>
Status: CLOSED ERRATA QA Contact: xiachen
Severity: high Docs Contact:
Priority: high    
Version: 8.2CC: eterrell, ftrivino, huzhao, jgreguske, jwboyer, kaolivei, lmiksik, pvoborni, ribarry, vkuznets, vpoliset, xiachen, yacao, yuxisun
Target Milestone: rcKeywords: ZStream
Target Release: 8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: cloud-init-19.4-3.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1826262 1831646 2006693 (view as bug list) Environment:
Last Closed: 2020-11-04 02:40:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1789719, 1826262, 1831646    

Description Ladislav Jozsa 2020-02-17 17:49:59 UTC
Description of problem:
A race condition occurs when starting cloud-init. Sometimes it's started before NetworkManager starts resulting in unreachable instance when running in OpenStack.

Version-Release number of selected component (if applicable):
cloud-init-18.5-9.src.rpm (RHEL-8.2.0-20200207.n.0)

How reproducible:
sometimes

Steps to Reproduce:
1. Launch an instance on OpenStack (http://rhos-d.infra.prod.upshift.rdu2.redhat.com) with RHEL-8.2.0-20200207.n.0 image, can be the same as soon as cloud-init matches
2. Attempt to log in (or check openstack console log show <uuid>)


Actual results:
Sometimes we get into non-working state:

[[0;32m  OK  [0m] Reached target Network.
         Starting Dynamic System Tuning Daemon...
         Starting Hostname Service...
[[0;32m  OK  [0m] Started Hostname Service.
         Starting Network Manager Script Dispatcher Service...
[   22.082582] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   22.092830] cloud-init[797]: Cloud-init v. 18.5 running 'init' at Tue, 11 Feb 2020 17:24:51 +0000. Up 21.99 seconds.
[   22.096952] cloud-init[797]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[   22.100207] cloud-init[797]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   22.103408] cloud-init[797]: ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
[   22.106786] cloud-init[797]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   22.109515] cloud-init[797]: ci-info: |  eth0  | False |     .     |     .     |   .   | fa:16:3e:6d:fe:0c |
[   22.112389] cloud-init[797]: ci-info: |   lo   |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
[   22.115274] cloud-init[797]: ci-info: |   lo   |  True |  ::1/128  |     .     |  host |         .         |
[   22.118561] cloud-init[797]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   22.121644] cloud-init[797]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   22.124375] cloud-init[797]: ci-info: +-------+-------------+---------+-----------+-------+
[   22.127132] cloud-init[797]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   22.129836] cloud-init[797]: ci-info: +-------+-------------+---------+-----------+-------+
[   22.132641] cloud-init[797]: ci-info: +-------+-------------+---------+-----------+-------+
[   22.135363] cloud-init[797]: 2020-02-11 17:24:51,261 - util.py[WARNING]: No active metadata service found
[[0;32m  OK  [0m] Started Network Manager Script Dispatcher Service.
[[0;32m  OK  [0m] Started Network Manager Wait Online.
[[0;32m  OK  [0m] Started Dynamic System Tuning Daemon.
[[0;32m  OK  [0m] Started Initial cloud-init job (metadata service crawler).
         Starting OpenSSH server daemon...
[[0;32m  OK  [0m] Reached target Network is Online.
         Starting System Logging Service...
         Starting Crash recovery kernel arming...
         Starting Notify NFS peers of a restart...
         Starting Permit User Sessions...


Sometimes it works:

[[0;32m  OK  [0m] Started Hostname Service.
         Starting Network Manager Script Dispatcher Service...
[   11.516135] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[[0;32m  OK  [0m] Started Network Manager Script Dispatcher Service.
[[0;32m  OK  [0m] Started Network Manager Wait Online.
[[0;32m  OK  [0m] Started Dynamic System Tuning Daemon.
[   12.360228] cloud-init[742]: Cloud-init v. 18.5 running 'init' at Wed, 12 Feb 2020 11:05:46 +0000. Up 11.97 seconds.
[   12.363435] cloud-init[742]: ci-info: +++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
[   12.366925] cloud-init[742]: ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
[   12.370133] cloud-init[742]: ci-info: | Device |  Up  |           Address            |      Mask     | Scope  |     Hw-Address    |
[   12.373383] cloud-init[742]: ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
[   12.376795] cloud-init[742]: ci-info: |  eth0  | True |         10.0.154.137         | 255.255.252.0 | global | fa:16:3e:b5:f9:03 |
[   12.380451] cloud-init[742]: ci-info: |  eth0  | True | fe80::f816:3eff:feb5:f903/64 |       .       |  link  | fa:16:3e:b5:f9:03 |
[   12.384549] cloud-init[742]: ci-info: |   lo   | True |          127.0.0.1           |   255.0.0.0   |  host  |         .         |
[   12.387948] cloud-init[742]: ci-info: |   lo   | True |           ::1/128            |       .       |  host  |         .         |
[   12.391372] cloud-init[742]: ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
[   12.394757] cloud-init[742]: ci-info: ++++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++++
[   12.397952] cloud-init[742]: ci-info: +-------+-----------------+--------------+-----------------+-----------+-------+
[   12.401026] cloud-init[742]: ci-info: | Route |   Destination   |   Gateway    |     Genmask     | Interface | Flags |
[   12.404052] cloud-init[742]: ci-info: +-------+-----------------+--------------+-----------------+-----------+-------+
[   12.407339] cloud-init[742]: ci-info: |   0   |     0.0.0.0     | 10.0.155.254 |     0.0.0.0     |    eth0   |   UG  |
[   12.410376] cloud-init[742]: ci-info: |   1   |    10.0.152.0   |   0.0.0.0    |  255.255.252.0  |    eth0   |   U   |
[   12.413578] cloud-init[742]: ci-info: |   2   | 169.254.169.254 | 10.0.152.41  | 255.255.255.255 |    eth0   |  UGH  |
[   12.416569] cloud-init[742]: ci-info: +-------+-----------------+--------------+-----------------+-----------+-------+
[   12.419499] cloud-init[742]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   12.421897] cloud-init[742]: ci-info: +-------+-------------+---------+-----------+-------+
[   12.424975] cloud-init[742]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   12.428249] cloud-init[742]: ci-info: +-------+-------------+---------+-----------+-------+
[   12.431559] cloud-init[742]: ci-info: |   1   |  fe80::/64  |    ::   |    eth0   |   U   |
[   12.434824] cloud-init[742]: ci-info: |   3   |   ff00::/8  |    ::   |    eth0   |   U   |
[   12.438151] cloud-init[742]: ci-info: +-------+-------------+---------+-----------+-------+
[[0;32m  OK  [0m] Started Initial cloud-init job (metadata service crawler).
         Starting Permit User Sessions...
         Starting OpenSSH server daemon...
[[0;32m  OK  [0m] Reached target Cloud-config availability.
[[0;32m  OK  [0m] Reached target Network is Online.
         Starting Crash recovery kernel arming...
         Starting System Logging Service...
         Starting Notify NFS peers of a restart...
         Starting Apply the settings specified in cloud-config...

If it is not working and we attempt to log in as cloud-user.155.12 we're getting:


cloud-user.upshift.rdu2.redhat.com: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Expected results:

Able to ssh in. Network manager started before cloud-init.


Additional info:
If you need any assistance with OpenStack please reach out to me or kaolivei.

Comment 34 Trivino 2020-04-21 10:33:01 UTC
rhel-guest-image-8.2-bug1803928-0420 works for me.

I disabled the workaround and configured a simple pipeline where I spawn tons of instances using the scratch image. All instances (~300 VMs) are accessible.

These are the pipelines:

https://idm-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/blue/organizations/jenkins/ftrivino-dev-pipeline/detail/bug1803928/3/pipeline/33

https://idm-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/blue/organizations/jenkins/ftrivino-dev-pipeline/detail/bug1803928/4/pipeline/


Additional info.
-----------------

The console log for a particular instance using rhel-guest-image-8.2-bug1803928-0420 image shows: 

...
...
[[0;32m  OK  [0m] Stopped OpenSSH server daemon.
[[0;32m  OK  [0m] Stopped target sshd-keygen.target.
         Stopping sshd-keygen.target.
[[0;32m  OK  [0m] Reached target sshd-keygen.target.
         Starting OpenSSH server daemon...
[[0;32m  OK  [0m] Started OpenSSH server daemon.
[   30.249663] cloud-init[942]: Cloud-init v. 18.5 running 'modules:config' at Tue, 21 Apr 2020 10:09:40 +0000. Up 29.85 seconds.
[[0;32m  OK  [0m] Started Apply the settings specified in cloud-config.
         Starting Execute cloud user/final scripts...
ci-info: +++Authorized keys from /home/cloud-user/.ssh/authorized_keys for user cloud-user+++
ci-info: +---------+-------------------------------------------------+---------+------------+
ci-info: | Keytype |                Fingerprint (md5)                | Options |  Comment   |
ci-info: +---------+-------------------------------------------------+---------+------------+
ci-info: | ssh-rsa | 35:e4:6b:01:0d:6f:9b:66:d0:5b:bc:65:09:c7:eb:b1 |    -    | jslave-idm |
ci-info: +---------+-------------------------------------------------+---------+------------+
<14>Apr 21 06:09:42 ec2: 
<14>Apr 21 06:09:42 ec2: #############################################################
<14>Apr 21 06:09:42 ec2: -----BEGIN SSH HOST KEY FINGERPRINTS-----
<14>Apr 21 06:09:42 ec2: 256 SHA256:poGl7+PHFJZ55uH7WYK6dTrXoZEmOmgVXMhaRaaQWI0 no comment (ECDSA)
<14>Apr 21 06:09:42 ec2: 256 SHA256:3R5CUaSIsWWiexq/bGKMHGBj1ANteO+AtXsZpYXXMu8 no comment (ED25519)
<14>Apr 21 06:09:42 ec2: 3072 SHA256:mN/JhUeBlvnS2+EBAjJbQFyALPjp6bMd6iWJQlHBMS8 no comment (RSA)
<14>Apr 21 06:09:42 ec2: -----END SSH HOST KEY FINGERPRINTS-----
<14>Apr 21 06:09:42 ec2: #############################################################
-----BEGIN SSH HOST KEY KEYS-----
...
...
-----END SSH HOST KEY KEYS-----
[   31.199284] cloud-init[1343]: Cloud-init v. 18.5 running 'modules:final' at Tue, 21 Apr 2020 10:09:41 +0000. Up 30.94 seconds.
[   31.204040] cloud-init[1343]: Cloud-init v. 18.5 finished at Tue, 21 Apr 2020 10:09:42 +0000. Datasource DataSourceOpenStack [net,ver=2].  Up 31.17 seconds
[   31.282796] echo[1911]: try restart NetworkManager.service
[[0;32m  OK  [0m] Stopped Network Manager Wait Online.
         Stopping Network Manager Wait Online...
         Stopping Network Manager...
[[0;32m  OK  [0m] Stopped Network Manager.
         Starting Network Manager...
[[0;32m  OK  [0m] Started Network Manager.
         Starting Network Manager Wait Online...
[[0;32m  OK  [0m] Started Execute cloud user/final scripts.
[[0;32m  OK  [0m] Reached target Cloud-init target.
[[0;32m  OK  [0m] Started Network Manager Wait Online.

Red Hat Enterprise Linux 8.2 (Ootpa)
Kernel 4.18.0-193.el8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

bug1 login: 


ftrivino@t480s:~>openstack server show bug1.ipadomain.test_c7aef6 

+-----------------------------+-----------------------------------------------------------------------------+
| Field                       | Value                                                                       |
+-----------------------------+-----------------------------------------------------------------------------+
| OS-DCF:diskConfig           | MANUAL                                                                      |
| OS-EXT-AZ:availability_zone | nova                                                                        |
| OS-EXT-STS:power_state      | Running                                                                     |
| OS-EXT-STS:task_state       | None                                                                        |
| OS-EXT-STS:vm_state         | active                                                                      |
| OS-SRV-USG:launched_at      | 2020-04-21T10:09:07.000000                                                  |
| OS-SRV-USG:terminated_at    | None                                                                        |
| accessIPv4                  |                                                                             |
| accessIPv6                  |                                                                             |
| addresses                   | provider_net_cci_6=10.0.154.218                                             |
| config_drive                |                                                                             |
| created                     | 2020-04-21T10:08:50Z                                                        |
| flavor                      | ci.m1.medium (df66ba33-2528-4cdd-9a01-f770961bc25b)                         |
| hostId                      | 6589ef6f29b0136fa57af8a7856b997e2af00991cd8e33a32dbf6173                    |
| id                          | 67cdda7f-07bd-4408-b42f-9540cce6f562                                        |
| image                       | rhel-guest-image-8.2-bug1803928-0420 (9cfc3340-bf9d-412a-a4a9-ef3aab889fd4) |
| key_name                    | idm-jenkins                                                                 |
| name                        | bug1.ipadomain.test_c7aef6                                                  |
| progress                    | 0                                                                           |
| project_id                  | d11b9f000a1a4682905f244022d9c995                                            |
| properties                  |                                                                             |
| security_groups             | name='default'                                                              |
| status                      | ACTIVE                                                                      |
| updated                     | 2020-04-21T10:09:07Z                                                        |
| user_id                     | 8168cf6928354163271027d8f7eea126703fec4f1717195acbef886b4db3b03e            |
| volumes_attached            |                                                                             |
+-----------------------------+-----------------------------------------------------------------------------+

Comment 38 Rick Barry 2020-04-21 13:00:11 UTC
*** Bug 1789719 has been marked as a duplicate of this bug. ***

Comment 50 Trivino 2020-06-10 14:42:39 UTC
I have tested RHEL-8.3.0-x86_64-nightly-latest that points to RHEL-8.3.0-20200608.n.0 (one compose after the one is solving the problem) with _no WA_ here:

https://idm-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/blue/organizations/jenkins/load_pipeline/detail/master/37/pipeline/28

the test provisioned > 40 VMs with no issues at all.

I verify the bug fix is in place and works fine.

Comment 54 errata-xmlrpc 2020-11-04 02:40:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: cloud-init security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4650