Bug 1980679 - On a Azure IPI installation MCO fails to create new nodes
Summary: On a Azure IPI installation MCO fails to create new nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Benjamin Gilbert
QA Contact: HuijingHei
URL:
Whiteboard:
Depends On: 1981999
Blocks: 1982002 1982003 1982004
TreeView+ depends on / blocked
 
Reported: 2021-07-09 08:54 UTC by Victor Medina
Modified: 2021-10-18 17:39 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1982002 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:39:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Serial console log - try 1 (93.75 KB, text/plain)
2021-07-09 08:54 UTC, Victor Medina
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift os pull 580 0 None open rhcos-afterburn-checkin: wait for Ignition fetch and kargs stages 2021-07-13 23:10:42 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:39:08 UTC

Description Victor Medina 2021-07-09 08:54:30 UTC
Created attachment 1799913 [details]
Serial console log - try 1

After a Azure Openshift 4.7 IPI installation client deleted a machine a worker node with 'oc -n openshift-machine-api delete machine <MACHINE>'

After that MCO has been not able to recreate a new node again. It complains on several issues

.- first issue: intermitent...

ignition[1003]: failed to open config device: open /dev/disk/by-id/ata-Virtual_CD: no medium found

Looked like it was creating a Gen2 machine but upon inspection 

---
(oc: ocpazrd08) $ oc -n openshift-machine-api get machine
NAME                                               PHASE         TYPE               REGION               ZONE   AGE
ocpazrd08-rh7h8-infra-germanywestcentral1-txt2c    Running       Standard_D8s_v3    germanywestcentral   1      12d
ocpazrd08-rh7h8-infra-germanywestcentral2-kmgr5    Running       Standard_D8s_v3    germanywestcentral   2      12d
ocpazrd08-rh7h8-infra-germanywestcentral3-qwx22    Running       Standard_D8s_v3    germanywestcentral   3      12d
ocpazrd08-rh7h8-master-0                           Running       Standard_D8s_v3    germanywestcentral   3      12d
ocpazrd08-rh7h8-master-1                           Running       Standard_D8s_v3    germanywestcentral   1      12d
ocpazrd08-rh7h8-master-2                           Running       Standard_D8s_v3    germanywestcentral   2      12d
ocpazrd08-rh7h8-worker-germanywestcentral1-5c9qf   Running       Standard_E32s_v3   germanywestcentral   1      12d
ocpazrd08-rh7h8-worker-germanywestcentral1-hzt4h   Running       Standard_E32s_v3   germanywestcentral   1      12d
ocpazrd08-rh7h8-worker-germanywestcentral1-t466g   Running       Standard_D4s_v3    germanywestcentral   1      12d
ocpazrd08-rh7h8-worker-germanywestcentral1-x6drr   Running       Standard_E32s_v3   germanywestcentral   1      12d
ocpazrd08-rh7h8-worker-germanywestcentral2-jv49t   Running       Standard_E32s_v3   germanywestcentral   2      12d
ocpazrd08-rh7h8-worker-germanywestcentral2-scffd   Running       Standard_D4s_v3    germanywestcentral   2      12d
ocpazrd08-rh7h8-worker-germanywestcentral2-v8b2z   Running       Standard_E32s_v3   germanywestcentral   2      12d
ocpazrd08-rh7h8-worker-germanywestcentral2-w6qp4   Running       Standard_E32s_v3   germanywestcentral   2      12d
ocpazrd08-rh7h8-worker-germanywestcentral3-kw4xr   Running       Standard_E32s_v3   germanywestcentral   3      12d
ocpazrd08-rh7h8-worker-germanywestcentral3-s56ww   Provisioned   Standard_E32s_v3   germanywestcentral   3      43s
ocpazrd08-rh7h8-worker-germanywestcentral3-tftbg   Running       Standard_E32s_v3   germanywestcentral   3      12d
ocpazrd08-rh7h8-worker-germanywestcentral3-vz74n   Running       Standard_E32s_v3   germanywestcentral   3      12d
---

Machine is actually a V1

---
az vm get-instance-view -g 014396-ocpazrd08-gwc-dev-rg -n ocpazrd08-rh7h8-worker-germanywestcentral3-s56ww |egrep -i generation
    "hyperVGeneration": "V1",
---


By reviewing the serial output...

.- ata-Virtual_CD seems to be working now on another instance
---
[   11.523060] systemd[1]: Started Copy CoreOS Firstboot Networking Config.
[[0;32m  OK  [0m[   11.530599] ignition[1070]: op(1): [started]  mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure016828109"
] Started Copy CoreOS Firstboot Networking Config.
[   11.544844] systemd[1]: Starting dracut initqueue hook...
         Starting dracut initqueue hook...
[   11.567030] UDF-fs: INFO Mounting volume 'UDF Volume', timestamp 2021/07/08 00:00 (1000)
[   11.575413] ignition[1070]: op(1): [finished] mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure016828109"
[   11.586506] systemd[1]: tmp-ignition\x2dazure016828109.mount: Succeeded.
[   11.594714] ignition[1070]: op(2): [started]  unmounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure016828109"
[   11.617865] ignition[1070]: op(2): [finished] unmounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure016828109"
---

.- Some issues with DNS resolver?, but eventually it suuceeded?
---
[   12.260822] NetworkManager[1109]: <info>  [1625656477.9794] dhcp4 (eth0): option next_server          => '168.63.129.16'
[   12.269632] ignition[1070]: GET error: Get "https://api-int.ocpazrd08.cloud.internal:22623/config/worker": dial tcp: lookup api-int.ocpazrd08.cloud.internal on [::1]:53: read udp [::1]:57936->[::1]:53: read: connection refused
[   12.284566] NetworkManager[1109]: <info>  [1625656477.9794] dhcp4 (eth0): option private_245          => 'a8:3f:81:10'
[   12.293260] NetworkManager[1109]: <info>  [1625656477.9794] dhcp4 (eth0): option requested_broadcast_address => '1'

...

[   13.043335] ignition[1070]: GET https://api-int.ocpazrd08.cloud.internal:22623/config/worker: attempt #4
[   13.052241] ignition[1070]: GET error: Get "https://api-int.ocpazrd08.cloud.internal:22623/config/worker": dial tcp: lookup api-int.ocpazrd08.cloud.internal on [::1]:53: read udp [::1]:40542->[::1]:53: read: connection refused
[   14.644480] ignition[1070]: GET https://api-int.ocpazrd08.cloud.internal:22623/config/worker: attempt #5
[   14.655047] ignition[1070]: GET error: Get "https://api-int.ocpazrd08.cloud.internal:22623/config/worker": dial tcp: lookup api-int.ocpazrd08.cloud.internal on [::1]:53: read udp [::1]:40966->[::1]:53: read: connection refused
[[0m[0;31m*     [0m] A start job is running for Ignition (fetch) (12s / no limit)[   17.846598] ignition[1070]: GET https://api-int.ocpazrd08.cloud.internal:22623/config/worker: attempt #6
[   17.916702] ignition[1070]: GET result: OK
[   17.948924] ignition[1070]: warning at $.systemd.units.14.contents, line 1 col 230286: unit "cbk_flush_route_cache.service" is enabled, but has no install section so enable does nothing
[   17.965094] ignition[1070]: fetched base config from "system"
---

Machine provisioning is never completed.

---
oc -n openshift-machine-api get machine
NAME                                               PHASE      TYPE               REGION               ZONE   AGE
ocpazrd08-rh7h8-infra-germanywestcentral1-txt2c    Running    Standard_D8s_v3    germanywestcentral   1      14d
ocpazrd08-rh7h8-infra-germanywestcentral2-kmgr5    Running    Standard_D8s_v3    germanywestcentral   2      14d
ocpazrd08-rh7h8-infra-germanywestcentral3-qwx22    Running    Standard_D8s_v3    germanywestcentral   3      14d
ocpazrd08-rh7h8-master-0                           Running    Standard_D8s_v3    germanywestcentral   3      15d
ocpazrd08-rh7h8-master-1                           Running    Standard_D8s_v3    germanywestcentral   1      15d
ocpazrd08-rh7h8-master-2                           Running    Standard_D8s_v3    germanywestcentral   2      15d
ocpazrd08-rh7h8-worker-germanywestcentral1-5c9qf   Running    Standard_E32s_v3   germanywestcentral   1      14d
ocpazrd08-rh7h8-worker-germanywestcentral1-hzt4h   Running    Standard_E32s_v3   germanywestcentral   1      14d
ocpazrd08-rh7h8-worker-germanywestcentral1-t466g   Running    Standard_D4s_v3    germanywestcentral   1      15d
ocpazrd08-rh7h8-worker-germanywestcentral1-x6drr   Running    Standard_E32s_v3   germanywestcentral   1      14d
ocpazrd08-rh7h8-worker-germanywestcentral2-jv49t   Running    Standard_E32s_v3   germanywestcentral   2      14d
ocpazrd08-rh7h8-worker-germanywestcentral2-scffd   Running    Standard_D4s_v3    germanywestcentral   2      15d
ocpazrd08-rh7h8-worker-germanywestcentral2-v8b2z   Running    Standard_E32s_v3   germanywestcentral   2      14d
ocpazrd08-rh7h8-worker-germanywestcentral2-w6qp4   Running    Standard_E32s_v3   germanywestcentral   2      14d
ocpazrd08-rh7h8-worker-germanywestcentral3-4xr8w   Deleting   Standard_E32s_v3   germanywestcentral   3      19h
ocpazrd08-rh7h8-worker-germanywestcentral3-bg57p   Deleting   Standard_E32s_v3   germanywestcentral   3      19h
ocpazrd08-rh7h8-worker-germanywestcentral3-kw4xr   Running    Standard_E32s_v3   germanywestcentral   3      14d
ocpazrd08-rh7h8-worker-germanywestcentral3-r7xnb                                                             19h
ocpazrd08-rh7h8-worker-germanywestcentral3-tftbg   Running    Standard_E32s_v3   germanywestcentral   3      14d
ocpazrd08-rh7h8-worker-germanywestcentral3-vz74n   Running    Standard_E32s_v3   germanywestcentral   3      14d
--

Comment 1 Victor Medina 2021-07-09 08:55:38 UTC
Created attachment 1799914 [details]
Serial console - try 2

Comment 2 Yu Qi Zhang 2021-07-13 17:26:14 UTC
I believe Gen 2 Azure VMs are not supported today. Passing over to the CoreOS team to take a look at the ignition failures as well.

Comment 4 Benjamin Gilbert 2021-07-13 18:07:40 UTC
This is indeed a problem on gen 1.  The CD device doesn't exist at all on gen 2.

> ignition[1003]: failed to open config device: open /dev/disk/by-id/ata-Virtual_CD: no medium found

Good find.  It appears that the RHCOS Azure checkin (which causes the virtual CD to be removed) is racing with Ignition fetch.

> [   12.269632] ignition[1070]: GET error: Get "https://api-int.ocpazrd08.cloud.internal:22623/config/worker": dial tcp: lookup api-int.ocpazrd08.cloud.internal on [::1]:53: read udp [::1]:57936->[::1]:53: read: connection refused

This part is probably normal.  By design, Ignition may start accessing the network before networking is fully available.  It'll retry until it succeeds.

Comment 6 Benjamin Gilbert 2021-07-15 20:44:26 UTC
This has landed in Git; waiting for bootimage bump.

Comment 7 Micah Abbott 2021-08-27 13:49:59 UTC
Boot image bump is merged, moving to MODIFIED

Comment 12 HuijingHei 2021-09-02 02:01:50 UTC
Thanks Benjamin for your confirmation, change status to verified

Comment 13 RHCOS Bug Bot 2021-09-02 16:36:31 UTC
The fix for this bug will not be delivered to customers until it lands in an updated bootimage.  That process is tracked in bug 1981999, which is in state ASSIGNED.  Moving this bug back to POST.

Comment 14 RHCOS Bug Bot 2021-09-22 18:37:26 UTC
The fix for this bug has landed in a bootimage bump, as tracked in bug 1981999 (now in status MODIFIED).  Moving this bug to MODIFIED.

Comment 17 HuijingHei 2021-09-27 09:23:27 UTC
launch 4.9.0-0.nightly-2021-09-25-094414 on azure

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-25-094414   True        False         58m     Cluster version is 4.9.0-0.nightly-2021-09-25-094414

$ oc get nodes
$ oc debug node/worker-node
sh-4.4# chroot /host
sh-4.4# grep ^After /usr/lib/dracut/modules.d/30rhcos-afterburn-checkin/rhcos-afterburn-checkin.service
After=ignition-fetch.service
After=coreos-kargs-reboot.service

sh-4.4# journalctl -u ignition-fetch | grep -i start; journalctl | grep coreos-kargs-reboot; journalctl -u rhcos-afterburn-checkin | grep -i start
Sep 27 08:00:08 localhost systemd[1]: Starting Ignition (fetch)...
...
Sep 27 08:00:15 localhost systemd[1]: Started Ignition (fetch).
Sep 27 08:00:16 localhost systemd[1]: Starting Afterburn (Check In - from the initramfs)...
Sep 27 08:00:45 localhost systemd[1]: Started Afterburn (Check In - from the initramfs).

sh-4.4# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="49.84.202109241334-0"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 49.84.202109241334-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.9/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.9"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.9"
OPENSHIFT_VERSION="4.9"
RHEL_VERSION="8.4"
OSTREE_VERSION='49.84.202109241334-0'

Comment 18 HuijingHei 2021-09-27 13:31:58 UTC
Thanks @mnguyen, change status to verified according to Comment 17

Comment 20 errata-xmlrpc 2021-10-18 17:39:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.