Created attachment 1799913 [details] Serial console log - try 1 After a Azure Openshift 4.7 IPI installation client deleted a machine a worker node with 'oc -n openshift-machine-api delete machine <MACHINE>' After that MCO has been not able to recreate a new node again. It complains on several issues .- first issue: intermitent... ignition[1003]: failed to open config device: open /dev/disk/by-id/ata-Virtual_CD: no medium found Looked like it was creating a Gen2 machine but upon inspection --- (oc: ocpazrd08) $ oc -n openshift-machine-api get machine NAME PHASE TYPE REGION ZONE AGE ocpazrd08-rh7h8-infra-germanywestcentral1-txt2c Running Standard_D8s_v3 germanywestcentral 1 12d ocpazrd08-rh7h8-infra-germanywestcentral2-kmgr5 Running Standard_D8s_v3 germanywestcentral 2 12d ocpazrd08-rh7h8-infra-germanywestcentral3-qwx22 Running Standard_D8s_v3 germanywestcentral 3 12d ocpazrd08-rh7h8-master-0 Running Standard_D8s_v3 germanywestcentral 3 12d ocpazrd08-rh7h8-master-1 Running Standard_D8s_v3 germanywestcentral 1 12d ocpazrd08-rh7h8-master-2 Running Standard_D8s_v3 germanywestcentral 2 12d ocpazrd08-rh7h8-worker-germanywestcentral1-5c9qf Running Standard_E32s_v3 germanywestcentral 1 12d ocpazrd08-rh7h8-worker-germanywestcentral1-hzt4h Running Standard_E32s_v3 germanywestcentral 1 12d ocpazrd08-rh7h8-worker-germanywestcentral1-t466g Running Standard_D4s_v3 germanywestcentral 1 12d ocpazrd08-rh7h8-worker-germanywestcentral1-x6drr Running Standard_E32s_v3 germanywestcentral 1 12d ocpazrd08-rh7h8-worker-germanywestcentral2-jv49t Running Standard_E32s_v3 germanywestcentral 2 12d ocpazrd08-rh7h8-worker-germanywestcentral2-scffd Running Standard_D4s_v3 germanywestcentral 2 12d ocpazrd08-rh7h8-worker-germanywestcentral2-v8b2z Running Standard_E32s_v3 germanywestcentral 2 12d ocpazrd08-rh7h8-worker-germanywestcentral2-w6qp4 Running Standard_E32s_v3 germanywestcentral 2 12d ocpazrd08-rh7h8-worker-germanywestcentral3-kw4xr Running Standard_E32s_v3 germanywestcentral 3 12d ocpazrd08-rh7h8-worker-germanywestcentral3-s56ww Provisioned Standard_E32s_v3 germanywestcentral 3 43s ocpazrd08-rh7h8-worker-germanywestcentral3-tftbg Running Standard_E32s_v3 germanywestcentral 3 12d ocpazrd08-rh7h8-worker-germanywestcentral3-vz74n Running Standard_E32s_v3 germanywestcentral 3 12d --- Machine is actually a V1 --- az vm get-instance-view -g 014396-ocpazrd08-gwc-dev-rg -n ocpazrd08-rh7h8-worker-germanywestcentral3-s56ww |egrep -i generation "hyperVGeneration": "V1", --- By reviewing the serial output... .- ata-Virtual_CD seems to be working now on another instance --- [ 11.523060] systemd[1]: Started Copy CoreOS Firstboot Networking Config. [[0;32m OK [0m[ 11.530599] ignition[1070]: op(1): [started] mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure016828109" ] Started Copy CoreOS Firstboot Networking Config. [ 11.544844] systemd[1]: Starting dracut initqueue hook... Starting dracut initqueue hook... [ 11.567030] UDF-fs: INFO Mounting volume 'UDF Volume', timestamp 2021/07/08 00:00 (1000) [ 11.575413] ignition[1070]: op(1): [finished] mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure016828109" [ 11.586506] systemd[1]: tmp-ignition\x2dazure016828109.mount: Succeeded. [ 11.594714] ignition[1070]: op(2): [started] unmounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure016828109" [ 11.617865] ignition[1070]: op(2): [finished] unmounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure016828109" --- .- Some issues with DNS resolver?, but eventually it suuceeded? --- [ 12.260822] NetworkManager[1109]: <info> [1625656477.9794] dhcp4 (eth0): option next_server => '168.63.129.16' [ 12.269632] ignition[1070]: GET error: Get "https://api-int.ocpazrd08.cloud.internal:22623/config/worker": dial tcp: lookup api-int.ocpazrd08.cloud.internal on [::1]:53: read udp [::1]:57936->[::1]:53: read: connection refused [ 12.284566] NetworkManager[1109]: <info> [1625656477.9794] dhcp4 (eth0): option private_245 => 'a8:3f:81:10' [ 12.293260] NetworkManager[1109]: <info> [1625656477.9794] dhcp4 (eth0): option requested_broadcast_address => '1' ... [ 13.043335] ignition[1070]: GET https://api-int.ocpazrd08.cloud.internal:22623/config/worker: attempt #4 [ 13.052241] ignition[1070]: GET error: Get "https://api-int.ocpazrd08.cloud.internal:22623/config/worker": dial tcp: lookup api-int.ocpazrd08.cloud.internal on [::1]:53: read udp [::1]:40542->[::1]:53: read: connection refused [ 14.644480] ignition[1070]: GET https://api-int.ocpazrd08.cloud.internal:22623/config/worker: attempt #5 [ 14.655047] ignition[1070]: GET error: Get "https://api-int.ocpazrd08.cloud.internal:22623/config/worker": dial tcp: lookup api-int.ocpazrd08.cloud.internal on [::1]:53: read udp [::1]:40966->[::1]:53: read: connection refused [[0m[0;31m* [0m] A start job is running for Ignition (fetch) (12s / no limit)[ 17.846598] ignition[1070]: GET https://api-int.ocpazrd08.cloud.internal:22623/config/worker: attempt #6 [ 17.916702] ignition[1070]: GET result: OK [ 17.948924] ignition[1070]: warning at $.systemd.units.14.contents, line 1 col 230286: unit "cbk_flush_route_cache.service" is enabled, but has no install section so enable does nothing [ 17.965094] ignition[1070]: fetched base config from "system" --- Machine provisioning is never completed. --- oc -n openshift-machine-api get machine NAME PHASE TYPE REGION ZONE AGE ocpazrd08-rh7h8-infra-germanywestcentral1-txt2c Running Standard_D8s_v3 germanywestcentral 1 14d ocpazrd08-rh7h8-infra-germanywestcentral2-kmgr5 Running Standard_D8s_v3 germanywestcentral 2 14d ocpazrd08-rh7h8-infra-germanywestcentral3-qwx22 Running Standard_D8s_v3 germanywestcentral 3 14d ocpazrd08-rh7h8-master-0 Running Standard_D8s_v3 germanywestcentral 3 15d ocpazrd08-rh7h8-master-1 Running Standard_D8s_v3 germanywestcentral 1 15d ocpazrd08-rh7h8-master-2 Running Standard_D8s_v3 germanywestcentral 2 15d ocpazrd08-rh7h8-worker-germanywestcentral1-5c9qf Running Standard_E32s_v3 germanywestcentral 1 14d ocpazrd08-rh7h8-worker-germanywestcentral1-hzt4h Running Standard_E32s_v3 germanywestcentral 1 14d ocpazrd08-rh7h8-worker-germanywestcentral1-t466g Running Standard_D4s_v3 germanywestcentral 1 15d ocpazrd08-rh7h8-worker-germanywestcentral1-x6drr Running Standard_E32s_v3 germanywestcentral 1 14d ocpazrd08-rh7h8-worker-germanywestcentral2-jv49t Running Standard_E32s_v3 germanywestcentral 2 14d ocpazrd08-rh7h8-worker-germanywestcentral2-scffd Running Standard_D4s_v3 germanywestcentral 2 15d ocpazrd08-rh7h8-worker-germanywestcentral2-v8b2z Running Standard_E32s_v3 germanywestcentral 2 14d ocpazrd08-rh7h8-worker-germanywestcentral2-w6qp4 Running Standard_E32s_v3 germanywestcentral 2 14d ocpazrd08-rh7h8-worker-germanywestcentral3-4xr8w Deleting Standard_E32s_v3 germanywestcentral 3 19h ocpazrd08-rh7h8-worker-germanywestcentral3-bg57p Deleting Standard_E32s_v3 germanywestcentral 3 19h ocpazrd08-rh7h8-worker-germanywestcentral3-kw4xr Running Standard_E32s_v3 germanywestcentral 3 14d ocpazrd08-rh7h8-worker-germanywestcentral3-r7xnb 19h ocpazrd08-rh7h8-worker-germanywestcentral3-tftbg Running Standard_E32s_v3 germanywestcentral 3 14d ocpazrd08-rh7h8-worker-germanywestcentral3-vz74n Running Standard_E32s_v3 germanywestcentral 3 14d --
Created attachment 1799914 [details] Serial console - try 2
I believe Gen 2 Azure VMs are not supported today. Passing over to the CoreOS team to take a look at the ignition failures as well.
This is indeed a problem on gen 1. The CD device doesn't exist at all on gen 2. > ignition[1003]: failed to open config device: open /dev/disk/by-id/ata-Virtual_CD: no medium found Good find. It appears that the RHCOS Azure checkin (which causes the virtual CD to be removed) is racing with Ignition fetch. > [ 12.269632] ignition[1070]: GET error: Get "https://api-int.ocpazrd08.cloud.internal:22623/config/worker": dial tcp: lookup api-int.ocpazrd08.cloud.internal on [::1]:53: read udp [::1]:57936->[::1]:53: read: connection refused This part is probably normal. By design, Ignition may start accessing the network before networking is fully available. It'll retry until it succeeds.
This has landed in Git; waiting for bootimage bump.
Boot image bump is merged, moving to MODIFIED
Thanks Benjamin for your confirmation, change status to verified
The fix for this bug will not be delivered to customers until it lands in an updated bootimage. That process is tracked in bug 1981999, which is in state ASSIGNED. Moving this bug back to POST.
The fix for this bug has landed in a bootimage bump, as tracked in bug 1981999 (now in status MODIFIED). Moving this bug to MODIFIED.
launch 4.9.0-0.nightly-2021-09-25-094414 on azure $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-25-094414 True False 58m Cluster version is 4.9.0-0.nightly-2021-09-25-094414 $ oc get nodes $ oc debug node/worker-node sh-4.4# chroot /host sh-4.4# grep ^After /usr/lib/dracut/modules.d/30rhcos-afterburn-checkin/rhcos-afterburn-checkin.service After=ignition-fetch.service After=coreos-kargs-reboot.service sh-4.4# journalctl -u ignition-fetch | grep -i start; journalctl | grep coreos-kargs-reboot; journalctl -u rhcos-afterburn-checkin | grep -i start Sep 27 08:00:08 localhost systemd[1]: Starting Ignition (fetch)... ... Sep 27 08:00:15 localhost systemd[1]: Started Ignition (fetch). Sep 27 08:00:16 localhost systemd[1]: Starting Afterburn (Check In - from the initramfs)... Sep 27 08:00:45 localhost systemd[1]: Started Afterburn (Check In - from the initramfs). sh-4.4# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="49.84.202109241334-0" ID="rhcos" ID_LIKE="rhel fedora" VERSION_ID="4.9" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 49.84.202109241334-0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.9/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.9" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.9" OPENSHIFT_VERSION="4.9" RHEL_VERSION="8.4" OSTREE_VERSION='49.84.202109241334-0'
Thanks @mnguyen, change status to verified according to Comment 17
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759