In roughly 4-8% of CI runs we believe etcd is failing to come up. The problem appears to have been traced back to a race condition which was thought to be fixed, but it appears bootstrap is using an rhcos image with outdated kubelet and the bug is still present. Discussion mostly in this thread: https://coreos.slack.com/archives/C01CQA76KMX/p1642786244483900 But the key conclusion was: 4.10 installer is using for bootstrap Red Hat Enterprise Linux CoreOS 410.84.202112040202-0 which has the wrong kubelet [core@test1-k59fj-bootstrap ~]$ sudo kubelet --version Kubernetes v1.22.1+6859754 Suspected out of date metadata in https://github.com/openshift/installer/blob/release-4.10/data/data/coreos/rhcos.json ?
This bug has been reported fixed in a new RHCOS build and is ready for QE verification. To mark the bug verified, set the Verified field to Tested. This bug will automatically move to MODIFIED once the fix has landed in a new bootimage.
Preverified on RHCOS 410.84.202201241447-0 [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: ● ostree://064c92e49da0e5dd9dbc5ca8be7495ffb3f703e0f2c55c5b7a59d17d19d35a2b Version: 410.84.202201241447-0 (2022-01-24T14:51:10Z) [core@cosa-devsh ~]$ rpm -qa | grep kube openshift-hyperkube-4.10.0-202201230027.p0.g06791f6.assembly.stream.el8.x86_64 [core@cosa-devsh ~]$ kubelet --version Kubernetes v1.23.0+06791f6
The fix for this bug has landed in a bootimage bump, as tracked in bug 2043297 (now in status MODIFIED). Moving this bug to MODIFIED.
Verified on 4.10.0-0.nightly-2022-02-02-000921 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-02-02-000921 True False 3m50s Cluster version is 4.10.0-0.nightly-2022-02-02-000921 $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-b7fit7k-72292-4wkqh-master-0 Ready master 23m v1.23.3+b63be7f ci-ln-b7fit7k-72292-4wkqh-master-1 Ready master 23m v1.23.3+b63be7f ci-ln-b7fit7k-72292-4wkqh-master-2 Ready master 23m v1.23.3+b63be7f ci-ln-b7fit7k-72292-4wkqh-worker-a-82gp9 Ready worker 16m v1.23.3+b63be7f ci-ln-b7fit7k-72292-4wkqh-worker-b-fvvls Ready worker 14m v1.23.3+b63be7f ci-ln-b7fit7k-72292-4wkqh-worker-c-jnrnq Ready worker 14m v1.23.3+b63be7f $ oc debug node/ci-ln-b7fit7k-72292-4wkqh-worker-a-82gp9 Starting pod/ci-ln-b7fit7k-72292-4wkqh-worker-a-82gp9-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# kubelet --version Kubernetes v1.23.3+b63be7f
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056