Description of problem: IPI on azure, enable etcd encryption, enable fips, Upgrade from 4.5.6 to 4.6.0-0.nightly-2020-08-18-165040 failed due to one node is NotReady Version-Release number of selected component (if applicable): 4.5.6->4.6.0-0.nightly-2020-08-18-165040 How reproducible: met once time Steps to Reproduce: 1. Set up 4.5.6 cluster, after setting up the cluster,openshift-apiserver is not avaiable,so use this workaround https://bugzilla.redhat.com/show_bug.cgi?id=1825219#c51, then upgrade. openshift-apiserver 4.6.0-0.nightly-2020-08-18-165040 False False False 8h $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.6 True False 4h27m Cluster version is 4.5.6 $ oc get node NAME STATUS ROLES AGE VERSION zhsunupgrade819-ctcx8-master-0 Ready master 72m v1.18.3+002a51f zhsunupgrade819-ctcx8-master-1 Ready master 71m v1.18.3+002a51f zhsunupgrade819-ctcx8-master-2 Ready master 72m v1.18.3+002a51f zhsunupgrade819-ctcx8-worker-northcentralus-lsccw Ready worker 56m v1.18.3+002a51f zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn Ready worker 56m v1.18.3+002a51f $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.6 True False False 4h27m cloud-credential 4.5.6 True False False 4h56m cluster-autoscaler 4.5.6 True False False 4h43m config-operator 4.5.6 True False False 4h43m console 4.5.6 True False False 4h9m csi-snapshot-controller 4.5.6 True False False 4h24m dns 4.5.6 True False False 4h50m etcd 4.5.6 True False False 4h49m image-registry 4.5.6 True False False 4h34m ingress 4.5.6 True False False 4h34m insights 4.5.6 True False False 4h44m kube-apiserver 4.5.6 True False False 4h49m kube-controller-manager 4.5.6 True False False 4h49m kube-scheduler 4.5.6 True False False 4h48m kube-storage-version-migrator 4.5.6 True False False 4h21m machine-api 4.5.6 True False False 4h41m machine-approver 4.5.6 True False False 4h45m machine-config 4.5.6 True False False 4h42m marketplace 4.5.6 True False False 4h19m monitoring 4.5.6 True False False 3m13s network 4.5.6 True False False 4h51m node-tuning 4.5.6 True False False 4h51m openshift-apiserver 4.5.6 True False False 3h22m openshift-controller-manager 4.5.6 True False False 4h44m openshift-samples 4.5.6 True False False 4h43m operator-lifecycle-manager 4.5.6 True False False 4h50m operator-lifecycle-manager-catalog 4.5.6 True False False 4h51m operator-lifecycle-manager-packageserver 4.5.6 True False False 4h19m service-ca 4.5.6 True False False 4h51m storage 4.5.6 True False False 4h44m 2. Upgrade to 4.6.0-0.nightly-2020-08-18-165040 oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-18-165040 --force --allow-explicit-upgrade 3. Actual results: One node is in NotReady status, it is stuck at FIPS mode initialized. After rebooting it, the node is back and in Ready state. Finally the cluster upgrade to 4.6.0-0.nightly-2020-08-18-165040 successful. Before reboot the NotReady node: sh-4.4# ssh -i openshift-qe.pem core@zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn FIPS mode initialized Red Hat Enterprise Linux CoreOS 45.82.202008101249-0 Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html --- [systemd] Failed Units: 5 crio-f4f3ecc7efafaab5f8fba7052ae9999ae6e5e449a5af64d4829d26d55fcc770f.scope afterburn-checkin.service chronyd.service systemd-coredump systemd-coredump $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.6 True True 3h42m Unable to apply 4.6.0-0.nightly-2020-08-18-165040: the cluster operator monitoring has not yet successfully rolled out $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-08-18-165040 True False False 26s cloud-credential 4.6.0-0.nightly-2020-08-18-165040 True False False 8h cluster-autoscaler 4.6.0-0.nightly-2020-08-18-165040 True False False 8h config-operator 4.6.0-0.nightly-2020-08-18-165040 True False False 8h console 4.6.0-0.nightly-2020-08-18-165040 True False False 3h24m csi-snapshot-controller 4.6.0-0.nightly-2020-08-18-165040 True False False 40m dns 4.5.6 True True False 8h etcd 4.6.0-0.nightly-2020-08-18-165040 True False False 8h image-registry 4.6.0-0.nightly-2020-08-18-165040 True False False 42m ingress 4.6.0-0.nightly-2020-08-18-165040 True False False 3h26m insights 4.6.0-0.nightly-2020-08-18-165040 True False False 8h kube-apiserver 4.6.0-0.nightly-2020-08-18-165040 True False False 8h kube-controller-manager 4.6.0-0.nightly-2020-08-18-165040 True False False 8h kube-scheduler 4.6.0-0.nightly-2020-08-18-165040 True False False 8h kube-storage-version-migrator 4.6.0-0.nightly-2020-08-18-165040 True False False 41m machine-api 4.6.0-0.nightly-2020-08-18-165040 True False False 8h machine-approver 4.6.0-0.nightly-2020-08-18-165040 True False False 8h machine-config 4.5.6 False False True 3h9m marketplace 4.6.0-0.nightly-2020-08-18-165040 True False False 3h25m monitoring 4.5.6 False False True 3h23m network 4.5.6 True True True 8h node-tuning 4.6.0-0.nightly-2020-08-18-165040 True False False 3h25m openshift-apiserver 4.6.0-0.nightly-2020-08-18-165040 True False False 139m openshift-controller-manager 4.6.0-0.nightly-2020-08-18-165040 True False False 3h24m openshift-samples 4.6.0-0.nightly-2020-08-18-165040 True False False 3h25m operator-lifecycle-manager 4.6.0-0.nightly-2020-08-18-165040 True False False 8h operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-08-18-165040 True False False 8h operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-08-18-165040 True False False 166m service-ca 4.6.0-0.nightly-2020-08-18-165040 True False False 8h storage 4.6.0-0.nightly-2020-08-18-165040 True False False 3h26m $ oc get node NAME STATUS ROLES AGE VERSION zhsunupgrade819-ctcx8-master-0 Ready master 8h v1.18.3+002a51f zhsunupgrade819-ctcx8-master-1 Ready master 8h v1.18.3+002a51f zhsunupgrade819-ctcx8-master-2 Ready master 8h v1.18.3+002a51f zhsunupgrade819-ctcx8-worker-northcentralus-lsccw Ready worker 8h v1.18.3+002a51f zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn NotReady worker 8h v1.18.3+002a51f Expected results: Upgrade is successfull. Additional info: must-gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.7896037696304830913.zip cluster: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/107252/artifact/workdir/install-dir/auth/kubeconfig/*view*/
Created attachment 1711968 [details] coredump in zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn
I met the similar issue when upgrade from 4.5.0-0.nightly-2020-08-23-191713 to 4.6.0-0.nightly-2020-08-23-185640. Two workers are in NotReady during the upgrade. $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-55-19.us-east-2.compute.internal NotReady worker 6h44m v1.19.0-rc.2+3e083ac-dirty ip-10-0-58-5.us-east-2.compute.internal Ready worker 6h45m v1.19.0-rc.2+3e083ac-dirty ip-10-0-58-77.us-east-2.compute.internal Ready master 6h54m v1.19.0-rc.2+3e083ac-dirty ip-10-0-60-223.us-east-2.compute.internal Ready master 6h54m v1.19.0-rc.2+3e083ac-dirty ip-10-0-77-95.us-east-2.compute.internal Ready master 6h54m v1.19.0-rc.2+3e083ac-dirty ip-10-0-79-138.us-east-2.compute.internal NotReady worker 6h45m v1.19.0-rc.2+3e083ac-dirty After manually reboot, node are back to ready state with @Sunil's help, related coredumps are attached. And Paste the os version. ip-10-0-79-138.us-east-2.compute.internal Ready worker 7h29m v1.19.0-rc.2+3e083ac-dirty 10.0.79.138 <none> Red Hat Enterprise Linux CoreOS 46.82.202008231640-0 (Ootpa) 4.18.0-211.el8.x86_64 cri-o://1.19.0-87.rhaos4.6.git5f02fa2.el8-dev
Created attachment 1712329 [details] 0824-coresystemd
There was a recent issue where fips enablement raced and could prevent boot https://bugzilla.redhat.com/show_bug.cgi?id=1862957 Fixed by https://github.com/openshift/installer/pull/4066 *** This bug has been marked as a duplicate of bug 1862957 ***