Created attachment 1737973 [details] cluster-monitoring-operator describe Created attachment 1737973 [details] cluster-monitoring-operator describe Version: $ ./openshift-baremetal-install version ./openshift-baremetal-install 4.6.0-0.nightly-2020-12-08-021151 built from commit f5ba6239853f0904704c04d8b1c04c78172f1141 release image registry.svc.ci.openshift.org/ocp/release@sha256:bd84091070e50e41cd30bcda6c6bd2b821ad48a0ee9aa7637165db31e7ad51dd Platform: IPI Barmetal What happened? After deploy finished and reported successful get clusterversion returns error: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-12-08-021151 True False 43m Error while reconciling 4.6.0-0.nightly-2020-12-08-021151: the workload openshift-monitoring/cluster-monitoring-operator has not yet successfully rolled out Pod cluster-monitoring-operator stuck in CreateContainerConfigError status (container kube-rbac-proxy failed to start - see attached cluster-monitoring-operator.describe): $ oc get pods -n openshift-monitoring NAME READY STATUS RESTARTS AGE cluster-monitoring-operator-866c9df665-tpm9m 1/2 CreateContainerConfigError 0 100m In pod events reported "Error: container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root" All operators are reported as Available $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-12-08-021151 True False False 61m cloud-credential 4.6.0-0.nightly-2020-12-08-021151 True False False 108m cluster-autoscaler 4.6.0-0.nightly-2020-12-08-021151 True False False 95m config-operator 4.6.0-0.nightly-2020-12-08-021151 True False False 95m console 4.6.0-0.nightly-2020-12-08-021151 True False False 65m csi-snapshot-controller 4.6.0-0.nightly-2020-12-08-021151 True False False 95m dns 4.6.0-0.nightly-2020-12-08-021151 True False False 94m etcd 4.6.0-0.nightly-2020-12-08-021151 True False False 93m image-registry 4.6.0-0.nightly-2020-12-08-021151 True False False 58m ingress 4.6.0-0.nightly-2020-12-08-021151 True False False 70m insights 4.6.0-0.nightly-2020-12-08-021151 True False False 95m kube-apiserver 4.6.0-0.nightly-2020-12-08-021151 True False False 92m kube-controller-manager 4.6.0-0.nightly-2020-12-08-021151 True False False 92m kube-scheduler 4.6.0-0.nightly-2020-12-08-021151 True False False 92m kube-storage-version-migrator 4.6.0-0.nightly-2020-12-08-021151 True False False 70m machine-api 4.6.0-0.nightly-2020-12-08-021151 True False False 79m machine-approver 4.6.0-0.nightly-2020-12-08-021151 True False False 94m machine-config 4.6.0-0.nightly-2020-12-08-021151 True False False 94m marketplace 4.6.0-0.nightly-2020-12-08-021151 True False False 93m monitoring 4.6.0-0.nightly-2020-12-08-021151 True False False 70m network 4.6.0-0.nightly-2020-12-08-021151 True False False 95m node-tuning 4.6.0-0.nightly-2020-12-08-021151 True False False 95m openshift-apiserver 4.6.0-0.nightly-2020-12-08-021151 True False False 74m openshift-controller-manager 4.6.0-0.nightly-2020-12-08-021151 True False False 92m openshift-samples 4.6.0-0.nightly-2020-12-08-021151 True False False 56m operator-lifecycle-manager 4.6.0-0.nightly-2020-12-08-021151 True False False 94m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-12-08-021151 True False False 94m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-12-08-021151 True False False 73m service-ca 4.6.0-0.nightly-2020-12-08-021151 True False False 95m storage 4.6.0-0.nightly-2020-12-08-021151 True False False 95m What did you expect to happen? After deploy all pods in Running/Complited state How to reproduce it (as minimally and precisely as possible)? 1. Deploy OCP 4.6 - disconnected barmetal network ipv4 provision ipv6 2. oc get clusterversion 3. oc get pods -A|grep -vE "Run|Comp" Anything else we need to know? 1. It happen in 3 deploys out of 4 In the one that has no such a problem, the pod was reported as restarted twice, but in state Running by the end of deployment 2. While running must-gather there were errors (attached)
Created attachment 1737975 [details] errors reported by must-gather
must-gather http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1906130-must-gather.tar.gz
Thanks for the report! If it's not clearly specific to baremetal, generally reports should go against the failing operator. Looks like this was a dupe of BZ1904538, the monitoring team has fixed it. *** This bug has been marked as a duplicate of bug 1904538 ***