test: [sig-arch][Early] Managed cluster should start all core operators is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D%5C%5BEarly%5C%5D+Managed+cluster+should+start+all+core+operators Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1334868183173042176 Might be same or different than https://bugzilla.redhat.com/show_bug.cgi?id=1874513 which is closed.
From the linked example job, the error message for this test-case was: fail [github.com/openshift/origin/test/extended/operators/operators.go:53]: Dec 4 16:10:54.561: ClusterVersion Failing=True: WorkloadNotAvailable: deployment openshift-monitoring/cluster-monitoring-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-monitoring-operator-cd5cb559" has timed out progressing.) Looking at that pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1334868183173042176/artifacts/e2e-vsphere/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-monitoring-operator-")).status.containerStatuses[] | select(.ready == false)' { "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcee0d9746272b095b38b10c8b50edc99db655b6ee75be5d878e63df4c99a355", "imageID": "", "lastState": {}, "name": "kube-rbac-proxy", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "message": "container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root", "reason": "CreateContainerConfigError" } } } Word on the street is that deleting the pod will recover it, but I'm not sure who would be best to root-cause the issue itself. Sending it to the node folks in case they have ideas...
the error seems to be coming from https://github.com/kubernetes/kubernetes/blob/5648200571889140ad246feb82c8f80a5946f167/pkg/kubelet/kuberuntime/security_context.go#L88 (credit to stackoverflow for finding that link for me: https://stackoverflow.com/questions/49720308/kubernetes-podsecuritypolicy-set-to-runasnonroot-container-has-runasnonroot-and) I believe this is an error in the image/pod spec definition, and not a problem with the kubelet. Sending to the monitoring team for further triage
*** Bug 1905109 has been marked as a duplicate of this bug. ***
I have a suspicion that this is related to https://github.com/openshift/cluster-monitoring-operator/pull/990, investigating. As this prevents the cluster from starting, setting the blocker flag and raising urgency.
It seems to be failing consistently on 4.6 as well, for example, https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-oauth-server-release-4.6-e2e-gcp
I was able to recover out of this situation by simply deleting the pod: $ oc get all -n openshift-monitoring NAME READY STATUS RESTARTS AGE pod/alertmanager-main-0 5/5 Running 0 38m ... pod/cluster-monitoring-operator-5f98f58d55-c9p2k 1/2 CreateContainerConfigError 0 52m ... $ oc delete po cluster-monitoring-operator-5f98f58d55-c9p2k -n openshift-monitoring pod "cluster-monitoring-operator-5f98f58d55-c9p2k" deleted $ oc get all -n openshift-monitoring NAME READY STATUS RESTARTS AGE pod/alertmanager-main-0 5/5 Running 0 41m ... pod/cluster-monitoring-operator-5f98f58d55-nkcs9 2/2 Running 0 2m37s ...
*** Bug 1902320 has been marked as a duplicate of this bug. ***
*** Bug 1906130 has been marked as a duplicate of this bug. ***
I no longer see this failure present in any master branch jobs. Marking VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Has this problem been fixed on OCP 4.6?
@Richard yes see bug 1906836, it's been fixed in 4.6.12.