Bug 1904538
Summary: | [sig-arch][Early] Managed cluster should start all core operators: monitoring: container has runAsNonRoot and image has non-numeric user (nobody) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dusty Mabe <dustymabe> |
Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.6 | CC: | aabhishe, akrzos, alegrand, anpicker, aos-bugs, erooth, fiezzi, kakkoyun, lcosic, lshilin, lszaszki, mbukatov, ngirard, pchavan, pehunt, pkrupa, rheinzma, rtheis, sdodson, sjenning, spasquie, ssonigra, surbania, tsweeney, wking |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
[sig-arch][Early] Managed cluster should start all core operators
|
|
Last Closed: | 2021-02-24 15:38:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1906836 |
Description
Dusty Mabe
2020-12-04 17:34:05 UTC
From the linked example job, the error message for this test-case was: fail [github.com/openshift/origin/test/extended/operators/operators.go:53]: Dec 4 16:10:54.561: ClusterVersion Failing=True: WorkloadNotAvailable: deployment openshift-monitoring/cluster-monitoring-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-monitoring-operator-cd5cb559" has timed out progressing.) Looking at that pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1334868183173042176/artifacts/e2e-vsphere/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-monitoring-operator-")).status.containerStatuses[] | select(.ready == false)' { "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcee0d9746272b095b38b10c8b50edc99db655b6ee75be5d878e63df4c99a355", "imageID": "", "lastState": {}, "name": "kube-rbac-proxy", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "message": "container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root", "reason": "CreateContainerConfigError" } } } Word on the street is that deleting the pod will recover it, but I'm not sure who would be best to root-cause the issue itself. Sending it to the node folks in case they have ideas... the error seems to be coming from https://github.com/kubernetes/kubernetes/blob/5648200571889140ad246feb82c8f80a5946f167/pkg/kubelet/kuberuntime/security_context.go#L88 (credit to stackoverflow for finding that link for me: https://stackoverflow.com/questions/49720308/kubernetes-podsecuritypolicy-set-to-runasnonroot-container-has-runasnonroot-and) I believe this is an error in the image/pod spec definition, and not a problem with the kubelet. Sending to the monitoring team for further triage *** Bug 1905109 has been marked as a duplicate of this bug. *** I have a suspicion that this is related to https://github.com/openshift/cluster-monitoring-operator/pull/990, investigating. As this prevents the cluster from starting, setting the blocker flag and raising urgency. It seems to be failing consistently on 4.6 as well, for example, https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-oauth-server-release-4.6-e2e-gcp I was able to recover out of this situation by simply deleting the pod: $ oc get all -n openshift-monitoring NAME READY STATUS RESTARTS AGE pod/alertmanager-main-0 5/5 Running 0 38m ... pod/cluster-monitoring-operator-5f98f58d55-c9p2k 1/2 CreateContainerConfigError 0 52m ... $ oc delete po cluster-monitoring-operator-5f98f58d55-c9p2k -n openshift-monitoring pod "cluster-monitoring-operator-5f98f58d55-c9p2k" deleted $ oc get all -n openshift-monitoring NAME READY STATUS RESTARTS AGE pod/alertmanager-main-0 5/5 Running 0 41m ... pod/cluster-monitoring-operator-5f98f58d55-nkcs9 2/2 Running 0 2m37s ... *** Bug 1902320 has been marked as a duplicate of this bug. *** *** Bug 1906130 has been marked as a duplicate of this bug. *** I no longer see this failure present in any master branch jobs. Marking VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 Has this problem been fixed on OCP 4.6? @Richard yes see bug 1906836, it's been fixed in 4.6.12. |