Bug 1904538 - [sig-arch][Early] Managed cluster should start all core operators: monitoring: container has runAsNonRoot and image has non-numeric user (nobody)
Summary: [sig-arch][Early] Managed cluster should start all core operators: monitoring...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1902320 1905109 1906130 (view as bug list)
Depends On:
Blocks: 1906836
TreeView+ depends on / blocked
 
Reported: 2020-12-04 17:34 UTC by Dusty Mabe
Modified: 2021-01-19 17:13 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-arch][Early] Managed cluster should start all core operators
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift kube-rbac-proxy pull 36 None closed Bug 1904538: Dockerfile.ocp: specify numeric uid 2021-01-27 10:47:50 UTC
Red Hat Knowledge Base (Solution) 5663021 None None None 2020-12-23 14:39:00 UTC

Comment 1 W. Trevor King 2020-12-04 22:29:14 UTC
From the linked example job, the error message for this test-case was:

  fail [github.com/openshift/origin/test/extended/operators/operators.go:53]: Dec  4 16:10:54.561: ClusterVersion Failing=True: WorkloadNotAvailable: deployment openshift-monitoring/cluster-monitoring-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-monitoring-operator-cd5cb559" has timed out progressing.)

Looking at that pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1334868183173042176/artifacts/e2e-vsphere/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-monitoring-operator-")).status.containerStatuses[] | select(.ready == false)'
{
  "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcee0d9746272b095b38b10c8b50edc99db655b6ee75be5d878e63df4c99a355",
  "imageID": "",
  "lastState": {},
  "name": "kube-rbac-proxy",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "waiting": {
      "message": "container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root",
      "reason": "CreateContainerConfigError"
    }
  }
}

Word on the street is that deleting the pod will recover it, but I'm not sure who would be best to root-cause the issue itself.  Sending it to the node folks in case they have ideas...

Comment 3 Peter Hunt 2020-12-07 17:07:58 UTC
the error seems to be coming from https://github.com/kubernetes/kubernetes/blob/5648200571889140ad246feb82c8f80a5946f167/pkg/kubelet/kuberuntime/security_context.go#L88
(credit to stackoverflow for finding that link for me: https://stackoverflow.com/questions/49720308/kubernetes-podsecuritypolicy-set-to-runasnonroot-container-has-runasnonroot-and)

I believe this is an error in the image/pod spec definition, and not a problem with the kubelet. Sending to the monitoring team for further triage

Comment 4 Sergiusz Urbaniak 2020-12-08 07:47:38 UTC
*** Bug 1905109 has been marked as a duplicate of this bug. ***

Comment 5 Sergiusz Urbaniak 2020-12-08 07:50:10 UTC
I have a suspicion that this is related to https://github.com/openshift/cluster-monitoring-operator/pull/990, investigating.

As this prevents the cluster from starting, setting the blocker flag and raising urgency.

Comment 6 Lukasz Szaszkiewicz 2020-12-08 14:22:01 UTC
It seems to be failing consistently on 4.6 as well, for example, https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-oauth-server-release-4.6-e2e-gcp

Comment 7 Alex Krzos 2020-12-08 21:50:44 UTC
I was able to recover out of this situation by simply deleting the pod:

$ oc get all -n openshift-monitoring
NAME                                               READY   STATUS                       RESTARTS   AGE
pod/alertmanager-main-0                            5/5     Running                      0          38m
...
pod/cluster-monitoring-operator-5f98f58d55-c9p2k   1/2     CreateContainerConfigError   0          52m
...
$ oc delete po cluster-monitoring-operator-5f98f58d55-c9p2k -n openshift-monitoring
pod "cluster-monitoring-operator-5f98f58d55-c9p2k" deleted
$ oc get all -n openshift-monitoring
NAME                                               READY   STATUS    RESTARTS   AGE
pod/alertmanager-main-0                            5/5     Running   0          41m
...
pod/cluster-monitoring-operator-5f98f58d55-nkcs9   2/2     Running   0          2m37s
...

Comment 8 Simon Pasquier 2020-12-09 09:06:40 UTC
*** Bug 1902320 has been marked as a duplicate of this bug. ***

Comment 12 Stephen Benjamin 2020-12-10 19:12:07 UTC
*** Bug 1906130 has been marked as a duplicate of this bug. ***

Comment 14 Scott Dodson 2020-12-14 20:36:09 UTC
I no longer see this failure present in any master branch jobs. Marking VERIFIED.


Note You need to log in before you can comment on or make changes to this bug.