Bug 1904538

Summary:	[sig-arch][Early] Managed cluster should start all core operators: monitoring: container has runAsNonRoot and image has non-numeric user (nobody)
Product:	OpenShift Container Platform	Reporter:	Dusty Mabe <dustymabe>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.6	CC:	aabhishe, akrzos, alegrand, anpicker, aos-bugs, erooth, fiezzi, kakkoyun, lcosic, lshilin, lszaszki, mbukatov, ngirard, pchavan, pehunt, pkrupa, rheinzma, rtheis, sdodson, sjenning, spasquie, ssonigra, surbania, tsweeney, wking
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-arch][Early] Managed cluster should start all core operators
Last Closed:	2021-02-24 15:38:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1906836

Description Dusty Mabe 2020-12-04 17:34:05 UTC

test:
[sig-arch][Early] Managed cluster should start all core operators 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D%5C%5BEarly%5C%5D+Managed+cluster+should+start+all+core+operators

Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1334868183173042176

Might be same or different than https://bugzilla.redhat.com/show_bug.cgi?id=1874513 which is closed.

Comment 1 W. Trevor King 2020-12-04 22:29:14 UTC

From the linked example job, the error message for this test-case was:

  fail [github.com/openshift/origin/test/extended/operators/operators.go:53]: Dec  4 16:10:54.561: ClusterVersion Failing=True: WorkloadNotAvailable: deployment openshift-monitoring/cluster-monitoring-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-monitoring-operator-cd5cb559" has timed out progressing.)

Looking at that pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1334868183173042176/artifacts/e2e-vsphere/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-monitoring-operator-")).status.containerStatuses[] | select(.ready == false)'
{
  "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcee0d9746272b095b38b10c8b50edc99db655b6ee75be5d878e63df4c99a355",
  "imageID": "",
  "lastState": {},
  "name": "kube-rbac-proxy",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "waiting": {
      "message": "container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root",
      "reason": "CreateContainerConfigError"
    }
  }
}

Word on the street is that deleting the pod will recover it, but I'm not sure who would be best to root-cause the issue itself.  Sending it to the node folks in case they have ideas...

Comment 3 Peter Hunt 2020-12-07 17:07:58 UTC

the error seems to be coming from https://github.com/kubernetes/kubernetes/blob/5648200571889140ad246feb82c8f80a5946f167/pkg/kubelet/kuberuntime/security_context.go#L88
(credit to stackoverflow for finding that link for me: https://stackoverflow.com/questions/49720308/kubernetes-podsecuritypolicy-set-to-runasnonroot-container-has-runasnonroot-and)

I believe this is an error in the image/pod spec definition, and not a problem with the kubelet. Sending to the monitoring team for further triage

Comment 4 Sergiusz Urbaniak 2020-12-08 07:47:38 UTC

*** Bug 1905109 has been marked as a duplicate of this bug. ***

Comment 5 Sergiusz Urbaniak 2020-12-08 07:50:10 UTC

I have a suspicion that this is related to https://github.com/openshift/cluster-monitoring-operator/pull/990, investigating.

As this prevents the cluster from starting, setting the blocker flag and raising urgency.

Comment 6 Lukasz Szaszkiewicz 2020-12-08 14:22:01 UTC

It seems to be failing consistently on 4.6 as well, for example, https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-oauth-server-release-4.6-e2e-gcp

Comment 7 Alex Krzos 2020-12-08 21:50:44 UTC

I was able to recover out of this situation by simply deleting the pod:

$ oc get all -n openshift-monitoring
NAME                                               READY   STATUS                       RESTARTS   AGE
pod/alertmanager-main-0                            5/5     Running                      0          38m
...
pod/cluster-monitoring-operator-5f98f58d55-c9p2k   1/2     CreateContainerConfigError   0          52m
...
$ oc delete po cluster-monitoring-operator-5f98f58d55-c9p2k -n openshift-monitoring
pod "cluster-monitoring-operator-5f98f58d55-c9p2k" deleted
$ oc get all -n openshift-monitoring
NAME                                               READY   STATUS    RESTARTS   AGE
pod/alertmanager-main-0                            5/5     Running   0          41m
...
pod/cluster-monitoring-operator-5f98f58d55-nkcs9   2/2     Running   0          2m37s
...

Comment 8 Simon Pasquier 2020-12-09 09:06:40 UTC

*** Bug 1902320 has been marked as a duplicate of this bug. ***

Comment 12 Stephen Benjamin 2020-12-10 19:12:07 UTC

*** Bug 1906130 has been marked as a duplicate of this bug. ***

Comment 14 Scott Dodson 2020-12-14 20:36:09 UTC

I no longer see this failure present in any master branch jobs. Marking VERIFIED.

Comment 20 errata-xmlrpc 2021-02-24 15:38:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 21 Richard Theis 2021-03-04 11:19:59 UTC

Has this problem been fixed on OCP 4.6?

Comment 22 Simon Pasquier 2021-03-04 12:43:23 UTC

@Richard yes see bug 1906836, it's been fixed in  4.6.12.