1904538 – [sig-arch][Early] Managed cluster should start all core operators: monitoring: container has runAsNonRoot and image has non-numeric user (nobody)

Bug 1904538 - [sig-arch][Early] Managed cluster should start all core operators: monitoring: container has runAsNonRoot and image has non-numeric user (nobody)

Summary: [sig-arch][Early] Managed cluster should start all core operators: monitoring...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1902320 1905109 1906130 (view as bug list)
Depends On:
Blocks:	1906836
TreeView+	depends on / blocked

Reported:	2020-12-04 17:34 UTC by Dusty Mabe
Modified:	2024-03-25 17:24 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	[sig-arch][Early] Managed cluster should start all core operators
Last Closed:	2021-02-24 15:38:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kube-rbac-proxy pull 36	None	closed	Bug 1904538: Dockerfile.ocp: specify numeric uid	2021-02-15 17:02:17 UTC
Red Hat Knowledge Base (Solution)	5663021	None	None	None	2020-12-23 14:39:00 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:40:03 UTC

Description Dusty Mabe 2020-12-04 17:34:05 UTC

test:
[sig-arch][Early] Managed cluster should start all core operators 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D%5C%5BEarly%5C%5D+Managed+cluster+should+start+all+core+operators

Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1334868183173042176

Might be same or different than https://bugzilla.redhat.com/show_bug.cgi?id=1874513 which is closed.

Comment 1 W. Trevor King 2020-12-04 22:29:14 UTC

From the linked example job, the error message for this test-case was:

  fail [github.com/openshift/origin/test/extended/operators/operators.go:53]: Dec  4 16:10:54.561: ClusterVersion Failing=True: WorkloadNotAvailable: deployment openshift-monitoring/cluster-monitoring-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-monitoring-operator-cd5cb559" has timed out progressing.)

Looking at that pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1334868183173042176/artifacts/e2e-vsphere/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-monitoring-operator-")).status.containerStatuses[] | select(.ready == false)'
{
  "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bcee0d9746272b095b38b10c8b50edc99db655b6ee75be5d878e63df4c99a355",
  "imageID": "",
  "lastState": {},
  "name": "kube-rbac-proxy",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "waiting": {
      "message": "container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root",
      "reason": "CreateContainerConfigError"
    }
  }
}

Word on the street is that deleting the pod will recover it, but I'm not sure who would be best to root-cause the issue itself.  Sending it to the node folks in case they have ideas...

Comment 3 Peter Hunt 2020-12-07 17:07:58 UTC

the error seems to be coming from https://github.com/kubernetes/kubernetes/blob/5648200571889140ad246feb82c8f80a5946f167/pkg/kubelet/kuberuntime/security_context.go#L88
(credit to stackoverflow for finding that link for me: https://stackoverflow.com/questions/49720308/kubernetes-podsecuritypolicy-set-to-runasnonroot-container-has-runasnonroot-and)

I believe this is an error in the image/pod spec definition, and not a problem with the kubelet. Sending to the monitoring team for further triage

Comment 4 Sergiusz Urbaniak 2020-12-08 07:47:38 UTC

*** Bug 1905109 has been marked as a duplicate of this bug. ***

Comment 5 Sergiusz Urbaniak 2020-12-08 07:50:10 UTC

I have a suspicion that this is related to https://github.com/openshift/cluster-monitoring-operator/pull/990, investigating.

As this prevents the cluster from starting, setting the blocker flag and raising urgency.

Comment 6 Lukasz Szaszkiewicz 2020-12-08 14:22:01 UTC

It seems to be failing consistently on 4.6 as well, for example, https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-oauth-server-release-4.6-e2e-gcp

Comment 7 Alex Krzos 2020-12-08 21:50:44 UTC

I was able to recover out of this situation by simply deleting the pod:

$ oc get all -n openshift-monitoring
NAME                                               READY   STATUS                       RESTARTS   AGE
pod/alertmanager-main-0                            5/5     Running                      0          38m
...
pod/cluster-monitoring-operator-5f98f58d55-c9p2k   1/2     CreateContainerConfigError   0          52m
...
$ oc delete po cluster-monitoring-operator-5f98f58d55-c9p2k -n openshift-monitoring
pod "cluster-monitoring-operator-5f98f58d55-c9p2k" deleted
$ oc get all -n openshift-monitoring
NAME                                               READY   STATUS    RESTARTS   AGE
pod/alertmanager-main-0                            5/5     Running   0          41m
...
pod/cluster-monitoring-operator-5f98f58d55-nkcs9   2/2     Running   0          2m37s
...

Comment 8 Simon Pasquier 2020-12-09 09:06:40 UTC

*** Bug 1902320 has been marked as a duplicate of this bug. ***

Comment 12 Stephen Benjamin 2020-12-10 19:12:07 UTC

*** Bug 1906130 has been marked as a duplicate of this bug. ***

Comment 14 Scott Dodson 2020-12-14 20:36:09 UTC

I no longer see this failure present in any master branch jobs. Marking VERIFIED.

Comment 20 errata-xmlrpc 2021-02-24 15:38:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 21 Richard Theis 2021-03-04 11:19:59 UTC

Has this problem been fixed on OCP 4.6?

Comment 22 Simon Pasquier 2021-03-04 12:43:23 UTC

@Richard yes see bug 1906836, it's been fixed in  4.6.12.

Note You need to log in before you can comment on or make changes to this bug.

aabhishe
akrzos
alegrand
anpicker
aos-bugs
erooth
fiezzi
kakkoyun
lcosic
lshilin
lszaszki
mbukatov
ngirard
pchavan
pehunt
pkrupa
rheinzma
rtheis
sdodson
sjenning
spasquie
ssonigra
surbania
tsweeney
wking