Bug 1906130

Summary:

cluster-monitoring-operator pod stuck in CreateContainerConfigError after installer successfully finished deploy

Product:

OpenShift Container Platform

Reporter:

Lubov <lshilin>

Component:

Installer

Assignee:

Beth White <beth.white>

Installer sub component:

OpenShift on Bare Metal IPI

QA Contact:

Amit Ugol <augol>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

stbenjam

Version:

4.6.z

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-12-10 19:12:09 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
cluster-monitoring-operator describe	none
errors reported by must-gather	none

Description Lubov 2020-12-09 17:46:58 UTC

Created attachment 1737973 [details]
cluster-monitoring-operator describe

Created attachment 1737973 [details]
cluster-monitoring-operator describe

Version:

$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.6.0-0.nightly-2020-12-08-021151
built from commit f5ba6239853f0904704c04d8b1c04c78172f1141
release image registry.svc.ci.openshift.org/ocp/release@sha256:bd84091070e50e41cd30bcda6c6bd2b821ad48a0ee9aa7637165db31e7ad51dd

Platform:
IPI Barmetal

What happened?
After deploy finished and reported successful get clusterversion returns error:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-08-021151   True        False         43m     Error while reconciling 4.6.0-0.nightly-2020-12-08-021151: the workload openshift-monitoring/cluster-monitoring-operator has not yet successfully rolled out

Pod cluster-monitoring-operator stuck in CreateContainerConfigError status (container kube-rbac-proxy failed to start - see attached cluster-monitoring-operator.describe):
$ oc get pods -n openshift-monitoring
NAME                                           READY   STATUS                       RESTARTS   AGE
cluster-monitoring-operator-866c9df665-tpm9m   1/2     CreateContainerConfigError   0          100m

In pod events reported "Error: container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root"

All operators are reported as Available
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      61m
cloud-credential                           4.6.0-0.nightly-2020-12-08-021151   True        False         False      108m
cluster-autoscaler                         4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
config-operator                            4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
console                                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      65m
csi-snapshot-controller                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
dns                                        4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
etcd                                       4.6.0-0.nightly-2020-12-08-021151   True        False         False      93m
image-registry                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      58m
ingress                                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      70m
insights                                   4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
kube-apiserver                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      92m
kube-controller-manager                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      92m
kube-scheduler                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      92m
kube-storage-version-migrator              4.6.0-0.nightly-2020-12-08-021151   True        False         False      70m
machine-api                                4.6.0-0.nightly-2020-12-08-021151   True        False         False      79m
machine-approver                           4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
machine-config                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
marketplace                                4.6.0-0.nightly-2020-12-08-021151   True        False         False      93m
monitoring                                 4.6.0-0.nightly-2020-12-08-021151   True        False         False      70m
network                                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
node-tuning                                4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
openshift-apiserver                        4.6.0-0.nightly-2020-12-08-021151   True        False         False      74m
openshift-controller-manager               4.6.0-0.nightly-2020-12-08-021151   True        False         False      92m
openshift-samples                          4.6.0-0.nightly-2020-12-08-021151   True        False         False      56m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-12-08-021151   True        False         False      73m
service-ca                                 4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
storage                                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m

What did you expect to happen?
After deploy all pods in Running/Complited state

How to reproduce it (as minimally and precisely as possible)?
1. Deploy OCP 4.6 - disconnected barmetal network ipv4 provision ipv6 
2. oc get clusterversion
3. oc get pods -A|grep -vE "Run|Comp"

Anything else we need to know?
1. It happen in 3 deploys out of 4
In the one that has no such a problem, the pod was reported as restarted twice, but in state Running by the end of deployment

2. While running must-gather there were errors (attached)

Comment 1 Lubov 2020-12-09 17:49:22 UTC

Created attachment 1737975 [details]
errors reported by must-gather

Comment 2 Lubov 2020-12-09 17:53:41 UTC

must-gather http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1906130-must-gather.tar.gz

Comment 3 Stephen Benjamin 2020-12-10 19:12:09 UTC

Thanks for the report! If it's not clearly specific to baremetal, generally reports should go against the failing operator.

Looks like this was a dupe of BZ1904538, the monitoring team has fixed it.

*** This bug has been marked as a duplicate of bug 1904538 ***