Bug 1906130

Summary: cluster-monitoring-operator pod stuck in CreateContainerConfigError after installer successfully finished deploy
Product: OpenShift Container Platform Reporter: Lubov <lshilin>
Component: InstallerAssignee: Beth White <beth.white>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Amit Ugol <augol>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: stbenjam
Version: 4.6.z   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-10 19:12:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster-monitoring-operator describe
none
errors reported by must-gather none

Description Lubov 2020-12-09 17:46:58 UTC
Created attachment 1737973 [details]
cluster-monitoring-operator describe

Created attachment 1737973 [details]
cluster-monitoring-operator describe

Version:

$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.6.0-0.nightly-2020-12-08-021151
built from commit f5ba6239853f0904704c04d8b1c04c78172f1141
release image registry.svc.ci.openshift.org/ocp/release@sha256:bd84091070e50e41cd30bcda6c6bd2b821ad48a0ee9aa7637165db31e7ad51dd

Platform:
IPI Barmetal

What happened?
After deploy finished and reported successful get clusterversion returns error:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-08-021151   True        False         43m     Error while reconciling 4.6.0-0.nightly-2020-12-08-021151: the workload openshift-monitoring/cluster-monitoring-operator has not yet successfully rolled out

Pod cluster-monitoring-operator stuck in CreateContainerConfigError status (container kube-rbac-proxy failed to start - see attached cluster-monitoring-operator.describe):
$ oc get pods -n openshift-monitoring
NAME                                           READY   STATUS                       RESTARTS   AGE
cluster-monitoring-operator-866c9df665-tpm9m   1/2     CreateContainerConfigError   0          100m

In pod events reported "Error: container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root"

All operators are reported as Available
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      61m
cloud-credential                           4.6.0-0.nightly-2020-12-08-021151   True        False         False      108m
cluster-autoscaler                         4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
config-operator                            4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
console                                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      65m
csi-snapshot-controller                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
dns                                        4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
etcd                                       4.6.0-0.nightly-2020-12-08-021151   True        False         False      93m
image-registry                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      58m
ingress                                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      70m
insights                                   4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
kube-apiserver                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      92m
kube-controller-manager                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      92m
kube-scheduler                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      92m
kube-storage-version-migrator              4.6.0-0.nightly-2020-12-08-021151   True        False         False      70m
machine-api                                4.6.0-0.nightly-2020-12-08-021151   True        False         False      79m
machine-approver                           4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
machine-config                             4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
marketplace                                4.6.0-0.nightly-2020-12-08-021151   True        False         False      93m
monitoring                                 4.6.0-0.nightly-2020-12-08-021151   True        False         False      70m
network                                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
node-tuning                                4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
openshift-apiserver                        4.6.0-0.nightly-2020-12-08-021151   True        False         False      74m
openshift-controller-manager               4.6.0-0.nightly-2020-12-08-021151   True        False         False      92m
openshift-samples                          4.6.0-0.nightly-2020-12-08-021151   True        False         False      56m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-12-08-021151   True        False         False      94m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-12-08-021151   True        False         False      73m
service-ca                                 4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m
storage                                    4.6.0-0.nightly-2020-12-08-021151   True        False         False      95m

What did you expect to happen?
After deploy all pods in Running/Complited state

How to reproduce it (as minimally and precisely as possible)?
1. Deploy OCP 4.6 - disconnected barmetal network ipv4 provision ipv6 
2. oc get clusterversion
3. oc get pods -A|grep -vE "Run|Comp"

Anything else we need to know?
1. It happen in 3 deploys out of 4
In the one that has no such a problem, the pod was reported as restarted twice, but in state Running by the end of deployment

2. While running must-gather there were errors (attached)

Comment 1 Lubov 2020-12-09 17:49:22 UTC
Created attachment 1737975 [details]
errors reported by must-gather

Comment 3 Stephen Benjamin 2020-12-10 19:12:09 UTC
Thanks for the report! If it's not clearly specific to baremetal, generally reports should go against the failing operator.

Looks like this was a dupe of BZ1904538, the monitoring team has fixed it.

*** This bug has been marked as a duplicate of bug 1904538 ***