Bug 1902320

Summary: cluster-monitoring-operator fails on CreateContainerConfigError
Product: OpenShift Container Platform Reporter: Martin Bukatovic <mbukatov>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: alegrand, anpicker, erooth, frank.lamon, itbrown, jarrpa, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania
Target Milestone: ---Keywords: UpcomingSprint
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-09 09:06:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster-monitoring-operator pod file none

Description Martin Bukatovic 2020-11-27 18:07:24 UTC
Description of problem
======================

When I install OCP 4.6 with OCS 4.6 on GCP, I see that
cluster-monitoring-operator fails on CreateContainerConfigError.

Version-Release number of selected component
============================================

OCP 4.6.0-0.nightly-2020-11-26-234822
OCS 4.6.0-160.ci

How reproducible
================

2/2

Steps to Reproduce
==================

1. Install OCP/OCS cluster on GCP
2. Check cluster dashboard in OCP Console
3. Check pods in openshift-monitoring namespace

Actual results
==============

There is the following alert in OCP Console:

```
100% of the cluster-monitoring-operator/cluster-monitoring-operator targets in openshift-monitoring namespace are down.
```

And cluster-monitoring-operator pod is not running:

```
$ oc get pods -n openshift-monitoring
NAME                                           READY   STATUS                       RESTARTS   AGE
alertmanager-main-0                            5/5     Running                      0          109m
alertmanager-main-1                            5/5     Running                      0          109m
alertmanager-main-2                            5/5     Running                      0          109m
cluster-monitoring-operator-769d997849-6xzdk   1/2     CreateContainerConfigError   0          150m
grafana-6754564857-wvbwf                       2/2     Running                      0          137m
kube-state-metrics-6b86844c5d-r8ctl            3/3     Running                      0          144m
node-exporter-2rg2w                            2/2     Running                      0          144m
node-exporter-4blr5                            2/2     Running                      0          138m
node-exporter-bzxzv                            2/2     Running                      0          138m
node-exporter-d7h8x                            2/2     Running                      0          144m
node-exporter-sndkw                            2/2     Running                      0          144m
node-exporter-wlfnt                            2/2     Running                      0          138m
openshift-state-metrics-66454d8fcc-m29l6       3/3     Running                      0          144m
prometheus-adapter-6c7cc44f88-7qm24            1/1     Running                      0          138m
prometheus-adapter-6c7cc44f88-ts2pm            1/1     Running                      0          138m
prometheus-k8s-0                               6/6     Running                      1          109m
prometheus-k8s-1                               6/6     Running                      1          109m
prometheus-operator-57d46dd98c-8lg8f           2/2     Running                      0          114m
telemeter-client-7dbf4cfdc-jtj5d               3/3     Running                      0          144m
thanos-querier-75d567b696-q7fvd                5/5     Running                      0          137m
thanos-querier-75d567b696-zhgkj                5/5     Running                      0          137m
```

Expected results
================

cluster-monitoring-operator pod is running.

Additional info
===============

Must gather data are referenced below.

Effect of using OCS for monitoring storage
------------------------------------------

I tested both cases, and there seems to be no effect on the bug.

Description of cluster-monitoring-operator pod
----------------------------------------------

Output of the following command attached:

```
$ oc describe pod/cluster-monitoring-operator-769d997849-6xzdk -n openshift-monitoring > cluster-monitoring-operator-769d997849-6xzdk.describe
```

From this output, I would highlight Events section:

```
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  158m (x7 over 161m)   default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  157m (x2 over 157m)   default-scheduler  0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Warning  FailedScheduling  156m (x5 over 157m)   default-scheduler  0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Normal   Scheduled         155m                  default-scheduler  Successfully assigned openshift-monitoring/cluster-monitoring-operator-769d997849-6xzdk to mbukatov-11-27a-96gj9-master-2.c.ocs4-283313.internal
  Normal   AddedInterface    155m                  multus             Add eth0 [10.129.0.10/23]
  Normal   Pulling           155m                  kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1311a289dd7226d1b2783571f32f9c2e87fe1a66208ee08d06208e6ab2344984"
  Normal   Pulled            155m                  kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1311a289dd7226d1b2783571f32f9c2e87fe1a66208ee08d06208e6ab2344984" in 10.378011984s
  Normal   Created           155m                  kubelet            Created container cluster-monitoring-operator
  Normal   Started           155m                  kubelet            Started container cluster-monitoring-operator
  Warning  Failed            154m (x10 over 155m)  kubelet            Error: container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root
  Normal   Pulled            47s (x719 over 155m)  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:04c3f0ad9fc07192783d5ac5ce8acab865e4f382a143c773b1d8ccb08252c3a9" already present on machine
```

Comment 3 Martin Bukatovic 2020-11-27 18:38:26 UTC
Full list of related alerts:

- Pod Namespace openshift-monitoring/Pod cluster-monitoring-operator-769d997849-xkrmg has been in a non-ready state for longer than 15 minutes.
- Deployment Namespace openshift-monitoring/Deployment cluster-monitoring-operator has not matched the expected number of replicas for longer than 15 minutes.
- Pod Namespace openshift-monitoring/Pod cluster-monitoring-operator-769d997849-xkrmg container Container kube-rbac-proxy
has been in waiting state for longer than 1 hour.
- 100% of the cluster-monitoring-operator/cluster-monitoring-operator targets in Namespace openshift-monitoring namespace are down.

Comment 5 Junqi Zhao 2020-11-30 02:18:53 UTC
Created attachment 1734715 [details]
cluster-monitoring-operator pod file

Comment 6 Simon Pasquier 2020-11-30 08:54:02 UTC
For some reason, the cluster-monitoring-oeprator pod ended up with the "nonroot" SCC while it should normally be "restricted". Do OCS manipulate SCCs and related bindings by any chance?

Comment 7 Martin Bukatovic 2020-11-30 09:40:09 UTC
Jose, could you help us to direct the Simon's question to a proper subteam of OCS Dev group?

At the same time, we also need to rule out an issue in ocs-ci. As noted in the bugreport, I already ruled out an effect of persistent storage configuration for OCP Monitoring.

Comment 9 Itzik Brown 2020-12-06 09:45:47 UTC
This also happens with OCP on Openstack 
OCP - 4.6.0-0.nightly-2020-12-04-165039
OSP - RHOS-16.1-RHEL-8-20201021.n.0

Comment 11 frank.lamon 2020-12-22 14:57:02 UTC
We are seeing this exact same issue upgrading from 4.5.23 to 4.6.8

And also a very similar issue (scc changing to nonroot on prometheus pods) doing an upgrade from 4.6.6 to 4.6.8