1902320 – cluster-monitoring-operator fails on CreateContainerConfigError

Bug 1902320 - cluster-monitoring-operator fails on CreateContainerConfigError

Summary: cluster-monitoring-operator fails on CreateContainerConfigError

Keywords:
Status:	CLOSED DUPLICATE of bug 1904538
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-27 18:07 UTC by Martin Bukatovic
Modified:	2020-12-22 14:57 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-09 09:06:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster-monitoring-operator pod file (12.96 KB, text/plain) 2020-11-30 02:18 UTC, Junqi Zhao	no flags	Details
View All

Description Martin Bukatovic 2020-11-27 18:07:24 UTC

Description of problem
======================

When I install OCP 4.6 with OCS 4.6 on GCP, I see that
cluster-monitoring-operator fails on CreateContainerConfigError.

Version-Release number of selected component
============================================

OCP 4.6.0-0.nightly-2020-11-26-234822
OCS 4.6.0-160.ci

How reproducible
================

2/2

Steps to Reproduce
==================

1. Install OCP/OCS cluster on GCP
2. Check cluster dashboard in OCP Console
3. Check pods in openshift-monitoring namespace

Actual results
==============

There is the following alert in OCP Console:

```
100% of the cluster-monitoring-operator/cluster-monitoring-operator targets in openshift-monitoring namespace are down.
```

And cluster-monitoring-operator pod is not running:

```
$ oc get pods -n openshift-monitoring
NAME                                           READY   STATUS                       RESTARTS   AGE
alertmanager-main-0                            5/5     Running                      0          109m
alertmanager-main-1                            5/5     Running                      0          109m
alertmanager-main-2                            5/5     Running                      0          109m
cluster-monitoring-operator-769d997849-6xzdk   1/2     CreateContainerConfigError   0          150m
grafana-6754564857-wvbwf                       2/2     Running                      0          137m
kube-state-metrics-6b86844c5d-r8ctl            3/3     Running                      0          144m
node-exporter-2rg2w                            2/2     Running                      0          144m
node-exporter-4blr5                            2/2     Running                      0          138m
node-exporter-bzxzv                            2/2     Running                      0          138m
node-exporter-d7h8x                            2/2     Running                      0          144m
node-exporter-sndkw                            2/2     Running                      0          144m
node-exporter-wlfnt                            2/2     Running                      0          138m
openshift-state-metrics-66454d8fcc-m29l6       3/3     Running                      0          144m
prometheus-adapter-6c7cc44f88-7qm24            1/1     Running                      0          138m
prometheus-adapter-6c7cc44f88-ts2pm            1/1     Running                      0          138m
prometheus-k8s-0                               6/6     Running                      1          109m
prometheus-k8s-1                               6/6     Running                      1          109m
prometheus-operator-57d46dd98c-8lg8f           2/2     Running                      0          114m
telemeter-client-7dbf4cfdc-jtj5d               3/3     Running                      0          144m
thanos-querier-75d567b696-q7fvd                5/5     Running                      0          137m
thanos-querier-75d567b696-zhgkj                5/5     Running                      0          137m
```

Expected results
================

cluster-monitoring-operator pod is running.

Additional info
===============

Must gather data are referenced below.

Effect of using OCS for monitoring storage
------------------------------------------

I tested both cases, and there seems to be no effect on the bug.

Description of cluster-monitoring-operator pod
----------------------------------------------

Output of the following command attached:

```
$ oc describe pod/cluster-monitoring-operator-769d997849-6xzdk -n openshift-monitoring > cluster-monitoring-operator-769d997849-6xzdk.describe
```

From this output, I would highlight Events section:

```
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  158m (x7 over 161m)   default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  157m (x2 over 157m)   default-scheduler  0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Warning  FailedScheduling  156m (x5 over 157m)   default-scheduler  0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Normal   Scheduled         155m                  default-scheduler  Successfully assigned openshift-monitoring/cluster-monitoring-operator-769d997849-6xzdk to mbukatov-11-27a-96gj9-master-2.c.ocs4-283313.internal
  Normal   AddedInterface    155m                  multus             Add eth0 [10.129.0.10/23]
  Normal   Pulling           155m                  kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1311a289dd7226d1b2783571f32f9c2e87fe1a66208ee08d06208e6ab2344984"
  Normal   Pulled            155m                  kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1311a289dd7226d1b2783571f32f9c2e87fe1a66208ee08d06208e6ab2344984" in 10.378011984s
  Normal   Created           155m                  kubelet            Created container cluster-monitoring-operator
  Normal   Started           155m                  kubelet            Started container cluster-monitoring-operator
  Warning  Failed            154m (x10 over 155m)  kubelet            Error: container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root
  Normal   Pulled            47s (x719 over 155m)  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:04c3f0ad9fc07192783d5ac5ce8acab865e4f382a143c773b1d8ccb08252c3a9" already present on machine
```

Comment 3 Martin Bukatovic 2020-11-27 18:38:26 UTC

Full list of related alerts:

- Pod Namespace openshift-monitoring/Pod cluster-monitoring-operator-769d997849-xkrmg has been in a non-ready state for longer than 15 minutes.
- Deployment Namespace openshift-monitoring/Deployment cluster-monitoring-operator has not matched the expected number of replicas for longer than 15 minutes.
- Pod Namespace openshift-monitoring/Pod cluster-monitoring-operator-769d997849-xkrmg container Container kube-rbac-proxy
has been in waiting state for longer than 1 hour.
- 100% of the cluster-monitoring-operator/cluster-monitoring-operator targets in Namespace openshift-monitoring namespace are down.

Comment 5 Junqi Zhao 2020-11-30 02:18:53 UTC

Created attachment 1734715 [details]
cluster-monitoring-operator pod file

Comment 6 Simon Pasquier 2020-11-30 08:54:02 UTC

For some reason, the cluster-monitoring-oeprator pod ended up with the "nonroot" SCC while it should normally be "restricted". Do OCS manipulate SCCs and related bindings by any chance?

Comment 7 Martin Bukatovic 2020-11-30 09:40:09 UTC

Jose, could you help us to direct the Simon's question to a proper subteam of OCS Dev group?

At the same time, we also need to rule out an issue in ocs-ci. As noted in the bugreport, I already ruled out an effect of persistent storage configuration for OCP Monitoring.

Comment 9 Itzik Brown 2020-12-06 09:45:47 UTC

This also happens with OCP on Openstack 
OCP - 4.6.0-0.nightly-2020-12-04-165039
OSP - RHOS-16.1-RHEL-8-20201021.n.0

Comment 11 frank.lamon 2020-12-22 14:57:02 UTC

We are seeing this exact same issue upgrading from 4.5.23 to 4.6.8

And also a very similar issue (scc changing to nonroot on prometheus pods) doing an upgrade from 4.6.6 to 4.6.8

Note You need to log in before you can comment on or make changes to this bug.