Description of problem ====================== When I install OCP 4.6 with OCS 4.6 on GCP, I see that cluster-monitoring-operator fails on CreateContainerConfigError. Version-Release number of selected component ============================================ OCP 4.6.0-0.nightly-2020-11-26-234822 OCS 4.6.0-160.ci How reproducible ================ 2/2 Steps to Reproduce ================== 1. Install OCP/OCS cluster on GCP 2. Check cluster dashboard in OCP Console 3. Check pods in openshift-monitoring namespace Actual results ============== There is the following alert in OCP Console: ``` 100% of the cluster-monitoring-operator/cluster-monitoring-operator targets in openshift-monitoring namespace are down. ``` And cluster-monitoring-operator pod is not running: ``` $ oc get pods -n openshift-monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 5/5 Running 0 109m alertmanager-main-1 5/5 Running 0 109m alertmanager-main-2 5/5 Running 0 109m cluster-monitoring-operator-769d997849-6xzdk 1/2 CreateContainerConfigError 0 150m grafana-6754564857-wvbwf 2/2 Running 0 137m kube-state-metrics-6b86844c5d-r8ctl 3/3 Running 0 144m node-exporter-2rg2w 2/2 Running 0 144m node-exporter-4blr5 2/2 Running 0 138m node-exporter-bzxzv 2/2 Running 0 138m node-exporter-d7h8x 2/2 Running 0 144m node-exporter-sndkw 2/2 Running 0 144m node-exporter-wlfnt 2/2 Running 0 138m openshift-state-metrics-66454d8fcc-m29l6 3/3 Running 0 144m prometheus-adapter-6c7cc44f88-7qm24 1/1 Running 0 138m prometheus-adapter-6c7cc44f88-ts2pm 1/1 Running 0 138m prometheus-k8s-0 6/6 Running 1 109m prometheus-k8s-1 6/6 Running 1 109m prometheus-operator-57d46dd98c-8lg8f 2/2 Running 0 114m telemeter-client-7dbf4cfdc-jtj5d 3/3 Running 0 144m thanos-querier-75d567b696-q7fvd 5/5 Running 0 137m thanos-querier-75d567b696-zhgkj 5/5 Running 0 137m ``` Expected results ================ cluster-monitoring-operator pod is running. Additional info =============== Must gather data are referenced below. Effect of using OCS for monitoring storage ------------------------------------------ I tested both cases, and there seems to be no effect on the bug. Description of cluster-monitoring-operator pod ---------------------------------------------- Output of the following command attached: ``` $ oc describe pod/cluster-monitoring-operator-769d997849-6xzdk -n openshift-monitoring > cluster-monitoring-operator-769d997849-6xzdk.describe ``` From this output, I would highlight Events section: ``` Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 158m (x7 over 161m) default-scheduler no nodes available to schedule pods Warning FailedScheduling 157m (x2 over 157m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate. Warning FailedScheduling 156m (x5 over 157m) default-scheduler 0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate. Normal Scheduled 155m default-scheduler Successfully assigned openshift-monitoring/cluster-monitoring-operator-769d997849-6xzdk to mbukatov-11-27a-96gj9-master-2.c.ocs4-283313.internal Normal AddedInterface 155m multus Add eth0 [10.129.0.10/23] Normal Pulling 155m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1311a289dd7226d1b2783571f32f9c2e87fe1a66208ee08d06208e6ab2344984" Normal Pulled 155m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1311a289dd7226d1b2783571f32f9c2e87fe1a66208ee08d06208e6ab2344984" in 10.378011984s Normal Created 155m kubelet Created container cluster-monitoring-operator Normal Started 155m kubelet Started container cluster-monitoring-operator Warning Failed 154m (x10 over 155m) kubelet Error: container has runAsNonRoot and image has non-numeric user (nobody), cannot verify user is non-root Normal Pulled 47s (x719 over 155m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:04c3f0ad9fc07192783d5ac5ce8acab865e4f382a143c773b1d8ccb08252c3a9" already present on machine ```
Full list of related alerts: - Pod Namespace openshift-monitoring/Pod cluster-monitoring-operator-769d997849-xkrmg has been in a non-ready state for longer than 15 minutes. - Deployment Namespace openshift-monitoring/Deployment cluster-monitoring-operator has not matched the expected number of replicas for longer than 15 minutes. - Pod Namespace openshift-monitoring/Pod cluster-monitoring-operator-769d997849-xkrmg container Container kube-rbac-proxy has been in waiting state for longer than 1 hour. - 100% of the cluster-monitoring-operator/cluster-monitoring-operator targets in Namespace openshift-monitoring namespace are down.
Created attachment 1734715 [details] cluster-monitoring-operator pod file
For some reason, the cluster-monitoring-oeprator pod ended up with the "nonroot" SCC while it should normally be "restricted". Do OCS manipulate SCCs and related bindings by any chance?
Jose, could you help us to direct the Simon's question to a proper subteam of OCS Dev group? At the same time, we also need to rule out an issue in ocs-ci. As noted in the bugreport, I already ruled out an effect of persistent storage configuration for OCP Monitoring.
This also happens with OCP on Openstack OCP - 4.6.0-0.nightly-2020-12-04-165039 OSP - RHOS-16.1-RHEL-8-20201021.n.0
We are seeing this exact same issue upgrading from 4.5.23 to 4.6.8 And also a very similar issue (scc changing to nonroot on prometheus pods) doing an upgrade from 4.6.6 to 4.6.8