Bug 1963833

Summary: Cluster monitoring operator crashlooping on single node clusters due to segfault
Product: OpenShift Container Platform Reporter: Omer Tuchfeld <otuchfel>
Component: MonitoringAssignee: Omer Tuchfeld <otuchfel>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.8CC: alegrand, anpicker, aos-bugs, erooth, kakkoyun, pkrupa, pmuller, sasha, wking
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:09:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Omer Tuchfeld 2021-05-24 07:10:08 UTC
Description of problem:
Cluster monitoring operator crashlooping on single node clusters due to segfault

Version-Release number of selected component (if applicable):
4.8-fc5

How reproducible:
100% of single node clusters, see test grid starting 2021/05/21, link [1]

Steps to Reproduce:
Single node live ISO CI, e.g. [3]

Actual results:
Monitoring operator pod crashlooping

Expected results:
Monitoring operator pod not crashlooping

Additional info:
Seems to come from this PR [2]


[1] https://testgrid.k8s.io/redhat-single-node#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-single-node-live-iso

[2] https://github.com/openshift/cluster-monitoring-operator/pull/1151#issuecomment-846809313 

[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-single-node-live-iso/1395892352853217280

Comment 1 Junqi Zhao 2021-05-24 07:46:35 UTC
same bug as bug 1963775

Comment 5 Junqi Zhao 2021-05-25 07:40:12 UTC
checked with 4.8.0-0.nightly-2021-05-25-041803, cluster-monitoring-operator pod is normal now
# oc get no
NAME                                         STATUS   ROLES           AGE   VERSION
ip-10-0-139-220.us-east-2.compute.internal   Ready    master,worker   64m   v1.21.0-rc.0+ee60d07
# oc get pod -n openshift-monitoring
NAME                                           READY   STATUS    RESTARTS   AGE
alertmanager-main-0                            5/5     Running   0          50m
cluster-monitoring-operator-77d7fdc7cb-jvvtz   2/2     Running   3          62m
grafana-6589f84cd6-7g7dv                       2/2     Running   0          50m
kube-state-metrics-69cc98557f-pgpc5            3/3     Running   0          62m
node-exporter-8dqvh                            2/2     Running   0          62m
openshift-state-metrics-5f54b4ff58-p5rjs       3/3     Running   0          62m
prometheus-adapter-789fbc4d86-6qfhl            1/1     Running   0          54m
prometheus-k8s-0                               7/7     Running   1          50m
prometheus-operator-fd77ffdd8-t22s9            2/2     Running   0          50m
telemeter-client-79976c8cd5-vkd7g              3/3     Running   0          61m
thanos-querier-9d758d9d-kh49k                  5/5     Running   0          50m

Comment 6 Simon Pasquier 2021-05-25 10:18:58 UTC
*** Bug 1963775 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2021-07-27 23:09:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438