Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1635103

Summary:	[3.11] cluster-monitoring-operator pod CrashLoopBackOff
Product:	OpenShift Container Platform	Reporter:	Weihua Meng <wmeng>
Component:	Monitoring	Assignee:	Frederic Branczyk <fbranczy>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.11.0	CC:	lserven, minden
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-02 15:10:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Weihua Meng 2018-10-02 06:38:46 UTC

Description of problem:
cluster-monitoring-operator pod CrashLoopBackOff

Version-Release number of selected component (if applicable):
openshift v3.11.18

How reproducible:
Always

Steps to Reproduce:
1. Install OCP v3.11

Actual results:
Install succeeded.
cluster-monitoring-operator pod CrashLoopBackOff

# oc get pod -n openshift-monitoring
NAME                                           READY     STATUS             RESTARTS   AGE
cluster-monitoring-operator-56bb5946c4-d5b5b   0/1       CrashLoopBackOff   23         2h


# oc describe pod/cluster-monitoring-operator-56bb5946c4-gb7lf

Events:
  Type     Reason     Age               From                                  Message
  ----     ------     ----              ----                                  -------
  Normal   Scheduled  10m               default-scheduler                     Successfully assigned openshift-monitoring/cluster-monitoring-operator-56bb5946c4-gb7lf to ip-172-18-8-81.ec2.internal
  Normal   Pulling    10m               kubelet, ip-172-18-8-81.ec2.internal  pulling image "registry.reg-aws.openshift.com:443/openshift3/ose-cluster-monitoring-operator:v3.11"
  Normal   Pulled     10m               kubelet, ip-172-18-8-81.ec2.internal  Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/ose-cluster-monitoring-operator:v3.11"
  Normal   Created    1m (x5 over 10m)  kubelet, ip-172-18-8-81.ec2.internal  Created container
  Normal   Pulled     1m (x4 over 5m)   kubelet, ip-172-18-8-81.ec2.internal  Container image "registry.reg-aws.openshift.com:443/openshift3/ose-cluster-monitoring-operator:v3.11" already present on machine
  Normal   Started    1m (x5 over 10m)  kubelet, ip-172-18-8-81.ec2.internal  Started container
  Warning  BackOff    3s (x8 over 4m)   kubelet, ip-172-18-8-81.ec2.internal  Back-off restarting failed container


# oc logs pod/cluster-monitoring-operator-56bb5946c4-gb7lf

I1002 04:52:59.433584       1 decoder.go:224] decoding stream as YAML
I1002 04:53:00.624866       1 tasks.go:37] running task Updating Telemeter client
I1002 04:53:00.624964       1 decoder.go:224] decoding stream as YAML
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x10d344d]

goroutine 13 [running]:
github.com/openshift/cluster-monitoring-operator/pkg/manifests.(*Factory).TelemeterClientServiceMonitor(0xc4203d4a20, 0xc420098420, 0xc4208d3cc0, 0x6ae93a)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/manifests/manifests.go:1441 +0x11d
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TelemeterClientTask).Run(0xc4203e10a0, 0x1, 0x1)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/telemeter.go:36 +0x33
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TaskRunner).ExecuteTask(0xc4208d3eb8, 0xc4203d4ba0, 0xf, 0xc4208d3d60)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/tasks.go:48 +0x34
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TaskRunner).RunAll(0xc4208d3eb8, 0xc4200f31c0, 0x351c07d935ca6caa)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/tasks.go:38 +0x141
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).sync(0xc420348500, 0xc4203e21e0, 0x2e, 0x11a2f40, 0xc42042a060)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:251 +0x828
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).processNextWorkItem(0xc420348500, 0xc42003bf00)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:201 +0xfb
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).worker(0xc420348500)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:171 +0x15a
created by github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).Run
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:130 +0x1e2


Expected results:
pod not crash

Additional info:

Comment 1 lserven 2018-10-02 15:09:09 UTC

I have been speaking OOB with N. Harrison Ripps. This issue is fundamentally the same as https://bugzilla.redhat.com/show_bug.cgi?id=1634227. The root cause of these issues is that 3.11 OCP images are incorrectly being built from the master branch of the Cluster Monitoring Operator rather than the release-3.11 branch. The commit that caused this external crash should never have ended up in 3.11. The master branch switched to 4.0 development some time ago.

Comment 2 lserven 2018-10-02 15:10:14 UTC


*** This bug has been marked as a duplicate of bug 1634227 ***