Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1634227

Summary: cluster-monitoring-operator pods in CrashLoopBackOff, image wrongly packaged 4.0 Telemeter client
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: ReleaseAssignee: lserven
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: aos-bugs, jokerman, minden, mmccomas, smunilla, wmeng
Target Milestone: ---Keywords: TestBlocker
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-10 09:04:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster-monitoring-operator pod logs none

Description Junqi Zhao 2018-09-29 05:19:14 UTC
Created attachment 1488287 [details]
cluster-monitoring-operator pod logs

Description of problem:
cluster-monitoring-operator pods in CrashLoopBackOff, image wrongly packaged 4.0 Telemeter client
# oc -n openshift-monitoring get pod
.....
cluster-monitoring-operator-56bb5946c4-49v8n   0/1       CrashLoopBackOff   12         1h

# oc -n openshift-monitoring logs cluster-monitoring-operator-56bb5946c4-49v8n
I0929 03:13:21.844419       1 tasks.go:37] running task Updating Telemeter client
I0929 03:13:21.844509       1 decoder.go:224] decoding stream as YAML
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x10d344d]

goroutine 36 [running]:
github.com/openshift/cluster-monitoring-operator/pkg/manifests.(*Factory).TelemeterClientServiceMonitor(0xc4204ccc60, 0xc4200ae840, 0xc420a51cc0, 0x6ae93a)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/manifests/manifests.go:1441 +0x11d
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TelemeterClientTask).Run(0xc42036a0c0, 0x1, 0x1)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/telemeter.go:36 +0x33
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TaskRunner).ExecuteTask(0xc420a51eb8, 0xc4204cce40, 0xf, 0xc420a51d60)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/tasks.go:48 +0x34
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TaskRunner).RunAll(0xc420a51eb8, 0xc420473a80, 0x53f961cb21d1086d)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/tasks.go:38 +0x141
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).sync(0xc42017bf00, 0xc4203b0d50, 0x2e, 0x11a2f40, 0xc42042b790)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:251 +0x828
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).processNextWorkItem(0xc42017bf00, 0xc420037f00)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:201 +0xfb
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).worker(0xc42017bf00)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:171 +0x15a
created by github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).Run
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:130 +0x1e2

Version-Release number of selected component (if applicable):
ose-cluster-monitoring-operator/images/v3.11.17-1
# openshift version
openshift v3.11.17

How reproducible:
Always

Steps to Reproduce:
1. Deploy cluster monitoring upon openshift v3.11.17
2.
3.

Actual results:
cluster-monitoring-operator pods in CrashLoopBackOff, image wrongly packaged 4.0 Telemeter client

Expected results:
cluster-monitoring-operator pods should be running well.

Additional info:

Comment 1 Junqi Zhao 2018-09-29 05:21:56 UTC
please don't package 3.11 images from master branch, master branch is 4.0 now.
This issue blocks etcd monitoring function.

Comment 2 minden 2018-10-01 11:50:28 UTC
Assigning to lserven.

Comment 3 lserven 2018-10-01 11:54:26 UTC
This issue is a result of a bug introduced in https://github.com/openshift/cluster-monitoring-operator/pull/103.

A follow up PR was made to correct the issue: https://github.com/openshift/cluster-monitoring-operator/pull/109. This PR was dependent on upstream fixes in https://github.com/openshift/telemeter/pull/35 but the patches were mixed out of order.

Finally, one last PR https://github.com/openshift/cluster-monitoring-operator/pull/110 was made to correct everything. From my tests, CMO master is stable again.

Please note that CMO master is 4.0 and _not_ 3.11.

Please verify again to confirm that the issue is fixed.

Comment 4 lserven 2018-10-02 15:10:14 UTC
*** Bug 1635103 has been marked as a duplicate of this bug. ***

Comment 5 lserven 2018-10-02 15:12:02 UTC
As noted in 1635103, the root cause of these issues is that 3.11 OCP images are incorrectly being built from the master branch of the Cluster Monitoring Operator rather than the release-3.11 branch. The commit that caused this external crash should never have ended up in 3.11. The master branch switched to 4.0 development some time ago.

Comment 6 lserven 2018-10-02 15:20:56 UTC
I have just made a PR to the OCP images repo to fix the branch for cluster monitoring operator images for 3.11.

Comment 7 Junqi Zhao 2018-10-10 03:22:35 UTC
Issue is fixed, 4.0 Telemeter client is removed from 3.11 brach

Images:
ose-cluster-monitoring-operator-v3.11.20-1

Please change to ON_QA then I will close it

Comment 8 lserven 2018-10-10 06:34:31 UTC
Thanks!

Comment 9 DeShuai Ma 2018-10-15 02:20:39 UTC
As comment 7 move to verified.

Comment 11 errata-xmlrpc 2019-01-10 09:04:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024