Bug 1635103 - [3.11] cluster-monitoring-operator pod CrashLoopBackOff
Summary: [3.11] cluster-monitoring-operator pod CrashLoopBackOff
Keywords:
Status: CLOSED DUPLICATE of bug 1634227
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-02 06:38 UTC by Weihua Meng
Modified: 2018-10-02 15:10 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-02 15:10:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Weihua Meng 2018-10-02 06:38:46 UTC
Description of problem:
cluster-monitoring-operator pod CrashLoopBackOff

Version-Release number of selected component (if applicable):
openshift v3.11.18

How reproducible:
Always

Steps to Reproduce:
1. Install OCP v3.11

Actual results:
Install succeeded.
cluster-monitoring-operator pod CrashLoopBackOff

# oc get pod -n openshift-monitoring
NAME                                           READY     STATUS             RESTARTS   AGE
cluster-monitoring-operator-56bb5946c4-d5b5b   0/1       CrashLoopBackOff   23         2h


# oc describe pod/cluster-monitoring-operator-56bb5946c4-gb7lf

Events:
  Type     Reason     Age               From                                  Message
  ----     ------     ----              ----                                  -------
  Normal   Scheduled  10m               default-scheduler                     Successfully assigned openshift-monitoring/cluster-monitoring-operator-56bb5946c4-gb7lf to ip-172-18-8-81.ec2.internal
  Normal   Pulling    10m               kubelet, ip-172-18-8-81.ec2.internal  pulling image "registry.reg-aws.openshift.com:443/openshift3/ose-cluster-monitoring-operator:v3.11"
  Normal   Pulled     10m               kubelet, ip-172-18-8-81.ec2.internal  Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/ose-cluster-monitoring-operator:v3.11"
  Normal   Created    1m (x5 over 10m)  kubelet, ip-172-18-8-81.ec2.internal  Created container
  Normal   Pulled     1m (x4 over 5m)   kubelet, ip-172-18-8-81.ec2.internal  Container image "registry.reg-aws.openshift.com:443/openshift3/ose-cluster-monitoring-operator:v3.11" already present on machine
  Normal   Started    1m (x5 over 10m)  kubelet, ip-172-18-8-81.ec2.internal  Started container
  Warning  BackOff    3s (x8 over 4m)   kubelet, ip-172-18-8-81.ec2.internal  Back-off restarting failed container


# oc logs pod/cluster-monitoring-operator-56bb5946c4-gb7lf

I1002 04:52:59.433584       1 decoder.go:224] decoding stream as YAML
I1002 04:53:00.624866       1 tasks.go:37] running task Updating Telemeter client
I1002 04:53:00.624964       1 decoder.go:224] decoding stream as YAML
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x10d344d]

goroutine 13 [running]:
github.com/openshift/cluster-monitoring-operator/pkg/manifests.(*Factory).TelemeterClientServiceMonitor(0xc4203d4a20, 0xc420098420, 0xc4208d3cc0, 0x6ae93a)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/manifests/manifests.go:1441 +0x11d
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TelemeterClientTask).Run(0xc4203e10a0, 0x1, 0x1)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/telemeter.go:36 +0x33
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TaskRunner).ExecuteTask(0xc4208d3eb8, 0xc4203d4ba0, 0xf, 0xc4208d3d60)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/tasks.go:48 +0x34
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TaskRunner).RunAll(0xc4208d3eb8, 0xc4200f31c0, 0x351c07d935ca6caa)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/tasks.go:38 +0x141
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).sync(0xc420348500, 0xc4203e21e0, 0x2e, 0x11a2f40, 0xc42042a060)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:251 +0x828
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).processNextWorkItem(0xc420348500, 0xc42003bf00)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:201 +0xfb
github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).worker(0xc420348500)
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:171 +0x15a
created by github.com/openshift/cluster-monitoring-operator/pkg/operator.(*Operator).Run
	/go/src/github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:130 +0x1e2


Expected results:
pod not crash

Additional info:

Comment 1 lserven 2018-10-02 15:09:09 UTC
I have been speaking OOB with N. Harrison Ripps. This issue is fundamentally the same as https://bugzilla.redhat.com/show_bug.cgi?id=1634227. The root cause of these issues is that 3.11 OCP images are incorrectly being built from the master branch of the Cluster Monitoring Operator rather than the release-3.11 branch. The commit that caused this external crash should never have ended up in 3.11. The master branch switched to 4.0 development some time ago.

Comment 2 lserven 2018-10-02 15:10:14 UTC

*** This bug has been marked as a duplicate of bug 1634227 ***


Note You need to log in before you can comment on or make changes to this bug.