Bug 1963775 - cluster-monitoring-operator shows fails during SNO deployment
Summary: cluster-monitoring-operator shows fails during SNO deployment
Keywords:
Status: CLOSED DUPLICATE of bug 1963833
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-24 01:42 UTC by Alexander Chuzhoy
Modified: 2021-05-25 10:18 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-25 10:18:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Alexander Chuzhoy 2021-05-24 01:42:22 UTC
Version:
4.8.0-0.nightly-2021-05-21-233425:


Steps to reproduce:

Try to deploy SNO with ipv4.

Result:
The monitoring operator appears as failed.

Querier 
panic: runtime error: invalid memory address or nil pointer dereference



oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          76m     Unable to apply 4.8.0-0.nightly-2021-05-21-233425: the cluster operator monitoring has not yet successfully rolled out


oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE                                          NAME                                                                         READY   STATUS             RESTARTS   AGE
openshift-monitoring                               cluster-monitoring-operator-fdb9d949c-44w8r                                  1/2     CrashLoopBackOff   12         41m



oc describe pod -n openshift-monitoring                               cluster-monitoring-operator-fdb9d949c-44w8r
Name:                 cluster-monitoring-operator-fdb9d949c-44w8r
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 openshift-master-0.qe1.kni.lab.eng.bos.redhat.com/10.19.134.13
Start Time:           Sun, 23 May 2021 20:59:17 -0400
Labels:               app=cluster-monitoring-operator
                      pod-template-hash=fdb9d949c
Annotations:          k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "",
                            "interface": "eth0",
                            "ips": [
                                "10.128.0.96"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "",
                            "interface": "eth0",
                            "ips": [
                                "10.128.0.96"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: restricted
                      workload.openshift.io/warning:
                        the node "openshift-master-0.qe1.kni.lab.eng.bos.redhat.com" does not have resource "management.workload.openshift.io/cores"
Status:               Running
IP:                   10.128.0.96
IPs:
  IP:           10.128.0.96
Controlled By:  ReplicaSet/cluster-monitoring-operator-fdb9d949c
Containers:
  kube-rbac-proxy:
    Container ID:  cri-o://d40eaff5abbbbb282377648874bc863f40701d00426b7e2a016edb9f3b8f27b4
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73daea39b02fbf384a6c0fdc5db7b6034d45112004633d72f508b31c6c5f1c3f
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73daea39b02fbf384a6c0fdc5db7b6034d45112004633d72f508b31c6c5f1c3f
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --logtostderr
      --secure-listen-address=:8443
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
      --upstream=http://127.0.0.1:8080/
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    State:          Running
      Started:      Sun, 23 May 2021 20:59:21 -0400
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/tls/private from cluster-monitoring-operator-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-monitoring-operator-token-zqgsw (ro)
  cluster-monitoring-operator:
    Container ID:  cri-o://d4ba04b17e939f768d02b40daaba636b44aeebb0caf132c4931e76ae73717b0b
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:478333df826fbd4534d5bfc8f27ea5b01bb531d62cb18832a9da4d6a8bcc538f
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:478333df826fbd4534d5bfc8f27ea5b01bb531d62cb18832a9da4d6a8bcc538f
    Port:          <none>
    Host Port:     <none>
    Args:
      -namespace=openshift-monitoring
      -namespace-user-workload=openshift-user-workload-monitoring
      -configmap=cluster-monitoring-config
      -release-version=$(RELEASE_VERSION)
      -logtostderr=true
      -v=2
      -images=prometheus-operator=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:01bcc9143ee529339cba78255b0eef80014022d43b78df6c00f3b90949a3e54b
      -images=prometheus-config-reloader=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e27249e9080ed72bd2a720e62391dcd81f589a565978e7830aaa34f15daee4f
      -images=configmap-reloader=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5897a3a33b597a6d97912f3715642de762b64f0c39c982975d0417672351a1b5
      -images=prometheus=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1bfa5629bfde2d2e045a600e0f83d3a47ad3740b2051e0f6f87ee02467af9330
      -images=alertmanager=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4cb695db9dd455904b6f23b9b1201040286fa37d98aeb8fe1302f5c0f1794e83
      -images=grafana=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:493ef314d2c1977ff66d65d3d905377f0a297d030d895078512bfbc878a8781f
      -images=oauth-proxy=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:013f14899294e4d6e18aab7ea9d0b6d98db99e477f49607d9287dc5caba3ec5d
      -images=node-exporter=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5f57488e90465919487e47abffda690f346c108180b159b51cd693dbba197b1
      -images=kube-state-metrics=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d427b8a5548b85b8fe8a2a673dfec821bf898197d8530c8372fd4053accb3179
      -images=openshift-state-metrics=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31ec557de595a7caf68a638a0363ac91dda60a85649b6b34d91d849b435384b0
      -images=kube-rbac-proxy=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73daea39b02fbf384a6c0fdc5db7b6034d45112004633d72f508b31c6c5f1c3f
      -images=telemeter-client=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:354c44628376adcc8e68592cdf4a67ba83f453d8a2deca013fa720be43beb1f4
      -images=prom-label-proxy=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:833933f54d0d72bf2a6195b05800155a955c77ab961c468a1d724b777acf7cbb
      -images=k8s-prometheus-adapter=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
      -images=thanos=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55aaf9dc4d1e8495b543d886688db12b72b69a2826d89d3b0f44f0dbed30f86d
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:    tasks.go:46] ran task 8 of 16: Updating node-exporter
I0524 01:37:50.173099       1 tasks.go:46] ran task 11 of 16: Updating prometheus-adapter
I0524 01:37:50.240853       1 tasks.go:46] ran task 12 of 16: Updating Telemeter client
I0524 01:37:51.211841       1 tasks.go:46] ran task 1 of 16: Updating Prometheus Operator
I0524 01:37:51.663992       1 tasks.go:46] ran task 4 of 16: Updating Grafana
I0524 01:37:53.381295       1 tasks.go:46] ran task 14 of 16: Updating Thanos Querier
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x17f5c58]

goroutine 296 [running]:
github.com/openshift/cluster-monitoring-operator/pkg/client.(*Client).DeletePodDisruptionBudget(0xc000155970, 0x0, 0x0, 0x0)
  /go/src/github.com/openshift/cluster-monitoring-operator/pkg/client/client.go:451 +0x98
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*PrometheusUserWorkloadTask).destroy(0xc000758738, 0x226e214, 0xc000208070)
  /go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/prometheus_user_workload.go:311 +0x4ba
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*PrometheusUserWorkloadTask).Run(0xc000758738, 0xc000000000, 0x0)
  /go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/prometheus_user_workload.go:44 +0x6c
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TaskRunner).ExecuteTask(...)
  /go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/tasks.go:66
github.com/openshift/cluster-monitoring-operator/pkg/tasks.(*TaskRunner).RunAll.func1(0x0, 0x0)
  /go/src/github.com/openshift/cluster-monitoring-operator/pkg/tasks/tasks.go:45 +0x1d1
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000c81140, 0xc00044bb80)
  /go/src/github.com/openshift/cluster-monitoring-operator/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
  /go/src/github.com/openshift/cluster-monitoring-operator/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66

      Exit Code:    2
      Started:      Sun, 23 May 2021 21:37:48 -0400
      Finished:     Sun, 23 May 2021 21:38:00 -0400
    Ready:          False
    Restart Count:  12
    Requests:
      cpu:     10m
      memory:  75Mi
    Environment:
      RELEASE_VERSION:  4.8.0-0.nightly-2021-05-21-233425
    Mounts:
      /etc/cluster-monitoring-operator/telemetry from telemetry-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-monitoring-operator-token-zqgsw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  telemetry-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      telemetry-config
    Optional:  false
  cluster-monitoring-operator-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-monitoring-operator-tls
    Optional:    true
  cluster-monitoring-operator-token-zqgsw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-monitoring-operator-token-zqgsw
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
                 node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason          Age                  From               Message
  ----     ------          ----                 ----               -------
  Normal   Scheduled       41m                  default-scheduler  Successfully assigned openshift-monitoring/cluster-monitoring-operator-fdb9d949c-44w8r to openshift-master-0.qe1.kni.lab.eng.bos.redhat.com
  Normal   AddedInterface  41m                  multus             Add eth0 [10.128.0.96/23]
  Normal   Pulled          41m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:73daea39b02fbf384a6c0fdc5db7b6034d45112004633d72f508b31c6c5f1c3f" already present on machine
  Normal   Created         41m                  kubelet            Created container kube-rbac-proxy
  Normal   Started         41m                  kubelet            Started container kube-rbac-proxy
  Normal   Pulled          39m (x5 over 41m)    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:478333df826fbd4534d5bfc8f27ea5b01bb531d62cb18832a9da4d6a8bcc538f" already present on machine
  Normal   Created         39m (x5 over 41m)    kubelet            Created container cluster-monitoring-operator
  Normal   Started         39m (x5 over 41m)    kubelet            Started container cluster-monitoring-operator
  Warning  BackOff         98s (x174 over 41m)  kubelet            Back-off restarting failed container

Comment 1 Junqi Zhao 2021-05-24 06:33:44 UTC
checked with the same 4.8.0-0.nightly-2021-05-21-233425 payload, no issue in HighlyAvailable cluster, this issue is only related to SNO
# oc -n openshift-monitoring get pod | grep cluster-monitoring-operator
cluster-monitoring-operator-fdb9d949c-vkl5q   2/2     Running   2          7h14m

Comment 2 Omer Tuchfeld 2021-05-24 08:02:04 UTC
bug 1963833 is a duplicate of this, but a PR for bug 1963833 [1] has already been created and linked to it, so I prefer if this one was closed.

[1] https://github.com/openshift/cluster-monitoring-operator/pull/1176

Comment 3 Simon Pasquier 2021-05-25 10:18:58 UTC

*** This bug has been marked as a duplicate of bug 1963833 ***


Note You need to log in before you can comment on or make changes to this bug.