Description of problem: Troubleshooting steps on how to clean up the monitoring project with pods stuck in terminating status Version-Release number of selected component (if applicable): Client Version: openshift-clients-4.2.1-201910220950 Server Version: 4.2.2 Kubernetes Version: v1.14.6+868bc38 How reproducible: Unsure Steps to Reproduce: OpenShift cluster suddenly goes down (disk storage target goes down). When bringing back up OpenShift virtual machines (baremetal UPI on RHV), everything seems to come up with the exception of the monitoring project. Actual results: # oc get all NAME READY STATUS RESTARTS AGE pod/alertmanager-main-0 3/3 Terminating 4 41d pod/alertmanager-main-1 3/3 Terminating 3 41d pod/alertmanager-main-2 3/3 Terminating 4 41d pod/cluster-monitoring-operator-6bf7c89799-2pbfw 1/1 Running 2 41d pod/grafana-69f4f95645-84m98 2/2 Running 0 20d pod/grafana-69f4f95645-gxkw9 2/2 Terminating 3 41d pod/kube-state-metrics-646c968f77-28bfp 3/3 Running 0 20d pod/kube-state-metrics-646c968f77-hz7gf 3/3 Terminating 3 41d pod/node-exporter-79gdv 2/2 Running 8 41d pod/node-exporter-8zpvc 2/2 Running 6 41d pod/node-exporter-d7z8q 2/2 Running 4 41d pod/node-exporter-krs6k 2/2 Running 6 41d pod/node-exporter-vb9cw 2/2 Running 6 41d pod/openshift-state-metrics-7f4bdfbdf9-2mp47 3/3 Terminating 3 41d pod/openshift-state-metrics-7f4bdfbdf9-x68b6 3/3 Running 0 20d pod/prometheus-adapter-85c478d57b-hb4kg 1/1 Running 0 4d7h pod/prometheus-adapter-85c478d57b-vc9pl 1/1 Running 0 4d7h pod/prometheus-adapter-fbfddf8d9-lrxdm 1/1 Terminating 1 39d pod/prometheus-k8s-0 6/6 Terminating 7 41d pod/prometheus-k8s-1 6/6 Terminating 7 41d pod/prometheus-operator-6c7dbc485f-m5cbh 1/1 Running 4 41d pod/telemeter-client-56fdb5c589-6tkxr 3/3 Running 0 20d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager-main ClusterIP 172.30.34.131 <none> 9094/TCP 54d service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 54d service/cluster-monitoring-operator ClusterIP None <none> 8080/TCP 54d service/grafana ClusterIP 172.30.112.46 <none> 3000/TCP 54d service/kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 54d service/node-exporter ClusterIP None <none> 9100/TCP 54d service/openshift-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 54d service/prometheus-adapter ClusterIP 172.30.167.74 <none> 443/TCP 54d service/prometheus-k8s ClusterIP 172.30.10.2 <none> 9091/TCP,9092/TCP 54d service/prometheus-operated ClusterIP None <none> 9090/TCP 54d service/prometheus-operator ClusterIP None <none> 8080/TCP 54d service/telemeter-client ClusterIP None <none> 8443/TCP 20d NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/node-exporter 5 5 4 5 4 kubernetes.io/os=linux 54d NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/cluster-monitoring-operator 1/1 1 1 54d deployment.apps/grafana 1/1 1 1 54d deployment.apps/kube-state-metrics 1/1 1 1 54d deployment.apps/openshift-state-metrics 1/1 1 1 54d deployment.apps/prometheus-adapter 2/2 2 2 54d deployment.apps/prometheus-operator 1/1 1 1 54d deployment.apps/telemeter-client 1/1 1 1 20d NAME DESIRED CURRENT READY AGE replicaset.apps/cluster-monitoring-operator-6bf7c89799 1 1 1 41d replicaset.apps/cluster-monitoring-operator-84cd9df668 0 0 0 54d replicaset.apps/grafana-5db6fd97f8 0 0 0 54d replicaset.apps/grafana-69f4f95645 1 1 1 41d replicaset.apps/kube-state-metrics-646c968f77 1 1 1 41d replicaset.apps/kube-state-metrics-895899678 0 0 0 54d replicaset.apps/openshift-state-metrics-77d5f699d8 0 0 0 54d replicaset.apps/openshift-state-metrics-7f4bdfbdf9 1 1 1 41d replicaset.apps/prometheus-adapter-5dcb66b7bb 0 0 0 9d replicaset.apps/prometheus-adapter-6679d4b44 0 0 0 54d replicaset.apps/prometheus-adapter-6bcc8bb5c4 0 0 0 20d replicaset.apps/prometheus-adapter-74d7c85978 0 0 0 54d replicaset.apps/prometheus-adapter-785d77c74f 0 0 0 54d replicaset.apps/prometheus-adapter-7bc7db849b 0 0 0 5d19h replicaset.apps/prometheus-adapter-7c8d5f9bbd 0 0 0 41d replicaset.apps/prometheus-adapter-7cd7464584 0 0 0 19d replicaset.apps/prometheus-adapter-85c478d57b 2 2 2 4d7h replicaset.apps/prometheus-adapter-d78bc79cf 0 0 0 54d replicaset.apps/prometheus-adapter-fbfddf8d9 0 0 0 39d replicaset.apps/prometheus-operator-6584955c55 0 0 0 54d replicaset.apps/prometheus-operator-6c7bddddfd 0 0 0 54d replicaset.apps/prometheus-operator-6c7dbc485f 1 1 1 41d replicaset.apps/telemeter-client-56fdb5c589 1 1 1 20d NAME READY AGE statefulset.apps/alertmanager-main 0/3 54d statefulset.apps/prometheus-k8s 0/2 54d NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/alertmanager-main alertmanager-main-openshift-monitoring.apps.lab.msp.redhat.com alertmanager-main web reencrypt/Redirect None route.route.openshift.io/grafana grafana-openshift-monitoring.apps.lab.msp.redhat.com grafana https reencrypt/Redirect None route.route.openshift.io/prometheus-k8s prometheus-k8s-openshift-monitoring.apps.lab.msp.redhat.com prometheus-k8s web reencrypt/Redirect None [root@tatooine ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.2 True False False 20d cloud-credential 4.2.2 True False False 54d cluster-autoscaler 4.2.2 True False False 54d console 4.2.2 True False False 20d dns 4.2.2 True True True 54d image-registry 4.2.2 True False False 20d ingress 4.2.2 True False False 20d insights 4.2.2 True False False 54d kube-apiserver 4.2.2 True False False 54d kube-controller-manager 4.2.2 True False False 54d kube-scheduler 4.2.2 True False False 54d machine-api 4.2.2 True False False 54d machine-config 4.2.2 True False False 33d marketplace 4.2.2 True False False 20d monitoring 4.2.2 False True True 20d network 4.2.2 True True False 54d node-tuning 4.2.2 True False False 20d openshift-apiserver 4.2.2 True False False 34d openshift-controller-manager 4.2.2 True False False 54d openshift-samples 4.2.2 True False False 41d operator-lifecycle-manager 4.2.2 True False False 54d operator-lifecycle-manager-catalog 4.2.2 True False False 54d operator-lifecycle-manager-packageserver 4.2.2 True False False 4d7h service-ca 4.2.2 True False False 54d service-catalog-apiserver 4.2.2 True False False 54d service-catalog-controller-manager 4.2.2 True False False 54d storage 4.2.2 True False False 41d Expected results: Monitoring project should attempt to come up, and if it doesn't we should have troubleshooting instructions on where to check or how to bring the project back online. Additional info: When attempting to delete the pods that are in terminating state, the command hangs. [root@tatooine ~]# oc delete pod/alertmanager-main-2 pod "alertmanager-main-2" deleted Hitting `Control-C` brings the prompt back.
Are you able to get the pod logs for one of the stuck prometheus pods, and also the kubelet log on the node where it is running?
(In reply to Paul Gier from comment #1) > Are you able to get the pod logs for one of the stuck prometheus pods, and > also the kubelet log on the node where it is running? Is this the only one you want to see? [root@tatooine ~]# oc logs pod/prometheus-k8s-0 -c prometheus Error from server: Get https://10.15.108.87:10250/containerLogs/openshift-monitoring/prometheus-k8s-0/prometheus: remote error: tls: internal error Or the rest of the containers? [root@tatooine ~]# oc logs pod/prometheus-k8s-0 Error from server (BadRequest): a container name must be specified for pod prometheus-k8s-0, choose one of: [prometheus prometheus-config-reloader rules-configmap-reloader prometheus-proxy kube-rbac-proxy prom-label-proxy] Also, since this is RHCOS, how would we get the kubelet log? Should we run `oc adm gather` for this or is there another area to look? Thanks!
Are you able to get the logs of cluster-monitoring-operator? Hopefully, that will give some information about what is failing.
Created attachment 1644018 [details] cluster operator logs
Cluster operator logs attached in comment#4, here's the configuration (default, no changes): [root@tatooine ~]# oc get clusteroperator -o yaml monitoring apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-10-17T01:14:49Z" generation: 1 name: monitoring resourceVersion: "27374073" selfLink: /apis/config.openshift.io/v1/clusteroperators/monitoring uid: 83f440de-f07b-11e9-a275-001a4a16010d spec: {} status: conditions: - lastTransitionTime: "2019-12-11T15:47:05Z" message: Rolling out the stack. reason: RollOutInProgress status: "True" type: Progressing - lastTransitionTime: "2019-11-22T12:41:11Z" message: 'Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4)' reason: UpdatingnodeExporterFailed status: "True" type: Degraded - lastTransitionTime: "2019-12-11T15:47:05Z" message: Rollout of the monitoring stack is in progress. Please wait until it finishes. reason: RollOutInProgress status: "True" type: Upgradeable - lastTransitionTime: "2019-11-19T20:19:47Z" status: "False" type: Available extension: null relatedObjects: - group: operator.openshift.io name: cluster resource: monitoring - group: "" name: openshift-monitoring resource: namespaces versions: - name: operator version: 4.2.2
From the logs looks like the node_exporter rollout is not completing for some reason. I1211 15:47:05.823735 1 operator.go:321] Updating ClusterOperator status to failed. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4) E1211 15:47:05.884244 1 operator.go:267] Syncing "openshift-monitoring/cluster-monitoring-config" failed E1211 15:47:05.884294 1 operator.go:268] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4) W1211 15:47:05.884358 1 operator.go:349] No Cluster Monitoring ConfigMap was found. Using defaults. I1211 15:47:05.946093 1 operator.go:313] Updating ClusterOperator status to in progress. Reassigning to the node team to investigate why this failed.
I believe this was fixed by https://github.com/cri-o/cri-o/commit/98d0d9a776d781bcbbda4181e41103126a1bc02f and is in the latest 1.14 build https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1089477. Can you please try this out again with the latest cri-o 1.14 build. Thanks!
(In reply to Urvashi Mohnani from comment #7) > I believe this was fixed by > https://github.com/cri-o/cri-o/commit/ > 98d0d9a776d781bcbbda4181e41103126a1bc02f and is in the latest 1.14 build > https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1089477. Can > you please try this out again with the latest cri-o 1.14 build. > Thanks! How do you install the latest upstream bits on an already running RHCOS instance? Thanks!
You can manually switch the cri-o binary on the nodes in the cluster (it is a bit tedious depending on how many nodes you have). But these are the steps for doing so: 1) Copy over the new cri-o binary to the node via scp 2) ssh into the node and become root, then run `ostree admin unlock --hotfix` 3) run `systemctl stop crio` 4) run `which crio` 5) move the new cri-o binary over to the path you get from step 4 6) run `systemctl start crio` 7) `cri-o --version` should show you the new version of the cri-o you copied over One more thing you can do is start a new cluster from the latest 4.2 nightly and that should have the patched up cri-o version.
Hi Sam, any update on whether this worked?
(In reply to Urvashi Mohnani from comment #10) > Hi Sam, any update on whether this worked? No update, sorry, I went ahead and rebuilt the cluster on 4.3.1 since I was out of time for a customer demo. Thanks!
Clearing `needinfo` for comment#11
The patch is in the current latest version of 4.2. Closing, please reopen if it occurs again.