Bug 1781824

Summary: Ways to clean up the monitoring project with pods stuck in terminating status
Product: OpenShift Container Platform Reporter: Sam Yangsao <syangsao>
Component: NodeAssignee: Urvashi Mohnani <umohnani>
Status: CLOSED CURRENTRELEASE QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.zCC: alegrand, anpicker, aos-bugs, erooth, gkimetto, jokerman, kakkoyun, lcosic, mdunn, mlabonte, mloibl, pkrupa, rphillips, surbania
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: crio was not cleaning up IPs correctly on node reboot when pods could not be restored. Consequence: This would lead to node IP exhaustion, causing pods to fail to start. Fix: crio was fixed to correctly cleanup IPs. Result:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-05 16:44:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1731242    
Attachments:
Description Flags
cluster operator logs none

Description Sam Yangsao 2019-12-10 16:22:12 UTC
Description of problem:

Troubleshooting steps on how to clean up the monitoring project with pods stuck in terminating status

Version-Release number of selected component (if applicable):

Client Version: openshift-clients-4.2.1-201910220950
Server Version: 4.2.2
Kubernetes Version: v1.14.6+868bc38

How reproducible:

Unsure

Steps to Reproduce:

OpenShift cluster suddenly goes down (disk storage target goes down).  When bringing back up OpenShift virtual machines (baremetal UPI on RHV), everything seems to come up with the exception of the monitoring project.

Actual results:

# oc get all
NAME                                               READY   STATUS        RESTARTS   AGE
pod/alertmanager-main-0                            3/3     Terminating   4          41d
pod/alertmanager-main-1                            3/3     Terminating   3          41d
pod/alertmanager-main-2                            3/3     Terminating   4          41d
pod/cluster-monitoring-operator-6bf7c89799-2pbfw   1/1     Running       2          41d
pod/grafana-69f4f95645-84m98                       2/2     Running       0          20d
pod/grafana-69f4f95645-gxkw9                       2/2     Terminating   3          41d
pod/kube-state-metrics-646c968f77-28bfp            3/3     Running       0          20d
pod/kube-state-metrics-646c968f77-hz7gf            3/3     Terminating   3          41d
pod/node-exporter-79gdv                            2/2     Running       8          41d
pod/node-exporter-8zpvc                            2/2     Running       6          41d
pod/node-exporter-d7z8q                            2/2     Running       4          41d
pod/node-exporter-krs6k                            2/2     Running       6          41d
pod/node-exporter-vb9cw                            2/2     Running       6          41d
pod/openshift-state-metrics-7f4bdfbdf9-2mp47       3/3     Terminating   3          41d
pod/openshift-state-metrics-7f4bdfbdf9-x68b6       3/3     Running       0          20d
pod/prometheus-adapter-85c478d57b-hb4kg            1/1     Running       0          4d7h
pod/prometheus-adapter-85c478d57b-vc9pl            1/1     Running       0          4d7h
pod/prometheus-adapter-fbfddf8d9-lrxdm             1/1     Terminating   1          39d
pod/prometheus-k8s-0                               6/6     Terminating   7          41d
pod/prometheus-k8s-1                               6/6     Terminating   7          41d
pod/prometheus-operator-6c7dbc485f-m5cbh           1/1     Running       4          41d
pod/telemeter-client-56fdb5c589-6tkxr              3/3     Running       0          20d

NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-main             ClusterIP   172.30.34.131   <none>        9094/TCP                     54d
service/alertmanager-operated         ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP   54d
service/cluster-monitoring-operator   ClusterIP   None            <none>        8080/TCP                     54d
service/grafana                       ClusterIP   172.30.112.46   <none>        3000/TCP                     54d
service/kube-state-metrics            ClusterIP   None            <none>        8443/TCP,9443/TCP            54d
service/node-exporter                 ClusterIP   None            <none>        9100/TCP                     54d
service/openshift-state-metrics       ClusterIP   None            <none>        8443/TCP,9443/TCP            54d
service/prometheus-adapter            ClusterIP   172.30.167.74   <none>        443/TCP                      54d
service/prometheus-k8s                ClusterIP   172.30.10.2     <none>        9091/TCP,9092/TCP            54d
service/prometheus-operated           ClusterIP   None            <none>        9090/TCP                     54d
service/prometheus-operator           ClusterIP   None            <none>        8080/TCP                     54d
service/telemeter-client              ClusterIP   None            <none>        8443/TCP                     20d

NAME                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/node-exporter   5         5         4       5            4           kubernetes.io/os=linux   54d

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cluster-monitoring-operator   1/1     1            1           54d
deployment.apps/grafana                       1/1     1            1           54d
deployment.apps/kube-state-metrics            1/1     1            1           54d
deployment.apps/openshift-state-metrics       1/1     1            1           54d
deployment.apps/prometheus-adapter            2/2     2            2           54d
deployment.apps/prometheus-operator           1/1     1            1           54d
deployment.apps/telemeter-client              1/1     1            1           20d

NAME                                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/cluster-monitoring-operator-6bf7c89799   1         1         1       41d
replicaset.apps/cluster-monitoring-operator-84cd9df668   0         0         0       54d
replicaset.apps/grafana-5db6fd97f8                       0         0         0       54d
replicaset.apps/grafana-69f4f95645                       1         1         1       41d
replicaset.apps/kube-state-metrics-646c968f77            1         1         1       41d
replicaset.apps/kube-state-metrics-895899678             0         0         0       54d
replicaset.apps/openshift-state-metrics-77d5f699d8       0         0         0       54d
replicaset.apps/openshift-state-metrics-7f4bdfbdf9       1         1         1       41d
replicaset.apps/prometheus-adapter-5dcb66b7bb            0         0         0       9d
replicaset.apps/prometheus-adapter-6679d4b44             0         0         0       54d
replicaset.apps/prometheus-adapter-6bcc8bb5c4            0         0         0       20d
replicaset.apps/prometheus-adapter-74d7c85978            0         0         0       54d
replicaset.apps/prometheus-adapter-785d77c74f            0         0         0       54d
replicaset.apps/prometheus-adapter-7bc7db849b            0         0         0       5d19h
replicaset.apps/prometheus-adapter-7c8d5f9bbd            0         0         0       41d
replicaset.apps/prometheus-adapter-7cd7464584            0         0         0       19d
replicaset.apps/prometheus-adapter-85c478d57b            2         2         2       4d7h
replicaset.apps/prometheus-adapter-d78bc79cf             0         0         0       54d
replicaset.apps/prometheus-adapter-fbfddf8d9             0         0         0       39d
replicaset.apps/prometheus-operator-6584955c55           0         0         0       54d
replicaset.apps/prometheus-operator-6c7bddddfd           0         0         0       54d
replicaset.apps/prometheus-operator-6c7dbc485f           1         1         1       41d
replicaset.apps/telemeter-client-56fdb5c589              1         1         1       20d

NAME                                 READY   AGE
statefulset.apps/alertmanager-main   0/3     54d
statefulset.apps/prometheus-k8s      0/2     54d

NAME                                         HOST/PORT                                                        PATH   SERVICES            PORT    TERMINATION          WILDCARD
route.route.openshift.io/alertmanager-main   alertmanager-main-openshift-monitoring.apps.lab.msp.redhat.com          alertmanager-main   web     reencrypt/Redirect   None
route.route.openshift.io/grafana             grafana-openshift-monitoring.apps.lab.msp.redhat.com                    grafana             https   reencrypt/Redirect   None
route.route.openshift.io/prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.lab.msp.redhat.com             prometheus-k8s      web     reencrypt/Redirect   None
[root@tatooine ~]# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.2     True        False         False      20d
cloud-credential                           4.2.2     True        False         False      54d
cluster-autoscaler                         4.2.2     True        False         False      54d
console                                    4.2.2     True        False         False      20d
dns                                        4.2.2     True        True          True       54d
image-registry                             4.2.2     True        False         False      20d
ingress                                    4.2.2     True        False         False      20d
insights                                   4.2.2     True        False         False      54d
kube-apiserver                             4.2.2     True        False         False      54d
kube-controller-manager                    4.2.2     True        False         False      54d
kube-scheduler                             4.2.2     True        False         False      54d
machine-api                                4.2.2     True        False         False      54d
machine-config                             4.2.2     True        False         False      33d
marketplace                                4.2.2     True        False         False      20d
monitoring                                 4.2.2     False       True          True       20d
network                                    4.2.2     True        True          False      54d
node-tuning                                4.2.2     True        False         False      20d
openshift-apiserver                        4.2.2     True        False         False      34d
openshift-controller-manager               4.2.2     True        False         False      54d
openshift-samples                          4.2.2     True        False         False      41d
operator-lifecycle-manager                 4.2.2     True        False         False      54d
operator-lifecycle-manager-catalog         4.2.2     True        False         False      54d
operator-lifecycle-manager-packageserver   4.2.2     True        False         False      4d7h
service-ca                                 4.2.2     True        False         False      54d
service-catalog-apiserver                  4.2.2     True        False         False      54d
service-catalog-controller-manager         4.2.2     True        False         False      54d
storage                                    4.2.2     True        False         False      41d

Expected results:

Monitoring project should attempt to come up, and if it doesn't we should have troubleshooting instructions on where to check or how to bring the project back online.

Additional info:

When attempting to delete the pods that are in terminating state, the command hangs.

[root@tatooine ~]# oc delete pod/alertmanager-main-2
pod "alertmanager-main-2" deleted

Hitting `Control-C` brings the prompt back.

Comment 1 Paul Gier 2019-12-10 22:05:10 UTC
Are you able to get the pod logs for one of the stuck prometheus pods, and also the kubelet log on the node where it is running?

Comment 2 Sam Yangsao 2019-12-10 22:24:11 UTC
(In reply to Paul Gier from comment #1)
> Are you able to get the pod logs for one of the stuck prometheus pods, and
> also the kubelet log on the node where it is running?

Is this the only one you want to see?  

[root@tatooine ~]# oc logs pod/prometheus-k8s-0 -c prometheus
Error from server: Get https://10.15.108.87:10250/containerLogs/openshift-monitoring/prometheus-k8s-0/prometheus: remote error: tls: internal error

Or the rest of the containers?

[root@tatooine ~]# oc logs pod/prometheus-k8s-0
Error from server (BadRequest): a container name must be specified for pod prometheus-k8s-0, choose one of: [prometheus prometheus-config-reloader rules-configmap-reloader prometheus-proxy kube-rbac-proxy prom-label-proxy]

Also, since this is RHCOS, how would we get the kubelet log?  Should we run `oc adm gather` for this or is there another area to look?

Thanks!

Comment 3 Paul Gier 2019-12-11 15:25:07 UTC
Are you able to get the logs of cluster-monitoring-operator?  Hopefully, that will give some information about what is failing.

Comment 4 Sam Yangsao 2019-12-11 15:51:55 UTC
Created attachment 1644018 [details]
cluster operator logs

Comment 5 Sam Yangsao 2019-12-11 15:52:56 UTC
Cluster operator logs attached in comment#4, here's the configuration (default, no changes):

[root@tatooine ~]# oc get clusteroperator -o yaml monitoring
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-10-17T01:14:49Z"
  generation: 1
  name: monitoring
  resourceVersion: "27374073"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/monitoring
  uid: 83f440de-f07b-11e9-a275-001a4a16010d
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-12-11T15:47:05Z"
    message: Rolling out the stack.
    reason: RollOutInProgress
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-11-22T12:41:11Z"
    message: 'Failed to rollout the stack. Error: running task Updating node-exporter
      failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object
      failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter
      is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4)'
    reason: UpdatingnodeExporterFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-12-11T15:47:05Z"
    message: Rollout of the monitoring stack is in progress. Please wait until it
      finishes.
    reason: RollOutInProgress
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2019-11-19T20:19:47Z"
    status: "False"
    type: Available
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: monitoring
  - group: ""
    name: openshift-monitoring
    resource: namespaces
  versions:
  - name: operator
    version: 4.2.2

Comment 6 Paul Gier 2019-12-11 16:29:16 UTC
From the logs looks like the node_exporter rollout is not completing for some reason.

I1211 15:47:05.823735       1 operator.go:321] Updating ClusterOperator status to failed. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4)
E1211 15:47:05.884244       1 operator.go:267] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E1211 15:47:05.884294       1 operator.go:268] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4)
W1211 15:47:05.884358       1 operator.go:349] No Cluster Monitoring ConfigMap was found. Using defaults.
I1211 15:47:05.946093       1 operator.go:313] Updating ClusterOperator status to in progress.


Reassigning to the node team to investigate why this failed.

Comment 7 Urvashi Mohnani 2020-02-27 14:58:12 UTC
I believe this was fixed by https://github.com/cri-o/cri-o/commit/98d0d9a776d781bcbbda4181e41103126a1bc02f and is in the latest 1.14 build https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1089477. Can you please try this out again with the latest cri-o 1.14 build.
Thanks!

Comment 8 Sam Yangsao 2020-02-27 15:13:20 UTC
(In reply to Urvashi Mohnani from comment #7)
> I believe this was fixed by
> https://github.com/cri-o/cri-o/commit/
> 98d0d9a776d781bcbbda4181e41103126a1bc02f and is in the latest 1.14 build
> https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1089477. Can
> you please try this out again with the latest cri-o 1.14 build.
> Thanks!

How do you install the latest upstream bits on an already running RHCOS instance?

Thanks!

Comment 9 Urvashi Mohnani 2020-03-02 17:23:21 UTC
You can manually switch the cri-o binary on the nodes in the cluster (it is a bit tedious depending on how many  nodes you have). But these are the steps for doing so:

1) Copy over the new cri-o binary to the node via scp
2) ssh into the node and become root, then run `ostree admin unlock --hotfix`
3) run `systemctl stop crio`
4) run `which crio`
5) move the new cri-o binary over to the path you get from step 4
6) run `systemctl start crio`
7) `cri-o --version` should show you the new version of the cri-o you copied over

One more thing you can do is start a new cluster from the latest 4.2 nightly and that should have the patched up cri-o version.

Comment 10 Urvashi Mohnani 2020-03-05 15:15:02 UTC
Hi Sam, any update on whether this worked?

Comment 11 Sam Yangsao 2020-03-05 15:23:29 UTC
(In reply to Urvashi Mohnani from comment #10)
> Hi Sam, any update on whether this worked?

No update, sorry, I went ahead and rebuilt the cluster on 4.3.1 since I was out of time for a customer demo.  Thanks!

Comment 12 Sam Yangsao 2020-03-05 15:24:07 UTC
Clearing `needinfo` for comment#11

Comment 13 Urvashi Mohnani 2020-03-05 16:44:58 UTC
The patch is in the current latest version of 4.2. Closing, please reopen if it occurs again.