Bug 1781824
Summary: | Ways to clean up the monitoring project with pods stuck in terminating status | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sam Yangsao <syangsao> | ||||
Component: | Node | Assignee: | Urvashi Mohnani <umohnani> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Sunil Choudhary <schoudha> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.2.z | CC: | alegrand, anpicker, aos-bugs, erooth, gkimetto, jokerman, kakkoyun, lcosic, mdunn, mlabonte, mloibl, pkrupa, rphillips, surbania | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.4.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: crio was not cleaning up IPs correctly on node reboot when pods could not be restored.
Consequence: This would lead to node IP exhaustion, causing pods to fail to start.
Fix: crio was fixed to correctly cleanup IPs.
Result:
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-03-05 16:44:58 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1731242 | ||||||
Attachments: |
|
Description
Sam Yangsao
2019-12-10 16:22:12 UTC
Are you able to get the pod logs for one of the stuck prometheus pods, and also the kubelet log on the node where it is running? (In reply to Paul Gier from comment #1) > Are you able to get the pod logs for one of the stuck prometheus pods, and > also the kubelet log on the node where it is running? Is this the only one you want to see? [root@tatooine ~]# oc logs pod/prometheus-k8s-0 -c prometheus Error from server: Get https://10.15.108.87:10250/containerLogs/openshift-monitoring/prometheus-k8s-0/prometheus: remote error: tls: internal error Or the rest of the containers? [root@tatooine ~]# oc logs pod/prometheus-k8s-0 Error from server (BadRequest): a container name must be specified for pod prometheus-k8s-0, choose one of: [prometheus prometheus-config-reloader rules-configmap-reloader prometheus-proxy kube-rbac-proxy prom-label-proxy] Also, since this is RHCOS, how would we get the kubelet log? Should we run `oc adm gather` for this or is there another area to look? Thanks! Are you able to get the logs of cluster-monitoring-operator? Hopefully, that will give some information about what is failing. Created attachment 1644018 [details]
cluster operator logs
Cluster operator logs attached in comment#4, here's the configuration (default, no changes): [root@tatooine ~]# oc get clusteroperator -o yaml monitoring apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-10-17T01:14:49Z" generation: 1 name: monitoring resourceVersion: "27374073" selfLink: /apis/config.openshift.io/v1/clusteroperators/monitoring uid: 83f440de-f07b-11e9-a275-001a4a16010d spec: {} status: conditions: - lastTransitionTime: "2019-12-11T15:47:05Z" message: Rolling out the stack. reason: RollOutInProgress status: "True" type: Progressing - lastTransitionTime: "2019-11-22T12:41:11Z" message: 'Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4)' reason: UpdatingnodeExporterFailed status: "True" type: Degraded - lastTransitionTime: "2019-12-11T15:47:05Z" message: Rollout of the monitoring stack is in progress. Please wait until it finishes. reason: RollOutInProgress status: "True" type: Upgradeable - lastTransitionTime: "2019-11-19T20:19:47Z" status: "False" type: Available extension: null relatedObjects: - group: operator.openshift.io name: cluster resource: monitoring - group: "" name: openshift-monitoring resource: namespaces versions: - name: operator version: 4.2.2 From the logs looks like the node_exporter rollout is not completing for some reason. I1211 15:47:05.823735 1 operator.go:321] Updating ClusterOperator status to failed. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4) E1211 15:47:05.884244 1 operator.go:267] Syncing "openshift-monitoring/cluster-monitoring-config" failed E1211 15:47:05.884294 1 operator.go:268] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 4) W1211 15:47:05.884358 1 operator.go:349] No Cluster Monitoring ConfigMap was found. Using defaults. I1211 15:47:05.946093 1 operator.go:313] Updating ClusterOperator status to in progress. Reassigning to the node team to investigate why this failed. I believe this was fixed by https://github.com/cri-o/cri-o/commit/98d0d9a776d781bcbbda4181e41103126a1bc02f and is in the latest 1.14 build https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1089477. Can you please try this out again with the latest cri-o 1.14 build. Thanks! (In reply to Urvashi Mohnani from comment #7) > I believe this was fixed by > https://github.com/cri-o/cri-o/commit/ > 98d0d9a776d781bcbbda4181e41103126a1bc02f and is in the latest 1.14 build > https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1089477. Can > you please try this out again with the latest cri-o 1.14 build. > Thanks! How do you install the latest upstream bits on an already running RHCOS instance? Thanks! You can manually switch the cri-o binary on the nodes in the cluster (it is a bit tedious depending on how many nodes you have). But these are the steps for doing so: 1) Copy over the new cri-o binary to the node via scp 2) ssh into the node and become root, then run `ostree admin unlock --hotfix` 3) run `systemctl stop crio` 4) run `which crio` 5) move the new cri-o binary over to the path you get from step 4 6) run `systemctl start crio` 7) `cri-o --version` should show you the new version of the cri-o you copied over One more thing you can do is start a new cluster from the latest 4.2 nightly and that should have the patched up cri-o version. Hi Sam, any update on whether this worked? (In reply to Urvashi Mohnani from comment #10) > Hi Sam, any update on whether this worked? No update, sorry, I went ahead and rebuilt the cluster on 4.3.1 since I was out of time for a customer demo. Thanks! Clearing `needinfo` for comment#11 The patch is in the current latest version of 4.2. Closing, please reopen if it occurs again. |