Bug 1821576
Summary: | pods stuck in Terminating status on RHEL 7.7 nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> |
Component: | Node | Assignee: | Ryan Phillips <rphillips> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aos-bugs, jokerman, mpatel, rphillips, wking |
Version: | 4.2.z | ||
Target Milestone: | --- | ||
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-07 18:49:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Junqi Zhao
2020-04-07 06:45:50 UTC
# oc -n openshift-monitoring describe pod node-exporter-rktgs Name: node-exporter-rktgs Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: juzhao-41-qgxjb-rhel-1/10.0.98.151 Start Time: Mon, 06 Apr 2020 23:27:49 -0400 Labels: app=node-exporter controller-revision-hash=6bfcfc7c69 pod-template-generation=1 Annotations: openshift.io/scc: node-exporter Status: Terminating (lasts 171m) Termination Grace Period: 30s IP: 10.0.98.151 IPs: <none> Controlled By: DaemonSet/node-exporter Containers: node-exporter: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6 Image ID: Port: <none> Host Port: <none> Args: --web.listen-address=127.0.0.1:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/host/root --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/) --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$ --no-collector.wifi State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 0 Environment: <none> Mounts: /host/proc from proc (rw) /host/root from root (ro) /host/sys from sys (rw) /var/run/secrets/kubernetes.io/serviceaccount from node-exporter-token-r9rlb (ro) kube-rbac-proxy: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361 Image ID: Port: 9100/TCP Host Port: 9100/TCP Args: --logtostderr --secure-listen-address=$(IP):9100 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 --upstream=http://127.0.0.1:9100/ --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: IP: (v1:status.podIP) Mounts: /etc/tls/private from node-exporter-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from node-exporter-token-r9rlb (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: sys: Type: HostPath (bare host directory volume) Path: /sys HostPathType: root: Type: HostPath (bare host directory volume) Path: / HostPathType: node-exporter-tls: Type: Secret (a volume populated by a Secret) SecretName: node-exporter-tls Optional: false node-exporter-token-r9rlb: Type: Secret (a volume populated by a Secret) SecretName: node-exporter-token-r9rlb Optional: false QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/network-unavailable:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 3h21m default-scheduler Successfully assigned openshift-monitoring/node-exporter-rktgs to juzhao-41-qgxjb-rhel-1 Normal Pulling 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6" Normal Pulled 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6" Normal Created 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Created container node-exporter Normal Started 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Started container node-exporter Normal Pulling 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361" Normal Pulled 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361" Normal Created 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Created container kube-rbac-proxy Normal Started 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Started container kube-rbac-proxy Normal Killing 171m kubelet, juzhao-41-qgxjb-rhel-1 Stopping container node-exporter Normal Killing 171m kubelet, juzhao-41-qgxjb-rhel-1 Stopping container kube-rbac-proxy We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |