Bug 1821576 - pods stuck in Terminating status on RHEL 7.7 nodes
Summary: pods stuck in Terminating status on RHEL 7.7 nodes
Keywords:
Status: CLOSED DUPLICATE of bug 1810722
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-07 06:45 UTC by Junqi Zhao
Modified: 2023-09-15 00:30 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-07 18:49:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Junqi Zhao 2020-04-07 06:45:50 UTC
Description of problem:
build 41 upi-on-baremetal on openstack(4.1.0-0.nightly-2020-04-04-073122), and add two rhel 7.7 nodes, all pods are healthy, then upgrade to 4.2.0-0.nightly-2020-04-04-181814, monitoring is degraded, one node-exporter scheduled on rhel 7.7 node is stuck in Terminating status, and there are also other pods in Terminating status, they are all on rhel 7.7 nodes
# oc get node --show-labels
NAME                              STATUS   ROLES    AGE    VERSION                LABELS
juzhao-41-qgxjb-compute-0         Ready    worker   106m   v1.13.4-138-g41dc99c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1
juzhao-41-qgxjb-compute-1         Ready    worker   104m   v1.13.4-138-g41dc99c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1
juzhao-41-qgxjb-compute-2         Ready    worker   104m   v1.13.4-138-g41dc99c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1
juzhao-41-qgxjb-control-plane-0   Ready    master   116m   v1.13.4-138-g41dc99c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1
juzhao-41-qgxjb-control-plane-1   Ready    master   116m   v1.13.4-138-g41dc99c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-control-plane-1,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1
juzhao-41-qgxjb-control-plane-2   Ready    master   116m   v1.13.4-138-g41dc99c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-control-plane-2,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1
juzhao-41-qgxjb-rhel-0            Ready    worker   57m    v1.13.4-138-g41dc99c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-rhel-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhel,node.openshift.io/os_version=7.7
juzhao-41-qgxjb-rhel-1            Ready    worker   56m    v1.13.4-138-g41dc99c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-rhel-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhel,node.openshift.io/os_version=7.7


# oc get co/monitoring -oyaml
...
  - lastTransitionTime: "2020-04-07T04:07:09Z"
    message: 'Failed to rollout the stack. Error: running task Updating node-exporter
      failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object
      failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter
      is not ready. status: (desired: 8, updated: 2, ready: 7, unavailable: 1)'
    reason: UpdatingnodeExporterFailed
    status: "True"
    type: Degraded
...
# oc -n openshift-monitoring get pod -o wide | grep node-exporter
node-exporter-cqj6j                            2/2     Running       2          3h23m   10.0.99.156   juzhao-41-qgxjb-compute-2         <none>           <none>
node-exporter-hpx95                            2/2     Running       0          126m    10.0.99.142   juzhao-41-qgxjb-control-plane-1   <none>           <none>
node-exporter-hxwst                            2/2     Running       2          3h23m   10.0.99.205   juzhao-41-qgxjb-compute-1         <none>           <none>
node-exporter-l9vzn                            2/2     Running       2          3h25m   10.0.99.144   juzhao-41-qgxjb-control-plane-2   <none>           <none>
node-exporter-plgqb                            2/2     Running       0          156m    10.0.99.209   juzhao-41-qgxjb-rhel-0            <none>           <none>
node-exporter-pttnk                            2/2     Running       0          126m    10.0.99.38    juzhao-41-qgxjb-control-plane-0   <none>           <none>
node-exporter-rktgs                            0/2     Terminating   0          156m    10.0.98.151   juzhao-41-qgxjb-rhel-1            <none>           <none>
node-exporter-x5cxn                            2/2     Running       2          3h25m   10.0.98.209   juzhao-41-qgxjb-compute-0         <none>           <none> 

# oc get pod --all-namespaces -o wide | grep -Ev "Running|Completed"
NAMESPACE                                               NAME                                                              READY   STATUS        RESTARTS   AGE     IP            NODE                              NOMINATED NODE   READINESS GATES
openshift-cluster-node-tuning-operator                  tuned-n5s78                                                       0/1     Terminating   0          174m    10.0.98.151   juzhao-41-qgxjb-rhel-1            <none>           <none>
openshift-cluster-node-tuning-operator                  tuned-zjvrq                                                       0/1     Terminating   0          174m    10.0.99.209   juzhao-41-qgxjb-rhel-0            <none>           <none>
openshift-image-registry                                node-ca-mtwnp                                                     0/1     Terminating   0          174m    10.131.2.3    juzhao-41-qgxjb-rhel-1            <none>           <none>
openshift-monitoring                                    node-exporter-rktgs                                               0/2     Terminating   0          174m    10.0.98.151   juzhao-41-qgxjb-rhel-1            <none>           <none>

Version-Release number of selected component (if applicable):
build 41 upi-on-baremetal on openstack(4.1.0-0.nightly-2020-04-04-073122), and add two rhel 7.7 nodes, then upgrade to 4.2.0-0.nightly-2020-04-04-181814

How reproducible:
Always

Steps to Reproduce:
1. See the description
2.
3.

Actual results:
pods stuck in Terminating status

Expected results:
no issue

Additional info:

Comment 3 Junqi Zhao 2020-04-07 06:57:09 UTC
# oc -n openshift-monitoring describe pod node-exporter-rktgs 
Name:                      node-exporter-rktgs
Namespace:                 openshift-monitoring
Priority:                  2000000000
Priority Class Name:       system-cluster-critical
Node:                      juzhao-41-qgxjb-rhel-1/10.0.98.151
Start Time:                Mon, 06 Apr 2020 23:27:49 -0400
Labels:                    app=node-exporter
                           controller-revision-hash=6bfcfc7c69
                           pod-template-generation=1
Annotations:               openshift.io/scc: node-exporter
Status:                    Terminating (lasts 171m)
Termination Grace Period:  30s
IP:                        10.0.98.151
IPs:                       <none>
Controlled By:             DaemonSet/node-exporter
Containers:
  node-exporter:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --web.listen-address=127.0.0.1:9100
      --path.procfs=/host/proc
      --path.sysfs=/host/sys
      --path.rootfs=/host/root
      --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
      --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
      --no-collector.wifi
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/proc from proc (rw)
      /host/root from root (ro)
      /host/sys from sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from node-exporter-token-r9rlb (ro)
  kube-rbac-proxy:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361
    Image ID:      
    Port:          9100/TCP
    Host Port:     9100/TCP
    Args:
      --logtostderr
      --secure-listen-address=$(IP):9100
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
      --upstream=http://127.0.0.1:9100/
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     10m
      memory:  20Mi
    Environment:
      IP:   (v1:status.podIP)
    Mounts:
      /etc/tls/private from node-exporter-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from node-exporter-token-r9rlb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  
  sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  
  root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  node-exporter-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  node-exporter-tls
    Optional:    false
  node-exporter-token-r9rlb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  node-exporter-token-r9rlb
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age    From                             Message
  ----    ------     ----   ----                             -------
  Normal  Scheduled  3h21m  default-scheduler                Successfully assigned openshift-monitoring/node-exporter-rktgs to juzhao-41-qgxjb-rhel-1
  Normal  Pulling    3h21m  kubelet, juzhao-41-qgxjb-rhel-1  Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6"
  Normal  Pulled     3h21m  kubelet, juzhao-41-qgxjb-rhel-1  Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6"
  Normal  Created    3h21m  kubelet, juzhao-41-qgxjb-rhel-1  Created container node-exporter
  Normal  Started    3h21m  kubelet, juzhao-41-qgxjb-rhel-1  Started container node-exporter
  Normal  Pulling    3h21m  kubelet, juzhao-41-qgxjb-rhel-1  Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361"
  Normal  Pulled     3h21m  kubelet, juzhao-41-qgxjb-rhel-1  Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361"
  Normal  Created    3h21m  kubelet, juzhao-41-qgxjb-rhel-1  Created container kube-rbac-proxy
  Normal  Started    3h21m  kubelet, juzhao-41-qgxjb-rhel-1  Started container kube-rbac-proxy
  Normal  Killing    171m   kubelet, juzhao-41-qgxjb-rhel-1  Stopping container node-exporter
  Normal  Killing    171m   kubelet, juzhao-41-qgxjb-rhel-1  Stopping container kube-rbac-proxy

Comment 4 Scott Dodson 2020-04-07 18:19:08 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 7 W. Trevor King 2021-04-05 17:46:58 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 8 Red Hat Bugzilla 2023-09-15 00:30:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.