Description of problem: build 41 upi-on-baremetal on openstack(4.1.0-0.nightly-2020-04-04-073122), and add two rhel 7.7 nodes, all pods are healthy, then upgrade to 4.2.0-0.nightly-2020-04-04-181814, monitoring is degraded, one node-exporter scheduled on rhel 7.7 node is stuck in Terminating status, and there are also other pods in Terminating status, they are all on rhel 7.7 nodes # oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS juzhao-41-qgxjb-compute-0 Ready worker 106m v1.13.4-138-g41dc99c beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1 juzhao-41-qgxjb-compute-1 Ready worker 104m v1.13.4-138-g41dc99c beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1 juzhao-41-qgxjb-compute-2 Ready worker 104m v1.13.4-138-g41dc99c beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1 juzhao-41-qgxjb-control-plane-0 Ready master 116m v1.13.4-138-g41dc99c beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1 juzhao-41-qgxjb-control-plane-1 Ready master 116m v1.13.4-138-g41dc99c beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-control-plane-1,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1 juzhao-41-qgxjb-control-plane-2 Ready master 116m v1.13.4-138-g41dc99c beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-control-plane-2,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1 juzhao-41-qgxjb-rhel-0 Ready worker 57m v1.13.4-138-g41dc99c beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-rhel-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhel,node.openshift.io/os_version=7.7 juzhao-41-qgxjb-rhel-1 Ready worker 56m v1.13.4-138-g41dc99c beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-41-qgxjb-rhel-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhel,node.openshift.io/os_version=7.7 # oc get co/monitoring -oyaml ... - lastTransitionTime: "2020-04-07T04:07:09Z" message: 'Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 8, updated: 2, ready: 7, unavailable: 1)' reason: UpdatingnodeExporterFailed status: "True" type: Degraded ... # oc -n openshift-monitoring get pod -o wide | grep node-exporter node-exporter-cqj6j 2/2 Running 2 3h23m 10.0.99.156 juzhao-41-qgxjb-compute-2 <none> <none> node-exporter-hpx95 2/2 Running 0 126m 10.0.99.142 juzhao-41-qgxjb-control-plane-1 <none> <none> node-exporter-hxwst 2/2 Running 2 3h23m 10.0.99.205 juzhao-41-qgxjb-compute-1 <none> <none> node-exporter-l9vzn 2/2 Running 2 3h25m 10.0.99.144 juzhao-41-qgxjb-control-plane-2 <none> <none> node-exporter-plgqb 2/2 Running 0 156m 10.0.99.209 juzhao-41-qgxjb-rhel-0 <none> <none> node-exporter-pttnk 2/2 Running 0 126m 10.0.99.38 juzhao-41-qgxjb-control-plane-0 <none> <none> node-exporter-rktgs 0/2 Terminating 0 156m 10.0.98.151 juzhao-41-qgxjb-rhel-1 <none> <none> node-exporter-x5cxn 2/2 Running 2 3h25m 10.0.98.209 juzhao-41-qgxjb-compute-0 <none> <none> # oc get pod --all-namespaces -o wide | grep -Ev "Running|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-cluster-node-tuning-operator tuned-n5s78 0/1 Terminating 0 174m 10.0.98.151 juzhao-41-qgxjb-rhel-1 <none> <none> openshift-cluster-node-tuning-operator tuned-zjvrq 0/1 Terminating 0 174m 10.0.99.209 juzhao-41-qgxjb-rhel-0 <none> <none> openshift-image-registry node-ca-mtwnp 0/1 Terminating 0 174m 10.131.2.3 juzhao-41-qgxjb-rhel-1 <none> <none> openshift-monitoring node-exporter-rktgs 0/2 Terminating 0 174m 10.0.98.151 juzhao-41-qgxjb-rhel-1 <none> <none> Version-Release number of selected component (if applicable): build 41 upi-on-baremetal on openstack(4.1.0-0.nightly-2020-04-04-073122), and add two rhel 7.7 nodes, then upgrade to 4.2.0-0.nightly-2020-04-04-181814 How reproducible: Always Steps to Reproduce: 1. See the description 2. 3. Actual results: pods stuck in Terminating status Expected results: no issue Additional info:
# oc -n openshift-monitoring describe pod node-exporter-rktgs Name: node-exporter-rktgs Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: juzhao-41-qgxjb-rhel-1/10.0.98.151 Start Time: Mon, 06 Apr 2020 23:27:49 -0400 Labels: app=node-exporter controller-revision-hash=6bfcfc7c69 pod-template-generation=1 Annotations: openshift.io/scc: node-exporter Status: Terminating (lasts 171m) Termination Grace Period: 30s IP: 10.0.98.151 IPs: <none> Controlled By: DaemonSet/node-exporter Containers: node-exporter: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6 Image ID: Port: <none> Host Port: <none> Args: --web.listen-address=127.0.0.1:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/host/root --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/) --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$ --no-collector.wifi State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 0 Environment: <none> Mounts: /host/proc from proc (rw) /host/root from root (ro) /host/sys from sys (rw) /var/run/secrets/kubernetes.io/serviceaccount from node-exporter-token-r9rlb (ro) kube-rbac-proxy: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361 Image ID: Port: 9100/TCP Host Port: 9100/TCP Args: --logtostderr --secure-listen-address=$(IP):9100 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 --upstream=http://127.0.0.1:9100/ --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: IP: (v1:status.podIP) Mounts: /etc/tls/private from node-exporter-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from node-exporter-token-r9rlb (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: sys: Type: HostPath (bare host directory volume) Path: /sys HostPathType: root: Type: HostPath (bare host directory volume) Path: / HostPathType: node-exporter-tls: Type: Secret (a volume populated by a Secret) SecretName: node-exporter-tls Optional: false node-exporter-token-r9rlb: Type: Secret (a volume populated by a Secret) SecretName: node-exporter-token-r9rlb Optional: false QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/network-unavailable:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 3h21m default-scheduler Successfully assigned openshift-monitoring/node-exporter-rktgs to juzhao-41-qgxjb-rhel-1 Normal Pulling 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6" Normal Pulled 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cd19db1ac2b99e1bea85419637198034379bcbce2cc29ec481599f2f8d621b6" Normal Created 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Created container node-exporter Normal Started 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Started container node-exporter Normal Pulling 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361" Normal Pulled 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ddf66115172e35f78944913081c67ec868c671a7520e62ba83e05fa3f768361" Normal Created 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Created container kube-rbac-proxy Normal Started 3h21m kubelet, juzhao-41-qgxjb-rhel-1 Started container kube-rbac-proxy Normal Killing 171m kubelet, juzhao-41-qgxjb-rhel-1 Stopping container node-exporter Normal Killing 171m kubelet, juzhao-41-qgxjb-rhel-1 Stopping container kube-rbac-proxy
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days