+++ This bug was initially created as a clone of Bug #1819954 +++ Description of problem: oc get ns | grep Terminating ci-op-0kfbll3t Terminating 15h ci-op-0s7gqtnf Terminating 27h ci-op-30c903rw Terminating 18h ci-op-fpwy7vwm Terminating 7d13h ci-op-i7xcfyv9 Terminating 7d21h ci-op-jripq40x Terminating 7d13h ci-op-lc136j9k Terminating 18h ci-op-m2qdij2v Terminating 14h ci-op-rmyyz9vc Terminating 26h ci-op-sw5540vd Terminating 26h ci-op-t59xf5b1 Terminating 7d13h ci-op-vz3wn11c Terminating 7d13h ci-op-x7zyz5c5 Terminating 42h ci-op-y1wt39w2 Terminating 7d18h ci-op-z4fh5nin Terminating 26h oc get all -n ci-op-0kfbll3t -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/release-initial 0/2 Terminating 0 14h <none> ip-10-0-169-106.ec2.internal <none> <none> #typical description of the terminating pods oc describe pod/release-initial -n ci-op-0kfbll3t | grep "was terminated" -A3 -B3 State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 As a side effect, autoscaler failed on draining the nodes with the those pods and thus cannot scale down the cluster. Version-Release number of selected component (if applicable): oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-23-130439 True False 8d Cluster version is 4.3.0-0.nightly-2020-03-23-130439 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Also tried: oc describe pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 | grep "was terminated" -A3 -B3 /tools/entrypoint State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 oc get pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 355d34f2-733a-11ea-9b47-0a58ac10a1a6 0/2 Terminating 1 35h 10.128.64.76 ip-10-0-131-192.ec2.internal <none> <none> oc adm drain "ip-10-0-131-192.ec2.internal" --delete-local-data --ignore-daemonsets --force node/ip-10-0-131-192.ec2.internal already cordoned WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: ci/355d34f2-733a-11ea-9b47-0a58ac10a1a6, ci/70f91686-7339-11ea-9b47-0a58ac10a1a6; ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-zmpqj, openshift-dns/dns-default-tfg2f, openshift-image-registry/node-ca-c8gbg, openshift-machine-config-operator/machine-config-daemon-p2j7l, openshift-monitoring/node-exporter-xhzfd, openshift-multus/multus-67xxk, openshift-sdn/ovs-z2zsd, openshift-sdn/sdn-2sqht evicting pod "70f91686-7339-11ea-9b47-0a58ac10a1a6" evicting pod "kubevirt-test-build" evicting pod "openshift-acme-exposer-build" evicting pod "baremetal-installer-build" evicting pod "355d34f2-733a-11ea-9b47-0a58ac10a1a6" pod/70f91686-7339-11ea-9b47-0a58ac10a1a6 evicted pod/355d34f2-733a-11ea-9b47-0a58ac10a1a6 evicted There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23425ef] goroutine 1 [running]: github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain.(*podDeleteList).Pods(0x0, 0xc00000e010, 0x3828790, 0x3d) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain/filters.go:49 +0x4f github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).deleteOrEvictPodsSimple(0xc0010447e0, 0xc000630380, 0x0, 0x0) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:335 +0x219 github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).RunDrain(0xc0010447e0, 0x0, 0x3960c88) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:293 +0x5bd github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.NewCmdDrain.func1(0xc0011aec80, 0xc0012cce00, 0x1, 0x4) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:185 +0xa0 github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).execute(0xc0011aec80, 0xc0012ccdc0, 0x4, 0x4, 0xc0011aec80, 0xc0012ccdc0) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:830 +0x2ae github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00100db80, 0x2, 0xc00100db80, 0x2) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:914 +0x2fc github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:864 main.main() /go/src/github.com/openshift/oc/cmd/oc/oc.go:107 +0x835 --- Additional comment from Hongkai Liu on 2020-04-02 00:30:07 CEST --- oc adm must-gather --dest-dir='./must-gather' [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 [must-gather ] OUT namespace/openshift-must-gather-zvdnt created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 created [must-gather-zqwrb] POD Wrote inspect data to must-gather. [must-gather-zqwrb] POD Gathering data for ns/openshift-cluster-version... [must-gather-zqwrb] POD Wrote inspect data to must-gather. [must-gather-zqwrb] POD Gathering data for ns/openshift-config... [must-gather-zqwrb] POD Gathering data for ns/openshift-config-managed... [must-gather-zqwrb] POD Gathering data for ns/openshift-authentication... [must-gather-zqwrb] POD Gathering data for ns/openshift-authentication-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-ingress... [must-gather-zqwrb] POD Gathering data for ns/openshift-cloud-credential-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-machine-api... [must-gather-zqwrb] POD Gathering data for ns/openshift-console-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-console... [must-gather-zqwrb] POD Gathering data for ns/openshift-dns-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-dns... [must-gather-zqwrb] POD Gathering data for ns/openshift-image-registry... [must-gather-zqwrb] OUT waiting for gather to complete [must-gather-zqwrb] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 deleted [must-gather ] OUT namespace/openshift-must-gather-zvdnt deleted error: gather never finished for pod must-gather-zqwrb: timed out waiting for the condition --- Additional comment from Hongkai Liu on 2020-04-02 01:37:51 CEST --- https://coreos.slack.com/archives/CFDM5CQMN/p1585781207023200
I split this part from https://bugzilla.redhat.com/show_bug.cgi?id=1819954 b/c there are 2 problems one is oc panicking which is handled in the other BZ, and the other problem is with terminating pods. Although the latter is very unclear. Hongkai can you provide more concrete information what failed and where? I'm asking for kubelet logs which parts are failing what is stuck the above description does not help debugging the problem.
The logs of clusterautoscaler showed that it failed to scale down the cluster because of the failures of draining nodes. And then I tried to run `oc adm drain <node>` manually and it failed with the similar logs like There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated] And then panic after sometime (see above). I was trying to use `oc adm must-gather` because i thought you might need me to collect them later for debugging, and failed too. I collected the node log by `oc adm node-logs` (only for ip-10-0-131-192.ec2.internal). I will upload later. Yesterday, we deleted the nodes which autoscaler failed to scale down with. The cluster is used as production of CI infra and we have to fix it. And now must-gather works (i guess it does not help much because the broken nodes are removed from the cluster).
To clarity, the dentries are leaking on curl liveness calls. This is technically a kernel bug, but we are fixing this in crio by adding default_env support into crio (https://github.com/cri-o/cri-o/pull/3611), and setting NSS_SDB_USE_CACHE=no within containers. If containers contain a different value, then the container overrides crio's injected variable. There are other fixes going into crio fixing the error handling in deferred functions. https://github.com/cri-o/cri-o/pull/3600 https://github.com/cri-o/cri-o/pull/3608 https://github.com/cri-o/cri-o/pull/3597 https://github.com/cri-o/cri-o/pull/3592
Also related to this bug series are the two post-fix mitigation bugs: bug 1829664 and bug 1829999.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to bug 1822269. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Example: Customers running 4.3 (under some conditions?) who try to drain nodes. Also customers running 4.4 before bug 1822268 landed (in some specific 4.4 RC?). We don't (usually?) pull edges into RCs or for non-regressions, so this would probably be pulling edges from 4.2 -> 4.3 for 4.3 that do not carry the fix for this bug's 4.3 clone (bug 1822269). What is the impact? Is it serious enough to warrant blocking edges? Example: Nodes get stuck draining forever and possibly need manual intervention to unstick them. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Example: Updating to a fixed version keeps *new* nodes from getting stuck like this, but does not automatically resolve existing stuck nodes. Bug 1829999 is about alerting admins impacted by this issue and pointing them at mitigation. Also, if anyone has ideas about how we can think about the remediation question more actively so we aren't surprised by the "fix avoids lock for new attempts but does not resolve resources that are already locked up" weeks after diagnosing the issue and floating a PR, we'd love to improve that part of our bugfix/impact flow.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days