Description of problem: oc get ns | grep Terminating ci-op-0kfbll3t Terminating 15h ci-op-0s7gqtnf Terminating 27h ci-op-30c903rw Terminating 18h ci-op-fpwy7vwm Terminating 7d13h ci-op-i7xcfyv9 Terminating 7d21h ci-op-jripq40x Terminating 7d13h ci-op-lc136j9k Terminating 18h ci-op-m2qdij2v Terminating 14h ci-op-rmyyz9vc Terminating 26h ci-op-sw5540vd Terminating 26h ci-op-t59xf5b1 Terminating 7d13h ci-op-vz3wn11c Terminating 7d13h ci-op-x7zyz5c5 Terminating 42h ci-op-y1wt39w2 Terminating 7d18h ci-op-z4fh5nin Terminating 26h oc get all -n ci-op-0kfbll3t -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/release-initial 0/2 Terminating 0 14h <none> ip-10-0-169-106.ec2.internal <none> <none> #typical description of the terminating pods oc describe pod/release-initial -n ci-op-0kfbll3t | grep "was terminated" -A3 -B3 State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 As a side effect, autoscaler failed on draining the nodes with the those pods and thus cannot scale down the cluster. Version-Release number of selected component (if applicable): oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-23-130439 True False 8d Cluster version is 4.3.0-0.nightly-2020-03-23-130439 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Also tried: oc describe pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 | grep "was terminated" -A3 -B3 /tools/entrypoint State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 oc get pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 355d34f2-733a-11ea-9b47-0a58ac10a1a6 0/2 Terminating 1 35h 10.128.64.76 ip-10-0-131-192.ec2.internal <none> <none> oc adm drain "ip-10-0-131-192.ec2.internal" --delete-local-data --ignore-daemonsets --force node/ip-10-0-131-192.ec2.internal already cordoned WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: ci/355d34f2-733a-11ea-9b47-0a58ac10a1a6, ci/70f91686-7339-11ea-9b47-0a58ac10a1a6; ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-zmpqj, openshift-dns/dns-default-tfg2f, openshift-image-registry/node-ca-c8gbg, openshift-machine-config-operator/machine-config-daemon-p2j7l, openshift-monitoring/node-exporter-xhzfd, openshift-multus/multus-67xxk, openshift-sdn/ovs-z2zsd, openshift-sdn/sdn-2sqht evicting pod "70f91686-7339-11ea-9b47-0a58ac10a1a6" evicting pod "kubevirt-test-build" evicting pod "openshift-acme-exposer-build" evicting pod "baremetal-installer-build" evicting pod "355d34f2-733a-11ea-9b47-0a58ac10a1a6" pod/70f91686-7339-11ea-9b47-0a58ac10a1a6 evicted pod/355d34f2-733a-11ea-9b47-0a58ac10a1a6 evicted There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23425ef] goroutine 1 [running]: github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain.(*podDeleteList).Pods(0x0, 0xc00000e010, 0x3828790, 0x3d) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain/filters.go:49 +0x4f github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).deleteOrEvictPodsSimple(0xc0010447e0, 0xc000630380, 0x0, 0x0) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:335 +0x219 github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).RunDrain(0xc0010447e0, 0x0, 0x3960c88) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:293 +0x5bd github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.NewCmdDrain.func1(0xc0011aec80, 0xc0012cce00, 0x1, 0x4) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:185 +0xa0 github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).execute(0xc0011aec80, 0xc0012ccdc0, 0x4, 0x4, 0xc0011aec80, 0xc0012ccdc0) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:830 +0x2ae github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00100db80, 0x2, 0xc00100db80, 0x2) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:914 +0x2fc github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:864 main.main() /go/src/github.com/openshift/oc/cmd/oc/oc.go:107 +0x835
oc adm must-gather --dest-dir='./must-gather' [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 [must-gather ] OUT namespace/openshift-must-gather-zvdnt created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 created [must-gather-zqwrb] POD Wrote inspect data to must-gather. [must-gather-zqwrb] POD Gathering data for ns/openshift-cluster-version... [must-gather-zqwrb] POD Wrote inspect data to must-gather. [must-gather-zqwrb] POD Gathering data for ns/openshift-config... [must-gather-zqwrb] POD Gathering data for ns/openshift-config-managed... [must-gather-zqwrb] POD Gathering data for ns/openshift-authentication... [must-gather-zqwrb] POD Gathering data for ns/openshift-authentication-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-ingress... [must-gather-zqwrb] POD Gathering data for ns/openshift-cloud-credential-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-machine-api... [must-gather-zqwrb] POD Gathering data for ns/openshift-console-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-console... [must-gather-zqwrb] POD Gathering data for ns/openshift-dns-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-dns... [must-gather-zqwrb] POD Gathering data for ns/openshift-image-registry... [must-gather-zqwrb] OUT waiting for gather to complete [must-gather-zqwrb] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 deleted [must-gather ] OUT namespace/openshift-must-gather-zvdnt deleted error: gather never finished for pod must-gather-zqwrb: timed out waiting for the condition
https://coreos.slack.com/archives/CFDM5CQMN/p1585781207023200
There are two problems here, one is with nil pointer in drain and that bit is going to be addressed by my team, but additionally the stuck pods I've split into https://bugzilla.redhat.com/show_bug.cgi?id=1820507. The oc part is fixed starting from 4.4, moving accordingly.
FYI, there is already a PR out to address the forbidden errors here: https://github.com/kubernetes/kubernetes/pull/89314 I don't know if or how that might impact the null pointer problem, though.
Confirmed with latest oc client, can't reproduce the issue now: [root@dhcp-140-138 roottest]# oc get po -A -o wide|grep Termi openshift-multus multus-b5gs4 0/1 Terminating 0 45h 10.0.99.139 hrw-bar12-9rfxg-rhel-0 <none> <none> openshift-sdn ovs-5m82p 0/1 Terminating 0 45h 10.0.99.139 hrw-bar12-9rfxg-rhel-0 <none> <none> openshift-sdn sdn-nfkd2 0/1 Terminating 1 45h 10.0.99.139 hrw-bar12-9rfxg-rhel-0 <none> <none> [root@dhcp-140-138 roottest]# oc adm drain "hrw-bar12-9rfxg-rhel-0" --delete-local-data --ignore-daemonsets --force node/hrw-bar12-9rfxg-rhel-0 cordoned WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-j4qb7, openshift-dns/dns-default-h5bd8, openshift-image-registry/node-ca-kvtt8, openshift-machine-config-operator/machine-config-daemon-ddhzp, openshift-monitoring/node-exporter-r96z2, openshift-multus/multus-b5gs4, openshift-sdn/ovs-5m82p, openshift-sdn/sdn-nfkd2 evicting pod openshift-marketplace/community-operators-879b5f6ff-2pzgl evicting pod openshift-image-registry/image-pruner-1586304000-t6skm evicting pod openshift-marketplace/redhat-marketplace-6d46bccd87-hx2f5 evicting pod openshift-marketplace/redhat-operators-bfd786b97-d9ts5 pod/community-operators-879b5f6ff-2pzgl evicted pod/redhat-operators-bfd786b97-d9ts5 evicted pod/image-pruner-1586304000-t6skm evicted pod/redhat-marketplace-6d46bccd87-hx2f5 evicted node/hrw-bar12-9rfxg-rhel-0 evicted
[root@dhcp-140-138 roottest]# oc version -o yaml clientVersion: buildDate: "2020-04-06T21:08:17Z" compiler: gc gitCommit: f2b01c4e4ae8c4ca11caabf8cb8e76b7a28b7009 gitTreeState: clean gitVersion: 4.5.0-202004062101-f2b01c4 goVersion: go1.13.4 major: "" minor: "" platform: linux/amd64
@zhou, could you reproduce with an older version? It is not that drain-node command does not work at all. We need to see a pod stuck with terminating first, then drain node.
(In reply to Hongkai Liu from comment #10) > @zhou, > > could you reproduce with an older version? > > It is not that drain-node command does not work at all. > We need to see a pod stuck with terminating first, then drain node. For the verify , I just looking a exist cluster with pod state: Terminating , and run `oc adm drain` command , please see https://bugzilla.redhat.com/show_bug.cgi?id=1819954#c8. Today I've tried to create pod and make it Terminating, but failed. I can't find way to do this. If my verify is wrong , please correct me . thanks.
Hey Zhou Ying, Thanks for the information. The cluster was with lots of ongoing builds when I hit the issue. The oc-cli panic was a result of simulating what machine-controller did upon scaling down the cluster. My feeling is that when the server side bug (https://bugzilla.redhat.com/show_bug.cgi?id=1820507) is fixed, oc-cli wont panic any more. I do not know how to verify this without the cooperation from server side.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409