Bug 1819954

Summary:	oc adm drain panics
Product:	OpenShift Container Platform	Reporter:	Hongkai Liu <hongkliu>
Component:	oc	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED ERRATA	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	aaleman, aos-bugs, jokerman, mfojtik, skuznets, yinzhou
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1820507 (view as bug list)		Environment:
Last Closed:	2020-07-13 17:24:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongkai Liu 2020-04-01 22:14:25 UTC

Description of problem:

oc get ns  | grep Terminating
ci-op-0kfbll3t                                          Terminating   15h
ci-op-0s7gqtnf                                          Terminating   27h
ci-op-30c903rw                                          Terminating   18h
ci-op-fpwy7vwm                                          Terminating   7d13h
ci-op-i7xcfyv9                                          Terminating   7d21h
ci-op-jripq40x                                          Terminating   7d13h
ci-op-lc136j9k                                          Terminating   18h
ci-op-m2qdij2v                                          Terminating   14h
ci-op-rmyyz9vc                                          Terminating   26h
ci-op-sw5540vd                                          Terminating   26h
ci-op-t59xf5b1                                          Terminating   7d13h
ci-op-vz3wn11c                                          Terminating   7d13h
ci-op-x7zyz5c5                                          Terminating   42h
ci-op-y1wt39w2                                          Terminating   7d18h
ci-op-z4fh5nin                                          Terminating   26h

oc get all -n ci-op-0kfbll3t  -o wide
NAME                  READY   STATUS        RESTARTS   AGE   IP       NODE                           NOMINATED NODE   READINESS GATES
pod/release-initial   0/2     Terminating   0          14h   <none>   ip-10-0-169-106.ec2.internal   <none>           <none>

#typical description of the terminating pods
oc describe pod/release-initial -n ci-op-0kfbll3t | grep "was terminated" -A3 -B3

    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

As a side effect, autoscaler failed on draining the nodes with the those pods and thus cannot scale down the cluster.


Version-Release number of selected component (if applicable):
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-23-130439   True        False         8d      Cluster version is 4.3.0-0.nightly-2020-03-23-130439

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Also tried:

oc describe pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 | grep "was terminated" -A3 -B3
      /tools/entrypoint
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

oc get pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 -o wide
NAME                                   READY   STATUS        RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
355d34f2-733a-11ea-9b47-0a58ac10a1a6   0/2     Terminating   1          35h   10.128.64.76   ip-10-0-131-192.ec2.internal   <none>           <none>


oc adm drain "ip-10-0-131-192.ec2.internal"  --delete-local-data --ignore-daemonsets --force
node/ip-10-0-131-192.ec2.internal already cordoned
WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: ci/355d34f2-733a-11ea-9b47-0a58ac10a1a6, ci/70f91686-7339-11ea-9b47-0a58ac10a1a6; ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-zmpqj, openshift-dns/dns-default-tfg2f, openshift-image-registry/node-ca-c8gbg, openshift-machine-config-operator/machine-config-daemon-p2j7l, openshift-monitoring/node-exporter-xhzfd, openshift-multus/multus-67xxk, openshift-sdn/ovs-z2zsd, openshift-sdn/sdn-2sqht
evicting pod "70f91686-7339-11ea-9b47-0a58ac10a1a6"
evicting pod "kubevirt-test-build"
evicting pod "openshift-acme-exposer-build"
evicting pod "baremetal-installer-build"
evicting pod "355d34f2-733a-11ea-9b47-0a58ac10a1a6"
pod/70f91686-7339-11ea-9b47-0a58ac10a1a6 evicted
pod/355d34f2-733a-11ea-9b47-0a58ac10a1a6 evicted
There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23425ef]

goroutine 1 [running]:
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain.(*podDeleteList).Pods(0x0, 0xc00000e010, 0x3828790, 0x3d)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain/filters.go:49 +0x4f
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).deleteOrEvictPodsSimple(0xc0010447e0, 0xc000630380, 0x0, 0x0)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:335 +0x219
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).RunDrain(0xc0010447e0, 0x0, 0x3960c88)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:293 +0x5bd
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.NewCmdDrain.func1(0xc0011aec80, 0xc0012cce00, 0x1, 0x4)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:185 +0xa0
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).execute(0xc0011aec80, 0xc0012ccdc0, 0x4, 0x4, 0xc0011aec80, 0xc0012ccdc0)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:830 +0x2ae
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00100db80, 0x2, 0xc00100db80, 0x2)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:914 +0x2fc
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:864
main.main()
	/go/src/github.com/openshift/oc/cmd/oc/oc.go:107 +0x835

Comment 1 Hongkai Liu 2020-04-01 22:30:07 UTC

oc adm must-gather --dest-dir='./must-gather'
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221
[must-gather      ] OUT namespace/openshift-must-gather-zvdnt created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 created
[must-gather-zqwrb] POD Wrote inspect data to must-gather.
[must-gather-zqwrb] POD Gathering data for ns/openshift-cluster-version...
[must-gather-zqwrb] POD Wrote inspect data to must-gather.
[must-gather-zqwrb] POD Gathering data for ns/openshift-config...
[must-gather-zqwrb] POD Gathering data for ns/openshift-config-managed...
[must-gather-zqwrb] POD Gathering data for ns/openshift-authentication...
[must-gather-zqwrb] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-ingress...
[must-gather-zqwrb] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-machine-api...
[must-gather-zqwrb] POD Gathering data for ns/openshift-console-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-console...
[must-gather-zqwrb] POD Gathering data for ns/openshift-dns-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-dns...
[must-gather-zqwrb] POD Gathering data for ns/openshift-image-registry...
[must-gather-zqwrb] OUT waiting for gather to complete
[must-gather-zqwrb] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 deleted
[must-gather      ] OUT namespace/openshift-must-gather-zvdnt deleted
error: gather never finished for pod must-gather-zqwrb: timed out waiting for the condition

Comment 2 Hongkai Liu 2020-04-01 23:37:51 UTC

https://coreos.slack.com/archives/CFDM5CQMN/p1585781207023200

Comment 3 Maciej Szulik 2020-04-03 09:04:55 UTC

There are two problems here, one is with nil pointer in drain and that bit is going to be addressed by my team, 
but additionally the stuck pods I've split into https://bugzilla.redhat.com/show_bug.cgi?id=1820507.

The oc part is fixed starting from 4.4, moving accordingly.

Comment 7 Michael Gugino 2020-04-08 13:18:44 UTC

FYI, there is already a PR out to address the forbidden errors here: https://github.com/kubernetes/kubernetes/pull/89314

I don't know if or how that might impact the null pointer problem, though.

Comment 8 zhou ying 2020-04-09 05:59:59 UTC

Confirmed with latest oc client, can't reproduce the issue now:

[root@dhcp-140-138 roottest]# oc get po -A -o wide|grep Termi
openshift-multus                                   multus-b5gs4                                                    0/1     Terminating   0          45h     10.0.99.139   hrw-bar12-9rfxg-rhel-0            <none>           <none>
openshift-sdn                                      ovs-5m82p                                                       0/1     Terminating   0          45h     10.0.99.139   hrw-bar12-9rfxg-rhel-0            <none>           <none>
openshift-sdn                                      sdn-nfkd2                                                       0/1     Terminating   1          45h     10.0.99.139   hrw-bar12-9rfxg-rhel-0            <none>           <none>

[root@dhcp-140-138 roottest]# oc adm drain "hrw-bar12-9rfxg-rhel-0"  --delete-local-data --ignore-daemonsets --force
node/hrw-bar12-9rfxg-rhel-0 cordoned
WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-j4qb7, openshift-dns/dns-default-h5bd8, openshift-image-registry/node-ca-kvtt8, openshift-machine-config-operator/machine-config-daemon-ddhzp, openshift-monitoring/node-exporter-r96z2, openshift-multus/multus-b5gs4, openshift-sdn/ovs-5m82p, openshift-sdn/sdn-nfkd2
evicting pod openshift-marketplace/community-operators-879b5f6ff-2pzgl
evicting pod openshift-image-registry/image-pruner-1586304000-t6skm
evicting pod openshift-marketplace/redhat-marketplace-6d46bccd87-hx2f5
evicting pod openshift-marketplace/redhat-operators-bfd786b97-d9ts5
pod/community-operators-879b5f6ff-2pzgl evicted
pod/redhat-operators-bfd786b97-d9ts5 evicted
pod/image-pruner-1586304000-t6skm evicted
pod/redhat-marketplace-6d46bccd87-hx2f5 evicted
node/hrw-bar12-9rfxg-rhel-0 evicted

Comment 9 zhou ying 2020-04-09 06:02:24 UTC

[root@dhcp-140-138 roottest]# oc version -o yaml 
clientVersion:
  buildDate: "2020-04-06T21:08:17Z"
  compiler: gc
  gitCommit: f2b01c4e4ae8c4ca11caabf8cb8e76b7a28b7009
  gitTreeState: clean
  gitVersion: 4.5.0-202004062101-f2b01c4
  goVersion: go1.13.4
  major: ""
  minor: ""
  platform: linux/amd64

Comment 10 Hongkai Liu 2020-04-22 19:13:41 UTC

@zhou,

could you reproduce with an older version?

It is not that drain-node command does not work at all.
We need to see a pod stuck with terminating first, then drain node.

Comment 11 zhou ying 2020-04-24 06:03:54 UTC

(In reply to Hongkai Liu from comment #10)
> @zhou,
> 
> could you reproduce with an older version?
> 
> It is not that drain-node command does not work at all.
> We need to see a pod stuck with terminating first, then drain node.

For the verify , I just looking a exist cluster with pod state: Terminating , and run `oc adm drain` command , please see https://bugzilla.redhat.com/show_bug.cgi?id=1819954#c8. 
Today I've tried to create pod and make it Terminating, but failed. I can't find way to do this. 

If my verify is wrong , please correct me . thanks.

Comment 12 Hongkai Liu 2020-04-26 15:55:10 UTC

Hey Zhou Ying,

Thanks for the information.

The cluster was with lots of ongoing builds when I hit the issue.
The oc-cli panic was a result of simulating what machine-controller did upon scaling down the cluster.
My feeling is that when the server side bug (https://bugzilla.redhat.com/show_bug.cgi?id=1820507) is fixed, oc-cli wont panic any more.

I do not know how to verify this without the cooperation from server side.

Comment 14 errata-xmlrpc 2020-07-13 17:24:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409