1814241 – Node in SchedulingDisabled after upgrade

Bug 1814241 - Node in SchedulingDisabled after upgrade

Summary: Node in SchedulingDisabled after upgrade

Keywords:
Status:	CLOSED DUPLICATE of bug 1814282
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	aos-storage-staff@redhat.com
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-17 12:58 UTC by Ben Parees
Modified:	2021-04-05 17:45 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-07 19:49:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2020-03-17 12:58:20 UTC

Description of problem:
both my 4.3 and 4.4 endurance clusters did an upgrade to the new nightly and both of them now have a node stuck in "scheduling disabled":

4.4:
$ oc get nodes
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-129-249.us-east-2.compute.internal   Ready                      master   6d17h   v1.17.1
ip-10-0-130-6.us-east-2.compute.internal     Ready,SchedulingDisabled   worker   6d17h   v1.17.1
ip-10-0-132-57.us-east-2.compute.internal    Ready                      worker   6d17h   v1.17.1
ip-10-0-142-163.us-east-2.compute.internal   Ready                      master   6d17h   v1.17.1
ip-10-0-147-128.us-east-2.compute.internal   Ready                      worker   6d17h   v1.17.1
ip-10-0-147-89.us-east-2.compute.internal    Ready                      master   6d17h   v1.17.1

$ oc version
Client Version: v4.2.0-alpha.0-249-gc276ecb
Server Version: 4.4.0-0.nightly-2020-03-15-215151
Kubernetes Version: v1.17.1


4.3:
$ oc get nodes
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-130-231.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   6d17h   v1.16.2
ip-10-0-131-254.us-east-2.compute.internal   Ready                      master   6d17h   v1.16.2
ip-10-0-134-56.us-east-2.compute.internal    Ready                      worker   6d17h   v1.16.2
ip-10-0-137-75.us-east-2.compute.internal    Ready                      master   6d17h   v1.16.2
ip-10-0-146-93.us-east-2.compute.internal    Ready                      worker   6d17h   v1.16.2
ip-10-0-157-250.us-east-2.compute.internal   Ready                      master   6d17h   v1.16.2

$ oc version
Client Version: v4.2.0-alpha.0-249-gc276ecb
Server Version: 4.3.0-0.nightly-2020-03-15-112942
Kubernetes Version: v1.16.2


Ping me for access to the clusters.

Expected results:
Nodes do not get stuck in scheduling disabled after upgrade.

Comment 1 Scott Dodson 2020-03-17 13:17:26 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug, it will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

Comment 2 Ben Parees 2020-03-17 13:26:11 UTC

upgrade edges 4.3:

    desired:
      force: true
      image: registry.svc.ci.openshift.org/ocp/release:4.3
      version: 4.3.0-0.nightly-2020-03-15-112942
    history:
    - completionTime: "2020-03-16T00:42:59Z"
      image: registry.svc.ci.openshift.org/ocp/release:4.3
      startedTime: "2020-03-16T00:04:28Z"
      state: Completed
      verified: false
      version: 4.3.0-0.nightly-2020-03-15-112942
    - completionTime: "2020-03-10T19:17:13Z"
      image: registry.svc.ci.openshift.org/ocp/release@sha256:e76667eb92d91d60fdc661bf88d6d15df528d417e5e11bd09244489d0aebf38d
      startedTime: "2020-03-10T19:00:10Z"
      state: Completed
      verified: false
      version: 4.3.0-0.nightly-2020-03-09-200240

upgrade edges 4.4:
    desired:
      force: true
      image: registry.svc.ci.openshift.org/ocp/release:4.4
      version: 4.4.0-0.nightly-2020-03-15-215151
    history:
    - completionTime: null
      image: registry.svc.ci.openshift.org/ocp/release:4.4
      startedTime: "2020-03-16T00:04:36Z"
      state: Partial
      verified: false
      version: 4.4.0-0.nightly-2020-03-15-215151
    - completionTime: "2020-03-10T19:38:36Z"
      image: registry.svc.ci.openshift.org/ocp/release@sha256:36cd5bc706b135e4dff14064dd4c6dffb87d5c04158e3a8243d2ffe94beea64e
      startedTime: "2020-03-10T19:18:46Z"
      state: Completed
      verified: false
      version: 4.4.0-0.nightly-2020-03-10-115843


both clusters are still up and available for debug, ping me for credentials.


Regarding the questions... 

who is affected?  these were both z-stream upgrades, so in theory a 4.3.z to 4.3.z+1 customer is impacted.  Since 4.4 isn't GA yet, unclear if there's any 4.4 impact.  But i'd suspect that 4.3 to 4.4 would also be affected by whatever went wrong here.

what is the impact?  Worker nodes become unscheduleable, meaning some daemonsets never achieve readiness again and some cluster capacity is unavailable.  Not sure if this could also affect a master node, which would have a much bigger impact.


How involved is remediation?  defer to assignee

Is this a regression? defer to assignee

Comment 3 Ryan Phillips 2020-03-17 14:06:59 UTC

The 4.3 cluster has the following node annotation from MCO:

    machineconfiguration.openshift.io/reason: 'failed to drain node (5 tries): timed
      out waiting for the condition: error when evicting pod "pvc-volume-tester-ccdlh":
      pods "pvc-volume-tester-ccdlh" is forbidden: unable to create new content in
      namespace e2e-csi-mock-volumes-7004 because it is being terminated'

Comment 4 Ryan Phillips 2020-03-17 14:25:41 UTC

The 4.3 node is in the middle of a drain and is cordoned. MCD is the process of draining the node and erroring with:

  Draining failed with: error when evicting pod "pvc-volume-tester-ccdlh": pods "pvc-volume-tester-ccdlh" is forbidden: unable to create new content in namespace e2e-csi-mock-volumes-7004 because it is being terminated, retrying

I suspect the MCD could ignore this error on drain.

Comment 5 Ryan Phillips 2020-03-17 14:27:04 UTC

MCO Team: Thoughts on ignoring this error in the MCD?

Comment 6 Antonio Murdaca 2020-03-17 14:30:08 UTC

(In reply to Ryan Phillips from comment #4)
> The 4.3 node is in the middle of a drain and is cordoned. MCD is the process
> of draining the node and erroring with:
> 
>   Draining failed with: error when evicting pod "pvc-volume-tester-ccdlh":
> pods "pvc-volume-tester-ccdlh" is forbidden: unable to create new content in
> namespace e2e-csi-mock-volumes-7004 because it is being terminated, retrying
> 
> I suspect the MCD could ignore this error on drain.

Ryan, is it safe to ignore that? I think I lack knowledge of that specific error to make a call right now, if it's safe to ignore, we can ignore.

Comment 7 Clayton Coleman 2020-03-17 18:38:59 UTC

Hrm... this is actually a bug.  Drain should succeed because the pod is deleted already.

Comment 8 Clayton Coleman 2020-03-17 18:42:28 UTC

So drain can do one of two things:

1. When it gets an error, reread the pod and if the pod is already deleted, consider it evicted
2. Be slightly smarter on this error (see https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L176 and https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L75 to check for the terminating cause (controllers do it)

However... no matter what, if drain returns an error, MCD MUST RETRY UNTIL IT RETURNS NO ERRORS.  Is MCD doing that?

Comment 9 Clayton Coleman 2020-03-17 18:45:23 UTC

Note that if you get 2 you can't complete drain until the pod is gone (you should probably consider 1 and 2 the same where refresh of the pod is required).  Alternatively, drain could detect namespace terminating error and then execute a pod delete directly.  But drain can't continue to the next step until it observes a pod from the apiserver that has deletion timestamp set.

Comment 10 Antonio Murdaca 2020-03-18 10:47:08 UTC

(In reply to Clayton Coleman from comment #8)
> So drain can do one of two things:
> 
> 1. When it gets an error, reread the pod and if the pod is already deleted,
> consider it evicted
> 2. Be slightly smarter on this error (see
> https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/
> apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L176 and
> https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/
> apimachinery/pkg/api/errors/errors.go#L75 to check for the terminating cause
> (controllers do it)
> 
> However... no matter what, if drain returns an error, MCD MUST RETRY UNTIL
> IT RETURNS NO ERRORS.  Is MCD doing that?

to answer that question - yes, MCD and MAO are doing that using the shared drain library in https://github.com/openshift/cluster-api/tree/master/pkg/drain - which now moved and I have to find it again

We need to be smarter on error, yes.

Comment 11 Antonio Murdaca 2020-03-18 11:15:41 UTC

So I've actually noticed the MAO switched to use the upstream kubectl/drain library directly which is something we're gonna target for 4.5 again, along with ignoring the errors mentioned by Clayong, I don't know, maybe we want to upstream this "ignore" errors as well to kubectl/drain? the MAO switched here https://github.com/openshift/machine-api-operator/commit/f0c52c18e72ac92743f8f86540d7715d24bedd19


(In reply to Clayton Coleman from comment #8)
> So drain can do one of two things:
> 
> 1. When it gets an error, reread the pod and if the pod is already deleted,
> consider it evicted
> 2. Be slightly smarter on this error (see
> https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/
> apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L176 and
> https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/
> apimachinery/pkg/api/errors/errors.go#L75 to check for the terminating cause
> (controllers do it)
> 
> However... no matter what, if drain returns an error, MCD MUST RETRY UNTIL
> IT RETURNS NO ERRORS.  Is MCD doing that?

Comment 12 Scott Dodson 2020-03-27 19:29:28 UTC

Do we know enough now to answer questions in comment 1? Comment 0 certainly leads me to the conclusion that we should clone this to 4.4 and treat this as a blocker.

Comment 13 Antonio Murdaca 2020-03-27 22:06:45 UTC

(In reply to Clayton Coleman from comment #8)
> So drain can do one of two things:
> 
> 1. When it gets an error, reread the pod and if the pod is already deleted,
> consider it evicted
> 2. Be slightly smarter on this error (see
> https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/
> apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L176 and
> https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/
> apimachinery/pkg/api/errors/errors.go#L75 to check for the terminating cause
> (controllers do it)
> 
> However... no matter what, if drain returns an error, MCD MUST RETRY UNTIL
> IT RETURNS NO ERRORS.  Is MCD doing that?

So, first, YES the MCD retries over and over - if we're not doing that it's a bug I agree but it seems that the ns is still in terminating even after 5 tries of draining, it would be nice to grab must-gather from those clusters to understand more. Again, the MCD keeps retrying with backoff and that's in pkg/daemon/daemon.go. We'll investigate that further after must gather...

Meanwhile, I've pushed a patch to ignore that terminating ns error in master/4.5 and we can definitely backport it to 4.4 - not sure about being a blocker, the error should go away as the MCD retries and the ns is gone hopeufully.

Scott, wrt to your questions:

(In reply to Scott Dodson from comment #1)
> We're asking the following questions to evaluate whether or not this bug
> warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The
> ultimate goal is to avoid delivering an update which introduces new risk or
> reduces cluster functionality in any way. Sample answers are provided to
> give more context. The UpgradeBlocker flag has been added to this bug, it
> will be removed if the assessment indicates that this should not block
> upgrade edges.
> 
> Who is impacted?

I don't think we have numbers, at least I don't and this is the very first time I see this issue.

> What is the impact?

Node cordoned and upgrade potentially blocked - we still have to assess (to Clayton's question) if the MCD is correctly retrying, which I think it does.

> How involved is remediation?

Could be as easy as fully manually terminate the ns/pods but we need a reproducer and/or a broken cluster to assess that

> Is this a regression?

No as far as I can tell, this is the first time we're seeing this issue so far.


I guess finally, I'm fine if this is an UpgradeBlocker, we have patches already to ignore the TerminatingNamespace error

Comment 15 W. Trevor King 2020-03-30 20:59:00 UTC

> I guess finally, I'm fine if this is an UpgradeBlocker, we have patches already to ignore the TerminatingNamespace error

If this is an UpgradeBlocker and impacts all existing 4.1+ releases, that's a lot of edges to pull.  And it means we'd need to backport the TerminatingNamespace patch all the way to 4.1, and then, in the release that landed the backport for each 4.y branch, bake in all the previous releases as update sources.  And then test all of those edges to make sure we weren't recommending something that would break with a huge 4.1.0-rc.0 -> 4.1.40 (or whatever) jump.  That's pretty aggressive.  If you feel it's warranted, could you at least spell out why you consider all of our current update edge recommendations impacted?

Comment 16 Antonio Murdaca 2020-03-31 06:47:02 UTC

(In reply to W. Trevor King from comment #15)
> > I guess finally, I'm fine if this is an UpgradeBlocker, we have patches already to ignore the TerminatingNamespace error
> 
> If this is an UpgradeBlocker and impacts all existing 4.1+ releases, that's
> a lot of edges to pull.  And it means we'd need to backport the
> TerminatingNamespace patch all the way to 4.1, and then, in the release that
> landed the backport for each 4.y branch, bake in all the previous releases
> as update sources.  And then test all of those edges to make sure we weren't
> recommending something that would break with a huge 4.1.0-rc.0 -> 4.1.40 (or
> whatever) jump.  That's pretty aggressive.  If you feel it's warranted,
> could you at least spell out why you consider all of our current update edge
> recommendations impacted?

I haven't assessed if that's still the case - the MCO should be retrying over and over, forever so if this ns termination error is transient, it should resolve. The fact that it didn't warrants me to have must-gather to further debug - the patch we have is for ignoring the termination error but it shouldn't be necessary

Again, we'd need must-gather.

Comment 17 Ben Parees 2020-03-31 13:52:34 UTC

must-gather was run against the clusters at the time they were in this state (the stuck nodes have since been cleaned up):

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/endurance-cluster-maintenance-aws-4.4/19/artifacts/endurance-install/

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/endurance-cluster-maintenance-aws-4.3/119/artifacts/endurance-install/


However it looks like must-gather itself timed out.  Perhaps you can find what you need in the other cluster artifacts that were gathered.

Comment 18 Antonio Murdaca 2020-04-03 07:38:20 UTC

So, I can confirm the MCD retries _forever_:

curl -v --silent https://storage.googleapis.com/origin-ci-test/logs/endurance-cluster-maintenance-aws-4.4/19/artifacts/endurance-install/pods/openshift-machine-config-operator_machine-config-daemon-ggpfp_machine-config-daemon.log 2>&1 | grep "Draining failed" | wc -l
    6554


So, the MCD here is doing what it's supposed to do.

Now, I believe whatever issue we have here is because this namespace is just laying around w/o going away after hours and hours:

started: 

I0316 00:28:02.822966  980825 update.go:1241] Update prepared; beginning drain

finished:

I0318 10:02:55.493442  980825 update.go:167] Draining failed with: error when evicting pod "security-context-cf162714-689f-4e8d-8f99-3266d2dce45d": pods "security-context-cf162714-689f-4e8d-8f99-3266d2dce45d" is forbidden: unable to create new content in namespace e2e-volume-expand-2690 because it is being terminated, retrying


It literally took forever and the draining operation was still unable to complete.

As you noticed the offending pod is:

security-context-cf162714-689f-4e8d-8f99-3266d2dce45d

in namespace:

e2e-volume-expand-2690


I think the bug is in the namespace not being terminated for whatever reason (leaked mounts in containers comes to mind). The MCO can definitely ignore that error but I don't believe it's the urgent fix we need for this now - rather, we should understand why that namespace is stuck because that's not normal and the MCO ignoring the error is just a stopgap (even if useful).

I think we should start with reproducing this again and dig into why the namespace cannot go away after hours. Not sure who's gonna work on that tho. Please advice.

Comment 19 Antonio Murdaca 2020-04-03 12:28:17 UTC

We now agree this is not an MCO and the MCO is behaving correctly; the MCO attempts to dain for 10+ hours but that namespace error is still there and that's _not_ on the MCO drain (as again, we keep draining).

The following statuses were on the namespace:

```
{
lastTransitionTime: "2020-03-14T04:06:41Z",
message: "Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: e2e-volume-expand-2690 for gvr: /v1, Resource=pods",
reason: "ContentDeletionFailed",
status: "True",
type: "NamespaceDeletionContentFailure"
},
{
lastTransitionTime: "2020-03-14T04:06:17Z",
message: "Some resources are remaining: persistentvolumeclaims. has 1 resource instances, pods. has 1 resource instances",
reason: "SomeResourcesRemain",
status: "True",
type: "NamespaceContentRemaining"
},
{
lastTransitionTime: "2020-03-14T04:06:17Z",
message: "Some content in the namespace has finalizers remaining: kubernetes.io/pvc-protection in 1 resource instances",
reason: "SomeFinalizersRemain",
status: "True",
type: "NamespaceFinalizersRemaining"
}
],
```

That's likely the reason that ns error isn't _temporary_ but goes on forever and MCO _should not_ ignore that as it may leads to data loss when we reboot.


Moving to storage to further investigate

Comment 20 Jan Safranek 2020-04-03 14:23:35 UTC

Most likely dup of #1814282.

Comment 21 W. Trevor King 2020-04-06 22:52:18 UTC

Also more NoSchedule after updates in bug 1821369 and bug 1821364.  Not clear to me if these are all dups of bug 1814282 or separate issues, or how to go about making that distinction.

Comment 23 W. Trevor King 2020-04-07 19:49:30 UTC

Marking as a dup of bug 1814282 per Hemant's secret-for-no-reason (that I can see) comment 22.

*** This bug has been marked as a duplicate of bug 1814282 ***

Comment 24 W. Trevor King 2021-04-05 17:45:51 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.