Description of problem: both my 4.3 and 4.4 endurance clusters did an upgrade to the new nightly and both of them now have a node stuck in "scheduling disabled": 4.4: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-129-249.us-east-2.compute.internal Ready master 6d17h v1.17.1 ip-10-0-130-6.us-east-2.compute.internal Ready,SchedulingDisabled worker 6d17h v1.17.1 ip-10-0-132-57.us-east-2.compute.internal Ready worker 6d17h v1.17.1 ip-10-0-142-163.us-east-2.compute.internal Ready master 6d17h v1.17.1 ip-10-0-147-128.us-east-2.compute.internal Ready worker 6d17h v1.17.1 ip-10-0-147-89.us-east-2.compute.internal Ready master 6d17h v1.17.1 $ oc version Client Version: v4.2.0-alpha.0-249-gc276ecb Server Version: 4.4.0-0.nightly-2020-03-15-215151 Kubernetes Version: v1.17.1 4.3: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-130-231.us-east-2.compute.internal Ready,SchedulingDisabled worker 6d17h v1.16.2 ip-10-0-131-254.us-east-2.compute.internal Ready master 6d17h v1.16.2 ip-10-0-134-56.us-east-2.compute.internal Ready worker 6d17h v1.16.2 ip-10-0-137-75.us-east-2.compute.internal Ready master 6d17h v1.16.2 ip-10-0-146-93.us-east-2.compute.internal Ready worker 6d17h v1.16.2 ip-10-0-157-250.us-east-2.compute.internal Ready master 6d17h v1.16.2 $ oc version Client Version: v4.2.0-alpha.0-249-gc276ecb Server Version: 4.3.0-0.nightly-2020-03-15-112942 Kubernetes Version: v1.16.2 Ping me for access to the clusters. Expected results: Nodes do not get stuck in scheduling disabled after upgrade.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug, it will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.2.z and 4.3.1
upgrade edges 4.3: desired: force: true image: registry.svc.ci.openshift.org/ocp/release:4.3 version: 4.3.0-0.nightly-2020-03-15-112942 history: - completionTime: "2020-03-16T00:42:59Z" image: registry.svc.ci.openshift.org/ocp/release:4.3 startedTime: "2020-03-16T00:04:28Z" state: Completed verified: false version: 4.3.0-0.nightly-2020-03-15-112942 - completionTime: "2020-03-10T19:17:13Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:e76667eb92d91d60fdc661bf88d6d15df528d417e5e11bd09244489d0aebf38d startedTime: "2020-03-10T19:00:10Z" state: Completed verified: false version: 4.3.0-0.nightly-2020-03-09-200240 upgrade edges 4.4: desired: force: true image: registry.svc.ci.openshift.org/ocp/release:4.4 version: 4.4.0-0.nightly-2020-03-15-215151 history: - completionTime: null image: registry.svc.ci.openshift.org/ocp/release:4.4 startedTime: "2020-03-16T00:04:36Z" state: Partial verified: false version: 4.4.0-0.nightly-2020-03-15-215151 - completionTime: "2020-03-10T19:38:36Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:36cd5bc706b135e4dff14064dd4c6dffb87d5c04158e3a8243d2ffe94beea64e startedTime: "2020-03-10T19:18:46Z" state: Completed verified: false version: 4.4.0-0.nightly-2020-03-10-115843 both clusters are still up and available for debug, ping me for credentials. Regarding the questions... who is affected? these were both z-stream upgrades, so in theory a 4.3.z to 4.3.z+1 customer is impacted. Since 4.4 isn't GA yet, unclear if there's any 4.4 impact. But i'd suspect that 4.3 to 4.4 would also be affected by whatever went wrong here. what is the impact? Worker nodes become unscheduleable, meaning some daemonsets never achieve readiness again and some cluster capacity is unavailable. Not sure if this could also affect a master node, which would have a much bigger impact. How involved is remediation? defer to assignee Is this a regression? defer to assignee
The 4.3 cluster has the following node annotation from MCO: machineconfiguration.openshift.io/reason: 'failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod "pvc-volume-tester-ccdlh": pods "pvc-volume-tester-ccdlh" is forbidden: unable to create new content in namespace e2e-csi-mock-volumes-7004 because it is being terminated'
The 4.3 node is in the middle of a drain and is cordoned. MCD is the process of draining the node and erroring with: Draining failed with: error when evicting pod "pvc-volume-tester-ccdlh": pods "pvc-volume-tester-ccdlh" is forbidden: unable to create new content in namespace e2e-csi-mock-volumes-7004 because it is being terminated, retrying I suspect the MCD could ignore this error on drain.
MCO Team: Thoughts on ignoring this error in the MCD?
(In reply to Ryan Phillips from comment #4) > The 4.3 node is in the middle of a drain and is cordoned. MCD is the process > of draining the node and erroring with: > > Draining failed with: error when evicting pod "pvc-volume-tester-ccdlh": > pods "pvc-volume-tester-ccdlh" is forbidden: unable to create new content in > namespace e2e-csi-mock-volumes-7004 because it is being terminated, retrying > > I suspect the MCD could ignore this error on drain. Ryan, is it safe to ignore that? I think I lack knowledge of that specific error to make a call right now, if it's safe to ignore, we can ignore.
Hrm... this is actually a bug. Drain should succeed because the pod is deleted already.
So drain can do one of two things: 1. When it gets an error, reread the pod and if the pod is already deleted, consider it evicted 2. Be slightly smarter on this error (see https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L176 and https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L75 to check for the terminating cause (controllers do it) However... no matter what, if drain returns an error, MCD MUST RETRY UNTIL IT RETURNS NO ERRORS. Is MCD doing that?
Note that if you get 2 you can't complete drain until the pod is gone (you should probably consider 1 and 2 the same where refresh of the pod is required). Alternatively, drain could detect namespace terminating error and then execute a pod delete directly. But drain can't continue to the next step until it observes a pod from the apiserver that has deletion timestamp set.
(In reply to Clayton Coleman from comment #8) > So drain can do one of two things: > > 1. When it gets an error, reread the pod and if the pod is already deleted, > consider it evicted > 2. Be slightly smarter on this error (see > https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/ > apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L176 and > https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/ > apimachinery/pkg/api/errors/errors.go#L75 to check for the terminating cause > (controllers do it) > > However... no matter what, if drain returns an error, MCD MUST RETRY UNTIL > IT RETURNS NO ERRORS. Is MCD doing that? to answer that question - yes, MCD and MAO are doing that using the shared drain library in https://github.com/openshift/cluster-api/tree/master/pkg/drain - which now moved and I have to find it again We need to be smarter on error, yes.
So I've actually noticed the MAO switched to use the upstream kubectl/drain library directly which is something we're gonna target for 4.5 again, along with ignoring the errors mentioned by Clayong, I don't know, maybe we want to upstream this "ignore" errors as well to kubectl/drain? the MAO switched here https://github.com/openshift/machine-api-operator/commit/f0c52c18e72ac92743f8f86540d7715d24bedd19 (In reply to Clayton Coleman from comment #8) > So drain can do one of two things: > > 1. When it gets an error, reread the pod and if the pod is already deleted, > consider it evicted > 2. Be slightly smarter on this error (see > https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/ > apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L176 and > https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/ > apimachinery/pkg/api/errors/errors.go#L75 to check for the terminating cause > (controllers do it) > > However... no matter what, if drain returns an error, MCD MUST RETRY UNTIL > IT RETURNS NO ERRORS. Is MCD doing that?
Do we know enough now to answer questions in comment 1? Comment 0 certainly leads me to the conclusion that we should clone this to 4.4 and treat this as a blocker.
(In reply to Clayton Coleman from comment #8) > So drain can do one of two things: > > 1. When it gets an error, reread the pod and if the pod is already deleted, > consider it evicted > 2. Be slightly smarter on this error (see > https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/ > apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L176 and > https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/ > apimachinery/pkg/api/errors/errors.go#L75 to check for the terminating cause > (controllers do it) > > However... no matter what, if drain returns an error, MCD MUST RETRY UNTIL > IT RETURNS NO ERRORS. Is MCD doing that? So, first, YES the MCD retries over and over - if we're not doing that it's a bug I agree but it seems that the ns is still in terminating even after 5 tries of draining, it would be nice to grab must-gather from those clusters to understand more. Again, the MCD keeps retrying with backoff and that's in pkg/daemon/daemon.go. We'll investigate that further after must gather... Meanwhile, I've pushed a patch to ignore that terminating ns error in master/4.5 and we can definitely backport it to 4.4 - not sure about being a blocker, the error should go away as the MCD retries and the ns is gone hopeufully. Scott, wrt to your questions: (In reply to Scott Dodson from comment #1) > We're asking the following questions to evaluate whether or not this bug > warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The > ultimate goal is to avoid delivering an update which introduces new risk or > reduces cluster functionality in any way. Sample answers are provided to > give more context. The UpgradeBlocker flag has been added to this bug, it > will be removed if the assessment indicates that this should not block > upgrade edges. > > Who is impacted? I don't think we have numbers, at least I don't and this is the very first time I see this issue. > What is the impact? Node cordoned and upgrade potentially blocked - we still have to assess (to Clayton's question) if the MCD is correctly retrying, which I think it does. > How involved is remediation? Could be as easy as fully manually terminate the ns/pods but we need a reproducer and/or a broken cluster to assess that > Is this a regression? No as far as I can tell, this is the first time we're seeing this issue so far. I guess finally, I'm fine if this is an UpgradeBlocker, we have patches already to ignore the TerminatingNamespace error
> I guess finally, I'm fine if this is an UpgradeBlocker, we have patches already to ignore the TerminatingNamespace error If this is an UpgradeBlocker and impacts all existing 4.1+ releases, that's a lot of edges to pull. And it means we'd need to backport the TerminatingNamespace patch all the way to 4.1, and then, in the release that landed the backport for each 4.y branch, bake in all the previous releases as update sources. And then test all of those edges to make sure we weren't recommending something that would break with a huge 4.1.0-rc.0 -> 4.1.40 (or whatever) jump. That's pretty aggressive. If you feel it's warranted, could you at least spell out why you consider all of our current update edge recommendations impacted?
(In reply to W. Trevor King from comment #15) > > I guess finally, I'm fine if this is an UpgradeBlocker, we have patches already to ignore the TerminatingNamespace error > > If this is an UpgradeBlocker and impacts all existing 4.1+ releases, that's > a lot of edges to pull. And it means we'd need to backport the > TerminatingNamespace patch all the way to 4.1, and then, in the release that > landed the backport for each 4.y branch, bake in all the previous releases > as update sources. And then test all of those edges to make sure we weren't > recommending something that would break with a huge 4.1.0-rc.0 -> 4.1.40 (or > whatever) jump. That's pretty aggressive. If you feel it's warranted, > could you at least spell out why you consider all of our current update edge > recommendations impacted? I haven't assessed if that's still the case - the MCO should be retrying over and over, forever so if this ns termination error is transient, it should resolve. The fact that it didn't warrants me to have must-gather to further debug - the patch we have is for ignoring the termination error but it shouldn't be necessary Again, we'd need must-gather.
must-gather was run against the clusters at the time they were in this state (the stuck nodes have since been cleaned up): https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/endurance-cluster-maintenance-aws-4.4/19/artifacts/endurance-install/ https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/endurance-cluster-maintenance-aws-4.3/119/artifacts/endurance-install/ However it looks like must-gather itself timed out. Perhaps you can find what you need in the other cluster artifacts that were gathered.
So, I can confirm the MCD retries _forever_: curl -v --silent https://storage.googleapis.com/origin-ci-test/logs/endurance-cluster-maintenance-aws-4.4/19/artifacts/endurance-install/pods/openshift-machine-config-operator_machine-config-daemon-ggpfp_machine-config-daemon.log 2>&1 | grep "Draining failed" | wc -l 6554 So, the MCD here is doing what it's supposed to do. Now, I believe whatever issue we have here is because this namespace is just laying around w/o going away after hours and hours: started: I0316 00:28:02.822966 980825 update.go:1241] Update prepared; beginning drain finished: I0318 10:02:55.493442 980825 update.go:167] Draining failed with: error when evicting pod "security-context-cf162714-689f-4e8d-8f99-3266d2dce45d": pods "security-context-cf162714-689f-4e8d-8f99-3266d2dce45d" is forbidden: unable to create new content in namespace e2e-volume-expand-2690 because it is being terminated, retrying It literally took forever and the draining operation was still unable to complete. As you noticed the offending pod is: security-context-cf162714-689f-4e8d-8f99-3266d2dce45d in namespace: e2e-volume-expand-2690 I think the bug is in the namespace not being terminated for whatever reason (leaked mounts in containers comes to mind). The MCO can definitely ignore that error but I don't believe it's the urgent fix we need for this now - rather, we should understand why that namespace is stuck because that's not normal and the MCO ignoring the error is just a stopgap (even if useful). I think we should start with reproducing this again and dig into why the namespace cannot go away after hours. Not sure who's gonna work on that tho. Please advice.
We now agree this is not an MCO and the MCO is behaving correctly; the MCO attempts to dain for 10+ hours but that namespace error is still there and that's _not_ on the MCO drain (as again, we keep draining). The following statuses were on the namespace: ``` { lastTransitionTime: "2020-03-14T04:06:41Z", message: "Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: e2e-volume-expand-2690 for gvr: /v1, Resource=pods", reason: "ContentDeletionFailed", status: "True", type: "NamespaceDeletionContentFailure" }, { lastTransitionTime: "2020-03-14T04:06:17Z", message: "Some resources are remaining: persistentvolumeclaims. has 1 resource instances, pods. has 1 resource instances", reason: "SomeResourcesRemain", status: "True", type: "NamespaceContentRemaining" }, { lastTransitionTime: "2020-03-14T04:06:17Z", message: "Some content in the namespace has finalizers remaining: kubernetes.io/pvc-protection in 1 resource instances", reason: "SomeFinalizersRemain", status: "True", type: "NamespaceFinalizersRemaining" } ], ``` That's likely the reason that ns error isn't _temporary_ but goes on forever and MCO _should not_ ignore that as it may leads to data loss when we reboot. Moving to storage to further investigate
Most likely dup of #1814282.
Also more NoSchedule after updates in bug 1821369 and bug 1821364. Not clear to me if these are all dups of bug 1814282 or separate issues, or how to go about making that distinction.
Marking as a dup of bug 1814282 per Hemant's secret-for-no-reason (that I can see) comment 22. *** This bug has been marked as a duplicate of bug 1814282 ***
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475