Bug 1733708
| Summary: | 4.1.4 cluster creating 400+ eviction requests a second | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Choudhary <vichoudh> | |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> | |
| Status: | CLOSED ERRATA | QA Contact: | Jianwei Hou <jhou> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.2.0 | CC: | afield, agarcial, ccoleman, jchaloup, jhou, mgugino | |
| Target Milestone: | --- | |||
| Target Release: | 4.2.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1732929 | |||
| : | 1733922 (view as bug list) | Environment: | ||
| Last Closed: | 2019-10-16 06:33:51 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1732929, 1733922 | |||
|
Description
Vikas Choudhary
2019-07-27 19:31:40 UTC
Steps to reproduce the issue:
1. Create the pdb with minAvailable 1:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
minAvailable: 1
selector:
matchLabels:
"app": "nginx"
2. Create a replicaset with 1 replica and labels as `"app": "nginx"`
3. Find out the machine for the node on which pod is running.
4. Delete the machine
Machine deletion command will not return.
In another terminal, do a `tail -f` on machine-controller logs. You will see logs like below:
late the pod's disruption budget.
I0727 19:40:08.240725 1 info.go:20] error when evicting pod "pdb-workload-q24lb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0727 19:40:13.250043 1 info.go:20] error when evicting pod "pdb-workload-q24lb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0727 19:40:18.265153 1 info.go:20] error when evicting pod "pdb-workload-q24lb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
In couple of minutes, you will see that time difference between the two eviction failure logs is decreasing. Failure logs will keep on increasing per second with time.
After applying the fix, you will see that time difference is always constant 5 secs.
WORKAROUND: Delete the pdb object which is making the eviction fail. PREVENTIVE STEP: PDB configuration and replica count should not be such that it could lead to eviction failure. For example, if minAvailable is 1 in the PDB and replica count is 1, this pod eviction will fail for sure and the issue will occur. WORKAROUND: Delete the pdb object which is making the eviction fail. PREVENTIVE STEP: PDB configuration and replica count should not be such that it could lead to eviction failure. For example, if minAvailable is 1 in the PDB and replica count is 1, this pod eviction will fail for sure and the issue will occur. Since this is in POST can you please link all the relevant PRs. Targeting this to 4.2 and using https://bugzilla.redhat.com/show_bug.cgi?id=1732929 for tracking backport *** Bug 1733922 has been marked as a duplicate of this bug. *** https://github.com/openshift/cluster-api/pull/54 https://github.com/openshift/cluster-api-provider-aws/pull/243 Verified on 4.2.0-0.nightly-2019-08-27-105356. Eviction retry is skipped after global timeout. ``` I0828 06:03:59.442555 1 controller.go:205] Reconciling machine "jhou1-mp2nv-worker-ap-northeast-1a-mcg2t" triggers delete I0828 06:03:59.510084 1 info.go:16] ignoring DaemonSet-managed pods: tuned-r6cfv, dns-default-4rnkl, node-ca-mj6c2, machine-config-daemon-8nqgn, node-exporter-28x7s, multus-cm86v, ovs-mzhv8, sdn-5br78 I0828 06:03:59.524755 1 info.go:20] error when evicting pod "rc-grgbz" (will retry after 5s): Cannot evict pod as it would violate the pod's dis ruption budget. I0828 06:04:04.535211 1 info.go:20] error when evicting pod "rc-grgbz" (will retry after 5s): Cannot evict pod as it would violate the pod's dis ruption budget. I0828 06:04:09.544775 1 info.go:20] error when evicting pod "rc-grgbz" (will retry after 5s): Cannot evict pod as it would violate the pod's dis ruption budget. I0828 06:04:14.554539 1 info.go:20] error when evicting pod "rc-grgbz" (will retry after 5s): Cannot evict pod as it would violate the pod's dis ruption budget. I0828 06:04:19.513918 1 info.go:20] Closing stopCh I0828 06:04:19.626801 1 info.go:16] ignoring DaemonSet-managed pods: tuned-r6cfv, dns-default-4rnkl, node-ca-mj6c2, machine-config-daemon-8nqgn, node-exporter-28x7s, multus-cm86v, ovs-mzhv8, sdn-5br78 I0828 06:04:19.626836 1 info.go:20] failed to evict pods from node "ip-10-0-130-176.ap-northeast-1.compute.internal" (pending pods: rc-grgbz): g lobal timeout!! Skip eviction retries for pod "rc-grgbz" I0828 06:04:19.626853 1 info.go:16] global timeout!! Skip eviction retries for pod "rc-grgbz" I0828 06:04:19.626861 1 info.go:20] unable to drain node "ip-10-0-130-176.ap-northeast-1.compute.internal" I0828 06:04:19.626871 1 info.go:20] there are pending nodes to be drained: ip-10-0-130-176.ap-northeast-1.compute.internal W0828 06:04:19.626883 1 controller.go:298] drain failed for machine "jhou1-mp2nv-worker-ap-northeast-1a-mcg2t": global timeout!! Skip eviction r etries for pod "rc-grgbz" E0828 06:04:19.626898 1 controller.go:214] Failed to drain node for machine "jhou1-mp2nv-worker-ap-northeast-1a-mcg2t": requeue in: 20s ``` Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |