Bug 1861189 - [sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it
Summary: [sig-apps] DisruptionController should block an eviction until the PDB is upd...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard: non-multi-arch
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-28 02:57 UTC by W. Trevor King
Modified: 2020-10-27 16:20 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:17:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25335 0 None closed bug 1861189: UPSTREAM: <drop>: make pdb tests pass reliably 2020-09-08 19:42:47 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:20:51 UTC

Description W. Trevor King 2020-07-28 02:57:29 UTC
test:
[sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-apps%5C%5D+DisruptionController+should+block+an+eviction+until+the+PDB+is+updated+to+allow+it 

For example:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it' | grep 'failures match' | sort
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 157 runs, 100% failed, 2% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 23 runs, 100% failed, 4% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 54 runs, 74% failed, 45% of failures match
pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws - 4 runs, 75% failed, 33% of failures match
...
pull-ci-operator-framework-operator-registry-master-e2e-aws - 8 runs, 63% failed, 80% of failures match
rehearse-10454-pull-ci-openshift-cloud-credential-operator-master-e2e-azure - 1 runs, 100% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 1 runs, 100% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-origin-master-e2e-gcp - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 - 8 runs, 100% failed, 63% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 9 runs, 89% failed, 63% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 17 runs, 100% failed, 24% of failures match
release-openshift-ocp-installer-e2e-gcp-4.6 - 4 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 8 runs, 100% failed, 25% of failures match
release-openshift-ocp-installer-e2e-openstack-ppc64le-4.6 - 2 runs, 100% failed, 50% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 9 runs, 100% failed, 44% of failures match
release-openshift-origin-installer-e2e-aws-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-shared-vpc-4.5 - 2 runs, 50% failed, 100% of failures match

Anchoring on a specific release job, let's choose [1], which had:

Jul 27 21:14:39.598: INFO: unable to fetch logs for pods: rs-8c49z[e2e-disruption-2242].container[busybox].error=the server rejected our request for an unknown reason (get pods rs-8c49z)
...
fail [k8s.io/kubernetes/test/e2e/apps/disruption.go:323]: Expected an error, got nil

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1287847874599587840

Comment 1 W. Trevor King 2020-07-28 03:48:44 UTC
History in [1] suggests the test became flaky between [2] and [3].  It's not a 100% failure rate, so it's possible [2] squeaked by with the broken code (or whatever the trigger is), but if that's a real bracket the suspects are:

$ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286818801559539712/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]') <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286844539994116096/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]')
--- /dev/fd/63	2020-07-27 20:42:39.654779459 -0700
+++ /dev/fd/62	2020-07-27 20:42:39.654779459 -0700
@@ -4 +4 @@
-baremetal-installer https://github.com/openshift/installer/commit/d795e477966872804f20f89ca4a8477b1e23b596
+baremetal-installer https://github.com/openshift/installer/commit/7287d88d35e985b72747ed64b44907a827fb3cad
@@ -27 +27 @@
-cluster-network-operator https://github.com/openshift/cluster-network-operator/commit/c9aefce9eb7510f80f26347ccd49c91301fc75b4
+cluster-network-operator https://github.com/openshift/cluster-network-operator/commit/7303c6858c6065a4ca4c3b6d8b6ed996af7d31ee
@@ -36 +36 @@
-cluster-version-operator https://github.com/openshift/cluster-version-operator/commit/b658b4258dbecc74eb3b997806e95bb65181b274
+cluster-version-operator https://github.com/openshift/cluster-version-operator/commit/a49fef5c66c6b0707c54fd93f84d2f51d3d28aca
@@ -38 +38 @@
-console https://github.com/openshift/console/commit/f7034541b5f435371d7f8174599e6e330b1bb1ff
+console https://github.com/openshift/console/commit/60f367edb6a71ed2242784186665e7c957568c5e
@@ -50 +50 @@
-hyperkube https://github.com/openshift/kubernetes/commit/9ba0c166caed682a678f0cf56be1f7aeeb339fc1
+hyperkube https://github.com/openshift/kubernetes/commit/53f1b9d6f8de259644c05a25aa7ac8c6a67258e2
@@ -52,2 +52,2 @@
-installer https://github.com/openshift/installer/commit/d795e477966872804f20f89ca4a8477b1e23b596
-installer-artifacts https://github.com/openshift/installer/commit/d795e477966872804f20f89ca4a8477b1e23b596
+installer https://github.com/openshift/installer/commit/7287d88d35e985b72747ed64b44907a827fb3cad
+installer-artifacts https://github.com/openshift/installer/commit/7287d88d35e985b72747ed64b44907a827fb3cad
@@ -76 +76 @@
-machine-config-operator https://github.com/openshift/machine-config-operator/commit/be70bfe842d7d2a996eeef3bd4c55e12a02b5a86
+machine-config-operator https://github.com/openshift/machine-config-operator/commit/ab2673986646c62bc6599931d22f050ffd5871db

I suspect hyperkube:

$ git --no-pager log --first-parent --oneline 9ba0c166cae..53f1b9d6f8d 
53f1b9d6f8d (openshift/release-4.7, openshift/release-4.6, openshift/master) Merge pull request #166 from marun/rebase-1.19

Not clear to me if the change is an issue on the test-suite side, or the kubelet side, or what.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-ocp-installer-e2e-aws-4.6
[2]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286818801559539712
[3]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286844539994116096

Comment 3 Seth Jennings 2020-07-28 16:04:52 UTC
Dan Li, it does not have multi-arch implications.  ppc64le was just one of many tests suites that were flaking.

Don't imagine this is a Node team bug, but I'll try to figure out where it should go.

Current theory is this flake came in on the rebase right Trevor?

Comment 4 W. Trevor King 2020-07-28 16:06:47 UTC
> Current theory is this flake came in on the rebase right Trevor?

Yup.  I just assigned to the node team because the kubelet (I think?) comes out of openshift/kubernetes now, and "PDB" seemed like a node-touching thing.  Could also be on the API-server/controller side of openshift/kubernetes output.

Comment 5 Seth Jennings 2020-07-28 16:26:28 UTC
Looks like there might be some skew between openshift/kubernetes and origin's vendored kube atm.

https://github.com/openshift/origin/pull/25314 has not yet merged to sync them.

If the e2e tests run out of the vendored kube, there have been changes upstream to PDBs both in terms of code and e2e test.

https://github.com/kubernetes/kubernetes/pull/91342 (change to code)
https://github.com/kubernetes/kubernetes/pull/92991 (e2e fix)

Sending to apiserver since it handles the eviction API and PDB enforcement.

Comment 6 W. Trevor King 2020-07-29 03:41:56 UTC
Recovering bug state after the PR got green-buttoned [1].

[1]: https://github.com/openshift/origin/pull/25335#event-3596734606

Comment 10 Ke Wang 2020-08-10 09:38:40 UTC
Run the following command line to search test-runs in the past 7 days, the failures match rate with a marked downward trend, there is no 100% rate.

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it&maxAge=168h' | grep 'failures match' | sort
endurance-e2e-aws-4.4 - 5 runs, 100% failed, 20% of failures match
osde2e-stage-aws-conformance-default - 7 runs, 43% failed, 67% of failures match
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 520 runs, 5% failed, 4% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 70 runs, 100% failed, 1% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-single - 26 runs, 69% failed, 6% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn - 95 runs, 98% failed, 1% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-ovn-hybrid-step-registry - 99 runs, 99% failed, 2% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry - 91 runs, 99% failed, 1% of failures match
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws - 4 runs, 75% failed, 33% of failures match
pull-ci-openshift-installer-master-e2e-aws - 82 runs, 55% failed, 2% of failures match
pull-ci-openshift-installer-master-e2e-aws-fips - 107 runs, 90% failed, 1% of failures match
pull-ci-openshift-installer-release-4.5-e2e-vsphere - 3 runs, 100% failed, 33% of failures match
pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-origin - 9 runs, 100% failed, 11% of failures match
pull-ci-openshift-kubernetes-master-e2e-aws-fips - 67 runs, 94% failed, 2% of failures match
pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry - 102 runs, 85% failed, 3% of failures match
pull-ci-openshift-origin-master-e2e-aws-fips - 248 runs, 90% failed, 3% of failures match
pull-ci-openshift-origin-master-e2e-gcp - 89 runs, 57% failed, 10% of failures match
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn - 18 runs, 100% failed, 6% of failures match
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn - 20 runs, 100% failed, 5% of failures match
pull-ci-openshift-router-master-e2e - 8 runs, 63% failed, 20% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 97 runs, 60% failed, 3% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 50 runs, 58% failed, 3% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 50 runs, 84% failed, 2% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 49 runs, 96% failed, 2% of failures match
release-openshift-ocp-installer-e2e-openstack-4.4 - 22 runs, 100% failed, 5% of failures match

The fix works as expected, move the bug Verified.

Comment 12 errata-xmlrpc 2020-10-27 16:17:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 13 errata-xmlrpc 2020-10-27 16:20:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.