test: [sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it is failing frequently in CI, see search results: https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-apps%5C%5D+DisruptionController+should+block+an+eviction+until+the+PDB+is+updated+to+allow+it For example: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it' | grep 'failures match' | sort promote-release-openshift-machine-os-content-e2e-aws-4.6 - 157 runs, 100% failed, 2% of failures match promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 23 runs, 100% failed, 4% of failures match pull-ci-cri-o-cri-o-master-e2e-aws - 54 runs, 74% failed, 45% of failures match pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws - 4 runs, 75% failed, 33% of failures match ... pull-ci-operator-framework-operator-registry-master-e2e-aws - 8 runs, 63% failed, 80% of failures match rehearse-10454-pull-ci-openshift-cloud-credential-operator-master-e2e-azure - 1 runs, 100% failed, 100% of failures match rehearse-10454-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 1 runs, 100% failed, 100% of failures match rehearse-10454-pull-ci-openshift-origin-master-e2e-gcp - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 - 8 runs, 100% failed, 63% of failures match release-openshift-ocp-installer-e2e-aws-4.6 - 9 runs, 89% failed, 63% of failures match release-openshift-ocp-installer-e2e-aws-fips-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-aws-mirrors-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-azure-4.6 - 17 runs, 100% failed, 24% of failures match release-openshift-ocp-installer-e2e-gcp-4.6 - 4 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-openstack-4.6 - 8 runs, 100% failed, 25% of failures match release-openshift-ocp-installer-e2e-openstack-ppc64le-4.6 - 2 runs, 100% failed, 50% of failures match release-openshift-ocp-installer-e2e-ovirt-4.6 - 9 runs, 100% failed, 44% of failures match release-openshift-origin-installer-e2e-aws-ovn-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-shared-vpc-4.5 - 2 runs, 50% failed, 100% of failures match Anchoring on a specific release job, let's choose [1], which had: Jul 27 21:14:39.598: INFO: unable to fetch logs for pods: rs-8c49z[e2e-disruption-2242].container[busybox].error=the server rejected our request for an unknown reason (get pods rs-8c49z) ... fail [k8s.io/kubernetes/test/e2e/apps/disruption.go:323]: Expected an error, got nil [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1287847874599587840
History in [1] suggests the test became flaky between [2] and [3]. It's not a 100% failure rate, so it's possible [2] squeaked by with the broken code (or whatever the trigger is), but if that's a real bracket the suspects are: $ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286818801559539712/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]') <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286844539994116096/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]') --- /dev/fd/63 2020-07-27 20:42:39.654779459 -0700 +++ /dev/fd/62 2020-07-27 20:42:39.654779459 -0700 @@ -4 +4 @@ -baremetal-installer https://github.com/openshift/installer/commit/d795e477966872804f20f89ca4a8477b1e23b596 +baremetal-installer https://github.com/openshift/installer/commit/7287d88d35e985b72747ed64b44907a827fb3cad @@ -27 +27 @@ -cluster-network-operator https://github.com/openshift/cluster-network-operator/commit/c9aefce9eb7510f80f26347ccd49c91301fc75b4 +cluster-network-operator https://github.com/openshift/cluster-network-operator/commit/7303c6858c6065a4ca4c3b6d8b6ed996af7d31ee @@ -36 +36 @@ -cluster-version-operator https://github.com/openshift/cluster-version-operator/commit/b658b4258dbecc74eb3b997806e95bb65181b274 +cluster-version-operator https://github.com/openshift/cluster-version-operator/commit/a49fef5c66c6b0707c54fd93f84d2f51d3d28aca @@ -38 +38 @@ -console https://github.com/openshift/console/commit/f7034541b5f435371d7f8174599e6e330b1bb1ff +console https://github.com/openshift/console/commit/60f367edb6a71ed2242784186665e7c957568c5e @@ -50 +50 @@ -hyperkube https://github.com/openshift/kubernetes/commit/9ba0c166caed682a678f0cf56be1f7aeeb339fc1 +hyperkube https://github.com/openshift/kubernetes/commit/53f1b9d6f8de259644c05a25aa7ac8c6a67258e2 @@ -52,2 +52,2 @@ -installer https://github.com/openshift/installer/commit/d795e477966872804f20f89ca4a8477b1e23b596 -installer-artifacts https://github.com/openshift/installer/commit/d795e477966872804f20f89ca4a8477b1e23b596 +installer https://github.com/openshift/installer/commit/7287d88d35e985b72747ed64b44907a827fb3cad +installer-artifacts https://github.com/openshift/installer/commit/7287d88d35e985b72747ed64b44907a827fb3cad @@ -76 +76 @@ -machine-config-operator https://github.com/openshift/machine-config-operator/commit/be70bfe842d7d2a996eeef3bd4c55e12a02b5a86 +machine-config-operator https://github.com/openshift/machine-config-operator/commit/ab2673986646c62bc6599931d22f050ffd5871db I suspect hyperkube: $ git --no-pager log --first-parent --oneline 9ba0c166cae..53f1b9d6f8d 53f1b9d6f8d (openshift/release-4.7, openshift/release-4.6, openshift/master) Merge pull request #166 from marun/rebase-1.19 Not clear to me if the change is an issue on the test-suite side, or the kubelet side, or what. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-ocp-installer-e2e-aws-4.6 [2]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286818801559539712 [3]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286844539994116096
Dan Li, it does not have multi-arch implications. ppc64le was just one of many tests suites that were flaking. Don't imagine this is a Node team bug, but I'll try to figure out where it should go. Current theory is this flake came in on the rebase right Trevor?
> Current theory is this flake came in on the rebase right Trevor? Yup. I just assigned to the node team because the kubelet (I think?) comes out of openshift/kubernetes now, and "PDB" seemed like a node-touching thing. Could also be on the API-server/controller side of openshift/kubernetes output.
Looks like there might be some skew between openshift/kubernetes and origin's vendored kube atm. https://github.com/openshift/origin/pull/25314 has not yet merged to sync them. If the e2e tests run out of the vendored kube, there have been changes upstream to PDBs both in terms of code and e2e test. https://github.com/kubernetes/kubernetes/pull/91342 (change to code) https://github.com/kubernetes/kubernetes/pull/92991 (e2e fix) Sending to apiserver since it handles the eviction API and PDB enforcement.
Recovering bug state after the PR got green-buttoned [1]. [1]: https://github.com/openshift/origin/pull/25335#event-3596734606
Run the following command line to search test-runs in the past 7 days, the failures match rate with a marked downward trend, there is no 100% rate. $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it&maxAge=168h' | grep 'failures match' | sort endurance-e2e-aws-4.4 - 5 runs, 100% failed, 20% of failures match osde2e-stage-aws-conformance-default - 7 runs, 43% failed, 67% of failures match promote-release-openshift-machine-os-content-e2e-aws-4.6 - 520 runs, 5% failed, 4% of failures match promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 70 runs, 100% failed, 1% of failures match pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-single - 26 runs, 69% failed, 6% of failures match pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn - 95 runs, 98% failed, 1% of failures match pull-ci-openshift-cluster-network-operator-master-e2e-ovn-hybrid-step-registry - 99 runs, 99% failed, 2% of failures match pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry - 91 runs, 99% failed, 1% of failures match pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws - 4 runs, 75% failed, 33% of failures match pull-ci-openshift-installer-master-e2e-aws - 82 runs, 55% failed, 2% of failures match pull-ci-openshift-installer-master-e2e-aws-fips - 107 runs, 90% failed, 1% of failures match pull-ci-openshift-installer-release-4.5-e2e-vsphere - 3 runs, 100% failed, 33% of failures match pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-origin - 9 runs, 100% failed, 11% of failures match pull-ci-openshift-kubernetes-master-e2e-aws-fips - 67 runs, 94% failed, 2% of failures match pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry - 102 runs, 85% failed, 3% of failures match pull-ci-openshift-origin-master-e2e-aws-fips - 248 runs, 90% failed, 3% of failures match pull-ci-openshift-origin-master-e2e-gcp - 89 runs, 57% failed, 10% of failures match pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn - 18 runs, 100% failed, 6% of failures match pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn - 20 runs, 100% failed, 5% of failures match pull-ci-openshift-router-master-e2e - 8 runs, 63% failed, 20% of failures match release-openshift-ocp-installer-e2e-aws-4.6 - 97 runs, 60% failed, 3% of failures match release-openshift-ocp-installer-e2e-azure-4.6 - 50 runs, 58% failed, 3% of failures match release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 50 runs, 84% failed, 2% of failures match release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 49 runs, 96% failed, 2% of failures match release-openshift-ocp-installer-e2e-openstack-4.4 - 22 runs, 100% failed, 5% of failures match The fix works as expected, move the bug Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196