[sig-apps] DisruptionController should block an eviction until the PDB is updated to allow it
is failing frequently in CI, see search results:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it' | grep 'failures match' | sort
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 157 runs, 100% failed, 2% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 23 runs, 100% failed, 4% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 54 runs, 74% failed, 45% of failures match
pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws - 4 runs, 75% failed, 33% of failures match
pull-ci-operator-framework-operator-registry-master-e2e-aws - 8 runs, 63% failed, 80% of failures match
rehearse-10454-pull-ci-openshift-cloud-credential-operator-master-e2e-azure - 1 runs, 100% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 1 runs, 100% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-origin-master-e2e-gcp - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 - 8 runs, 100% failed, 63% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 9 runs, 89% failed, 63% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 17 runs, 100% failed, 24% of failures match
release-openshift-ocp-installer-e2e-gcp-4.6 - 4 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 8 runs, 100% failed, 25% of failures match
release-openshift-ocp-installer-e2e-openstack-ppc64le-4.6 - 2 runs, 100% failed, 50% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 9 runs, 100% failed, 44% of failures match
release-openshift-origin-installer-e2e-aws-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-shared-vpc-4.5 - 2 runs, 50% failed, 100% of failures match
Anchoring on a specific release job, let's choose , which had:
Jul 27 21:14:39.598: INFO: unable to fetch logs for pods: rs-8c49z[e2e-disruption-2242].container[busybox].error=the server rejected our request for an unknown reason (get pods rs-8c49z)
fail [k8s.io/kubernetes/test/e2e/apps/disruption.go:323]: Expected an error, got nil
History in  suggests the test became flaky between  and . It's not a 100% failure rate, so it's possible  squeaked by with the broken code (or whatever the trigger is), but if that's a real bracket the suspects are:
$ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286818801559539712/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort') <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1286844539994116096/artifacts/release-images-latest/release-images-latest | jq -r '[.spec.tags | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort')
--- /dev/fd/63 2020-07-27 20:42:39.654779459 -0700
+++ /dev/fd/62 2020-07-27 20:42:39.654779459 -0700
@@ -4 +4 @@
@@ -27 +27 @@
@@ -36 +36 @@
@@ -38 +38 @@
@@ -50 +50 @@
@@ -52,2 +52,2 @@
@@ -76 +76 @@
I suspect hyperkube:
$ git --no-pager log --first-parent --oneline 9ba0c166cae..53f1b9d6f8d
53f1b9d6f8d (openshift/release-4.7, openshift/release-4.6, openshift/master) Merge pull request #166 from marun/rebase-1.19
Not clear to me if the change is an issue on the test-suite side, or the kubelet side, or what.
Dan Li, it does not have multi-arch implications. ppc64le was just one of many tests suites that were flaking.
Don't imagine this is a Node team bug, but I'll try to figure out where it should go.
Current theory is this flake came in on the rebase right Trevor?
> Current theory is this flake came in on the rebase right Trevor?
Yup. I just assigned to the node team because the kubelet (I think?) comes out of openshift/kubernetes now, and "PDB" seemed like a node-touching thing. Could also be on the API-server/controller side of openshift/kubernetes output.
Looks like there might be some skew between openshift/kubernetes and origin's vendored kube atm.
https://github.com/openshift/origin/pull/25314 has not yet merged to sync them.
If the e2e tests run out of the vendored kube, there have been changes upstream to PDBs both in terms of code and e2e test.
https://github.com/kubernetes/kubernetes/pull/91342 (change to code)
https://github.com/kubernetes/kubernetes/pull/92991 (e2e fix)
Sending to apiserver since it handles the eviction API and PDB enforcement.
Recovering bug state after the PR got green-buttoned .
Run the following command line to search test-runs in the past 7 days, the failures match rate with a marked downward trend, there is no 100% rate.
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=DisruptionController%20should%20block%20an%20eviction%20until%20the%20PDB%20is%20updated%20to%20allow%20it&maxAge=168h' | grep 'failures match' | sort
endurance-e2e-aws-4.4 - 5 runs, 100% failed, 20% of failures match
osde2e-stage-aws-conformance-default - 7 runs, 43% failed, 67% of failures match
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 520 runs, 5% failed, 4% of failures match
promote-release-openshift-okd-machine-os-content-e2e-aws-4.6 - 70 runs, 100% failed, 1% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-single - 26 runs, 69% failed, 6% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn - 95 runs, 98% failed, 1% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-ovn-hybrid-step-registry - 99 runs, 99% failed, 2% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry - 91 runs, 99% failed, 1% of failures match
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws - 4 runs, 75% failed, 33% of failures match
pull-ci-openshift-installer-master-e2e-aws - 82 runs, 55% failed, 2% of failures match
pull-ci-openshift-installer-master-e2e-aws-fips - 107 runs, 90% failed, 1% of failures match
pull-ci-openshift-installer-release-4.5-e2e-vsphere - 3 runs, 100% failed, 33% of failures match
pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-origin - 9 runs, 100% failed, 11% of failures match
pull-ci-openshift-kubernetes-master-e2e-aws-fips - 67 runs, 94% failed, 2% of failures match
pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry - 102 runs, 85% failed, 3% of failures match
pull-ci-openshift-origin-master-e2e-aws-fips - 248 runs, 90% failed, 3% of failures match
pull-ci-openshift-origin-master-e2e-gcp - 89 runs, 57% failed, 10% of failures match
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn - 18 runs, 100% failed, 6% of failures match
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn - 20 runs, 100% failed, 5% of failures match
pull-ci-openshift-router-master-e2e - 8 runs, 63% failed, 20% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 97 runs, 60% failed, 3% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 50 runs, 58% failed, 3% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 50 runs, 84% failed, 2% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 49 runs, 96% failed, 2% of failures match
release-openshift-ocp-installer-e2e-openstack-4.4 - 22 runs, 100% failed, 5% of failures match
The fix works as expected, move the bug Verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.