Bug 1960780

Summary: CI: failed to create PDB "service-test" the server could not find the requested resource
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: NetworkingAssignee: W. Trevor King <wking>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: aos-bugs, rgudimet
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:08:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-05-14 20:32:45 UTC
Seen in three different update CI jobs from 4.7.11 to 4.8.0-fc.4 [1,2,3]:

disruption_tests: [sig-network-edge] Application behind service load balancer with PDB is not disrupted 
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	1h13m22s
fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:115]: Unexpected error:
    <*errors.errorString | 0xc000bf70a0>: {
        s: "failed to create PDB \"service-test\" the server could not find the requested resource",
    }
    failed to create PDB "service-test" the server could not find the requested resource
occurred

Extremely common in 4.7 -> 4.8 update CI:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=failed+to+create+PDB.*the+server+could+not+find+the+requested+resource' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 20 runs, 100% failed, 85% of failures match = 85% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 94% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 6 runs, 100% failed, 67% of failures match = 67% impact
pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 12 runs, 100% failed, 83% of failures match = 83% impact
rehearse-18540-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
rehearse-18540-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
rehearse-18540-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
rehearse-18540-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-origin-installer-launch-azure (all) - 12 runs, 83% failed, 10% of failures match = 8% impact
release-openshift-origin-installer-launch-gcp (all) - 63 runs, 41% failed, 8% of failures match = 3% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1393250382481723392
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1393269399615442944
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1393269438639247360

Comment 1 W. Trevor King 2021-05-14 20:33:40 UTC
Possibly the issue is that the 4.8 test suite is assuming that something is present which is not present on 4.7?

Comment 2 W. Trevor King 2021-05-14 20:44:14 UTC
TestGrid [1] shows this transitioning to perma-fail between [2,3].  Comparing the target versions:

$ REF_A=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1391245505060671488/artifacts/release/artifacts/release-images-latest
$ REF_B=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1391607897154129920/artifacts/release/artifacts/release-images-latest
$ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]'
$ diff -U0 <(curl -s "${REF_A}" | jq -r "${JQ}") <(curl -s "${REF_B}" | jq -r "${JQ}")
--- /dev/fd/63  2021-05-14 13:38:59.849002076 -0700
+++ /dev/fd/62  2021-05-14 13:38:59.850002076 -0700
@@ -26 +26 @@
-cluster-etcd-operator https://github.com/openshift/cluster-etcd-operator/commit/b6530d132942cd84bec9e2a76a7386d4141cca78
+cluster-etcd-operator https://github.com/openshift/cluster-etcd-operator/commit/b54aaf90c1f0468730270163e8423ca23b27056c
@@ -35 +35 @@
-cluster-network-operator https://github.com/openshift/cluster-network-operator/commit/91af127c2d693adbc357ab14cf0318de44409a14
+cluster-network-operator https://github.com/openshift/cluster-network-operator/commit/103304d59bb26fdaadb0170f76c012775d4a979f
@@ -45 +45 @@
-console https://github.com/openshift/console/commit/f0b1fe1d368e50ccbf423de6165b9f870c0f06d5
+console https://github.com/openshift/console/commit/44c4fe0ea64befc9a9ebb54894bddba9a70b57a6
@@ -122 +122 @@
-ovirt-machine-controllers https://github.com/openshift/cluster-api-provider-ovirt/commit/2ac685fd451c03564072c873ca087b06a0934aab
+ovirt-machine-controllers https://github.com/openshift/cluster-api-provider-ovirt/commit/d6f563502a708f84489d629cf0b05212cb345c55
@@ -134 +134 @@
-tests https://github.com/openshift/origin/commit/2a813e180f73b3876b42bf04f27a5f66814560c2
+tests https://github.com/openshift/origin/commit/265b6ef959b8c8183ecda5aba10f7d437b87a9a9

Checking in origin:

$ git --no-pager log --oneline --first-parent 2a813e180f..265b6ef959b
265b6ef959 Merge pull request #26054 from soltysh/k8s-1.21

Ah, so yeah, lots of changes that came in there, and we broke this test-case (or maybe the new test-case logic is more robust and turning up implementation breakage we previously missed).

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1391245505060671488
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1391607897154129920

Comment 3 W. Trevor King 2021-05-14 21:22:46 UTC
Clayton suspects the PDB-creating fixture moved to v1 PDBs in 4.8, but 4.7 only supports v1beta1 PDBs.  Suggested fix is to try to create v1beta1 PDBs, falling back to v1 PDBs.  Or maybe the other way around.

Comment 5 W. Trevor King 2021-05-17 22:02:08 UTC
We really want this to give us a green signal for 4.7 -> 4.8 update CI.  But if for some reason it doesn't land in time, we can look more closely at those CI jobs to decide if this is the only failure mode we're seeing, and if so, I don't see a problem going GA without this fix.

Comment 7 W. Trevor King 2021-05-18 22:15:56 UTC
Still need the origin bump to vendor the new jig.

Comment 12 errata-xmlrpc 2021-07-27 23:08:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438