Over the last 2 weeks there have been multiple failures of the conformance test for 4.7 e2e-metal-ipi and e2e-metal-ipi-ovn-ipv6. These are only happening for 4.7, not 4.8. 4.9 etc. They started failing on 10-21 for the ipv4 tests and 10-23 for the ipv6 tests. Prior to that there were no failures for these conformance tests although at one point failures in these tests were addressed by introducing a delay such as https://github.com/kubernetes/kubernetes/pull/90452 to work around slow responses from the api server. See: e2e-metal-ipi - https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-blocking#periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi&include-filter-by-regex=CustomResourcePublishOpenAPI e2e-metal-ipi-ovn-ipv6 - https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-blocking#periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi-ovn-ipv6&include-filter-by-regex=CustomResourcePublishOpenAPI
As noted, there have been fixes upstream [1] and downstream [2] to work around access to the API server for these conformance tests specifically. https://github.com/kubernetes/kubernetes/issues/86967 shows potential issues with the api server for these tests. Its not clear however whether its an issue with the api server for these tests, why its only affecting 4.7, and why it started consistently failing over the last couple weeks. Some patterns we've seen and notes: - many of the tests are failing on a string comparison for example https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi-ovn-ipv6/1452055659775266816, as Andrea noted - "... the result returned by the explain command contains a capitalized name for the cr: E2e-test-crd-publish-openapiu-6714-crd - while in all the previous issued commands the cr name is always lowercase. - all of the conformance tests passed today (11/5), although its not clear what has changed - there is a fix to improve the waits for the api server specific to metal-ipi - https://github.com/openshift/release/pull/23258. This will be picked up for 4.7 tests and we'll be able to see any affect over the next few days [1] https://github.com/kubernetes/kubernetes/pull/90452 [2] https://github.com/openshift/origin/pull/24920
Andrea has found that some of the test fails because the CRD is being returned with the first letter capitalized when lower case was expected. For example, this test: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi-ovn-ipv6/1452055659775266816 fails here: Oct 24 00:40:06.132: INFO: stdout: "KIND: E2e-test-crd-publish-openapi-6714-crd\nVERSION: crd-publish-openapi-test-unknown-at-root.example.com/v1\n\nDESCRIPTION:\n <empty>\n" when was expected was: Nov 5 09:26:28.670: INFO: stdout: "KIND: e2e-test-crd-publish-openapi-7930-crd\nVERSION: crd-publish-openapi-test-unknown-at-root.example.com/v1\n\nDESCRIPTION:\n <empty>\n" The code to use lower-case for the CRD was added 3 months ago here: https://github.com/kubernetes/kubernetes/pull/102417/files This fix was vendored 3 months ago, so its not clear why its not being returned with a lower case: https://github.com/openshift/origin/blame/69d419c8e3f86005a13bb93c2e029e370fdaf26e/vendor/k8s.io/kubernetes/test/utils/crd/crd_util.go#L51
The last couple of runs actually succeeded.
By looking further at the test code, it seems that the capitalization was not influent for the test failure. Unfortunately the log traces appear to be truncated, and thus the reason of the failure remains unclear. Still from the logs, in many cases the test is marked as failed while it was waiting for 45 seconds: | Nov 3 23:27:20.963: INFO: sleeping 45 seconds before running the actual tests, we hope that during all API servers converge during that window, see "https://github.com/kubernetes/kubernetes/pull/90452" for more | ... | failed: (37.9s) 2021-11-03T23:27:42 "[sig-api-machinery] CustomResourcePublishOpenAPI [Privileged:ClusterAdmin] works for CRD preserving unknown fields at the schema root [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]"
It looks like the apiserver gets deleted (Deleted pod: apiserver) and restarted in both the passing and failing casess, its just that the deletion occurs during the wait when the test passes, and outside of the wait when the test fails. Andrea has added an additional wait here https://github.com/openshift/release/pull/23419 and we'll see if this helps.
Removing the blocker flag as this is an intermittent failure.
Further analysis revealed that the cases where a CustomResourcePublishOpenAPI test failed the kube-apiservers were not in a steady situation. In some scenarios the image registry operator hostname change is detected by the kube-apiserver-operator later, after the e2e tests execution have been already started (and in some cases after the related clusteroperator had already the Progressing field set to false). The additional waiting condition present in https://github.com/openshift/release/pull/23419 should provide more time to allow the cluster reaching a steady scenario, thus resulting in a more robust tests execution
https://github.com/openshift/release/pull/23579 has merged, could someone please confirm it's working now and move to verified?
Update: the issue was related to our CI only. After PR 23579 merged there was still some flakes, very likely due a client cache issue. Since the problem is not affecting versions > 4.7, and since CI metal-ipi jobs in 4.7 are just exercising a minimal tests set (whereas in versions >= 4.8 we're addressing the whole conformance/parallel suite), those tests have been excluded from the minimal list, so we can close it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056