job: periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade and periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade is always failing in CI, see testgrid results: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade sample job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade/1425875339757752320 level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ci-op-rmph6w2z-47a94.ci.azure.devcluster.openshift.com: []"
this is filed against 4.5 since it's the 4.5 install that is failing, but we need 4.5 installs to succeed so we can test upgrades to 4.6, so this bug cannot be closed as "4.5 is EOL"
Luigi will investigate the cause of the failure (ingress, auth, or other).
Investigating the gather-extra/artifacts/deployments.json we noticed this : "status": { "conditions": [ { "lastTransitionTime": "2021-08-12T17:59:51Z", "lastUpdateTime": "2021-08-12T17:59:51Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2021-08-12T17:59:51Z", "lastUpdateTime": "2021-08-12T17:59:51Z", "message": "pods \"ingress-operator-f679584dd-\" is forbidden: unable to validate against any security context constraint: []", "reason": "FailedCreate", "status": "True", "type": "ReplicaFailure" }, { "lastTransitionTime": "2021-08-12T18:09:52Z", "lastUpdateTime": "2021-08-12T18:09:52Z", "message": "ReplicaSet \"ingress-operator-f679584dd\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ], "observedGeneration": 1, "unavailableReplicas": 1 } Could this be the route cause ? (as the ingress operator is not deployed) Please could you investigate.
It seems the openshift-ingress-operator namespace was not created in the first place: $ ls namespaces/openshift-ingress-operator ls: cannot access 'namespaces/openshift-ingress-operator': No such file or directory Based on this the SCC validation logic is not able to validate SC candidates. Reassigning to installer, it seems the cluster is in a half-installed state.
Will review for a future sprint.
The metrics.yml manifest created in the ipi-install-hosted-loki step cannot be applied to the cluster. This manifest was added in https://github.com/openshift/release/pull/17604. I am assigning this BZ to Vadim, since he is the author of the PR. Please change the component to whichever team owns the CI step.
Sample of the error from bootkube.sh ~~~~ Nov 09 09:44:35 ci-op-z6d9016h-47a94-t7h6k-bootstrap bootkube.sh[2596]: "metrics.yml": unable to get REST mapping for "metrics.yml": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" ~~~~
That's a surprising error, because it certainly looks like 4.5 has v1 ServiceMonitors: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.5-e2e-aws/1435283192725639168/artifacts/e2e-aws/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-c204179f6befa1a1f021b6c14578df9ca65fb359e3aab1be51660fe3f9e53670/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/servicemonitors.monitoring.coreos.com.yaml | yaml2json | jq -r '.spec.versions[].name' v1 And the CVO has included its own v1 ServiceMonitor manifest since 2019 [1]. Maybe whatever adds the ServiceMonitor CRD is slower to come up, and cluster-bootstrap gives up on the manifest before it's available? [1]: https://github.com/openshift/cluster-version-operator/commit/d3bb5480b9d6062aae3297fba33b6547fcf0fd99
iiuc the issue is that in 4.5 monitoring CRDs are being installed after bootstrap phase, so when ipi-install-hosted-loki step injects this on bootstrap phase, the installation stalls
Couldn't find such error while searching in CI now, move this to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056