Bug 1993376

Summary:	periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade is permfailing
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Installer	Assignee:	Vadim Rutkovsky <vrutkovs>
Installer sub component:	openshift-installer	QA Contact:	Gaoyun Pei <gpei>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs, jdelft, mfojtik, mmasters, mstaeble, sippy, surbania, wking
Version:	4.5
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:	job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade=all
Last Closed:	2022-03-10 16:05:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Parees 2021-08-12 21:30:27 UTC

job:
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade 

and
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade 

is always failing in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade


sample job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade/1425875339757752320

level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ci-op-rmph6w2z-47a94.ci.azure.devcluster.openshift.com: []"

Comment 1 Ben Parees 2021-08-12 21:44:16 UTC

this is filed against 4.5 since it's the 4.5 install that is failing, but we need 4.5 installs to succeed so we can test upgrades to 4.6, so this bug cannot be closed as "4.5 is EOL"

Comment 2 Miciah Dashiel Butler Masters 2021-08-17 16:23:32 UTC

Luigi will investigate the cause of the failure (ingress, auth, or other).

Comment 3 Luigi Mario Zuccarelli 2021-08-20 08:48:45 UTC

Investigating the gather-extra/artifacts/deployments.json we noticed this :


    "status": {
        "conditions": [
            {
                "lastTransitionTime": "2021-08-12T17:59:51Z",
                "lastUpdateTime": "2021-08-12T17:59:51Z",
                "message": "Deployment does not have minimum availability.",
                "reason": "MinimumReplicasUnavailable",
                "status": "False",
                "type": "Available"
            },
            {
                "lastTransitionTime": "2021-08-12T17:59:51Z",
                "lastUpdateTime": "2021-08-12T17:59:51Z",
                "message": "pods \"ingress-operator-f679584dd-\" is forbidden: unable to validate against any security context constraint: []",
                "reason": "FailedCreate",
                "status": "True",
                "type": "ReplicaFailure"
            },
            {
                "lastTransitionTime": "2021-08-12T18:09:52Z",
                "lastUpdateTime": "2021-08-12T18:09:52Z",
                "message": "ReplicaSet \"ingress-operator-f679584dd\" has timed out progressing.",
                "reason": "ProgressDeadlineExceeded",
                "status": "False",
                "type": "Progressing"
            }
        ],
        "observedGeneration": 1,
        "unavailableReplicas": 1
    }


Could this be the route cause ? (as the ingress operator is not deployed)

Please could you investigate.

Comment 4 Sergiusz Urbaniak 2021-08-20 11:53:41 UTC

It seems the openshift-ingress-operator namespace was not created in the first place:

$ ls namespaces/openshift-ingress-operator
ls: cannot access 'namespaces/openshift-ingress-operator': No such file or directory

Based on this the SCC validation logic is not able to validate SC candidates. Reassigning to installer, it seems the cluster is in a half-installed state.

Comment 5 Russell Teague 2021-08-24 20:24:08 UTC

Will review for a future sprint.

Comment 7 Matthew Staebler 2021-12-07 19:54:30 UTC

The metrics.yml manifest created in the ipi-install-hosted-loki step cannot be applied to the cluster. This manifest was added in https://github.com/openshift/release/pull/17604.

I am assigning this BZ to Vadim, since he is the author of the PR. Please change the component to whichever team owns the CI step.

Comment 8 Matthew Staebler 2021-12-07 19:57:10 UTC

Sample of the error from bootkube.sh
~~~~
Nov 09 09:44:35 ci-op-z6d9016h-47a94-t7h6k-bootstrap bootkube.sh[2596]: "metrics.yml": unable to get REST mapping for "metrics.yml": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
~~~~

Comment 9 W. Trevor King 2021-12-07 20:27:03 UTC

That's a surprising error, because it certainly looks like 4.5 has v1 ServiceMonitors:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.5-e2e-aws/1435283192725639168/artifacts/e2e-aws/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-c204179f6befa1a1f021b6c14578df9ca65fb359e3aab1be51660fe3f9e53670/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/servicemonitors.monitoring.coreos.com.yaml | yaml2json | jq -r '.spec.versions[].name'
  v1

And the CVO has included its own v1 ServiceMonitor manifest since 2019 [1].  Maybe whatever adds the ServiceMonitor CRD is slower to come up, and cluster-bootstrap gives up on the manifest before it's available?

[1]: https://github.com/openshift/cluster-version-operator/commit/d3bb5480b9d6062aae3297fba33b6547fcf0fd99

Comment 10 Vadim Rutkovsky 2021-12-08 09:22:11 UTC

iiuc the issue is that in 4.5 monitoring CRDs are being installed after bootstrap phase, so when ipi-install-hosted-loki step injects this on bootstrap phase, the installation stalls

Comment 12 Gaoyun Pei 2022-01-17 00:32:22 UTC

Couldn't find such error while searching in CI now, move this to VERIFIED.

Comment 15 errata-xmlrpc 2022-03-10 16:05:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056