1993376 – periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade is permfailing

Bug 1993376 - periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade is permfailing

Summary: periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Vadim Rutkovsky
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-12 21:30 UTC by Ben Parees
Modified:	2022-03-10 16:05 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade=all
Last Closed:	2022-03-10 16:05:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift release pull 24359	0	None	open	Bug 1993376: hosted-loki: skip ServiceMonitor on OCP <4.5	2021-12-09 09:39:18 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:05:37 UTC

Description Ben Parees 2021-08-12 21:30:27 UTC

job:
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade 

and
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade 

is always failing in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade


sample job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade/1425875339757752320

level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ci-op-rmph6w2z-47a94.ci.azure.devcluster.openshift.com: []"

Comment 1 Ben Parees 2021-08-12 21:44:16 UTC

this is filed against 4.5 since it's the 4.5 install that is failing, but we need 4.5 installs to succeed so we can test upgrades to 4.6, so this bug cannot be closed as "4.5 is EOL"

Comment 2 Miciah Dashiel Butler Masters 2021-08-17 16:23:32 UTC

Luigi will investigate the cause of the failure (ingress, auth, or other).

Comment 3 Luigi Mario Zuccarelli 2021-08-20 08:48:45 UTC

Investigating the gather-extra/artifacts/deployments.json we noticed this :


    "status": {
        "conditions": [
            {
                "lastTransitionTime": "2021-08-12T17:59:51Z",
                "lastUpdateTime": "2021-08-12T17:59:51Z",
                "message": "Deployment does not have minimum availability.",
                "reason": "MinimumReplicasUnavailable",
                "status": "False",
                "type": "Available"
            },
            {
                "lastTransitionTime": "2021-08-12T17:59:51Z",
                "lastUpdateTime": "2021-08-12T17:59:51Z",
                "message": "pods \"ingress-operator-f679584dd-\" is forbidden: unable to validate against any security context constraint: []",
                "reason": "FailedCreate",
                "status": "True",
                "type": "ReplicaFailure"
            },
            {
                "lastTransitionTime": "2021-08-12T18:09:52Z",
                "lastUpdateTime": "2021-08-12T18:09:52Z",
                "message": "ReplicaSet \"ingress-operator-f679584dd\" has timed out progressing.",
                "reason": "ProgressDeadlineExceeded",
                "status": "False",
                "type": "Progressing"
            }
        ],
        "observedGeneration": 1,
        "unavailableReplicas": 1
    }


Could this be the route cause ? (as the ingress operator is not deployed)

Please could you investigate.

Comment 4 Sergiusz Urbaniak 2021-08-20 11:53:41 UTC

It seems the openshift-ingress-operator namespace was not created in the first place:

$ ls namespaces/openshift-ingress-operator
ls: cannot access 'namespaces/openshift-ingress-operator': No such file or directory

Based on this the SCC validation logic is not able to validate SC candidates. Reassigning to installer, it seems the cluster is in a half-installed state.

Comment 5 Russell Teague 2021-08-24 20:24:08 UTC

Will review for a future sprint.

Comment 7 Matthew Staebler 2021-12-07 19:54:30 UTC

The metrics.yml manifest created in the ipi-install-hosted-loki step cannot be applied to the cluster. This manifest was added in https://github.com/openshift/release/pull/17604.

I am assigning this BZ to Vadim, since he is the author of the PR. Please change the component to whichever team owns the CI step.

Comment 8 Matthew Staebler 2021-12-07 19:57:10 UTC

Sample of the error from bootkube.sh
~~~~
Nov 09 09:44:35 ci-op-z6d9016h-47a94-t7h6k-bootstrap bootkube.sh[2596]: "metrics.yml": unable to get REST mapping for "metrics.yml": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
~~~~

Comment 9 W. Trevor King 2021-12-07 20:27:03 UTC

That's a surprising error, because it certainly looks like 4.5 has v1 ServiceMonitors:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.5-e2e-aws/1435283192725639168/artifacts/e2e-aws/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-c204179f6befa1a1f021b6c14578df9ca65fb359e3aab1be51660fe3f9e53670/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/servicemonitors.monitoring.coreos.com.yaml | yaml2json | jq -r '.spec.versions[].name'
  v1

And the CVO has included its own v1 ServiceMonitor manifest since 2019 [1].  Maybe whatever adds the ServiceMonitor CRD is slower to come up, and cluster-bootstrap gives up on the manifest before it's available?

[1]: https://github.com/openshift/cluster-version-operator/commit/d3bb5480b9d6062aae3297fba33b6547fcf0fd99

Comment 10 Vadim Rutkovsky 2021-12-08 09:22:11 UTC

iiuc the issue is that in 4.5 monitoring CRDs are being installed after bootstrap phase, so when ipi-install-hosted-loki step injects this on bootstrap phase, the installation stalls

Comment 12 Gaoyun Pei 2022-01-17 00:32:22 UTC

Couldn't find such error while searching in CI now, move this to VERIFIED.

Comment 15 errata-xmlrpc 2022-03-10 16:05:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.