Bug 1993376 - periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade is permfailing
Summary: periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.10.0
Assignee: Vadim Rutkovsky
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-12 21:30 UTC by Ben Parees
Modified: 2022-03-10 16:05 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade=all job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade=all
Last Closed: 2022-03-10 16:05:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift release pull 24359 0 None open Bug 1993376: hosted-loki: skip ServiceMonitor on OCP <4.5 2021-12-09 09:39:18 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:05:37 UTC

Description Ben Parees 2021-08-12 21:30:27 UTC
job:
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade 

and
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade 

is always failing in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade


sample job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade/1425875339757752320

level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ci-op-rmph6w2z-47a94.ci.azure.devcluster.openshift.com: []"

Comment 1 Ben Parees 2021-08-12 21:44:16 UTC
this is filed against 4.5 since it's the 4.5 install that is failing, but we need 4.5 installs to succeed so we can test upgrades to 4.6, so this bug cannot be closed as "4.5 is EOL"

Comment 2 Miciah Dashiel Butler Masters 2021-08-17 16:23:32 UTC
Luigi will investigate the cause of the failure (ingress, auth, or other).

Comment 3 Luigi Mario Zuccarelli 2021-08-20 08:48:45 UTC
Investigating the gather-extra/artifacts/deployments.json we noticed this :


    "status": {
        "conditions": [
            {
                "lastTransitionTime": "2021-08-12T17:59:51Z",
                "lastUpdateTime": "2021-08-12T17:59:51Z",
                "message": "Deployment does not have minimum availability.",
                "reason": "MinimumReplicasUnavailable",
                "status": "False",
                "type": "Available"
            },
            {
                "lastTransitionTime": "2021-08-12T17:59:51Z",
                "lastUpdateTime": "2021-08-12T17:59:51Z",
                "message": "pods \"ingress-operator-f679584dd-\" is forbidden: unable to validate against any security context constraint: []",
                "reason": "FailedCreate",
                "status": "True",
                "type": "ReplicaFailure"
            },
            {
                "lastTransitionTime": "2021-08-12T18:09:52Z",
                "lastUpdateTime": "2021-08-12T18:09:52Z",
                "message": "ReplicaSet \"ingress-operator-f679584dd\" has timed out progressing.",
                "reason": "ProgressDeadlineExceeded",
                "status": "False",
                "type": "Progressing"
            }
        ],
        "observedGeneration": 1,
        "unavailableReplicas": 1
    }


Could this be the route cause ? (as the ingress operator is not deployed)

Please could you investigate.

Comment 4 Sergiusz Urbaniak 2021-08-20 11:53:41 UTC
It seems the openshift-ingress-operator namespace was not created in the first place:

$ ls namespaces/openshift-ingress-operator
ls: cannot access 'namespaces/openshift-ingress-operator': No such file or directory

Based on this the SCC validation logic is not able to validate SC candidates. Reassigning to installer, it seems the cluster is in a half-installed state.

Comment 5 Russell Teague 2021-08-24 20:24:08 UTC
Will review for a future sprint.

Comment 7 Matthew Staebler 2021-12-07 19:54:30 UTC
The metrics.yml manifest created in the ipi-install-hosted-loki step cannot be applied to the cluster. This manifest was added in https://github.com/openshift/release/pull/17604.

I am assigning this BZ to Vadim, since he is the author of the PR. Please change the component to whichever team owns the CI step.

Comment 8 Matthew Staebler 2021-12-07 19:57:10 UTC
Sample of the error from bootkube.sh
~~~~
Nov 09 09:44:35 ci-op-z6d9016h-47a94-t7h6k-bootstrap bootkube.sh[2596]: "metrics.yml": unable to get REST mapping for "metrics.yml": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
~~~~

Comment 9 W. Trevor King 2021-12-07 20:27:03 UTC
That's a surprising error, because it certainly looks like 4.5 has v1 ServiceMonitors:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.5-e2e-aws/1435283192725639168/artifacts/e2e-aws/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-c204179f6befa1a1f021b6c14578df9ca65fb359e3aab1be51660fe3f9e53670/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/servicemonitors.monitoring.coreos.com.yaml | yaml2json | jq -r '.spec.versions[].name'
  v1

And the CVO has included its own v1 ServiceMonitor manifest since 2019 [1].  Maybe whatever adds the ServiceMonitor CRD is slower to come up, and cluster-bootstrap gives up on the manifest before it's available?

[1]: https://github.com/openshift/cluster-version-operator/commit/d3bb5480b9d6062aae3297fba33b6547fcf0fd99

Comment 10 Vadim Rutkovsky 2021-12-08 09:22:11 UTC
iiuc the issue is that in 4.5 monitoring CRDs are being installed after bootstrap phase, so when ipi-install-hosted-loki step injects this on bootstrap phase, the installation stalls

Comment 12 Gaoyun Pei 2022-01-17 00:32:22 UTC
Couldn't find such error while searching in CI now, move this to VERIFIED.

Comment 15 errata-xmlrpc 2022-03-10 16:05:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.