Description of problem: We have observed two GCP clusters where the cloud-credential operator has gone to a Down state for sometimes hours at a time. The error causing CCO to report as Down in CVO is seemingly due to the networksecurity.googleapis.com service API not being enabled: 2021-11-10T00:47:57.944192896Z time="2021-11-10T00:47:57Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp 2021-11-10T00:47:57.944192896Z time="2021-11-10T00:47:57Z" level=error msg="not all required service APIs are enabled" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp 2021-11-10T00:47:57.944245352Z time="2021-11-10T00:47:57Z" level=error msg="error syncing credentials: not all required service APIs are enabled" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-gcp secret=openshift-machine-api/gcp-cloud-credentials After some hours have passed, the issue has eventually self-resolved on both clusters with no other actions taken. However, even when the issue eventually resolves itself, we have checked the GCP project for the clusters and verified that the Network Security API is still not enabled for the account at all, so I'm not sure why it resolved. Version-Release number of selected component (if applicable): 4.7.22 How reproducible: We have observed it on two clusters at separate times.
There appears to be an issue with GCP returning inconsistent results around the permissions associated with Role(s). Well, at least the compute.loadBalancerAdmin role is returning problematic results (and loadBalancerAdmin is a role being requested by the machine-api https://github.com/openshift/machine-api-operator/blob/master/install/0000_30_machine-api-operator_00_credentials-request.yaml#L122 ): [jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | wc -l 255 [jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | wc -l 261 [jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | wc -l 255 [jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | wc -l 255 And the difference is in fact around sometimes 'networksecurity' is in the list, and sometimes it isn't: [jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | grep networksecurity | wc -l 6 [jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | grep networksecurity | wc -l 6 [jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | grep networksecurity | wc -l 0 [jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | grep networksecurity | wc -l 0 You can enable the network security API on the GCP project https://console.cloud.google.com/apis/library/networksecurity.googleapis.com , or wait until GCP starts to return consistent results (maybe networksecurity is an upcoming permission being added to the existing role and it will in fact be required going forward: https://cloud.google.com/iam/docs/understanding-roles#compute-engine-roles ).
This is a 4.10 blocker to ensure that https://github.com/openshift/machine-api-operator/pull/949 is reverted before we ship 4.10. We believe this is a GCP bug that will be fixed, but we merged a machine-api-operator PR to get payloads flowing again. The machine-api-operator change is to skip a validation check that ensures our stock credential requests are well-formed. It is ok to skip this for a short duration to keep payloads moving and working around the GCP bug, but long term we want to get that check back to prevent accidents.
Trying to pin down the current user experience by looking at [1]: The particular missing API is not bubbled up into the CredentialsRequest: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442663035408384/artifacts/e2e-gcp-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-ci-openshift-org-ocp-4-10-2021-11-09-215013-sha256-bec152bc664c5ba6192357aabfcdd18810135c8e89800239ffee86b6a5d8730d/namespaces/openshift-cloud-credential-operator/cloudcredential.openshift.io/credentialsrequests/openshift-machine-api-gcp.yaml ... status: conditions: - lastProbeTime: "2021-11-10T14:59:06Z" lastTransitionTime: "2021-11-10T14:43:58Z" message: 'failed to grant creds: not all required service APIs are enabled' reason: CredentialsProvisionFailure status: "True" type: CredentialsProvisionFailure lastSyncGeneration: 0 providerStatus: apiVersion: cloudcredential.openshift.io/v1 kind: GCPProviderStatus serviceAccountID: ci-op-tb1rcv-openshift-m-zwk45 provisioned: false so to definitely identify this issue, you need to drop into the cloud-cred operator's logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442663035408384/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-6fbc6c67fd-jzt7g_cloud-credential-operator.log | grep -A1 networksecurity | tail -n2 time="2021-11-10T15:28:33Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp time="2021-11-10T15:28:33Z" level=error msg="not all required service APIs are enabled" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp The ClusterOperator signal is Degraded=True with reason=CredentialsFailing: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442663035408384/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cloud-credential").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-11-10T14:43:51Z Available=True -: - 2021-11-10T14:44:28Z Degraded=True CredentialsFailing: 1 of 5 credentials requests are failing to sync. 2021-11-10T14:50:22Z Progressing=True Reconciling: 4 of 5 credentials requests provisioned, 1 reporting errors. 2021-11-10T14:43:51Z Upgradeable=True -: - although of course you could see that when credentials failed for other reasons as well. The warning-level CloudCredentialOperatorProvisioningFailed [2] seems to go off in some, but not all, clusters that are Degraded=True with reason=CredentialsFailing. I'm not sure why it doesn't seem to go off in all of them yet. And again, it can go off because of other provisioning issues besides this networksecurity vs. roles/compute.loadBalancerAdmin issue. The green [3] is an example where PromeCIeus shows: cluster_operator_conditions{name="cloud-credential",condition="Degraded",reason="CredentialsFailing"} giving around 30s (one scrape?) of a hit around 15:55 UTC: cluster_operator_conditions{condition="Degraded", endpoint="metrics", instance="10.0.0.5:9099", job="cluster-version-operator", name="cloud-credential", namespace="openshift-cluster-version", pod="cluster-version-operator-56cf548bdb-ghzsr", reason="CredentialsFailing", service="cluster-version-operator"} while: cco_credentials_requests_conditions{condition="CredentialsProvisionFailure"} remains flat at zero the whole time. Hosted Loki for that run includes: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442661848420352"} | unpack |= "networksecurity" giving: time="2021-11-10T15:55:24Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp Maybe the recovery was quick enough to avoid a cco_credentials_requests_conditions rise? [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442663035408384 [2]: https://github.com/openshift/cloud-credential-operator/blame/fb3717b67a5295b2bdace01b46b3fd39b93b19b6/manifests/0000_90_cloud-credential-operator_04_alertrules.yaml#L23-L32 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442661848420352
I don't think we'll actually block any update edges on this, because we've been using the problematic role since 4.5 [1], so all of our existing releases are vulnerable. But to get a better handle on the priority, here's a trimmed down version of [2]'s impact statement request. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: All GCP customers with mint-mode credentials running existing 4.5+ releases who have not enabled the networksecurity.googleapis.com API for their cluster's project. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Either enable the networksecurity.googleapis.com API for the cluster project, or FIXME: some mitigations we actually test. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it has always been like this since 4.5+ and Google only started fiddling with the role permissions recently. [1]: https://github.com/openshift/machine-api-operator/pull/513#event-3129866463 [2]: https://github.com/openshift/enhancements/blob/master/enhancements/update/update-blocker-lifecycle/README.md#impact-statement-request
Oops, and also: What is the impact? Is it serious enough to warrant blocking edges? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Anyone running OpenShift 4.5+ on GCP in CCO's Mint or Passthrough mode would be exposed to this inconsistent list of permissions associated with the loadbalanceradmin role. If the networksecurity API is not enabled in the GCP project where the cluster is being run, then we will run into these intermittent API checks against the networksecurity API that will fail when networksecurity permissions are returned as associated with the loadbalanceradmin role. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? The networksecurity API can be enabled for the GCP project and then any future checks against that API being enabled will pass. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? No, this is a new behavior exposed by the networksecurity permissions being sometimes returned when querying the GCP loadbalanceradmin Role. What is the impact? Is it serious enough to warrant blocking edges? There is nothing to be gained by blocking any edges. OpenShift clusters running on GCP are exposed for all currently supported OpenShift releases (unless running in CCO's Manual mode).
We have a knowledge-base solution up now, focusing on detection and mitigation [1], while we work through a fix/backport here. [1]: https://access.redhat.com/solutions/6505391
*** Bug 2022752 has been marked as a duplicate of this bug. ***
Per comment 11, as expected, impacts all existing releases, so no edges need blocking, and I'm clearing UpgradeBlocker. Also setting FastFix to get faster QE verification.
Verified on 4.10.0-0.nightly-2021-11-14-184249 include the fix. 1. Launch a basic gcp cluster 2. Monitor the installation process the installaton can succeed and cco won't hit the issue about "Detected required APIs that are disabled: [networksecurity.googleapis.com]" $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-14-184249 True False 50m Cluster version is 4.10.0-0.nightly-2021-11-14-184249 $ oc logs cloud-credential-operator-646cc64f64-5lgjc -n openshift-cloud-credential-operator -c cloud-credential-operator | grep "Detected required APIs that are disabled" Nothing output ############ The payload without the fix merged like 4.9.0-0.nightly-2021-11-11-155043 will fail to install, and cco Degraded because of [networksecurity.googleapis.com] disabled. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 53m Unable to apply 4.9.0-0.nightly-2021-11-11-155043: some cluster operators have not yet rolled out $ oc logs cloud-credential-operator-5b97f67944-qp6k2 -n openshift-cloud-credential-operator -c cloud-credential-operator | grep "Detected required APIs that are disabled" time="2021-11-15T04:33:13Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp time="2021-11-15T04:33:17Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp time="2021-11-15T04:33:22Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp time="2021-11-15T04:33:31Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp time="2021-11-15T04:33:49Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp
Hi all, customer with multiple clusters in google have the same issue in 4.8, not clear if is cherry-picking in 4.9. As understand this networksecurity.* api is not really required but, related to role, roles/compute.loadBalancerAdmin, that is a beta role for google, https://cloud.google.com/compute/docs/access/iam#compute.loadBalancerAdmin, and where only the networksecurity method are related to load balancer TLS Client or server list, get, use, no create as below networksecurity.clientTlsPolicies.get networksecurity.clientTlsPolicies.list networksecurity.clientTlsPolicies.use networksecurity.serverTlsPolicies.get networksecurity.serverTlsPolicies.list networksecurity.serverTlsPolicies.use why we mention this role? is not a role that is specified in the mint mode section https://docs.openshift.com/container-platform/4.9/authentication/managing_cloud_provider_credentials/cco-mode-mint.html regards
(In reply to christian Marangoni from comment #23) > Hi all, > customer with multiple clusters in google have the same issue in 4.8, not > clear if is cherry-picking in 4.9. The fix has been backported to 4.8 BZ2022838 > why we mention this role? is not a role that is specified in the mint mode > section > https://docs.openshift.com/container-platform/4.9/authentication/ > managing_cloud_provider_credentials/cco-mode-mint.html > Mint mode is what allows the cloud-cred-operator to grant credentials to the various components inside the cluster. cloud-cred-operator doesn't need this role, it is the machine-api-operator that is requesting this role https://github.com/openshift/machine-api-operator/blob/master/install/0000_30_machine-api-operator_00_credentials-request.yaml#L123 . (This BZ really should have been in the machine-api component...).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056