2021731 – CCO occasionally down, reporting networksecurity.googleapis.com API as disabled

Bug 2021731 - CCO occasionally down, reporting networksecurity.googleapis.com API as disabled

Summary: CCO occasionally down, reporting networksecurity.googleapis.com API as disabled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Nobody
QA Contact:	wang lin
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2022752 (view as bug list)
Depends On:
Blocks:	2022813 2031858
TreeView+	depends on / blocked

Reported:	2021-11-10 03:48 UTC by Matt Bargenquast
Modified:	2024-12-20 21:34 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2031858 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:26:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 949	None	Merged	Bug 2021731: GCP credentials reporting networksecurity.googleapis.com API disabled	2021-11-16 12:23:12 UTC
Red Hat Knowledge Base (Solution)	6505391	None	None	None	2021-11-12 02:19:06 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:27:16 UTC

Description Matt Bargenquast 2021-11-10 03:48:21 UTC

Description of problem:

We have observed two GCP clusters where the cloud-credential operator has gone to a Down state for sometimes hours at a time. The error causing CCO to report as Down in CVO is seemingly due to the networksecurity.googleapis.com service API not being enabled:

2021-11-10T00:47:57.944192896Z time="2021-11-10T00:47:57Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp                                                                                                                                                                                                                       
2021-11-10T00:47:57.944192896Z time="2021-11-10T00:47:57Z" level=error msg="not all required service APIs are enabled" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp
2021-11-10T00:47:57.944245352Z time="2021-11-10T00:47:57Z" level=error msg="error syncing credentials: not all required service APIs are enabled" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-gcp secret=openshift-machine-api/gcp-cloud-credentials

After some hours have passed, the issue has eventually self-resolved on both clusters with no other actions taken.

However, even when the issue eventually resolves itself, we have checked the GCP project for the clusters and verified that the Network Security API is still not enabled for the account at all, so I'm not sure why it resolved.

Version-Release number of selected component (if applicable):

4.7.22

How reproducible:

We have observed it on two clusters at separate times.

Comment 2 Joel Diaz 2021-11-10 20:53:04 UTC

There appears to be an issue with GCP returning inconsistent results around the permissions associated with Role(s).  Well, at least the compute.loadBalancerAdmin role is returning problematic results (and loadBalancerAdmin is a role being requested by the machine-api https://github.com/openshift/machine-api-operator/blob/master/install/0000_30_machine-api-operator_00_credentials-request.yaml#L122 ):

[jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | wc -l
255
[jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | wc -l
261
[jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | wc -l
255
[jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | wc -l
255

And the difference is in fact around sometimes 'networksecurity' is in the list, and sometimes it isn't:

[jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | grep networksecurity | wc -l
6
[jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | grep networksecurity | wc -l
6
[jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | grep networksecurity | wc -l
0
[jdiaz@minigoomba google-cloud-sdk]$ gcloud iam roles describe roles/compute.loadBalancerAdmin | grep networksecurity | wc -l
0

You can enable the network security API on the GCP project https://console.cloud.google.com/apis/library/networksecurity.googleapis.com , or wait until GCP starts to return consistent results (maybe networksecurity is an upcoming permission being added to the existing role and it will in fact be required going forward: https://cloud.google.com/iam/docs/understanding-roles#compute-engine-roles ).

Comment 3 David Eads 2021-11-10 20:59:31 UTC

This is a 4.10 blocker to ensure that https://github.com/openshift/machine-api-operator/pull/949 is reverted before we ship 4.10.

We believe this is a GCP bug that will be fixed, but we merged a machine-api-operator PR to get payloads flowing again.  The machine-api-operator change is to skip a validation check that ensures our stock credential requests are well-formed.  It is ok to skip this for a short duration to keep payloads moving and working around the GCP bug, but long term we want to get that check back to prevent accidents.

Comment 8 W. Trevor King 2021-11-11 00:01:29 UTC

Trying to pin down the current user experience by looking at [1]:

The particular missing API is not bubbled up into the CredentialsRequest:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442663035408384/artifacts/e2e-gcp-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-ci-openshift-org-ocp-4-10-2021-11-09-215013-sha256-bec152bc664c5ba6192357aabfcdd18810135c8e89800239ffee86b6a5d8730d/namespaces/openshift-cloud-credential-operator/cloudcredential.openshift.io/credentialsrequests/openshift-machine-api-gcp.yaml
  ...
  status:
    conditions:
    - lastProbeTime: "2021-11-10T14:59:06Z"
      lastTransitionTime: "2021-11-10T14:43:58Z"
      message: 'failed to grant creds: not all required service APIs are enabled'
      reason: CredentialsProvisionFailure
      status: "True"
      type: CredentialsProvisionFailure
    lastSyncGeneration: 0
    providerStatus:
      apiVersion: cloudcredential.openshift.io/v1
      kind: GCPProviderStatus
      serviceAccountID: ci-op-tb1rcv-openshift-m-zwk45
    provisioned: false

so to definitely identify this issue, you need to drop into the cloud-cred operator's logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442663035408384/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-6fbc6c67fd-jzt7g_cloud-credential-operator.log | grep -A1 networksecurity | tail -n2
  time="2021-11-10T15:28:33Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp
  time="2021-11-10T15:28:33Z" level=error msg="not all required service APIs are enabled" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp

The ClusterOperator signal is Degraded=True with reason=CredentialsFailing:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442663035408384/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cloud-credential").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-11-10T14:43:51Z Available=True -: -
  2021-11-10T14:44:28Z Degraded=True CredentialsFailing: 1 of 5 credentials requests are failing to sync.
  2021-11-10T14:50:22Z Progressing=True Reconciling: 4 of 5 credentials requests provisioned, 1 reporting errors.
  2021-11-10T14:43:51Z Upgradeable=True -: -

although of course you could see that when credentials failed for other reasons as well.

The warning-level CloudCredentialOperatorProvisioningFailed [2] seems to go off in some, but not all, clusters that are Degraded=True with reason=CredentialsFailing.  I'm not sure why it doesn't seem to go off in all of them yet.  And again, it can go off because of other provisioning issues besides this networksecurity vs. roles/compute.loadBalancerAdmin issue.  The green [3] is an example where PromeCIeus shows:

  cluster_operator_conditions{name="cloud-credential",condition="Degraded",reason="CredentialsFailing"}

giving around 30s (one scrape?) of a hit around 15:55 UTC:

  cluster_operator_conditions{condition="Degraded", endpoint="metrics", instance="10.0.0.5:9099", job="cluster-version-operator", name="cloud-credential", namespace="openshift-cluster-version", pod="cluster-version-operator-56cf548bdb-ghzsr", reason="CredentialsFailing", service="cluster-version-operator"}

while:

  cco_credentials_requests_conditions{condition="CredentialsProvisionFailure"}

remains flat at zero the whole time.  Hosted Loki for that run includes:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442661848420352"} | unpack |= "networksecurity"

giving:

  time="2021-11-10T15:55:24Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp

Maybe the recovery was quick enough to avoid a cco_credentials_requests_conditions rise?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442663035408384
[2]: https://github.com/openshift/cloud-credential-operator/blame/fb3717b67a5295b2bdace01b46b3fd39b93b19b6/manifests/0000_90_cloud-credential-operator_04_alertrules.yaml#L23-L32
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1458442661848420352

Comment 9 W. Trevor King 2021-11-11 16:53:02 UTC

I don't think we'll actually block any update edges on this, because we've been using the problematic role since 4.5 [1], so all of our existing releases are vulnerable. But to get a better handle on the priority, here's a trimmed down version of [2]'s impact statement request. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?

example: All GCP customers with mint-mode credentials running existing 4.5+ releases who have not enabled the networksecurity.googleapis.com API for their cluster's project.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

example: Either enable the networksecurity.googleapis.com API for the cluster project, or FIXME: some mitigations we actually test.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

example: No, it has always been like this since 4.5+ and Google only started fiddling with the role permissions recently.

[1]: https://github.com/openshift/machine-api-operator/pull/513#event-3129866463
[2]: https://github.com/openshift/enhancements/blob/master/enhancements/update/update-blocker-lifecycle/README.md#impact-statement-request

Comment 10 W. Trevor King 2021-11-11 16:54:24 UTC

Oops, and also:

What is the impact? Is it serious enough to warrant blocking edges?

example: Issue resolves itself after five minutes
example: Admin uses oc to fix things
example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Comment 11 Joel Diaz 2021-11-11 17:09:34 UTC

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
Anyone running OpenShift 4.5+ on GCP in CCO's Mint or Passthrough mode would be exposed to this inconsistent list of permissions associated with the loadbalanceradmin role. If the networksecurity API is not enabled in the GCP project where the cluster is being run, then we will run into these intermittent API checks against the networksecurity API that will fail when networksecurity permissions are returned as associated with the loadbalanceradmin role.


How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
The networksecurity API can be enabled for the GCP project and then any future checks against that API being enabled will pass.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
No, this is a new behavior exposed by the networksecurity permissions being sometimes returned when querying the GCP loadbalanceradmin Role.

What is the impact? Is it serious enough to warrant blocking edges?
There is nothing to be gained by blocking any edges. OpenShift clusters running on GCP are exposed for all currently supported OpenShift releases (unless running in CCO's Manual mode).

Comment 14 W. Trevor King 2021-11-12 02:19:07 UTC

We have a knowledge-base solution up now, focusing on detection and mitigation [1], while we work through a fix/backport here.

[1]: https://access.redhat.com/solutions/6505391

Comment 15 Matthew Staebler 2021-11-12 15:01:57 UTC

*** Bug 2022752 has been marked as a duplicate of this bug. ***

Comment 17 W. Trevor King 2021-11-12 16:56:06 UTC

Per comment 11, as expected, impacts all existing releases, so no edges need blocking, and I'm clearing UpgradeBlocker.  Also setting FastFix to get faster QE verification.

Comment 19 wang lin 2021-11-15 05:20:13 UTC

Verified on 4.10.0-0.nightly-2021-11-14-184249 include the fix.

1. Launch a basic gcp cluster
2. Monitor the installation process

the installaton can succeed and cco won't hit the issue about "Detected required APIs that are disabled: [networksecurity.googleapis.com]"
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-14-184249   True        False         50m     Cluster version is 4.10.0-0.nightly-2021-11-14-184249

$ oc logs cloud-credential-operator-646cc64f64-5lgjc -n openshift-cloud-credential-operator -c cloud-credential-operator | grep "Detected required APIs that are disabled"
Nothing output

############
The payload without the fix merged like 4.9.0-0.nightly-2021-11-11-155043 will fail to install, and cco Degraded because of [networksecurity.googleapis.com] disabled.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          53m     Unable to apply 4.9.0-0.nightly-2021-11-11-155043: some cluster operators have not yet rolled out

$ oc logs cloud-credential-operator-5b97f67944-qp6k2 -n openshift-cloud-credential-operator -c cloud-credential-operator | grep "Detected required APIs that are disabled"
time="2021-11-15T04:33:13Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp
time="2021-11-15T04:33:17Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp
time="2021-11-15T04:33:22Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp
time="2021-11-15T04:33:31Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp
time="2021-11-15T04:33:49Z" level=warning msg="Detected required APIs that are disabled: [networksecurity.googleapis.com]" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp

Comment 23 christian Marangoni 2021-12-03 14:53:15 UTC

Hi all,
customer with multiple clusters in google have the same issue in 4.8, not clear if is cherry-picking in 4.9. 
As understand this networksecurity.* api is not really required but, related to role, roles/compute.loadBalancerAdmin, that is a beta role for google, https://cloud.google.com/compute/docs/access/iam#compute.loadBalancerAdmin,
and where only the networksecurity method are related to load balancer TLS Client or server list, get, use, no create as below 
networksecurity.clientTlsPolicies.get
networksecurity.clientTlsPolicies.list
networksecurity.clientTlsPolicies.use
networksecurity.serverTlsPolicies.get
networksecurity.serverTlsPolicies.list
networksecurity.serverTlsPolicies.use

why we mention this role? is not a role that is specified in the mint mode section 
https://docs.openshift.com/container-platform/4.9/authentication/managing_cloud_provider_credentials/cco-mode-mint.html

regards

Comment 25 Joel Diaz 2021-12-03 16:29:14 UTC

(In reply to christian Marangoni from comment #23)
> Hi all,
> customer with multiple clusters in google have the same issue in 4.8, not
> clear if is cherry-picking in 4.9. 
The fix has been backported to 4.8 BZ2022838
 
> why we mention this role? is not a role that is specified in the mint mode
> section 
> https://docs.openshift.com/container-platform/4.9/authentication/
> managing_cloud_provider_credentials/cco-mode-mint.html
> 
Mint mode is what allows the cloud-cred-operator to grant credentials to the various components inside the cluster. cloud-cred-operator doesn't need this role, it is the machine-api-operator that is requesting this role https://github.com/openshift/machine-api-operator/blob/master/install/0000_30_machine-api-operator_00_credentials-request.yaml#L123 . (This BZ really should have been in the machine-api component...).

Comment 30 errata-xmlrpc 2022-03-10 16:26:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.