Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2014275

Summary:	Special Resource Operator(SRO) - Fail to deploy simple-kmod on GCP
Product:	OpenShift Container Platform	Reporter:	Pablo Acevedo <pacevedo>
Component:	Special Resource Operator	Assignee:	Brett Thurber <bthurber>
Status:	CLOSED DUPLICATE	QA Contact:	Walid A. <wabouham>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.9	CC:	aos-bugs, bthurber, dagray, kmccarro, lhorsley, liqcui, pacevedo, wabouham
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1996916	Environment:
Last Closed:	2021-10-26 14:15:40 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pablo Acevedo 2021-10-14 19:23:17 UTC

+++ This bug was initially created as a clone of Bug #1996916 +++

Description of problem:


Version-Release number of selected component (if applicable):
4.9.0-fc.0

How reproducible:


Steps to Reproduce:
1. Setup OCP4.9 on GCP
2. Deploy SRO on OperatorHub, Create instance 


Actual results:
No Simple-kmod POD created

Expected results:
simple-kmod pod should be created normally

Additional info:
[ocpadmin@ec2-18-217-45-133 nto]$ oc logs -f special-resource-controller-manager-76758fbfd4-2fhrs -n openshift-operators
error: a container name must be specified for pod special-resource-controller-manager-76758fbfd4-2fhrs, choose one of: [kube-rbac-proxy manager]
[ocpadmin@ec2-18-217-45-133 nto]$ oc logs -f special-resource-controller-manager-76758fbfd4-2fhrs -n openshift-operators -c manager
I0823 15:00:02.375489       1 request.go:668] Waited for 1.047736909s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/nfd.openshift.io/v1?timeout=32s
2021-08-23T15:00:03.725Z        INFO    controller-runtime.metrics      metrics server is starting to listen   {"addr": "127.0.0.1:8080"}
2021-08-23T15:00:03.780Z        INFO    setup   starting manager
I0823 15:00:03.780477       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/b6ae617b.openshift.io...
2021-08-23T15:00:03.780Z        INFO    controller-runtime.manager      starting metrics server        {"path": "/metrics"}
I0823 15:00:21.780840       1 leaderelection.go:253] successfully acquired lease openshift-operators/b6ae617b.openshift.io
2021-08-23T15:00:21.780Z        DEBUG   controller-runtime.manager.events       Normal{"object": {"kind":"ConfigMap","namespace":"openshift-operators","name":"b6ae617b.openshift.io","uid":"a4b4ea4a-c8a8-44e3-96c3-b3262b2ce7ad","apiVersion":"v1","resourceVersion":"204878"}, "reason": "LeaderElection", "message": "special-resource-controller-manager-76758fbfd4-2fhrs_0214bbac-9868-4be4-b314-20427b8be7d4 became leader"}
2021-08-23T15:00:21.781Z        DEBUG   controller-runtime.manager.events       Normal{"object": {"kind":"Lease","namespace":"openshift-operators","name":"b6ae617b.openshift.io","uid":"c6750a42-a7ed-498a-a10b-475413a715e7","apiVersion":"coordination.k8s.io/v1","resourceVersion":"204879"}, "reason": "LeaderElection", "message": "special-resource-controller-manager-76758fbfd4-2fhrs_0214bbac-9868-4be4-b314-20427b8be7d4 became leader"}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting EventSource    {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="}
2021-08-23T15:00:21.781Z        INFO    controller-runtime.manager.controller.specialresource  Starting Controller     {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource"}
2021-08-23T15:00:21.982Z        INFO    filter          CREATE IsSpecialResource      {"GenerationChanged": "simple-kmod"}
2021-08-23T15:00:22.380Z        INFO    filter          CREATE IsSpecialResource      {"GenerationChanged": "special-resource-controller-manager"}
2021-08-23T15:00:24.278Z        INFO    filter          CREATE IsSpecialResource      {"GenerationChanged": "special-resource-controller-manager-76758fbfd4-2fhrs"}
2021-08-23T15:00:25.373Z        INFO    controller-runtime.manager.controller.specialresource  Starting workers        {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "worker count": 1}
2021-08-23T15:00:25.374Z        INFO    preamble        Controller Request      {"Name": "simple-kmod", "Namespace": ""}
2021-08-23T15:00:25.577Z        INFO    cache   Nodes cached    {"name": "liqcui-ocgcp01-h25hk-worker-a-whrpr.c.openshift-qe.internal"}
2021-08-23T15:00:25.577Z        INFO    cache   Nodes cached    {"name": "liqcui-ocgcp01-h25hk-worker-b-4sjbz.c.openshift-qe.internal"}
2021-08-23T15:00:25.577Z        INFO    cache   Nodes cached    {"name": "liqcui-ocgcp01-h25hk-worker-c-lml7k.c.openshift-qe.internal"}
2021-08-23T15:00:25.577Z        INFO    cache   Node list:      {"length": 3}
2021-08-23T15:00:25.577Z        INFO    cache   Nodes   {"num": 3}
2021-08-23T15:00:25.776Z        INFO    upgrade         History {"entry": "registry.ci.openshift.org/ocp/release@sha256:7e7fdee3d2a4ea95db5e356fed533fc80e350544633f175d438072e51ae8f0ff"}
E0823 15:00:32.427535       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2021-08-23T15:04:41.798Z        INFO    filter          DELETE IsSpecialResource      {"GenerationChanged": "simple-kmod"}
2021-08-23T15:05:02.150Z        INFO    filter          CREATE IsSpecialResource      {"GenerationChanged": "simple-kmod"}

--- Additional comment from Pablo Acevedo on 2021-08-27 09:51:45 UTC ---

After reproducing the issue, here is the sequence of events leading to this situation:
- After installing NFD and SRO, we get the simple-kmod created.

- While trying to reconcile we see the following in the logs:
2021-08-23T15:00:25.776Z        INFO    upgrade         History {"entry": "registry.ci.openshift.org/ocp/release@sha256:7e7fdee3d2a4ea95db5e356fed533fc80e350544633f175d438072e51ae8f0ff"}
E0823 15:00:32.427535       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors

- When enabling the verbose errors, we see the following:
E0827 08:02:02.065439       1 aws_credentials.go:81] while getting AWS credentials NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
SharedCredsLoad: failed to load profile, .
EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
While this can be misleading, it does not relate to the original issue, since we are not in AWS. We can expect this to happen, but getting a log error is a bit confusing.

- After creating a simple-kmod CR we see nothing happens. We delete it and create it again and its not reconciled.

- Looking into the AWS log it seems there is some infinite loop in the driver toolkit version logic. When investigating, we can find the following:
I0827 08:01:55.732875       1 plugins.go:50] Creating DockerKeyring
I0827 08:01:55.732917       1 plugins.go:64] Trying out provider .dockercfg
I0827 08:01:55.732928       1 plugins.go:66] Registering credential provider
I0827 08:01:55.732937       1 plugins.go:64] Trying out provider amazon-ecr
E0827 08:02:02.065439       1 aws_credentials.go:81] while getting AWS credentials NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
SharedCredsLoad: failed to load profile, .
EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I0827 08:02:02.065524       1 plugins.go:64] Trying out provider azure
I0827 08:02:02.065538       1 plugins.go:64] Trying out provider google-container-registry
I0827 08:02:02.065706       1 metadata.go:195] Reading URL with backoff. http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/
I0827 08:02:03.123781       1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused

So now we see some trouble with the metadata server in GCE. This is happening within and endless loop, as can be seen here: https://github.com/vdemeester/k8s-pkg-credentialprovider/blob/master/gcp/metadata.go#L193
This issue (https://github.com/vdemeester/k8s-pkg-credentialprovider/issues/14) looks similar, but the situation is not the same because the name is resolved to an IP. There is no DNS problem, its firewalling/routing rules in the network.
Since this is running in an endless backoff loop, the reconcile logic never gets past this function and no CR is going to be handled, hence the apparent inactivity we see.

- If we exec into the pod we can see the same situation:
❯ kubectl exec -it -n openshift-special-resource-operator special-resource-controller-manager-5f8b4449b5-wc7qw -c manager -- /bin/bash
bash-4.4$ cat /sys/class/dmi/id/product_name
Google Compute Engine
bash-4.4$ curl http://metadata.google.internal. -v
* Rebuilt URL to: http://metadata.google.internal./
*   Trying 169.254.169.254...
* TCP_NODELAY set
* connect to 169.254.169.254 port 80 failed: Connection refused
* Failed to connect to metadata.google.internal port 80: Connection refused
* Closing connection 0
curl: (7) Failed to connect to metadata.google.internal port 80: Connection refused

- If we do the same thing from one of the nodes the curl succeeds, so we are only allowed to do it from the VMs themselves.

This might be due to a faulty authorization config in GCP when setting up the cluster, can you please check it?

--- Additional comment from Pablo Acevedo on 2021-08-27 09:53:40 UTC ---

Some reference: https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata

--- Additional comment from  on 2021-09-01 03:42:57 UTC ---

https://groups.google.com/g/openshift/c/MnESJ0MUw1Y, for my understanding the only way  mount the cluster's Google Cloud network inside of the pod.

--- Additional comment from Pablo Acevedo on 2021-09-02 11:53:36 UTC ---

After some investigation we found the cause.

Starting from a fresh 4.9 install in GCP we have the initial situation, which is the infinite loop:
I0902 07:21:04.441264       1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused
I0902 07:21:05.593287       1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused
I0902 07:21:06.809316       1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused
I0902 07:21:08.217253       1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused
I0902 07:21:10.073291       1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused
I0902 07:21:12.697274       1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused

Then we check the connection is allowed from the node:
[root@ci-ln-jbfhhcb-f76d1-gd5qt-worker-b-6kxcs /]# curl http://metadata.google.internal. -v -i
* Rebuilt URL to: http://metadata.google.internal./
*   Trying 169.254.169.254...
* TCP_NODELAY set
* Connected to metadata.google.internal (169.254.169.254) port 80 (#0)
> GET / HTTP/1.1
> Host: metadata.google.internal
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
[...]

Then we go to the pod and we perform the same operation:
bash-4.4$ curl http://metadata.google.internal. -v -i
* Rebuilt URL to: http://metadata.google.internal./
*   Trying 169.254.169.254...
* TCP_NODELAY set
* connect to 169.254.169.254 port 80 failed: Connection refused
* Failed to connect to metadata.google.internal port 80: Connection refused
* Closing connection 0
curl: (7) Failed to connect to metadata.google.internal port 80: Connection refused

We spin up a pod for debug and install tcpdump and we find that making a curl while tcpdump is on only shows DNS traffic to resolve the metadata server IP:
09:23:14.955898 IP debug.50005 > dns-default.openshift-dns.svc.cluster.local.53: 27786+ A? metadata.google.internal.default.svc.cluster.local. (68)
09:23:14.956165 IP debug.50005 > dns-default.openshift-dns.svc.cluster.local.53: 28181+ AAAA? metadata.google.internal.default.svc.cluster.local. (68)
09:23:14.957172 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.50005: 28181 NXDomain*- 0/1/1 (172)
09:23:14.957196 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.50005: 27786 NXDomain*- 0/1/1 (172)
09:23:14.957292 IP debug.57561 > dns-default.openshift-dns.svc.cluster.local.53: 49982+ A? metadata.google.internal.svc.cluster.local. (60)
09:23:14.957504 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.57561: 49982 NXDomain*- 0/1/1 (164)
09:23:14.957792 IP debug.57561 > dns-default.openshift-dns.svc.cluster.local.53: 50413+ AAAA? metadata.google.internal.svc.cluster.local. (60)
09:23:14.958003 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.57561: 50413 NXDomain*- 0/1/1 (164)
09:23:14.958108 IP debug.46008 > dns-default.openshift-dns.svc.cluster.local.53: 19830+ A? metadata.google.internal.cluster.local. (56)
09:23:14.958183 IP debug.46008 > dns-default.openshift-dns.svc.cluster.local.53: 20167+ AAAA? metadata.google.internal.cluster.local. (56)
09:23:14.958950 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46008: 20167 NXDomain*- 0/1/1 (160)
09:23:14.959097 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46008: 19830 NXDomain*- 0/1/1 (160)
09:23:14.959214 IP debug.55770 > dns-default.openshift-dns.svc.cluster.local.53: 12236+ A? metadata.google.internal.c.openshift-gce-devel-ci.internal. (76)
09:23:14.959633 IP debug.55770 > dns-default.openshift-dns.svc.cluster.local.53: 12651+ AAAA? metadata.google.internal.c.openshift-gce-devel-ci.internal. (76)
09:23:14.963824 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.55770: 12651 NXDomain 0/1/1 (192)
09:23:14.963880 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.55770: 12236 NXDomain 0/1/1 (192)
09:23:14.964490 IP debug.37963 > dns-default.openshift-dns.svc.cluster.local.53: 41502+ A? metadata.google.internal.google.internal. (58)
09:23:14.964778 IP debug.37963 > dns-default.openshift-dns.svc.cluster.local.53: 41842+ AAAA? metadata.google.internal.google.internal. (58)
09:23:14.967874 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.37963: 41502 NXDomain 0/1/1 (174)
09:23:14.968422 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.37963: 41842 NXDomain 0/1/1 (174)
09:23:14.969681 IP debug.46076 > dns-default.openshift-dns.svc.cluster.local.53: 55746+ A? metadata.google.internal.c.openshift-gce-devel-c. (66)
09:23:14.969762 IP debug.46076 > dns-default.openshift-dns.svc.cluster.local.53: 56121+ AAAA? metadata.google.internal.c.openshift-gce-devel-c. (66)
09:23:15.011148 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46076: 55746 NXDomain 0/1/1 (152)
09:23:15.026148 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46076: 56121 NXDomain 0/1/1 (152)
09:23:15.026728 IP debug.45646 > dns-default.openshift-dns.svc.cluster.local.53: 61897+ A? metadata.google.internal. (42)
09:23:15.026954 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.45646: 61897* 1/0/1 A 169.254.169.254 (93)
09:23:15.027205 IP debug.45646 > dns-default.openshift-dns.svc.cluster.local.53: 62240+ AAAA? metadata.google.internal. (42)
09:23:15.028764 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.45646: 62240 0/0/1 (53)
09:23:15.061121 IP debug.58929 > dns-default.openshift-dns.svc.cluster.local.53: 3387+ PTR? 10.0.30.172.in-addr.arpa. (42)
09:23:15.061538 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.58929: 3387*- 1/0/1 PTR dns-default.openshift-dns.svc.cluster.local. (134)

And nothing else. We turn to iptables rules and we find this in SDN:
https://github.com/openshift/sdn/blob/e17993d1116cce1372dfc680752e792e8e51e28a/pkg/cmd/openshift-sdn-cni/openshift-sdn.go#L126-L129

After asking SDN team they said it is mainly for security reasons.

--- Additional comment from Brett Thurber on 2021-09-09 03:03:47 UTC ---

Is there anything to take action on for this BZ or is this trending as NOT A BUG?

--- Additional comment from Pablo Acevedo on 2021-09-09 07:29:57 UTC ---

We are looking into different alternatives for extracting the info we need without using the metadata server, but this is a bug when using SDN CNI on GCP.

--- Additional comment from Pablo Acevedo on 2021-10-07 08:15:49 UTC ---

Working in a solution already.

Comment 2 Pablo Acevedo 2021-10-26 14:15:40 UTC


*** This bug has been marked as a duplicate of bug 2014622 ***