+++ This bug was initially created as a clone of Bug #1996916 +++ Description of problem: Version-Release number of selected component (if applicable): 4.9.0-fc.0 How reproducible: Steps to Reproduce: 1. Setup OCP4.9 on GCP 2. Deploy SRO on OperatorHub, Create instance Actual results: No Simple-kmod POD created Expected results: simple-kmod pod should be created normally Additional info: [ocpadmin@ec2-18-217-45-133 nto]$ oc logs -f special-resource-controller-manager-76758fbfd4-2fhrs -n openshift-operators error: a container name must be specified for pod special-resource-controller-manager-76758fbfd4-2fhrs, choose one of: [kube-rbac-proxy manager] [ocpadmin@ec2-18-217-45-133 nto]$ oc logs -f special-resource-controller-manager-76758fbfd4-2fhrs -n openshift-operators -c manager I0823 15:00:02.375489 1 request.go:668] Waited for 1.047736909s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/nfd.openshift.io/v1?timeout=32s 2021-08-23T15:00:03.725Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "127.0.0.1:8080"} 2021-08-23T15:00:03.780Z INFO setup starting manager I0823 15:00:03.780477 1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/b6ae617b.openshift.io... 2021-08-23T15:00:03.780Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"} I0823 15:00:21.780840 1 leaderelection.go:253] successfully acquired lease openshift-operators/b6ae617b.openshift.io 2021-08-23T15:00:21.780Z DEBUG controller-runtime.manager.events Normal{"object": {"kind":"ConfigMap","namespace":"openshift-operators","name":"b6ae617b.openshift.io","uid":"a4b4ea4a-c8a8-44e3-96c3-b3262b2ce7ad","apiVersion":"v1","resourceVersion":"204878"}, "reason": "LeaderElection", "message": "special-resource-controller-manager-76758fbfd4-2fhrs_0214bbac-9868-4be4-b314-20427b8be7d4 became leader"} 2021-08-23T15:00:21.781Z DEBUG controller-runtime.manager.events Normal{"object": {"kind":"Lease","namespace":"openshift-operators","name":"b6ae617b.openshift.io","uid":"c6750a42-a7ed-498a-a10b-475413a715e7","apiVersion":"coordination.k8s.io/v1","resourceVersion":"204879"}, "reason": "LeaderElection", "message": "special-resource-controller-manager-76758fbfd4-2fhrs_0214bbac-9868-4be4-b314-20427b8be7d4 became leader"} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting Controller {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource"} 2021-08-23T15:00:21.982Z INFO filter CREATE IsSpecialResource {"GenerationChanged": "simple-kmod"} 2021-08-23T15:00:22.380Z INFO filter CREATE IsSpecialResource {"GenerationChanged": "special-resource-controller-manager"} 2021-08-23T15:00:24.278Z INFO filter CREATE IsSpecialResource {"GenerationChanged": "special-resource-controller-manager-76758fbfd4-2fhrs"} 2021-08-23T15:00:25.373Z INFO controller-runtime.manager.controller.specialresource Starting workers {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "worker count": 1} 2021-08-23T15:00:25.374Z INFO preamble Controller Request {"Name": "simple-kmod", "Namespace": ""} 2021-08-23T15:00:25.577Z INFO cache Nodes cached {"name": "liqcui-ocgcp01-h25hk-worker-a-whrpr.c.openshift-qe.internal"} 2021-08-23T15:00:25.577Z INFO cache Nodes cached {"name": "liqcui-ocgcp01-h25hk-worker-b-4sjbz.c.openshift-qe.internal"} 2021-08-23T15:00:25.577Z INFO cache Nodes cached {"name": "liqcui-ocgcp01-h25hk-worker-c-lml7k.c.openshift-qe.internal"} 2021-08-23T15:00:25.577Z INFO cache Node list: {"length": 3} 2021-08-23T15:00:25.577Z INFO cache Nodes {"num": 3} 2021-08-23T15:00:25.776Z INFO upgrade History {"entry": "registry.ci.openshift.org/ocp/release@sha256:7e7fdee3d2a4ea95db5e356fed533fc80e350544633f175d438072e51ae8f0ff"} E0823 15:00:32.427535 1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors 2021-08-23T15:04:41.798Z INFO filter DELETE IsSpecialResource {"GenerationChanged": "simple-kmod"} 2021-08-23T15:05:02.150Z INFO filter CREATE IsSpecialResource {"GenerationChanged": "simple-kmod"} --- Additional comment from Pablo Acevedo on 2021-08-27 09:51:45 UTC --- After reproducing the issue, here is the sequence of events leading to this situation: - After installing NFD and SRO, we get the simple-kmod created. - While trying to reconcile we see the following in the logs: 2021-08-23T15:00:25.776Z INFO upgrade History {"entry": "registry.ci.openshift.org/ocp/release@sha256:7e7fdee3d2a4ea95db5e356fed533fc80e350544633f175d438072e51ae8f0ff"} E0823 15:00:32.427535 1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors - When enabling the verbose errors, we see the following: E0827 08:02:02.065439 1 aws_credentials.go:81] while getting AWS credentials NoCredentialProviders: no valid providers in chain caused by: EnvAccessKeyNotFound: failed to find credentials in the environment. SharedCredsLoad: failed to load profile, . EC2RoleRequestError: no EC2 instance role found caused by: RequestError: send request failed caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) While this can be misleading, it does not relate to the original issue, since we are not in AWS. We can expect this to happen, but getting a log error is a bit confusing. - After creating a simple-kmod CR we see nothing happens. We delete it and create it again and its not reconciled. - Looking into the AWS log it seems there is some infinite loop in the driver toolkit version logic. When investigating, we can find the following: I0827 08:01:55.732875 1 plugins.go:50] Creating DockerKeyring I0827 08:01:55.732917 1 plugins.go:64] Trying out provider .dockercfg I0827 08:01:55.732928 1 plugins.go:66] Registering credential provider I0827 08:01:55.732937 1 plugins.go:64] Trying out provider amazon-ecr E0827 08:02:02.065439 1 aws_credentials.go:81] while getting AWS credentials NoCredentialProviders: no valid providers in chain caused by: EnvAccessKeyNotFound: failed to find credentials in the environment. SharedCredsLoad: failed to load profile, . EC2RoleRequestError: no EC2 instance role found caused by: RequestError: send request failed caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) I0827 08:02:02.065524 1 plugins.go:64] Trying out provider azure I0827 08:02:02.065538 1 plugins.go:64] Trying out provider google-container-registry I0827 08:02:02.065706 1 metadata.go:195] Reading URL with backoff. http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/ I0827 08:02:03.123781 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused So now we see some trouble with the metadata server in GCE. This is happening within and endless loop, as can be seen here: https://github.com/vdemeester/k8s-pkg-credentialprovider/blob/master/gcp/metadata.go#L193 This issue (https://github.com/vdemeester/k8s-pkg-credentialprovider/issues/14) looks similar, but the situation is not the same because the name is resolved to an IP. There is no DNS problem, its firewalling/routing rules in the network. Since this is running in an endless backoff loop, the reconcile logic never gets past this function and no CR is going to be handled, hence the apparent inactivity we see. - If we exec into the pod we can see the same situation: ❯ kubectl exec -it -n openshift-special-resource-operator special-resource-controller-manager-5f8b4449b5-wc7qw -c manager -- /bin/bash bash-4.4$ cat /sys/class/dmi/id/product_name Google Compute Engine bash-4.4$ curl http://metadata.google.internal. -v * Rebuilt URL to: http://metadata.google.internal./ * Trying 169.254.169.254... * TCP_NODELAY set * connect to 169.254.169.254 port 80 failed: Connection refused * Failed to connect to metadata.google.internal port 80: Connection refused * Closing connection 0 curl: (7) Failed to connect to metadata.google.internal port 80: Connection refused - If we do the same thing from one of the nodes the curl succeeds, so we are only allowed to do it from the VMs themselves. This might be due to a faulty authorization config in GCP when setting up the cluster, can you please check it? --- Additional comment from Pablo Acevedo on 2021-08-27 09:53:40 UTC --- Some reference: https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata --- Additional comment from on 2021-09-01 03:42:57 UTC --- https://groups.google.com/g/openshift/c/MnESJ0MUw1Y, for my understanding the only way mount the cluster's Google Cloud network inside of the pod. --- Additional comment from Pablo Acevedo on 2021-09-02 11:53:36 UTC --- After some investigation we found the cause. Starting from a fresh 4.9 install in GCP we have the initial situation, which is the infinite loop: I0902 07:21:04.441264 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:05.593287 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:06.809316 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:08.217253 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:10.073291 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:12.697274 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused Then we check the connection is allowed from the node: [root@ci-ln-jbfhhcb-f76d1-gd5qt-worker-b-6kxcs /]# curl http://metadata.google.internal. -v -i * Rebuilt URL to: http://metadata.google.internal./ * Trying 169.254.169.254... * TCP_NODELAY set * Connected to metadata.google.internal (169.254.169.254) port 80 (#0) > GET / HTTP/1.1 > Host: metadata.google.internal > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK [...] Then we go to the pod and we perform the same operation: bash-4.4$ curl http://metadata.google.internal. -v -i * Rebuilt URL to: http://metadata.google.internal./ * Trying 169.254.169.254... * TCP_NODELAY set * connect to 169.254.169.254 port 80 failed: Connection refused * Failed to connect to metadata.google.internal port 80: Connection refused * Closing connection 0 curl: (7) Failed to connect to metadata.google.internal port 80: Connection refused We spin up a pod for debug and install tcpdump and we find that making a curl while tcpdump is on only shows DNS traffic to resolve the metadata server IP: 09:23:14.955898 IP debug.50005 > dns-default.openshift-dns.svc.cluster.local.53: 27786+ A? metadata.google.internal.default.svc.cluster.local. (68) 09:23:14.956165 IP debug.50005 > dns-default.openshift-dns.svc.cluster.local.53: 28181+ AAAA? metadata.google.internal.default.svc.cluster.local. (68) 09:23:14.957172 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.50005: 28181 NXDomain*- 0/1/1 (172) 09:23:14.957196 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.50005: 27786 NXDomain*- 0/1/1 (172) 09:23:14.957292 IP debug.57561 > dns-default.openshift-dns.svc.cluster.local.53: 49982+ A? metadata.google.internal.svc.cluster.local. (60) 09:23:14.957504 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.57561: 49982 NXDomain*- 0/1/1 (164) 09:23:14.957792 IP debug.57561 > dns-default.openshift-dns.svc.cluster.local.53: 50413+ AAAA? metadata.google.internal.svc.cluster.local. (60) 09:23:14.958003 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.57561: 50413 NXDomain*- 0/1/1 (164) 09:23:14.958108 IP debug.46008 > dns-default.openshift-dns.svc.cluster.local.53: 19830+ A? metadata.google.internal.cluster.local. (56) 09:23:14.958183 IP debug.46008 > dns-default.openshift-dns.svc.cluster.local.53: 20167+ AAAA? metadata.google.internal.cluster.local. (56) 09:23:14.958950 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46008: 20167 NXDomain*- 0/1/1 (160) 09:23:14.959097 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46008: 19830 NXDomain*- 0/1/1 (160) 09:23:14.959214 IP debug.55770 > dns-default.openshift-dns.svc.cluster.local.53: 12236+ A? metadata.google.internal.c.openshift-gce-devel-ci.internal. (76) 09:23:14.959633 IP debug.55770 > dns-default.openshift-dns.svc.cluster.local.53: 12651+ AAAA? metadata.google.internal.c.openshift-gce-devel-ci.internal. (76) 09:23:14.963824 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.55770: 12651 NXDomain 0/1/1 (192) 09:23:14.963880 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.55770: 12236 NXDomain 0/1/1 (192) 09:23:14.964490 IP debug.37963 > dns-default.openshift-dns.svc.cluster.local.53: 41502+ A? metadata.google.internal.google.internal. (58) 09:23:14.964778 IP debug.37963 > dns-default.openshift-dns.svc.cluster.local.53: 41842+ AAAA? metadata.google.internal.google.internal. (58) 09:23:14.967874 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.37963: 41502 NXDomain 0/1/1 (174) 09:23:14.968422 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.37963: 41842 NXDomain 0/1/1 (174) 09:23:14.969681 IP debug.46076 > dns-default.openshift-dns.svc.cluster.local.53: 55746+ A? metadata.google.internal.c.openshift-gce-devel-c. (66) 09:23:14.969762 IP debug.46076 > dns-default.openshift-dns.svc.cluster.local.53: 56121+ AAAA? metadata.google.internal.c.openshift-gce-devel-c. (66) 09:23:15.011148 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46076: 55746 NXDomain 0/1/1 (152) 09:23:15.026148 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46076: 56121 NXDomain 0/1/1 (152) 09:23:15.026728 IP debug.45646 > dns-default.openshift-dns.svc.cluster.local.53: 61897+ A? metadata.google.internal. (42) 09:23:15.026954 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.45646: 61897* 1/0/1 A 169.254.169.254 (93) 09:23:15.027205 IP debug.45646 > dns-default.openshift-dns.svc.cluster.local.53: 62240+ AAAA? metadata.google.internal. (42) 09:23:15.028764 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.45646: 62240 0/0/1 (53) 09:23:15.061121 IP debug.58929 > dns-default.openshift-dns.svc.cluster.local.53: 3387+ PTR? 10.0.30.172.in-addr.arpa. (42) 09:23:15.061538 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.58929: 3387*- 1/0/1 PTR dns-default.openshift-dns.svc.cluster.local. (134) And nothing else. We turn to iptables rules and we find this in SDN: https://github.com/openshift/sdn/blob/e17993d1116cce1372dfc680752e792e8e51e28a/pkg/cmd/openshift-sdn-cni/openshift-sdn.go#L126-L129 After asking SDN team they said it is mainly for security reasons. --- Additional comment from Brett Thurber on 2021-09-09 03:03:47 UTC --- Is there anything to take action on for this BZ or is this trending as NOT A BUG? --- Additional comment from Pablo Acevedo on 2021-09-09 07:29:57 UTC --- We are looking into different alternatives for extracting the info we need without using the metadata server, but this is a bug when using SDN CNI on GCP. --- Additional comment from Pablo Acevedo on 2021-10-07 08:15:49 UTC --- Working in a solution already.
*** This bug has been marked as a duplicate of bug 2014622 ***