Description of problem: Version-Release number of selected component (if applicable): 4.9.0-fc.0 How reproducible: Steps to Reproduce: 1. Setup OCP4.9 on GCP 2. Deploy SRO on OperatorHub, Create instance Actual results: No Simple-kmod POD created Expected results: simple-kmod pod should be created normally Additional info: [ocpadmin@ec2-18-217-45-133 nto]$ oc logs -f special-resource-controller-manager-76758fbfd4-2fhrs -n openshift-operators error: a container name must be specified for pod special-resource-controller-manager-76758fbfd4-2fhrs, choose one of: [kube-rbac-proxy manager] [ocpadmin@ec2-18-217-45-133 nto]$ oc logs -f special-resource-controller-manager-76758fbfd4-2fhrs -n openshift-operators -c manager I0823 15:00:02.375489 1 request.go:668] Waited for 1.047736909s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/nfd.openshift.io/v1?timeout=32s 2021-08-23T15:00:03.725Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "127.0.0.1:8080"} 2021-08-23T15:00:03.780Z INFO setup starting manager I0823 15:00:03.780477 1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/b6ae617b.openshift.io... 2021-08-23T15:00:03.780Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"} I0823 15:00:21.780840 1 leaderelection.go:253] successfully acquired lease openshift-operators/b6ae617b.openshift.io 2021-08-23T15:00:21.780Z DEBUG controller-runtime.manager.events Normal{"object": {"kind":"ConfigMap","namespace":"openshift-operators","name":"b6ae617b.openshift.io","uid":"a4b4ea4a-c8a8-44e3-96c3-b3262b2ce7ad","apiVersion":"v1","resourceVersion":"204878"}, "reason": "LeaderElection", "message": "special-resource-controller-manager-76758fbfd4-2fhrs_0214bbac-9868-4be4-b314-20427b8be7d4 became leader"} 2021-08-23T15:00:21.781Z DEBUG controller-runtime.manager.events Normal{"object": {"kind":"Lease","namespace":"openshift-operators","name":"b6ae617b.openshift.io","uid":"c6750a42-a7ed-498a-a10b-475413a715e7","apiVersion":"coordination.k8s.io/v1","resourceVersion":"204879"}, "reason": "LeaderElection", "message": "special-resource-controller-manager-76758fbfd4-2fhrs_0214bbac-9868-4be4-b314-20427b8be7d4 became leader"} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting EventSource {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "source": "kind source: /, Kind="} 2021-08-23T15:00:21.781Z INFO controller-runtime.manager.controller.specialresource Starting Controller {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource"} 2021-08-23T15:00:21.982Z INFO filter CREATE IsSpecialResource {"GenerationChanged": "simple-kmod"} 2021-08-23T15:00:22.380Z INFO filter CREATE IsSpecialResource {"GenerationChanged": "special-resource-controller-manager"} 2021-08-23T15:00:24.278Z INFO filter CREATE IsSpecialResource {"GenerationChanged": "special-resource-controller-manager-76758fbfd4-2fhrs"} 2021-08-23T15:00:25.373Z INFO controller-runtime.manager.controller.specialresource Starting workers {"reconciler group": "sro.openshift.io", "reconciler kind": "SpecialResource", "worker count": 1} 2021-08-23T15:00:25.374Z INFO preamble Controller Request {"Name": "simple-kmod", "Namespace": ""} 2021-08-23T15:00:25.577Z INFO cache Nodes cached {"name": "liqcui-ocgcp01-h25hk-worker-a-whrpr.c.openshift-qe.internal"} 2021-08-23T15:00:25.577Z INFO cache Nodes cached {"name": "liqcui-ocgcp01-h25hk-worker-b-4sjbz.c.openshift-qe.internal"} 2021-08-23T15:00:25.577Z INFO cache Nodes cached {"name": "liqcui-ocgcp01-h25hk-worker-c-lml7k.c.openshift-qe.internal"} 2021-08-23T15:00:25.577Z INFO cache Node list: {"length": 3} 2021-08-23T15:00:25.577Z INFO cache Nodes {"num": 3} 2021-08-23T15:00:25.776Z INFO upgrade History {"entry": "registry.ci.openshift.org/ocp/release@sha256:7e7fdee3d2a4ea95db5e356fed533fc80e350544633f175d438072e51ae8f0ff"} E0823 15:00:32.427535 1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors 2021-08-23T15:04:41.798Z INFO filter DELETE IsSpecialResource {"GenerationChanged": "simple-kmod"} 2021-08-23T15:05:02.150Z INFO filter CREATE IsSpecialResource {"GenerationChanged": "simple-kmod"}
After reproducing the issue, here is the sequence of events leading to this situation: - After installing NFD and SRO, we get the simple-kmod created. - While trying to reconcile we see the following in the logs: 2021-08-23T15:00:25.776Z INFO upgrade History {"entry": "registry.ci.openshift.org/ocp/release@sha256:7e7fdee3d2a4ea95db5e356fed533fc80e350544633f175d438072e51ae8f0ff"} E0823 15:00:32.427535 1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors - When enabling the verbose errors, we see the following: E0827 08:02:02.065439 1 aws_credentials.go:81] while getting AWS credentials NoCredentialProviders: no valid providers in chain caused by: EnvAccessKeyNotFound: failed to find credentials in the environment. SharedCredsLoad: failed to load profile, . EC2RoleRequestError: no EC2 instance role found caused by: RequestError: send request failed caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) While this can be misleading, it does not relate to the original issue, since we are not in AWS. We can expect this to happen, but getting a log error is a bit confusing. - After creating a simple-kmod CR we see nothing happens. We delete it and create it again and its not reconciled. - Looking into the AWS log it seems there is some infinite loop in the driver toolkit version logic. When investigating, we can find the following: I0827 08:01:55.732875 1 plugins.go:50] Creating DockerKeyring I0827 08:01:55.732917 1 plugins.go:64] Trying out provider .dockercfg I0827 08:01:55.732928 1 plugins.go:66] Registering credential provider I0827 08:01:55.732937 1 plugins.go:64] Trying out provider amazon-ecr E0827 08:02:02.065439 1 aws_credentials.go:81] while getting AWS credentials NoCredentialProviders: no valid providers in chain caused by: EnvAccessKeyNotFound: failed to find credentials in the environment. SharedCredsLoad: failed to load profile, . EC2RoleRequestError: no EC2 instance role found caused by: RequestError: send request failed caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) I0827 08:02:02.065524 1 plugins.go:64] Trying out provider azure I0827 08:02:02.065538 1 plugins.go:64] Trying out provider google-container-registry I0827 08:02:02.065706 1 metadata.go:195] Reading URL with backoff. http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/ I0827 08:02:03.123781 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused So now we see some trouble with the metadata server in GCE. This is happening within and endless loop, as can be seen here: https://github.com/vdemeester/k8s-pkg-credentialprovider/blob/master/gcp/metadata.go#L193 This issue (https://github.com/vdemeester/k8s-pkg-credentialprovider/issues/14) looks similar, but the situation is not the same because the name is resolved to an IP. There is no DNS problem, its firewalling/routing rules in the network. Since this is running in an endless backoff loop, the reconcile logic never gets past this function and no CR is going to be handled, hence the apparent inactivity we see. - If we exec into the pod we can see the same situation: ❯ kubectl exec -it -n openshift-special-resource-operator special-resource-controller-manager-5f8b4449b5-wc7qw -c manager -- /bin/bash bash-4.4$ cat /sys/class/dmi/id/product_name Google Compute Engine bash-4.4$ curl http://metadata.google.internal. -v * Rebuilt URL to: http://metadata.google.internal./ * Trying 169.254.169.254... * TCP_NODELAY set * connect to 169.254.169.254 port 80 failed: Connection refused * Failed to connect to metadata.google.internal port 80: Connection refused * Closing connection 0 curl: (7) Failed to connect to metadata.google.internal port 80: Connection refused - If we do the same thing from one of the nodes the curl succeeds, so we are only allowed to do it from the VMs themselves. This might be due to a faulty authorization config in GCP when setting up the cluster, can you please check it?
Some reference: https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata
https://groups.google.com/g/openshift/c/MnESJ0MUw1Y, for my understanding the only way mount the cluster's Google Cloud network inside of the pod.
After some investigation we found the cause. Starting from a fresh 4.9 install in GCP we have the initial situation, which is the infinite loop: I0902 07:21:04.441264 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:05.593287 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:06.809316 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:08.217253 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:10.073291 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused I0902 07:21:12.697274 1 metadata.go:199] Unable to read URL http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/. Error=Get "http://metadata.google.internal./computeMetadata/v1/instance/service-accounts/": dial tcp 169.254.169.254:80: connect: connection refused Then we check the connection is allowed from the node: [root@ci-ln-jbfhhcb-f76d1-gd5qt-worker-b-6kxcs /]# curl http://metadata.google.internal. -v -i * Rebuilt URL to: http://metadata.google.internal./ * Trying 169.254.169.254... * TCP_NODELAY set * Connected to metadata.google.internal (169.254.169.254) port 80 (#0) > GET / HTTP/1.1 > Host: metadata.google.internal > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK [...] Then we go to the pod and we perform the same operation: bash-4.4$ curl http://metadata.google.internal. -v -i * Rebuilt URL to: http://metadata.google.internal./ * Trying 169.254.169.254... * TCP_NODELAY set * connect to 169.254.169.254 port 80 failed: Connection refused * Failed to connect to metadata.google.internal port 80: Connection refused * Closing connection 0 curl: (7) Failed to connect to metadata.google.internal port 80: Connection refused We spin up a pod for debug and install tcpdump and we find that making a curl while tcpdump is on only shows DNS traffic to resolve the metadata server IP: 09:23:14.955898 IP debug.50005 > dns-default.openshift-dns.svc.cluster.local.53: 27786+ A? metadata.google.internal.default.svc.cluster.local. (68) 09:23:14.956165 IP debug.50005 > dns-default.openshift-dns.svc.cluster.local.53: 28181+ AAAA? metadata.google.internal.default.svc.cluster.local. (68) 09:23:14.957172 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.50005: 28181 NXDomain*- 0/1/1 (172) 09:23:14.957196 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.50005: 27786 NXDomain*- 0/1/1 (172) 09:23:14.957292 IP debug.57561 > dns-default.openshift-dns.svc.cluster.local.53: 49982+ A? metadata.google.internal.svc.cluster.local. (60) 09:23:14.957504 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.57561: 49982 NXDomain*- 0/1/1 (164) 09:23:14.957792 IP debug.57561 > dns-default.openshift-dns.svc.cluster.local.53: 50413+ AAAA? metadata.google.internal.svc.cluster.local. (60) 09:23:14.958003 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.57561: 50413 NXDomain*- 0/1/1 (164) 09:23:14.958108 IP debug.46008 > dns-default.openshift-dns.svc.cluster.local.53: 19830+ A? metadata.google.internal.cluster.local. (56) 09:23:14.958183 IP debug.46008 > dns-default.openshift-dns.svc.cluster.local.53: 20167+ AAAA? metadata.google.internal.cluster.local. (56) 09:23:14.958950 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46008: 20167 NXDomain*- 0/1/1 (160) 09:23:14.959097 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46008: 19830 NXDomain*- 0/1/1 (160) 09:23:14.959214 IP debug.55770 > dns-default.openshift-dns.svc.cluster.local.53: 12236+ A? metadata.google.internal.c.openshift-gce-devel-ci.internal. (76) 09:23:14.959633 IP debug.55770 > dns-default.openshift-dns.svc.cluster.local.53: 12651+ AAAA? metadata.google.internal.c.openshift-gce-devel-ci.internal. (76) 09:23:14.963824 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.55770: 12651 NXDomain 0/1/1 (192) 09:23:14.963880 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.55770: 12236 NXDomain 0/1/1 (192) 09:23:14.964490 IP debug.37963 > dns-default.openshift-dns.svc.cluster.local.53: 41502+ A? metadata.google.internal.google.internal. (58) 09:23:14.964778 IP debug.37963 > dns-default.openshift-dns.svc.cluster.local.53: 41842+ AAAA? metadata.google.internal.google.internal. (58) 09:23:14.967874 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.37963: 41502 NXDomain 0/1/1 (174) 09:23:14.968422 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.37963: 41842 NXDomain 0/1/1 (174) 09:23:14.969681 IP debug.46076 > dns-default.openshift-dns.svc.cluster.local.53: 55746+ A? metadata.google.internal.c.openshift-gce-devel-c. (66) 09:23:14.969762 IP debug.46076 > dns-default.openshift-dns.svc.cluster.local.53: 56121+ AAAA? metadata.google.internal.c.openshift-gce-devel-c. (66) 09:23:15.011148 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46076: 55746 NXDomain 0/1/1 (152) 09:23:15.026148 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.46076: 56121 NXDomain 0/1/1 (152) 09:23:15.026728 IP debug.45646 > dns-default.openshift-dns.svc.cluster.local.53: 61897+ A? metadata.google.internal. (42) 09:23:15.026954 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.45646: 61897* 1/0/1 A 169.254.169.254 (93) 09:23:15.027205 IP debug.45646 > dns-default.openshift-dns.svc.cluster.local.53: 62240+ AAAA? metadata.google.internal. (42) 09:23:15.028764 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.45646: 62240 0/0/1 (53) 09:23:15.061121 IP debug.58929 > dns-default.openshift-dns.svc.cluster.local.53: 3387+ PTR? 10.0.30.172.in-addr.arpa. (42) 09:23:15.061538 IP dns-default.openshift-dns.svc.cluster.local.53 > debug.58929: 3387*- 1/0/1 PTR dns-default.openshift-dns.svc.cluster.local. (134) And nothing else. We turn to iptables rules and we find this in SDN: https://github.com/openshift/sdn/blob/e17993d1116cce1372dfc680752e792e8e51e28a/pkg/cmd/openshift-sdn-cni/openshift-sdn.go#L126-L129 After asking SDN team they said it is mainly for security reasons.
Is there anything to take action on for this BZ or is this trending as NOT A BUG?
We are looking into different alternatives for extracting the info we need without using the metadata server, but this is a bug when using SDN CNI on GCP.
Working in a solution already.
Verified on OCP 4.10.0-0.nightly-2021-10-15-141730 on ipi GCO cluster: Deploy Node Feature Discovery operator form OperatorHub and create a NodeFeatureDiscovery instance git clone https://github.com/openshift/special-resource-operator.git cd special-resource-operator git checkput master ### make local-image-build local-image-push deploy make local-image-build local-image-push deploy podman tag <imageid> quay.io/<my_username>/special-resource-operator:master podman push quay.io/<my_username>/special-resource-operator:master IMAGE=quay.io/<my_username>/special-resource-operator:master make deploy # oc get pods -n openshift-special-resource-operator NAME READY STATUS RESTARTS AGE special-resource-controller-manager-5bc9996b47-bkplz 2/2 Running 0 5m11s # cat /root/simple-kmod-local.yaml apiVersion: sro.openshift.io/v1beta1 kind: SpecialResource metadata: name: simple-kmod spec: namespace: simple-kmod chart: name: simple-kmod version: 0.0.1 repository: name: example url: file:///charts/example set: kind: Values apiVersion: sro.openshift.io/v1beta1 kmodNames: ["simple-kmod", "simple-procfs-kmod"] buildArgs: - name: "KMODVER" value: "SRO" driverContainer: source: git: ref: "master" uri: "https://github.com/openshift-psap/kvc-simple-kmod.git" # create the simple-kmod special resource: oc create -f /root/simple-kmod-local.yaml specialresource.sro.openshift.io/simple-kmod created # wait a few minutes for the simple-kmod-driver-build to complete # oc get pods -n simple-kmod NAME READY STATUS RESTARTS AGE simple-kmod-driver-build-bfb16b50984f16f0-1-build 0/1 Completed 0 7m19s simple-kmod-driver-container-bfb16b50984f16f0-mzsrb 1/1 Running 0 8m11s simple-kmod-driver-container-bfb16b50984f16f0-qdmg2 1/1 Running 0 8m11s simple-kmod-driver-container-bfb16b50984f16f0-xbg7t 1/1 Running 0 8m11s # oc exec -n simple-kmod -it pod/simple-kmod-driver-container-bfb16b50984f16f0-mzsrb -- lsmod | grep simple simple_procfs_kmod 16384 0 simple_kmod 16384 0 # oc exec -n simple-kmod -it pod/simple-kmod-driver-container-bfb16b50984f16f0-qdmg2 -- lsmod | grep simple simple_procfs_kmod 16384 0 simple_kmod 16384 0 # oc exec -n simple-kmod -it pod/simple-kmod-driver-container-bfb16b50984f16f0-xbg7t -- lsmod | grep simple simple_procfs_kmod 16384 0 simple_kmod 16384 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056