Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1913878

Summary:	When using a cluster with IPv6, nfd-worker can't connect and restarts with "too many colons in address" error
Product:	OpenShift Container Platform	Reporter:	Bob Fournier <bfournie>
Component:	Node Feature Discovery Operator	Assignee:	Carlos Eduardo Arango Gutierrez <carangog>
Status:	CLOSED DUPLICATE	QA Contact:	Walid A. <wabouham>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	carangog, sejug, tsedovic
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-25 22:24:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bob Fournier 2021-01-07 19:22:21 UTC

Description of problem:

I've set up a cluster with IPv6 and installed the cluster-nfd-operator.  All of the pods come OK but the nfd-worker can't connect and continually restarts.  It looks like the IPv6 address is not formatted properly inside '[]' brackets and the port is treated as part of the IPv6 address:
"address fd02::77ec:12000: too many colons in address"

[stack@openshift-master-0 cluster-nfd-operator]$ oc get pod -A | grep nfd
openshift-nfd                                      nfd-master-55mhq                                              1/1     Running            0          178m
openshift-nfd                                      nfd-master-96pqt                                              1/1     Running            0          178m
openshift-nfd                                      nfd-master-w8fwn                                              1/1     Running            0          178m
openshift-nfd                                      nfd-operator-58cdbbb559-9gcgf                                 1/1     Running            0          178m
openshift-nfd                                      nfd-worker-5qbg2                                              1/1     Running            33         178m
openshift-nfd                                      nfd-worker-jkjwz                                              1/1     Running            33         178m

This is on the worker node, the nfd-worker has exited:
[core@worker-0 ~]$ sudo crictl ps -a | grep nfd-worker
2ef3ac57c9d8b       virthost.ostest.test.metalkube.org:5000/localimages/origin-node-feature-discovery@sha256:75929c498301af285a8dcca4b17a45d5b53062c28b3a672a07b791be371757a1   4 minutes ago       Exited              nfd-worker                       32                  8b4166c54dbae

[core@worker-0 ~]$ sudo crictl logs 2ef3ac57c9d8b
2021/01/07 18:59:08 Node Feature Discovery Worker 1.15
2021/01/07 18:59:08 NodeName: 'worker-0.ostest.test.metalkube.org'
INFO: 2021/01/07 18:59:08 parsed scheme: ""
INFO: 2021/01/07 18:59:08 scheme "" not registered, fallback to default scheme
INFO: 2021/01/07 18:59:08 ccResolverWrapper: sending update to cc: {[{fd02::77ec:12000  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2021/01/07 18:59:08 ClientConn switching balancer to "pick_first"
WARNING: 2021/01/07 18:59:08 grpc: addrConn.createTransport failed to connect to {fd02::77ec:12000  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: address fd02::77ec:12000: too many colons in address". Reconnecting...
WARNING: 2021/01/07 18:59:09 grpc: addrConn.createTransport failed to connect to {fd02::77ec:12000  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: address fd02::77ec:12000: too many colons in address". Reconnecting...
WARNING: 2021/01/07 18:59:10 grpc: addrConn.createTransport failed to connect to {fd02::77ec:12000  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: address fd02::77ec:12000: too many colons in address". Reconnecting...
WARNING: 2021/01/07 18:59:13 grpc: addrConn.createTransport failed to connect to {fd02::77ec:12000  <nil> 0 <nil>}. Err :connection error: desc = "transport: ...
2021/01/07 19:00:08 ERROR: failed to connect: context deadline exceeded


Version-Release number of selected component (if applicable):

Using these images:
quay.io/openshift-psap/cluster-nfd-operator                              BZ1906129                                 fb5bfca77c05  3 weeks ago    292 MB
quay.io/openshift/origin-node-feature-discovery                          4.7                                       211b098df529  3 weeks ago    256 MB

How reproducible:
Happens every time.

Steps to Reproduce:
1. Download images as above using podman pull
2. Use a local image store so that the master can access the images locally instead of attempting to reach quay at and IPv4 address

$ sudo podman push --tls-verify=false --authfile /opt/dev-scripts/pull_secret.json fb5bfca77c05 virthost.ostest.test.metalkube.org:5000/localimages/origin-cluster-nfd-operator:master
$ sudo podman push --tls-verify=false --authfile /opt/dev-scripts/pull_secret.json 211b098df529 virthost.ostest.test.metalkube.org:5000/localimages/origin-node-feature-discovery:4.7

3. Update Makefile and manifests/0700_cr.yaml to use the local images
4. In cluster-nfd-operator run 'make deploy'

Actual results:

nfd-worker continuously restarts with connection errors

Expected results:
Can view labels updated by nfd-worker


Additional info:

Comment 1 Bob Fournier 2021-01-08 01:40:47 UTC

I think the issue is here:
https://github.com/openshift/cluster-nfd-operator/blob/master/assets/worker/0700_worker_daemonset.yaml#L44

If NFD_MASTER_SERVICE_HOST is an IPv6 address it needs to be encoded in brackets to separate it from the ":$(NFD_MASTER_SERVICE_PORT)"

Comment 2 Carlos Eduardo Arango Gutierrez 2021-01-25 22:24:39 UTC


*** This bug has been marked as a duplicate of bug 1823765 ***