Bug 1913878 - When using a cluster with IPv6, nfd-worker can't connect and restarts with "too many colons in address" error
Summary: When using a cluster with IPv6, nfd-worker can't connect and restarts with "t...
Keywords:
Status: CLOSED DUPLICATE of bug 1823765
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Feature Discovery Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.7.0
Assignee: Carlos Eduardo Arango Gutierrez
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-07 19:22 UTC by Bob Fournier
Modified: 2021-01-25 22:27 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-25 22:24:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Bob Fournier 2021-01-07 19:22:21 UTC
Description of problem:

I've set up a cluster with IPv6 and installed the cluster-nfd-operator.  All of the pods come OK but the nfd-worker can't connect and continually restarts.  It looks like the IPv6 address is not formatted properly inside '[]' brackets and the port is treated as part of the IPv6 address:
"address fd02::77ec:12000: too many colons in address"

[stack@openshift-master-0 cluster-nfd-operator]$ oc get pod -A | grep nfd
openshift-nfd                                      nfd-master-55mhq                                              1/1     Running            0          178m
openshift-nfd                                      nfd-master-96pqt                                              1/1     Running            0          178m
openshift-nfd                                      nfd-master-w8fwn                                              1/1     Running            0          178m
openshift-nfd                                      nfd-operator-58cdbbb559-9gcgf                                 1/1     Running            0          178m
openshift-nfd                                      nfd-worker-5qbg2                                              1/1     Running            33         178m
openshift-nfd                                      nfd-worker-jkjwz                                              1/1     Running            33         178m

This is on the worker node, the nfd-worker has exited:
[core@worker-0 ~]$ sudo crictl ps -a | grep nfd-worker
2ef3ac57c9d8b       virthost.ostest.test.metalkube.org:5000/localimages/origin-node-feature-discovery@sha256:75929c498301af285a8dcca4b17a45d5b53062c28b3a672a07b791be371757a1   4 minutes ago       Exited              nfd-worker                       32                  8b4166c54dbae

[core@worker-0 ~]$ sudo crictl logs 2ef3ac57c9d8b
2021/01/07 18:59:08 Node Feature Discovery Worker 1.15
2021/01/07 18:59:08 NodeName: 'worker-0.ostest.test.metalkube.org'
INFO: 2021/01/07 18:59:08 parsed scheme: ""
INFO: 2021/01/07 18:59:08 scheme "" not registered, fallback to default scheme
INFO: 2021/01/07 18:59:08 ccResolverWrapper: sending update to cc: {[{fd02::77ec:12000  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2021/01/07 18:59:08 ClientConn switching balancer to "pick_first"
WARNING: 2021/01/07 18:59:08 grpc: addrConn.createTransport failed to connect to {fd02::77ec:12000  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: address fd02::77ec:12000: too many colons in address". Reconnecting...
WARNING: 2021/01/07 18:59:09 grpc: addrConn.createTransport failed to connect to {fd02::77ec:12000  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: address fd02::77ec:12000: too many colons in address". Reconnecting...
WARNING: 2021/01/07 18:59:10 grpc: addrConn.createTransport failed to connect to {fd02::77ec:12000  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: address fd02::77ec:12000: too many colons in address". Reconnecting...
WARNING: 2021/01/07 18:59:13 grpc: addrConn.createTransport failed to connect to {fd02::77ec:12000  <nil> 0 <nil>}. Err :connection error: desc = "transport: ...
2021/01/07 19:00:08 ERROR: failed to connect: context deadline exceeded


Version-Release number of selected component (if applicable):

Using these images:
quay.io/openshift-psap/cluster-nfd-operator                              BZ1906129                                 fb5bfca77c05  3 weeks ago    292 MB
quay.io/openshift/origin-node-feature-discovery                          4.7                                       211b098df529  3 weeks ago    256 MB

How reproducible:
Happens every time.

Steps to Reproduce:
1. Download images as above using podman pull
2. Use a local image store so that the master can access the images locally instead of attempting to reach quay at and IPv4 address

$ sudo podman push --tls-verify=false --authfile /opt/dev-scripts/pull_secret.json fb5bfca77c05 virthost.ostest.test.metalkube.org:5000/localimages/origin-cluster-nfd-operator:master
$ sudo podman push --tls-verify=false --authfile /opt/dev-scripts/pull_secret.json 211b098df529 virthost.ostest.test.metalkube.org:5000/localimages/origin-node-feature-discovery:4.7

3. Update Makefile and manifests/0700_cr.yaml to use the local images
4. In cluster-nfd-operator run 'make deploy'

Actual results:

nfd-worker continuously restarts with connection errors

Expected results:
Can view labels updated by nfd-worker


Additional info:

Comment 1 Bob Fournier 2021-01-08 01:40:47 UTC
I think the issue is here:
https://github.com/openshift/cluster-nfd-operator/blob/master/assets/worker/0700_worker_daemonset.yaml#L44

If NFD_MASTER_SERVICE_HOST is an IPv6 address it needs to be encoded in brackets to separate it from the ":$(NFD_MASTER_SERVICE_PORT)"

Comment 2 Carlos Eduardo Arango Gutierrez 2021-01-25 22:24:39 UTC

*** This bug has been marked as a duplicate of bug 1823765 ***


Note You need to log in before you can comment on or make changes to this bug.