1831006 – IPv6 bare metal deployment with calico has kube-proxy binding on 0.0.0.0 instead of the IPv6 address

Bug 1831006 - IPv6 bare metal deployment with calico has kube-proxy binding on 0.0.0.0 instead of the IPv6 address

Summary: IPv6 bare metal deployment with calico has kube-proxy binding on 0.0.0.0 inst...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Ben Bennett
QA Contact:	Boris Deschenes
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1847969
TreeView+	depends on / blocked

Reported:	2020-05-04 13:34 UTC by Boris Deschenes
Modified:	2020-10-27 15:58 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1833048 1847969 (view as bug list)
Environment:
Last Closed:	2020-10-27 15:58:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 620	None	closed	bug 1831006: For third party plugins enable testing of IPv6 single stack	2020-10-06 18:48:29 UTC
Github	openshift sdn pull 152	None	closed	kube-proxy use node-ip to detect the IP family	2020-10-06 18:48:29 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:58:56 UTC

Description Boris Deschenes 2020-05-04 13:34:25 UTC

Description of problem:
on a baremetal, single-stack IPv6 deployment, testing any other CNI is impossible because kube-proxy gets configured to bind on "0.0.0.0" which fails the deployment

Version-Release number of selected component (if applicable):
OpenShift 4.4

How reproducible:
deploy with any other networkType than OVNKubernetes on a single-stack IPv6 deployment will result in a failed deployment when kube-proxy tries to bind to 0.0.0.0

Steps to Reproduce:
1. single-stack IPv6 deployment
2. networkType = Calico
3. observe kube-prody that cannot start because its configuration points to 0.0.0.0

Actual results:
deployment fails

Expected results:
we should be able to TEST other CNI with a single-stack IPv6 deployment just like we can TEST them with IPv4

Additional info:

Comment 1 Boris Deschenes 2020-05-04 13:52:24 UTC

this is the kube-proxy logs we get when trying another CNI than OVNKubernetes on an IPv6 single-stack, bare metal deployment:

2020-05-04T13:43:49.547255619+00:00 stderr F I0504 13:43:49.547248       1 server.go:536] Neither kubeconfig file nor master URL was specified. Falling back to in-cluster config.
2020-05-04T13:43:49.555761603+00:00 stderr F I0504 13:43:49.555662       1 node.go:135] Successfully retrieved node IP: 2001:4958:a:3e00:0:1:2:10
2020-05-04T13:43:49.555761603+00:00 stderr F I0504 13:43:49.555679       1 server_others.go:145] Using iptables Proxier.
2020-05-04T13:43:49.555761603+00:00 stderr F F0504 13:43:49.555715       1 server.go:485] unable to create proxier: clusterCIDR 2001:4958:a:3e00:0:1:100:0/104 has incorrect IP version: expect isIPv6=false

and this is the generated kube-proxy configuration that wrongfully tries to bind on "0.0.0.0"

apiVersion: kubeproxy.config.k8s.io/v1alpha1
bindAddress: 0.0.0.0
clientConnection:
  acceptContentTypes: ""
  burst: 0
  contentType: ""
  kubeconfig: ""
  qps: 0
clusterCIDR: 2001:4958:a:3e00:0:1:100:0/104
configSyncPeriod: 0s
conntrack:
  maxPerCore: null
  min: null
  tcpCloseWaitTimeout: null
  tcpEstablishedTimeout: null
enableProfiling: false
healthzBindAddress: 0.0.0.0:10256
hostnameOverride: ""
iptables:
  masqueradeAll: false
  masqueradeBit: null
  minSyncPeriod: 0s
  syncPeriod: 0s
ipvs:
  excludeCIDRs: null
  minSyncPeriod: 0s
  scheduler: ""
  strictARP: false
  syncPeriod: 0s
kind: KubeProxyConfiguration
metricsBindAddress: 0.0.0.0:9101
mode: iptables
nodePortAddresses: null
oomScoreAdj: null
portRange: ""
udpIdleTimeout: 0s
winkernel:
  enableDSR: false
  networkName: ""
  sourceVip: ""

Comment 5 zhaozhanqi 2020-05-09 03:51:43 UTC

Hi, Marius
Could you help verify this bug, it's need ipv6 and calico cluster?

Comment 6 Boris Deschenes 2020-05-11 13:40:14 UTC

just some more information: so basically we're in a lab in the biggest canadian telco, trying to prove that openshift can be used to build an open source edge platform for them, we're running single-stack IPv6 successfully but they need to "try" and "test" calico to integrate the pods into their network fabric.. We had the same blocker on OCS where the config generated by the operator was simply IPv4-only.. but OCS refused to fix so the customer is now using rook/ceph instead of OCS.

Please, please consider this bug as important and keep in mind that it could be as simple as generating a configuration that binds to "::/0" in case of IPv6 and "0.0.0.0" for IPv4.. it was as simple as that in the similar OCS issue (BZ 1831693, same principle, operator generating IPv4-only config).

Comment 9 zhaozhanqi 2020-05-13 03:40:01 UTC

(In reply to Boris Deschenes from comment #6)
> just some more information: so basically we're in a lab in the biggest
> canadian telco, trying to prove that openshift can be used to build an open
> source edge platform for them, we're running single-stack IPv6 successfully
> but they need to "try" and "test" calico to integrate the pods into their
> network fabric.. We had the same blocker on OCS where the config generated
> by the operator was simply IPv4-only.. but OCS refused to fix so the
> customer is now using rook/ceph instead of OCS.
> 
> Please, please consider this bug as important and keep in mind that it could
> be as simple as generating a configuration that binds to "::/0" in case of
> IPv6 and "0.0.0.0" for IPv4.. it was as simple as that in the similar OCS
> issue (BZ 1831693, same principle, operator generating IPv4-only config).

Hi, Boris, I'm not sure if you can help verified this issue with ipv6 and Calico plugin, since for now QE do not any experiences for ipv6 and Calico plugin. thanks

Comment 10 Douglas Smith 2020-05-13 17:46:25 UTC

I had noticed that there's actually a configurable `bindAddress` for kube proxy in the Network CRD for the cluster-network-operator.

Had a chat with Boris and sent him along the parameters for kube proxy in the Network CRD from https://github.com/openshift/cluster-network-operator/#configuring-kube-proxy

Also mocked up a sample usage of it (untested) in this pastebin @ https://pastebin.com/Yf1qQbMK

Comment 11 zhaozhanqi 2020-05-28 03:10:10 UTC

Hi, Boris have you try this comment 10 ? if this works for you and if so we can verified this bug. thanks.

Comment 12 Boris Deschenes 2020-06-02 18:52:34 UTC

Hi,

yes we've tried adding the kube-proxy config parameters in the network CRD but it looks like these parameter are simply ignored by the operator, we do not see any change in the configuration of kube-proxy, whether we pass these additional parameters or not.  I think Doug had the same result in his lab..

So, although this overriding of kube-proxy configuration through the network CRD could really be our way out, it seems the mechanism is not currently working.

Comment 13 Antonio Ojea 2020-06-02 22:53:20 UTC

I think this is a bug in kube-proxy, it should use the nodeIP of the node if the bind address is an unspecified address, actually for golang there is no difference when binding between using 0.0.0.0 or :: https://github.com/kubernetes/kubernetes/issues/88458


In this case we can see in the log that it successfully retrieved an IPv6 node IP, hence it should work in IPv6 mode.

> 2020-05-04T13:43:49.555761603+00:00 stderr F I0504 13:43:49.555662       1 node.go:135] Successfully retrieved node IP: 2001:4958:a:3e00:0:1:2:10

I'm new here, I'll fix it upstream and then follow up here for guinace. I don't know how the bugzilla assignments/workflow works though or if I should/can take over the ticket :-)

Comment 14 Antonio Ojea 2020-06-04 16:18:18 UTC

Submitted PR to fix upstream

https://github.com/kubernetes/kubernetes/pull/91725

I have to check now why https://bugzilla.redhat.com/show_bug.cgi?id=1831006#c10 didn't work

Comment 15 Antonio Ojea 2020-06-05 10:15:18 UTC

In the meantime, as a workaround, the cluster-network-operator will configure the bind-address with the same family that the ClusterCIDR. 
So, omitting the bindAddress in the configuration should make it work in IPv6 mode, if the ClusterCIDR is IPv6

Comment 17 zhaozhanqi 2020-06-17 09:59:16 UTC

As this PR is in open https://github.com/openshift/sdn/pull/152, move this bug to 'post' for now.

Comment 18 Ben Bennett 2020-06-17 13:03:32 UTC

Moving to 4.6 so we can track the backport.

Comment 19 Antonio Ojea 2020-06-23 13:31:56 UTC

*** Bug 1847969 has been marked as a duplicate of this bug. ***

Comment 21 zhaozhanqi 2020-07-16 02:17:01 UTC

could you help verified this bug with calico and ipv6 cluster?

Comment 22 Boris Deschenes 2020-08-03 14:55:53 UTC

hi guys,

ok her is the result of the same deeployment with the patch https://github.com/kubernetes/kubernetes/pull/91725 in place:

as we can see below, kube-proxy no longer tries to bind to 0.0.0.0 in an IPv6 environment and correctly assumes IPv6 operations.  I still see "incorrect IP version" messages but those could be coming in from calico configuration.

The deployment stalls without much errors in the logs.. I end up with masters running 4 pods:
* calico-node
* calico-typha
* kube-prody
* kube-multus

since it stalls at this point an my only "errors" are in kube-proxy logs, I'll investigate those..


kube-proxy logs:

2020-08-03T14:41:16.021578841+00:00 stderr F I0803 14:41:16.021371       1 server_others.go:96] IPv6 bind address (::), assume IPv6 operation
2020-08-03T14:41:16.025356057+00:00 stderr F W0803 14:41:16.025326       1 proxier.go:625] Failed to read file /lib/modules/4.18.0-193.13.2.el8_2.x86_64/modules.builtin with error open /lib/modules/4.18.0-193.13.2.el8_2.x86_64/modules.builtin: no such file or directory. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2020-08-03T14:41:16.026602419+00:00 stderr F W0803 14:41:16.026571       1 proxier.go:635] Failed to load kernel module ip_vs with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2020-08-03T14:41:16.027429499+00:00 stderr F W0803 14:41:16.027410       1 proxier.go:635] Failed to load kernel module ip_vs_rr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2020-08-03T14:41:16.028390299+00:00 stderr F W0803 14:41:16.028368       1 proxier.go:635] Failed to load kernel module ip_vs_wrr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2020-08-03T14:41:16.029178918+00:00 stderr F W0803 14:41:16.029160       1 proxier.go:635] Failed to load kernel module ip_vs_sh with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2020-08-03T14:41:16.030092526+00:00 stderr F W0803 14:41:16.030073       1 proxier.go:635] Failed to load kernel module nf_conntrack_ipv4 with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2020-08-03T14:41:16.030165502+00:00 stderr F I0803 14:41:16.030148       1 server.go:548] Neither kubeconfig file nor master URL was specified. Falling back to in-cluster config.
2020-08-03T14:41:16.042493177+00:00 stderr F I0803 14:41:16.042464       1 node.go:136] Successfully retrieved node IP: 2001:4958:a:3e00:0:1:2:30
2020-08-03T14:41:16.042513420+00:00 stderr F I0803 14:41:16.042491       1 server_others.go:186] Using iptables Proxier.
2020-08-03T14:41:16.042765223+00:00 stderr F I0803 14:41:16.042742       1 server.go:583] Version: v0.0.0-master+$Format:%h$
2020-08-03T14:41:16.043081454+00:00 stderr F I0803 14:41:16.043063       1 conntrack.go:52] Setting nf_conntrack_max to 262144
2020-08-03T14:41:16.043172109+00:00 stderr F I0803 14:41:16.043155       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
2020-08-03T14:41:16.043224015+00:00 stderr F I0803 14:41:16.043208       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
2020-08-03T14:41:16.043442180+00:00 stderr F I0803 14:41:16.043401       1 config.go:133] Starting endpoints config controller
2020-08-03T14:41:16.043455412+00:00 stderr F I0803 14:41:16.043439       1 shared_informer.go:223] Waiting for caches to sync for endpoints config
2020-08-03T14:41:16.043495117+00:00 stderr F I0803 14:41:16.043479       1 config.go:315] Starting service config controller
2020-08-03T14:41:16.043505359+00:00 stderr F I0803 14:41:16.043494       1 shared_informer.go:223] Waiting for caches to sync for service config
2020-08-03T14:41:16.047663280+00:00 stderr F E0803 14:41:16.047635       1 utils.go:223] 192.0.2.2 in endpoints has incorrect IP version (service openshift-etcd/host-etcd).
2020-08-03T14:41:16.047695848+00:00 stderr F E0803 14:41:16.047678       1 utils.go:223] 192.0.2.200 in endpoints has incorrect IP version (service openshift-etcd/host-etcd).
2020-08-03T14:41:16.047695848+00:00 stderr F E0803 14:41:16.047691       1 utils.go:223] 192.0.2.3 in endpoints has incorrect IP version (service openshift-etcd/host-etcd).
2020-08-03T14:41:16.047706967+00:00 stderr F E0803 14:41:16.047698       1 utils.go:223] 192.0.2.4 in endpoints has incorrect IP version (service openshift-etcd/host-etcd).
2020-08-03T14:41:16.143769916+00:00 stderr F I0803 14:41:16.143660       1 shared_informer.go:230] Caches are synced for service config
2020-08-03T14:41:16.143769916+00:00 stderr F I0803 14:41:16.143672       1 shared_informer.go:230] Caches are synced for endpoints config

Comment 23 zhaozhanqi 2020-08-04 02:09:45 UTC

thanks Boris Deschenes

Can we move this bug to 'verified' since the original issue already be fixed.

Comment 24 Boris Deschenes 2020-08-04 21:03:39 UTC

agreed, verified it is

Comment 27 errata-xmlrpc 2020-10-27 15:58:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.