Bug 1852112 - Host-prefix requirement for OCP 4 installer is breaking cluster networking with Tigera Calico SDN CNI plugin
Summary: Host-prefix requirement for OCP 4 installer is breaking cluster networking wi...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Surya Seetharaman
QA Contact: David Johnston
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-29 18:42 UTC by David Johnston
Modified: 2023-12-15 18:20 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-28 17:18:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift api pull 697 0 None closed Bug 1852112: Make hostPrefix optional for the sake of non-(sdn/ovn) plugins 2021-01-18 14:47:23 UTC
Github openshift cluster-network-operator pull 709 0 None closed Bug 1852112: Ignore hostPrefix validation for non-(sdn/ovn) plugins 2021-01-18 14:47:23 UTC
Github openshift installer pull 3888 0 None closed Bug 1852112: Ignore hostPrefix validation for non-(sdn/ovn) plugins 2021-01-18 14:47:23 UTC

Comment 1 Ben Bennett 2020-06-29 19:12:12 UTC
Moving to release 4.6 so we can fix it and then we can consider the backport.

Comment 3 Arvin Amirian 2020-06-30 19:54:54 UTC
Clarification:

There are two issues here. OCP Host Prefix requirement and multi-vlan deployment 

1. Host Prefix requirement
Calico does not require a large CIDR for the host prefix. A /28 is an exceptable  size since when calico will just assign a new /28 block when the node runs out of IPs. When doing deployments on nodes with a single network it works works, but when attached to 2 vlans there are communication issues and openshift-apiserver never comes up. This get addressed if a larger host prefix size is used. In this case the Master and workers are both on VLAN1 and VLAN2.

2.  OCP certifcate issue with multi vlan configuration

In this scenario, The Master has a single NIC on VLAN 1 and the Workers have two NICs, one on VLAN 1 and the other on VLAN2. Calico tries to communicate with the masters using both NICs but the communication from VLAN2 is always rejected with complaints about the cert only belonging to VLAN1 interface. We see a torrent of CSRs from the workers but doesn’t matter how many times you approve them they keep getting regenerated. It looks like the worker is generating a new request using each interface. The result is that the install never completes because openshift-apiserver never comes up.

Comment 4 Frank Hirtz 2020-07-01 14:19:01 UTC
Cutting visibility to RH since this isn't a general issue.

Thoughts:

In discussing yesterday, 1) above (the hostPrefix item) can be worked around for the moment. Not ideal, but shouldn't be a blocker. That the install sporadically fails with a constrained hostPrefix seemingly indicates that even though Calico doesn't make use of it, that it has an effect on the install/pod spin up (cri-o default network or something)? 

Item 2) is the blocker item. It sounds like we're round-robining between the 2 interfaces on the workers in terms of behavior. If we have a worker specific subnet (VLAN2) and one for the control plane (VLAN1), shouldn't all traffic for the API go through the route to VLAN1? Why would a worker try to talk to a master through the worker VLAN? I'm guessing this is a BGP side-effect since both subnets are advertised. That's just a guess though on my part.

If it's a Calico routing thing, I don't know what we can do about that but is there a way to fix the certificate issue(s) so that this doesn't error/fail on the (misrouted) calls?

Comment 5 Arvin Amirian 2020-07-06 15:29:20 UTC
Further analysis on issue 2.

It seems when Calico is set as the SDN in the VLAN Configuration described earlier, The kube-controller-manager and kube-apiserver end up with different signer certs. 


[root@ocp5-control3 ~]# openssl crl2pkcs7 -nocrl -certfile /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/csr-signer/tls.crt | openssl pkcs7 -print_certs -text -noout |  grep -i Subject:
        Subject: CN=kube-csr-signer_@1593229954
[root@ocp5-control3 ~]# openssl crl2pkcs7 -nocrl -certfile /etc/kubernetes/static-pod-resources/kube-apiserver-pod-3/configmaps/kubelet-serving-ca/ca-bundle.crt | openssl pkcs7 -print_certs -text -noout |  grep -i Subject:
        Subject: CN=kube-csr-signer_@1593161209
        Subject: OU=openshift, CN=kubelet-signer

Comment 6 David Johnston 2020-07-06 15:48:35 UTC
Please make this public so that the partner, Tigera, can view and comment. An engineer is still being requested for a call with the partner.

Comment 8 Pooriya Aghaalitari 2020-07-07 17:23:53 UTC
This is Pooriya from Tigera. Could we please schedule a call between Tigera and RH engineering to discuss this issue? Thanks.

Comment 9 David Johnston 2020-07-07 17:33:55 UTC
Hello Pooriya,

There is a call confirmed and scheduled with Tigera and RH engineering for today 4 pm EST.

Comment 11 Ben Bennett 2020-07-08 17:18:02 UTC
Based on a call I had with Tigera yesterday, the host-prefix problem is intermittent.  I asked for information with the exact error that they are seeing so we can see what is failing and why since that parameter is only applicable to the openshift-sdn and ovn-kubernetes SDN plugins and should be ignored when Calico is in use.  Also interesting is that it failed a few times ... and then worked.

Comment 12 Ben Bennett 2020-07-08 17:23:27 UTC
Dropping the priority since there is a workaround for this.  Just leave the Host Prefix alone.  Since Calico does not use it, it has no effect.

Comment 13 zhaozhanqi 2020-07-16 08:01:26 UTC
Hi, David.
Could you help verified or assign anyone can help this issue. since OCP SDN QE did not any experience to setup the cluster with Tigera Calico plugin. 
thanks

Comment 14 Arvin Amirian 2020-07-17 13:33:29 UTC
This issue occurred with the following config

Host prefix /28
2 Nics per node
3 masters, 4 workers

Comment 17 zhaozhanqi 2020-09-28 04:07:21 UTC
(In reply to aamirian from comment #14)
> This issue occurred with the following config
> 
> Host prefix /28
> 2 Nics per node
> 3 masters, 4 workers

aamirian  Do you have the env to verify this bug? thanks

Comment 18 Arvin Amirian 2020-09-28 15:48:49 UTC
Customer was advised to use a larger cidr. Customer asked what is the min cidr requirements and was told by support that it depends on the cluster size. Closing bug for now as customer has overcome this issue.

Comment 19 Weibin Liang 2020-09-28 17:18:40 UTC
Based on comment 18, QA close this bug now and will reopen it if need in the future.


Note You need to log in before you can comment on or make changes to this bug.