Bug 1852112

Summary:	Host-prefix requirement for OCP 4 installer is breaking cluster networking with Tigera Calico SDN CNI plugin
Product:	OpenShift Container Platform	Reporter:	David Johnston <djohnsto>
Component:	Networking	Assignee:	Surya Seetharaman <surya>
Networking sub component:	openshift-sdn	QA Contact:	David Johnston <djohnsto>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aamirian, bbennett, bleanhar, djohnsto, dmellado, fhirtz, mwhitehe, pooriya, weliang, zzhao
Version:	4.3.z
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-28 17:18:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Ben Bennett 2020-06-29 19:12:12 UTC

Moving to release 4.6 so we can fix it and then we can consider the backport.

Comment 3 Arvin Amirian 2020-06-30 19:54:54 UTC

Clarification:

There are two issues here. OCP Host Prefix requirement and multi-vlan deployment 

1. Host Prefix requirement
Calico does not require a large CIDR for the host prefix. A /28 is an exceptable  size since when calico will just assign a new /28 block when the node runs out of IPs. When doing deployments on nodes with a single network it works works, but when attached to 2 vlans there are communication issues and openshift-apiserver never comes up. This get addressed if a larger host prefix size is used. In this case the Master and workers are both on VLAN1 and VLAN2.

2.  OCP certifcate issue with multi vlan configuration

In this scenario, The Master has a single NIC on VLAN 1 and the Workers have two NICs, one on VLAN 1 and the other on VLAN2. Calico tries to communicate with the masters using both NICs but the communication from VLAN2 is always rejected with complaints about the cert only belonging to VLAN1 interface. We see a torrent of CSRs from the workers but doesn’t matter how many times you approve them they keep getting regenerated. It looks like the worker is generating a new request using each interface. The result is that the install never completes because openshift-apiserver never comes up.

Comment 4 Frank Hirtz 2020-07-01 14:19:01 UTC

Cutting visibility to RH since this isn't a general issue.

Thoughts:

In discussing yesterday, 1) above (the hostPrefix item) can be worked around for the moment. Not ideal, but shouldn't be a blocker. That the install sporadically fails with a constrained hostPrefix seemingly indicates that even though Calico doesn't make use of it, that it has an effect on the install/pod spin up (cri-o default network or something)? 

Item 2) is the blocker item. It sounds like we're round-robining between the 2 interfaces on the workers in terms of behavior. If we have a worker specific subnet (VLAN2) and one for the control plane (VLAN1), shouldn't all traffic for the API go through the route to VLAN1? Why would a worker try to talk to a master through the worker VLAN? I'm guessing this is a BGP side-effect since both subnets are advertised. That's just a guess though on my part.

If it's a Calico routing thing, I don't know what we can do about that but is there a way to fix the certificate issue(s) so that this doesn't error/fail on the (misrouted) calls?

Comment 5 Arvin Amirian 2020-07-06 15:29:20 UTC

Further analysis on issue 2.

It seems when Calico is set as the SDN in the VLAN Configuration described earlier, The kube-controller-manager and kube-apiserver end up with different signer certs. 


[root@ocp5-control3 ~]# openssl crl2pkcs7 -nocrl -certfile /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/csr-signer/tls.crt | openssl pkcs7 -print_certs -text -noout |  grep -i Subject:
        Subject: CN=kube-csr-signer_@1593229954
[root@ocp5-control3 ~]# openssl crl2pkcs7 -nocrl -certfile /etc/kubernetes/static-pod-resources/kube-apiserver-pod-3/configmaps/kubelet-serving-ca/ca-bundle.crt | openssl pkcs7 -print_certs -text -noout |  grep -i Subject:
        Subject: CN=kube-csr-signer_@1593161209
        Subject: OU=openshift, CN=kubelet-signer

Comment 6 David Johnston 2020-07-06 15:48:35 UTC

Please make this public so that the partner, Tigera, can view and comment. An engineer is still being requested for a call with the partner.

Comment 8 Pooriya Aghaalitari 2020-07-07 17:23:53 UTC

This is Pooriya from Tigera. Could we please schedule a call between Tigera and RH engineering to discuss this issue? Thanks.

Comment 9 David Johnston 2020-07-07 17:33:55 UTC

Hello Pooriya,

There is a call confirmed and scheduled with Tigera and RH engineering for today 4 pm EST.

Comment 11 Ben Bennett 2020-07-08 17:18:02 UTC

Based on a call I had with Tigera yesterday, the host-prefix problem is intermittent.  I asked for information with the exact error that they are seeing so we can see what is failing and why since that parameter is only applicable to the openshift-sdn and ovn-kubernetes SDN plugins and should be ignored when Calico is in use.  Also interesting is that it failed a few times ... and then worked.

Comment 12 Ben Bennett 2020-07-08 17:23:27 UTC

Dropping the priority since there is a workaround for this.  Just leave the Host Prefix alone.  Since Calico does not use it, it has no effect.

Comment 13 zhaozhanqi 2020-07-16 08:01:26 UTC

Hi, David.
Could you help verified or assign anyone can help this issue. since OCP SDN QE did not any experience to setup the cluster with Tigera Calico plugin. 
thanks

Comment 14 Arvin Amirian 2020-07-17 13:33:29 UTC

This issue occurred with the following config

Host prefix /28
2 Nics per node
3 masters, 4 workers

Comment 17 zhaozhanqi 2020-09-28 04:07:21 UTC

(In reply to aamirian from comment #14)
> This issue occurred with the following config
> 
> Host prefix /28
> 2 Nics per node
> 3 masters, 4 workers

aamirian  Do you have the env to verify this bug? thanks

Comment 18 Arvin Amirian 2020-09-28 15:48:49 UTC

Customer was advised to use a larger cidr. Customer asked what is the min cidr requirements and was told by support that it depends on the cluster size. Closing bug for now as customer has overcome this issue.

Comment 19 Weibin Liang 2020-09-28 17:18:40 UTC

Based on comment 18, QA close this bug now and will reopen it if need in the future.