Moving to release 4.6 so we can fix it and then we can consider the backport.
Clarification: There are two issues here. OCP Host Prefix requirement and multi-vlan deployment 1. Host Prefix requirement Calico does not require a large CIDR for the host prefix. A /28 is an exceptable size since when calico will just assign a new /28 block when the node runs out of IPs. When doing deployments on nodes with a single network it works works, but when attached to 2 vlans there are communication issues and openshift-apiserver never comes up. This get addressed if a larger host prefix size is used. In this case the Master and workers are both on VLAN1 and VLAN2. 2. OCP certifcate issue with multi vlan configuration In this scenario, The Master has a single NIC on VLAN 1 and the Workers have two NICs, one on VLAN 1 and the other on VLAN2. Calico tries to communicate with the masters using both NICs but the communication from VLAN2 is always rejected with complaints about the cert only belonging to VLAN1 interface. We see a torrent of CSRs from the workers but doesn’t matter how many times you approve them they keep getting regenerated. It looks like the worker is generating a new request using each interface. The result is that the install never completes because openshift-apiserver never comes up.
Cutting visibility to RH since this isn't a general issue. Thoughts: In discussing yesterday, 1) above (the hostPrefix item) can be worked around for the moment. Not ideal, but shouldn't be a blocker. That the install sporadically fails with a constrained hostPrefix seemingly indicates that even though Calico doesn't make use of it, that it has an effect on the install/pod spin up (cri-o default network or something)? Item 2) is the blocker item. It sounds like we're round-robining between the 2 interfaces on the workers in terms of behavior. If we have a worker specific subnet (VLAN2) and one for the control plane (VLAN1), shouldn't all traffic for the API go through the route to VLAN1? Why would a worker try to talk to a master through the worker VLAN? I'm guessing this is a BGP side-effect since both subnets are advertised. That's just a guess though on my part. If it's a Calico routing thing, I don't know what we can do about that but is there a way to fix the certificate issue(s) so that this doesn't error/fail on the (misrouted) calls?
Further analysis on issue 2. It seems when Calico is set as the SDN in the VLAN Configuration described earlier, The kube-controller-manager and kube-apiserver end up with different signer certs. [root@ocp5-control3 ~]# openssl crl2pkcs7 -nocrl -certfile /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/csr-signer/tls.crt | openssl pkcs7 -print_certs -text -noout | grep -i Subject: Subject: CN=kube-csr-signer_@1593229954 [root@ocp5-control3 ~]# openssl crl2pkcs7 -nocrl -certfile /etc/kubernetes/static-pod-resources/kube-apiserver-pod-3/configmaps/kubelet-serving-ca/ca-bundle.crt | openssl pkcs7 -print_certs -text -noout | grep -i Subject: Subject: CN=kube-csr-signer_@1593161209 Subject: OU=openshift, CN=kubelet-signer
Please make this public so that the partner, Tigera, can view and comment. An engineer is still being requested for a call with the partner.
This is Pooriya from Tigera. Could we please schedule a call between Tigera and RH engineering to discuss this issue? Thanks.
Hello Pooriya, There is a call confirmed and scheduled with Tigera and RH engineering for today 4 pm EST.
Based on a call I had with Tigera yesterday, the host-prefix problem is intermittent. I asked for information with the exact error that they are seeing so we can see what is failing and why since that parameter is only applicable to the openshift-sdn and ovn-kubernetes SDN plugins and should be ignored when Calico is in use. Also interesting is that it failed a few times ... and then worked.
Dropping the priority since there is a workaround for this. Just leave the Host Prefix alone. Since Calico does not use it, it has no effect.
Hi, David. Could you help verified or assign anyone can help this issue. since OCP SDN QE did not any experience to setup the cluster with Tigera Calico plugin. thanks
This issue occurred with the following config Host prefix /28 2 Nics per node 3 masters, 4 workers
(In reply to aamirian from comment #14) > This issue occurred with the following config > > Host prefix /28 > 2 Nics per node > 3 masters, 4 workers aamirian Do you have the env to verify this bug? thanks
Customer was advised to use a larger cidr. Customer asked what is the min cidr requirements and was told by support that it depends on the cluster size. Closing bug for now as customer has overcome this issue.
Based on comment 18, QA close this bug now and will reopen it if need in the future.