Created attachment 1579773 [details] cluster-network-03-config.yml Description of problem: If I follow the documentation[1] to install an OpenShift 4 cluster with network mode Multitenant the installation fails, attached my cluster-network-03-config.yml. Because api server can not connect to etcd. [1] https://docs.openshift.com/container-platform/4.1/installing/installing_aws/installing-aws-network-customizations.html#modifying-nwoperator-config-startup_installing-aws-network-customizations Version-Release number of selected component (if applicable): $ openshift-install version openshift-install v4.1.0-201905212232-dirty built from commit 71d8978039726046929729ad15302973e3da18ce release image quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6 How reproducible: Install ocp4 with cluster-network-03-config.yml follow the documentation [1] Steps to Reproduce: 1. 2. 3. Actual results: Installation fails. API Server can not connect to etcd server: $ oc debug apiserver-p48hk $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/ curl: (28) Resolving timed out after 1510 milliseconds Expected results: Installation pass. API Server can connect to etcd server: $ oc rsh apiserver-nf7hx $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/ curl: (58) NSS: client certificate not found (nickname not specified) Additional info: oc get netnamespaces | grep -E '(openshift-apiserver|openshift-etcd) ' openshift-apiserver 1 openshift-etcd 3025533 It looks like openshift-etcd should use the netid 1.
> $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/ > curl: (28) Resolving timed out after 1510 milliseconds That's a connecting-to-your-DNS-server problem, not a connecting-to-etcd problem. Weibin: do we have QE tests for Multitenant? I thought we were testing that this worked...
And DNS won't come up until after the control-plane pivot. The control plane connects to the host IPs directly. Do you have logs from any control plane components that indicate the issue? cc'ing ricky, who added some multitenant CI.
hi (In reply to Dan Winship from comment #1) > > $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/ > > curl: (28) Resolving timed out after 1510 milliseconds > > That's a connecting-to-your-DNS-server problem, not a connecting-to-etcd > problem. > > Weibin: do we have QE tests for Multitenant? I thought we were testing that > this worked... hi, Dan Winship, we have test cases for multitenant and subnet plugin installation. I remembered I got message those two plugin almostly deprecated since networkpolicy can cover those two by creating networkpolicy. So the test matrix for multitenant and subnet had been remove too.
Confirmed with Zhanqi that QE have executed SDN automation testing using Multitenant at the beginning of v4.0 testing, then dropped those testing due to some miscommunication. From now on, QE will continue to run automation regression testing for Networkpolicy, Multitenant and Subnet.
In cluster-network-03-config.yml, configure network mode to be Subnet, Multitenant and NetworkPolicy Tested in 4.1.0-0.ci-2019-07-01-170207: Installation passed whey using: Subnet or NetworkPolicy Installation failed whey using: Multitenant Failed test log is attached.
Created attachment 1586436 [details] Testing log
Ricardo, can you take a look? Most likely change is that we just need to add NetNamespace for openshift-etcd in 004-multitenant.yaml
So, yeah, the issue is the netid. I installed a cluster with multitenant and got the apiservers crashlooping due inability to connect to etcd. Editing the etcd netnamspaces netid to 1, killed the apiservers, got them redeployed and they started fine. Will push a patch to add the netnamespace for etcd on netid 1.
I pushed https://github.com/openshift/cluster-network-operator/pull/224 for master. Will create a backport for 4.1.
Verified this bug on 4.2.0-0.nightly-2019-07-10-062553 [root@preserve-zzhao 0710]# oc get netnamespaces | grep -E '(openshift-apiserver|openshift-etcd) ' openshift-apiserver 1 openshift-etcd 1 [root@preserve-zzhao 0710]# oc get clusternetwork NAME CLUSTER NETWORK SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14 172.30.0.0/16 redhat/openshift-ovs-multitenant [root@preserve-zzhao 0710]# oc get pod -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-8mcfm 1/1 Running 0 16h apiserver-jqnkm 1/1 Running 0 16h apiserver-wff76 1/1 Running 0 16h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922