Bug 1719653

Summary: Network mode Multitenant - apiserver can not connect to etcd because of netnamespaces
Product: OpenShift Container Platform Reporter: Robert Bohne <rbohne>
Component: NetworkingAssignee: Ricardo Carrillo Cruz <ricarril>
Status: CLOSED ERRATA QA Contact: zhaozhanqi <zzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: aos-bugs, danw, ricarril, toshio.oya, weliang
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Etcd namespace were not put on net id 1 Consequence: API server components were not able to connect to etcd Fix: Put Etcd namespace on net id 1 Result: API Server and etcd can communicate succesfully
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:31:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1726679    
Attachments:
Description Flags
cluster-network-03-config.yml
none
Testing log none

Description Robert Bohne 2019-06-12 09:41:13 UTC
Created attachment 1579773 [details]
cluster-network-03-config.yml

Description of problem:

If I follow the documentation[1] to install an OpenShift 4 cluster with network mode Multitenant the installation fails, attached my cluster-network-03-config.yml.

Because api server can not connect to etcd.

[1] https://docs.openshift.com/container-platform/4.1/installing/installing_aws/installing-aws-network-customizations.html#modifying-nwoperator-config-startup_installing-aws-network-customizations

Version-Release number of selected component (if applicable):

$ openshift-install version
openshift-install v4.1.0-201905212232-dirty
built from commit 71d8978039726046929729ad15302973e3da18ce
release image quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6

How reproducible:

Install ocp4 with cluster-network-03-config.yml follow the documentation [1]

Steps to Reproduce:
1.
2.
3.

Actual results:

Installation fails.

API Server can not connect to etcd server:
$ oc debug apiserver-p48hk
$ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
curl: (28) Resolving timed out after 1510 milliseconds
Expected results:

Installation pass.

API Server can connect to etcd server:
$ oc rsh apiserver-nf7hx
$ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
curl: (58) NSS: client certificate not found (nickname not specified)

Additional info:

oc get netnamespaces | grep -E '(openshift-apiserver|openshift-etcd) '
openshift-apiserver                                     1
openshift-etcd                                          3025533

It looks like openshift-etcd should use the netid 1.

Comment 1 Dan Winship 2019-06-26 16:17:11 UTC
> $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
> curl: (28) Resolving timed out after 1510 milliseconds

That's a connecting-to-your-DNS-server problem, not a connecting-to-etcd problem.

Weibin: do we have QE tests for Multitenant? I thought we were testing that this worked...

Comment 2 Casey Callendrello 2019-06-26 16:43:59 UTC
And DNS won't come up until after the control-plane pivot. The control plane connects to the host IPs directly. Do you have logs from any control plane components that indicate the issue?

cc'ing ricky, who added some multitenant CI.

Comment 3 zhaozhanqi 2019-06-28 04:19:32 UTC
hi (In reply to Dan Winship from comment #1)
> > $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
> > curl: (28) Resolving timed out after 1510 milliseconds
> 
> That's a connecting-to-your-DNS-server problem, not a connecting-to-etcd
> problem.
> 
> Weibin: do we have QE tests for Multitenant? I thought we were testing that
> this worked...

hi, Dan Winship,
 we have test cases for multitenant and subnet plugin installation. I remembered I got message those two plugin almostly deprecated since networkpolicy can cover those two by creating networkpolicy. So the test matrix for multitenant and subnet had been remove too.

Comment 6 Weibin Liang 2019-06-28 14:35:06 UTC
Confirmed with Zhanqi that QE have executed SDN automation testing using Multitenant at the beginning of v4.0 testing,
then dropped those testing due to some miscommunication.

From now on, QE will continue to run automation regression testing for Networkpolicy, Multitenant and Subnet.

Comment 7 Weibin Liang 2019-07-01 21:00:06 UTC
In cluster-network-03-config.yml, configure network mode to be Subnet, Multitenant and NetworkPolicy

Tested in 4.1.0-0.ci-2019-07-01-170207:

Installation passed whey using: Subnet or NetworkPolicy
Installation failed whey using: Multitenant 

Failed test log is attached.

Comment 8 Weibin Liang 2019-07-01 21:01:17 UTC
Created attachment 1586436 [details]
Testing log

Comment 9 Casey Callendrello 2019-07-02 12:25:24 UTC
Ricardo, can you take a look? Most likely change is that we just need to add NetNamespace for openshift-etcd in 004-multitenant.yaml

Comment 10 Ricardo Carrillo Cruz 2019-07-03 09:53:30 UTC
So, yeah, the issue is the netid.

I installed a cluster with multitenant and got the apiservers crashlooping due inability
to connect to etcd.
Editing the etcd netnamspaces netid to 1, killed the apiservers, got them redeployed and
they started fine.

Will push a patch to add the netnamespace for etcd on netid 1.

Comment 11 Ricardo Carrillo Cruz 2019-07-03 10:50:03 UTC
I pushed https://github.com/openshift/cluster-network-operator/pull/224 for master.
Will create a backport for 4.1.

Comment 13 zhaozhanqi 2019-07-11 01:36:01 UTC
Verified this bug on 4.2.0-0.nightly-2019-07-10-062553

[root@preserve-zzhao 0710]# oc get netnamespaces | grep -E '(openshift-apiserver|openshift-etcd) '
openshift-apiserver                                     1
openshift-etcd                                          1
[root@preserve-zzhao 0710]# oc get clusternetwork
NAME      CLUSTER NETWORK   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14     172.30.0.0/16     redhat/openshift-ovs-multitenant
[root@preserve-zzhao 0710]# oc get pod -n openshift-apiserver
NAME              READY   STATUS    RESTARTS   AGE
apiserver-8mcfm   1/1     Running   0          16h
apiserver-jqnkm   1/1     Running   0          16h
apiserver-wff76   1/1     Running   0          16h

Comment 15 errata-xmlrpc 2019-10-16 06:31:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922