1719653 – Network mode Multitenant - apiserver can not connect to etcd because of netnamespaces

Bug 1719653 - Network mode Multitenant - apiserver can not connect to etcd because of netnamespaces

Summary: Network mode Multitenant - apiserver can not connect to etcd because of netna...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ricardo Carrillo Cruz
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1726679
TreeView+	depends on / blocked

Reported:	2019-06-12 09:41 UTC by Robert Bohne
Modified:	2019-10-16 06:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Etcd namespace were not put on net id 1 Consequence: API server components were not able to connect to etcd Fix: Put Etcd namespace on net id 1 Result: API Server and etcd can communicate succesfully
Clone Of:
Environment:
Last Closed:	2019-10-16 06:31:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster-network-03-config.yml (309 bytes, text/plain) 2019-06-12 09:41 UTC, Robert Bohne	no flags	Details
Testing log (9.03 KB, text/plain) 2019-07-01 21:01 UTC, Weibin Liang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:32:10 UTC

Description Robert Bohne 2019-06-12 09:41:13 UTC

Created attachment 1579773 [details]
cluster-network-03-config.yml

Description of problem:

If I follow the documentation[1] to install an OpenShift 4 cluster with network mode Multitenant the installation fails, attached my cluster-network-03-config.yml.

Because api server can not connect to etcd.

[1] https://docs.openshift.com/container-platform/4.1/installing/installing_aws/installing-aws-network-customizations.html#modifying-nwoperator-config-startup_installing-aws-network-customizations

Version-Release number of selected component (if applicable):

$ openshift-install version
openshift-install v4.1.0-201905212232-dirty
built from commit 71d8978039726046929729ad15302973e3da18ce
release image quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6

How reproducible:

Install ocp4 with cluster-network-03-config.yml follow the documentation [1]

Steps to Reproduce:
1.
2.
3.

Actual results:

Installation fails.

API Server can not connect to etcd server:
$ oc debug apiserver-p48hk
$ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
curl: (28) Resolving timed out after 1510 milliseconds
Expected results:

Installation pass.

API Server can connect to etcd server:
$ oc rsh apiserver-nf7hx
$ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
curl: (58) NSS: client certificate not found (nickname not specified)

Additional info:

oc get netnamespaces | grep -E '(openshift-apiserver|openshift-etcd) '
openshift-apiserver                                     1
openshift-etcd                                          3025533

It looks like openshift-etcd should use the netid 1.

Comment 1 Dan Winship 2019-06-26 16:17:11 UTC

> $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
> curl: (28) Resolving timed out after 1510 milliseconds

That's a connecting-to-your-DNS-server problem, not a connecting-to-etcd problem.

Weibin: do we have QE tests for Multitenant? I thought we were testing that this worked...

Comment 2 Casey Callendrello 2019-06-26 16:43:59 UTC

And DNS won't come up until after the control-plane pivot. The control plane connects to the host IPs directly. Do you have logs from any control plane components that indicate the issue?

cc'ing ricky, who added some multitenant CI.

Comment 3 zhaozhanqi 2019-06-28 04:19:32 UTC

hi (In reply to Dan Winship from comment #1)
> > $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
> > curl: (28) Resolving timed out after 1510 milliseconds
> 
> That's a connecting-to-your-DNS-server problem, not a connecting-to-etcd
> problem.
> 
> Weibin: do we have QE tests for Multitenant? I thought we were testing that
> this worked...

hi, Dan Winship,
 we have test cases for multitenant and subnet plugin installation. I remembered I got message those two plugin almostly deprecated since networkpolicy can cover those two by creating networkpolicy. So the test matrix for multitenant and subnet had been remove too.

Comment 6 Weibin Liang 2019-06-28 14:35:06 UTC

Confirmed with Zhanqi that QE have executed SDN automation testing using Multitenant at the beginning of v4.0 testing,
then dropped those testing due to some miscommunication.

From now on, QE will continue to run automation regression testing for Networkpolicy, Multitenant and Subnet.

Comment 7 Weibin Liang 2019-07-01 21:00:06 UTC

In cluster-network-03-config.yml, configure network mode to be Subnet, Multitenant and NetworkPolicy

Tested in 4.1.0-0.ci-2019-07-01-170207:

Installation passed whey using: Subnet or NetworkPolicy
Installation failed whey using: Multitenant 

Failed test log is attached.

Comment 8 Weibin Liang 2019-07-01 21:01:17 UTC

Created attachment 1586436 [details]
Testing log

Comment 9 Casey Callendrello 2019-07-02 12:25:24 UTC

Ricardo, can you take a look? Most likely change is that we just need to add NetNamespace for openshift-etcd in 004-multitenant.yaml

Comment 10 Ricardo Carrillo Cruz 2019-07-03 09:53:30 UTC

So, yeah, the issue is the netid.

I installed a cluster with multitenant and got the apiservers crashlooping due inability
to connect to etcd.
Editing the etcd netnamspaces netid to 1, killed the apiservers, got them redeployed and
they started fine.

Will push a patch to add the netnamespace for etcd on netid 1.

Comment 11 Ricardo Carrillo Cruz 2019-07-03 10:50:03 UTC

I pushed https://github.com/openshift/cluster-network-operator/pull/224 for master.
Will create a backport for 4.1.

Comment 13 zhaozhanqi 2019-07-11 01:36:01 UTC

Verified this bug on 4.2.0-0.nightly-2019-07-10-062553

[root@preserve-zzhao 0710]# oc get netnamespaces | grep -E '(openshift-apiserver|openshift-etcd) '
openshift-apiserver                                     1
openshift-etcd                                          1
[root@preserve-zzhao 0710]# oc get clusternetwork
NAME      CLUSTER NETWORK   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14     172.30.0.0/16     redhat/openshift-ovs-multitenant
[root@preserve-zzhao 0710]# oc get pod -n openshift-apiserver
NAME              READY   STATUS    RESTARTS   AGE
apiserver-8mcfm   1/1     Running   0          16h
apiserver-jqnkm   1/1     Running   0          16h
apiserver-wff76   1/1     Running   0          16h

Comment 15 errata-xmlrpc 2019-10-16 06:31:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.