Bug 1719653 - Network mode Multitenant - apiserver can not connect to etcd because of netnamespaces
Summary: Network mode Multitenant - apiserver can not connect to etcd because of netna...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Ricardo Carrillo Cruz
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1726679
TreeView+ depends on / blocked
 
Reported: 2019-06-12 09:41 UTC by Robert Bohne
Modified: 2019-10-16 06:32 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Etcd namespace were not put on net id 1 Consequence: API server components were not able to connect to etcd Fix: Put Etcd namespace on net id 1 Result: API Server and etcd can communicate succesfully
Clone Of:
Environment:
Last Closed: 2019-10-16 06:31:56 UTC
Target Upstream Version:


Attachments (Terms of Use)
cluster-network-03-config.yml (309 bytes, text/plain)
2019-06-12 09:41 UTC, Robert Bohne
no flags Details
Testing log (9.03 KB, text/plain)
2019-07-01 21:01 UTC, Weibin Liang
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 None None None 2019-10-16 06:32:10 UTC

Description Robert Bohne 2019-06-12 09:41:13 UTC
Created attachment 1579773 [details]
cluster-network-03-config.yml

Description of problem:

If I follow the documentation[1] to install an OpenShift 4 cluster with network mode Multitenant the installation fails, attached my cluster-network-03-config.yml.

Because api server can not connect to etcd.

[1] https://docs.openshift.com/container-platform/4.1/installing/installing_aws/installing-aws-network-customizations.html#modifying-nwoperator-config-startup_installing-aws-network-customizations

Version-Release number of selected component (if applicable):

$ openshift-install version
openshift-install v4.1.0-201905212232-dirty
built from commit 71d8978039726046929729ad15302973e3da18ce
release image quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6

How reproducible:

Install ocp4 with cluster-network-03-config.yml follow the documentation [1]

Steps to Reproduce:
1.
2.
3.

Actual results:

Installation fails.

API Server can not connect to etcd server:
$ oc debug apiserver-p48hk
$ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
curl: (28) Resolving timed out after 1510 milliseconds
Expected results:

Installation pass.

API Server can connect to etcd server:
$ oc rsh apiserver-nf7hx
$ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
curl: (58) NSS: client certificate not found (nickname not specified)

Additional info:

oc get netnamespaces | grep -E '(openshift-apiserver|openshift-etcd) '
openshift-apiserver                                     1
openshift-etcd                                          3025533

It looks like openshift-etcd should use the netid 1.

Comment 1 Dan Winship 2019-06-26 16:17:11 UTC
> $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
> curl: (28) Resolving timed out after 1510 milliseconds

That's a connecting-to-your-DNS-server problem, not a connecting-to-etcd problem.

Weibin: do we have QE tests for Multitenant? I thought we were testing that this worked...

Comment 2 Casey Callendrello 2019-06-26 16:43:59 UTC
And DNS won't come up until after the control-plane pivot. The control plane connects to the host IPs directly. Do you have logs from any control plane components that indicate the issue?

cc'ing ricky, who added some multitenant CI.

Comment 3 zhaozhanqi 2019-06-28 04:19:32 UTC
hi (In reply to Dan Winship from comment #1)
> > $ curl -k -I --connect-timeout 1 https://etcd.openshift-etcd.svc:2379/
> > curl: (28) Resolving timed out after 1510 milliseconds
> 
> That's a connecting-to-your-DNS-server problem, not a connecting-to-etcd
> problem.
> 
> Weibin: do we have QE tests for Multitenant? I thought we were testing that
> this worked...

hi, Dan Winship,
 we have test cases for multitenant and subnet plugin installation. I remembered I got message those two plugin almostly deprecated since networkpolicy can cover those two by creating networkpolicy. So the test matrix for multitenant and subnet had been remove too.

Comment 6 Weibin Liang 2019-06-28 14:35:06 UTC
Confirmed with Zhanqi that QE have executed SDN automation testing using Multitenant at the beginning of v4.0 testing,
then dropped those testing due to some miscommunication.

From now on, QE will continue to run automation regression testing for Networkpolicy, Multitenant and Subnet.

Comment 7 Weibin Liang 2019-07-01 21:00:06 UTC
In cluster-network-03-config.yml, configure network mode to be Subnet, Multitenant and NetworkPolicy

Tested in 4.1.0-0.ci-2019-07-01-170207:

Installation passed whey using: Subnet or NetworkPolicy
Installation failed whey using: Multitenant 

Failed test log is attached.

Comment 8 Weibin Liang 2019-07-01 21:01:17 UTC
Created attachment 1586436 [details]
Testing log

Comment 9 Casey Callendrello 2019-07-02 12:25:24 UTC
Ricardo, can you take a look? Most likely change is that we just need to add NetNamespace for openshift-etcd in 004-multitenant.yaml

Comment 10 Ricardo Carrillo Cruz 2019-07-03 09:53:30 UTC
So, yeah, the issue is the netid.

I installed a cluster with multitenant and got the apiservers crashlooping due inability
to connect to etcd.
Editing the etcd netnamspaces netid to 1, killed the apiservers, got them redeployed and
they started fine.

Will push a patch to add the netnamespace for etcd on netid 1.

Comment 11 Ricardo Carrillo Cruz 2019-07-03 10:50:03 UTC
I pushed https://github.com/openshift/cluster-network-operator/pull/224 for master.
Will create a backport for 4.1.

Comment 13 zhaozhanqi 2019-07-11 01:36:01 UTC
Verified this bug on 4.2.0-0.nightly-2019-07-10-062553

[root@preserve-zzhao 0710]# oc get netnamespaces | grep -E '(openshift-apiserver|openshift-etcd) '
openshift-apiserver                                     1
openshift-etcd                                          1
[root@preserve-zzhao 0710]# oc get clusternetwork
NAME      CLUSTER NETWORK   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14     172.30.0.0/16     redhat/openshift-ovs-multitenant
[root@preserve-zzhao 0710]# oc get pod -n openshift-apiserver
NAME              READY   STATUS    RESTARTS   AGE
apiserver-8mcfm   1/1     Running   0          16h
apiserver-jqnkm   1/1     Running   0          16h
apiserver-wff76   1/1     Running   0          16h

Comment 15 errata-xmlrpc 2019-10-16 06:31:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.