Bug 1851782

Summary:	Authentication operator degraded when cluster is built with Multitenant plugin
Product:	OpenShift Container Platform	Reporter:	Alan Chan <alchan>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	alchan, aos-bugs, deads, eparis, jokerman, mfojtik, mnewby, sponnaga, wlewis
Version:	4.4
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-30 20:50:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alan Chan 2020-06-28 23:12:14 UTC

Description of problem:
-----------------------

Cluster is built with Multitenant plugin via customized manifest:

$ cat manifests/cluster-network-03-config.yml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  defaultNetwork:
    type: OpenShiftSDN
    openshiftSDNConfig:
      mode: Multitenant

Appears that after a successful build, the authentication operator goes into degraded mode:

[alchan-redhat.com@clientvm 0 ~]$ oc get co authentication 
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.4.0     True        False         True       55m

[alchan-redhat.com@clientvm 0 ~]$ oc get co authentication -o json | jq '.status.conditions[0]'
{
  "lastTransitionTime": "2020-06-28T20:43:19Z",
  "message": "IngressStateEndpointsDegraded: Unhealthy addresses found: 10.129.0.30:Get https://10.129.0.30:6443/healthz: dial tcp 10.129.0.30:6443: connect: connection timed out,10.130.0.29:Get https://10.130.0.29:6443/healthz: dial tcp 10.130.0.29:6443: connect: connection timed out",
  "reason": "IngressStateEndpoints_UnhealthyAddresses",
  "status": "True",
  "type": "Degraded"
}

The 10.129.0.30 & 10.130.0.29 IPs are oauth-openshift pods in openshift-authentication namespace.

[alchan-redhat.com@clientvm 0 ~]$ oc get netnamespaces | grep authentication
openshift-authentication                                9296695    
openshift-authentication-operator                       7693696

Since they are in different netid, it prevents the authentication-operator pod connecting to oauth-openshift pods.

The workaround appears to be joining the two projects:

$ oc adm pod-network join-projects --to=openshift-authentication openshift-authentication-operator

The authentication operator then is not degraded anymore. 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

- 4.4.0 has this issue.

- Latest 4.4.9 appears to be fine and does NOT has such issue. It appears that in 4.4.9, those two projects are all in the netid 1:

[alchan-redhat.com@clientvm 0 ~]$ oc get co authentication
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.4.9     True        False         False      9m44s

[alchan-redhat.com@clientvm 0 ~]$ oc get netnamespaces | grep authentication
openshift-authentication                                1          
openshift-authentication-operator                       1 

- Have not tested any other version in between 4.4.0 to 4.4.9.


Questions:
----------

- In which 4.4.z version is this fixed?

Comment 5 Maru Newby 2020-06-30 07:34:02 UTC

What action(s) are expected of the api/auth team that suggested assignment to me? It's not at all clear to me from the comments that appear on this bz.

Comment 7 David Eads 2020-06-30 20:43:45 UTC

It was fixed in 4.4.8 with https://github.com/openshift/cluster-network-operator/pull/657 related to https://bugzilla.redhat.com/show_bug.cgi?id=1841507.

The question about what happens for upgrades if someone worked around the problem (comment 4) is best addressed by the SDN team. Reassigning.

Comment 8 Maru Newby 2020-06-30 20:50:34 UTC

This bz is a duplicate of [1]. The fix is already merged for 4.5 [1] and backported to 4.4. 

For future reference, the list of namespaces to join when running in multitenant mode is maintained by the sdn team (openshift-sdn component). 

1: https://bugzilla.redhat.com/show_bug.cgi?id=1837575
2: https://github.com/openshift/cluster-network-operator/pull/650
3: https://github.com/openshift/cluster-network-operator/pull/657

*** This bug has been marked as a duplicate of bug 1837575 ***

Comment 9 Red Hat Bugzilla 2023-09-14 06:03:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days