1767047 – If openshift-control-plane is down, new pods cannot be created

Bug 1767047 - If openshift-control-plane is down, new pods cannot be created

Summary: If openshift-control-plane is down, new pods cannot be created

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.2.z
Assignee:	Sally
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:	1745102
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-30 15:07 UTC by Sally
Modified:	2019-12-11 22:36 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1745102
Environment:
Last Closed:	2019-12-11 22:36:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 300	'None'	'closed'	'Bug 1767047: move rbac for cluster-policy-controller leader lock to kcm-o'	2019-12-03 01:43:35 UTC
Github	openshift openshift-controller-manager pull 42	'None'	'closed'	'Bug 1767047: add leader lock for cluster-policy-controller in scc_namespace_allocator controller'	2019-12-03 01:43:35 UTC
Red Hat Product Errata	RHBA-2019:4093	None	None	None	2019-12-11 22:36:16 UTC

Description Sally 2019-10-30 15:07:39 UTC

+++ This bug was initially created as a clone of Bug #1745102 +++

We've found a layering violation between the openshift and kube control planes.  SCC requires annotations on namespaces to set default UIDs to create pods and clusterresourcequota (CRQ) requires reconciliation to free quota to create pods.  The controllers which do these things are in the openshift-controller-manager even though they have no logical openshift dependency.

In 4.1, we partially fixed this by creating these resources as CRDs so they were always available, but we missed the controllers that are responsible for keeping these resources functional inside of the cluster.

We need to pull the "openshift.io/namespace-security-allocation" and "openshift.io/cluster-quota-reconciliation" controllers into a spot above the openshift-apiserver so that our platform can continue to create pods even if part of the openshift-control-plane is down.

Best option known option: new image used in a new container in the existing kube-controller-manager static pod.  This gives us resiliency during disaster recovery that a normal pod would not provide us.  It doesn't require a new operator or a change to topology and it does not complicate a rebase.

We have migrated security and quota controllers to openshift/cluster-policy-controller.  cluster-policy-controller runs in openshift-kube-controller-manager static pod.  This 4.2 bug is to track the necessary leader-election-lock and rbac for upgrades from 4.2->4.3 as a result of the quota, security migration.

Comment 2 Xingxing Xia 2019-10-31 01:45:27 UTC

Ying Zhou, need your balance to help verify this bug, please check, thanks.

Comment 3 Sally 2019-10-31 17:01:43 UTC

moving back to 'POST' so GH bot picks it up.

Comment 6 zhou ying 2019-12-03 05:57:23 UTC

Confirmed with latest payload: 4.2.0-0.nightly-2019-12-02-165545, the issue has fixed:

Steps:
1. Login as normal user, create project and apps;
2. Scale CVO to 0 and then scaling the OCMO and OCM to 0 ;
3. As the normal user, delete the running pods , will recreate pods succeed:
[root@dhcp-140-138 yamlfile]# oc get po 
NAME              READY   STATUS      RESTARTS   AGE
dctest-1-deploy   0/1     Completed   0          73s
dctest-1-r8hcw    2/2     Running     0          63s
[root@dhcp-140-138 yamlfile]# oc delete po dctest-1-r8hcw 
pod "dctest-1-r8hcw" deleted
[root@dhcp-140-138 yamlfile]# oc get po 
NAME              READY   STATUS              RESTARTS   AGE
dctest-1-7rncn    0/2     ContainerCreating   0          40s
dctest-1-deploy   0/1     Completed           0          25m
[root@dhcp-140-138 yamlfile]# oc get po 
NAME              READY   STATUS      RESTARTS   AGE
dctest-1-7rncn    2/2     Running     0          2m49s
dctest-1-deploy   0/1     Completed   0          27m

Comment 8 errata-xmlrpc 2019-12-11 22:36:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4093

Note You need to log in before you can comment on or make changes to this bug.