Bug 1745102 - If openshift-control-plane is down, new pods cannot be created
Summary: If openshift-control-plane is down, new pods cannot be created
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.3.0
Assignee: David Eads
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks: 1767047
TreeView+ depends on / blocked
 
Reported: 2019-08-23 15:28 UTC by David Eads
Modified: 2023-09-14 05:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1767047 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:05:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:05:41 UTC

Description David Eads 2019-08-23 15:28:34 UTC
We've found a layering violation between the openshift and kube control planes.  SCC requires annotations on namespaces to set default UIDs to create pods and clusterresourcequota (CRQ) requires reconciliation to free quota to create pods.  The controllers which do these things are in the openshift-controller-manager even though they have no logical openshift dependency.

In 4.1, we partially fixed this by creating these resources as CRDs so they were always available, but we missed the controllers that are responsible for keeping these resources functional inside of the cluster.

We need to pull the "openshift.io/namespace-security-allocation" and "openshift.io/cluster-quota-reconciliation" controllers into a spot above the openshift-apiserver so that our platform can continue to create pods even if part of the openshift-control-plane is down.

Best option known option: new image used in a new container in the existing kube-controller-manager static pod.  This gives us resiliency during disaster recovery that a normal pod would not provide us.  It doesn't require a new operator or a change to topology and it does not complicate a rebase.

Comment 1 Sally 2019-10-30 17:12:30 UTC
static pod def, rbac: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/297
temporary lock NamespaceSCCAllocationController: https://github.com/openshift/openshift-controller-manager/pull/28
temporary lock ClusterPolicyController: https://github.com/openshift/cluster-policy-controller/pull/3
CI https://github.com/openshift/release/pull/5075

remove quota,sec controllers from OCM: https://github.com/openshift/openshift-controller-manager/pull/37

These changes are merged, setting to MODIFIED to be picked up by QA

Comment 2 David Eads 2019-10-30 17:20:08 UTC
What have strong CI on this.  Marking verified to free up our bot.

Comment 3 Derek Carr 2019-11-04 01:47:49 UTC
The bug should only be moved to VERIFIED by QE.

Comment 4 Derek Carr 2019-11-04 01:50:08 UTC
Referenced PR for validation by QE https://github.com/openshift/openshift-controller-manager/pull/37

Comment 5 Derek Carr 2019-11-04 02:46:59 UTC
Bugs should never move from MODIFIED->VERIFIED.  Bugs must move from MODIFIED->ON_QA->VERIFIED via ART automation in order to not cause other confusion in a release.

Comment 10 errata-xmlrpc 2020-01-23 11:05:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 11 Red Hat Bugzilla 2023-09-14 05:42:14 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.