Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1525642 - immortal namespace are not immortal (as we claim them to be)
immortal namespace are not immortal (as we claim them to be)
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master (Show other bugs)
3.6.0
Unspecified Unspecified
unspecified Severity urgent
: ---
: 3.9.0
Assigned To: David Eads
Wang Haoran
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-12-13 14:20 EST by Eric Rich
Modified: 2018-09-17 12:39 EDT (History)
6 users (show)

See Also:
Fixed In Version: oc v3.9.0-0.20.0
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-06-18 14:18:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2013 normal SHIPPED_LIVE Important: OpenShift Container Platform 3.9 security, bug fix, and enhancement update 2018-06-27 18:01:43 EDT

  None (edit)
Description Eric Rich 2017-12-13 14:20:07 EST
Description of problem: kube-system is not immortal! (issue: https://github.com/openshift/origin/issues/4228) 

Our fix was to implement https://github.com/openshift/origin/issues/3274, in short, we never completed this or we have a regression. 

Version-Release number of selected component (if applicable): 3.6
How reproducible: 100% 

Steps to Reproduce:
1. oc delete ns kube-system (as system:admin) 
   - This may need to be run several times (as its possible for this to get deleted and recreated on its own. 
   - creating a project and a deployment (or having a running deployment is thought to ensure that the namespace remains in terminating. 

--- Additionally you can confirm how bad this is (if this happens) by: 

1. oc new-project test
2. oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git

Actual results: The kube-system namespace will keep deployments from happening (as shown by the additional section) having a build complete but the deployment fail. 

# oc logs pods/ruby-ex-1-deploy
--> Scaling ruby-ex-1 to 1
error: couldn't scale ruby-ex-1 to 1: timed out waiting for "ruby-ex-1" to be synced


Expected results: The deployments should succeed, or it should not be possible to delete the kube-system namespace (ultimately). 


Additional info:
Comment 1 Eric Rich 2017-12-13 14:24:27 EST
It should be noted, that you should be able to work around/recover from the delete by following: 

1: Stop all instances of the atomic-openshift-master or atomic-openshift-master-api processes:

    # systemctl stop atomic-openshift-master OR # systemctl stop atomic-openshift-master-api
    # systemctl stop atomic-openshift-master*  ### This should also work

  - Note: if you have 3 masters run this on all 3 masters! This will cause an outage to cluster operations! 

2: Find / Confirm the kube-system namespace is accessible directly from etcd (this (and the next step) are the EVASIVE part) 
  - Note: We will use [-] as the foundation for explaining how to do this
  - Note: You need to fill in ${etcd_endpoint}, ${cert_file}, ${key_file} and ${ca_file} in the command below with files/values that match your cluster [-] shows you where/how you can lookup these values. 

   # export ETCDCTL_API=2; etcdctl --endpoints ${etcd_endpoint} --cert-file ${cert_file} --key-file ${key_file} --ca-file ${ca_file} ls /kubernetes.io/namespaces
  OR
   # export ETCDCTL_API=3; etcdctl --endpoints=${etcd_endpoint} --cert ${cert_file} --key ${key_file} --cacert ${ca_file} get /kubernetes.io/namespaces --prefix --keys-only

3: Delete the kube-system namespace from etcd directly

    # export ETCDCTL_API=2; etcdctl --endpoints ${etcd_endpoint} --cert-file ${cert_file} --key-file ${key_file} --ca-file ${ca_file} del /kubernetes.io/namespaces/kube-system
   OR
    # export ETCDCTL_API=3; etcdctl --endpoints=${etcd_endpoint} --cert ${cert_file} --key ${key_file} --cacert ${ca_file} del /kubernetes.io/namespaces/kube-system  

4: Restart all instances of the atomic-openshift-master or atomic-openshift-master-api processes:

    # systemctl restart atomic-openshift-master OR # systemctl restart atomic-openshift-master-api
    # systemctl restart atomic-openshift-master*  ### This is known _NOT_to work (so unlike befor do _NOT_ try this). 

  - Note: if you have 3 masters run this on all 3 masters! This will cause an outage to cluster operations! 

Once completed, the the kube-system namespace should get re-created by the api process and your cluster should begind functioning again. To test this you want to run the following: 

   # oc get ns                                         ### confirm that kube-system namespace is infact created! 
   # oc get all,sa,secrets -n kube-system              ### confirm that the kube-system namespace is infact populated with secretes and service accounts! 

   # oc rollout latest dc/ruby-ex -n test          ### confirm that this deploys the latest instance of your application. 

[-] https://access.redhat.com/articles/2542841
Comment 5 David Eads 2018-01-18 08:05:32 EST
Fixed in 3.6-3.8
Comment 6 David Eads 2018-01-18 08:05:43 EST
And 3.9
Comment 7 Wang Haoran 2018-01-18 21:22:08 EST
Verified with :
oc v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.16.120.125:8443
openshift v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657

Note You need to log in before you can comment on or make changes to this bug.