Description of problem: kube-system is not immortal! (issue: https://github.com/openshift/origin/issues/4228) Our fix was to implement https://github.com/openshift/origin/issues/3274, in short, we never completed this or we have a regression. Version-Release number of selected component (if applicable): 3.6 How reproducible: 100% Steps to Reproduce: 1. oc delete ns kube-system (as system:admin) - This may need to be run several times (as its possible for this to get deleted and recreated on its own. - creating a project and a deployment (or having a running deployment is thought to ensure that the namespace remains in terminating. --- Additionally you can confirm how bad this is (if this happens) by: 1. oc new-project test 2. oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git Actual results: The kube-system namespace will keep deployments from happening (as shown by the additional section) having a build complete but the deployment fail. # oc logs pods/ruby-ex-1-deploy --> Scaling ruby-ex-1 to 1 error: couldn't scale ruby-ex-1 to 1: timed out waiting for "ruby-ex-1" to be synced Expected results: The deployments should succeed, or it should not be possible to delete the kube-system namespace (ultimately). Additional info:
It should be noted, that you should be able to work around/recover from the delete by following: 1: Stop all instances of the atomic-openshift-master or atomic-openshift-master-api processes: # systemctl stop atomic-openshift-master OR # systemctl stop atomic-openshift-master-api # systemctl stop atomic-openshift-master* ### This should also work - Note: if you have 3 masters run this on all 3 masters! This will cause an outage to cluster operations! 2: Find / Confirm the kube-system namespace is accessible directly from etcd (this (and the next step) are the EVASIVE part) - Note: We will use [-] as the foundation for explaining how to do this - Note: You need to fill in ${etcd_endpoint}, ${cert_file}, ${key_file} and ${ca_file} in the command below with files/values that match your cluster [-] shows you where/how you can lookup these values. # export ETCDCTL_API=2; etcdctl --endpoints ${etcd_endpoint} --cert-file ${cert_file} --key-file ${key_file} --ca-file ${ca_file} ls /kubernetes.io/namespaces OR # export ETCDCTL_API=3; etcdctl --endpoints=${etcd_endpoint} --cert ${cert_file} --key ${key_file} --cacert ${ca_file} get /kubernetes.io/namespaces --prefix --keys-only 3: Delete the kube-system namespace from etcd directly # export ETCDCTL_API=2; etcdctl --endpoints ${etcd_endpoint} --cert-file ${cert_file} --key-file ${key_file} --ca-file ${ca_file} del /kubernetes.io/namespaces/kube-system OR # export ETCDCTL_API=3; etcdctl --endpoints=${etcd_endpoint} --cert ${cert_file} --key ${key_file} --cacert ${ca_file} del /kubernetes.io/namespaces/kube-system 4: Restart all instances of the atomic-openshift-master or atomic-openshift-master-api processes: # systemctl restart atomic-openshift-master OR # systemctl restart atomic-openshift-master-api # systemctl restart atomic-openshift-master* ### This is known _NOT_to work (so unlike befor do _NOT_ try this). - Note: if you have 3 masters run this on all 3 masters! This will cause an outage to cluster operations! Once completed, the the kube-system namespace should get re-created by the api process and your cluster should begind functioning again. To test this you want to run the following: # oc get ns ### confirm that kube-system namespace is infact created! # oc get all,sa,secrets -n kube-system ### confirm that the kube-system namespace is infact populated with secretes and service accounts! # oc rollout latest dc/ruby-ex -n test ### confirm that this deploys the latest instance of your application. [-] https://access.redhat.com/articles/2542841
Fixed in 3.6-3.8
And 3.9
Verified with : oc v3.9.0-0.20.0 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://172.16.120.125:8443 openshift v3.9.0-0.20.0 kubernetes v1.9.1+a0ce1bc657