Description of problem: Follow Doc: "https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit" to do the certificate recovery, after forcing rotation, some pods hand with create container error: kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid Version-Release number of selected component (if applicable): Payload: 4.1.0-0.nightly-2019-05-24-040103 How reproducible: Always Steps to Reproduce: 1. Follow the doc: https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit to do certificate recovery; 2. When do the "Breaking the cluster", after do force rotation, pod hand with container creating. 3. Try to follow the doc to do "Recovery", not work. Actual results: 2. The cluster not works well, pod hand with container creating error: [root@dhcp-140-138 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-24-040103 True False 5h32m Error while reconciling 4.1.0-0.nightly-2019-05-24-040103: an unknown error has occurred [root@dhcp-140-138 ~]# oc get po -A |grep -v -E "Running|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE openshift-controller-manager controller-manager-b6dm8 0/1 ContainerCreating 0 73m openshift-kube-apiserver installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com 0/1 ContainerCreating 0 3h3m openshift-kube-apiserver revision-pruner-7-dell-r730-005.dsal.lab.eng.rdu2.redhat.com 0/1 ContainerCreating 0 3h32m openshift-kube-controller-manager installer-9-dell-r730-005.dsal.lab.eng.rdu2.redhat.com 0/1 ContainerCreating 0 3h31m openshift-kube-scheduler installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com 0/1 ContainerCreating 0 3h32m openshift-monitoring prometheus-adapter-fd66c5659-8m7nl 0/1 ContainerCreating 0 78m [root@dhcp-140-138 ~]# oc describe po controller-manager-b6dm8 -n openshift-controller-manager ...... Warning FailedCreatePodSandBox 2m27s (x40 over 14m) kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid 3. After do the "Recovery", the cluster still met error same with step 2. Expected results: 2-3, The env should works well. Additional info:
This is apiserver with invalid certs. There seems to be a race between the recovery procedure and cert-sync and install processes which can overwrite the new certs with old ones. The workaround would be to run the procedure second time, starting with `regenerate-certs` and running all the following steps. Race fix will follow.
https://github.com/openshift/cluster-kube-apiserver-operator/pull/483
That has raised some worries for David about merging it in short time frame, agreed on smaller change that should fix the most obvious races for cert-syncer https://github.com/openshift/cluster-kube-apiserver-operator/pull/487 (we should have that cert syncer fix anyways to avoid stale caches).
already bumped with https://github.com/openshift/cluster-kube-apiserver-operator/pull/487#issuecomment-503060868
this isn't in 4.1 branch yet
confirmed with latest payload 4.1.0-0.nightly-2019-09-09-223953, can't reproduce the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2768