Bug 1714158

Summary: [DR][bare metal] Pod hang with container create error
Product: OpenShift Container Platform Reporter: zhou ying <yinzhou>
Component: kube-apiserverAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: aos-bugs, jokerman, mfojtik, mmccomas, sttts, talessio, tnozicka, xxia
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1745571 (view as bug list) Environment:
Last Closed: 2019-09-20 12:29:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1745571, 1749271    
Bug Blocks:    

Description zhou ying 2019-05-27 09:25:49 UTC
Description of problem:
Follow Doc: "https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit"  to do the certificate recovery, after forcing rotation, some pods hand with create container error:
kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid


Version-Release number of selected component (if applicable):
Payload: 4.1.0-0.nightly-2019-05-24-040103

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit to do certificate recovery;
2. When do the "Breaking the cluster", after do force rotation, pod hand with container creating.
3. Try to follow the doc to do "Recovery", not work. 


Actual results:
2. The cluster not works well, pod hand with container creating error:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-24-040103   True        False         5h32m   Error while reconciling 4.1.0-0.nightly-2019-05-24-040103: an unknown error has occurred
[root@dhcp-140-138 ~]# oc get po -A |grep -v -E  "Running|Completed"
NAMESPACE                                               NAME                                                                  READY   STATUS              RESTARTS   AGE
openshift-controller-manager                            controller-manager-b6dm8                                              0/1     ContainerCreating   0          73m
openshift-kube-apiserver                                installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h3m
openshift-kube-apiserver                                revision-pruner-7-dell-r730-005.dsal.lab.eng.rdu2.redhat.com          0/1     ContainerCreating   0          3h32m
openshift-kube-controller-manager                       installer-9-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h31m
openshift-kube-scheduler                                installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h32m
openshift-monitoring                                    prometheus-adapter-fd66c5659-8m7nl                                    0/1     ContainerCreating   0          78m
[root@dhcp-140-138 ~]# oc describe po controller-manager-b6dm8 -n openshift-controller-manager
......
  Warning  FailedCreatePodSandBox  2m27s (x40 over 14m)  kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid


3. After do the "Recovery", the cluster still met error same with step 2.

Expected results:
2-3, The env should works well.


Additional info:

Comment 1 Tomáš Nožička 2019-05-28 06:50:22 UTC
This is apiserver with invalid certs. There seems to be a race between the recovery procedure and cert-sync and install processes which can overwrite the new certs with old ones. The workaround would be to run the procedure second time, starting with `regenerate-certs` and running all the following steps. Race fix will follow.

Comment 3 Tomáš Nožička 2019-05-31 11:02:30 UTC
That has raised some worries for David about merging it in short time frame, agreed on smaller change that should fix the most obvious races for cert-syncer https://github.com/openshift/cluster-kube-apiserver-operator/pull/487 (we should have that cert syncer fix anyways to avoid stale caches).

Comment 6 Tomáš Nožička 2019-08-26 11:56:53 UTC
this isn't in 4.1 branch yet

Comment 13 zhou ying 2019-09-10 13:18:28 UTC
confirmed with latest payload 4.1.0-0.nightly-2019-09-09-223953, can't reproduce the issue.

Comment 15 errata-xmlrpc 2019-09-20 12:29:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2768