Bug 1714158

Summary:	[DR][bare metal] Pod hang with container create error
Product:	OpenShift Container Platform	Reporter:	zhou ying <yinzhou>
Component:	kube-apiserver	Assignee:	Tomáš Nožička <tnozicka>
Status:	CLOSED ERRATA	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.1.0	CC:	aos-bugs, jokerman, mfojtik, mmccomas, sttts, talessio, tnozicka, xxia
Target Milestone:	---
Target Release:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1745571 (view as bug list)		Environment:
Last Closed:	2019-09-20 12:29:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1745571, 1749271
Bug Blocks:

Description zhou ying 2019-05-27 09:25:49 UTC

Description of problem:
Follow Doc: "https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit"  to do the certificate recovery, after forcing rotation, some pods hand with create container error:
kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid


Version-Release number of selected component (if applicable):
Payload: 4.1.0-0.nightly-2019-05-24-040103

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit to do certificate recovery;
2. When do the "Breaking the cluster", after do force rotation, pod hand with container creating.
3. Try to follow the doc to do "Recovery", not work. 


Actual results:
2. The cluster not works well, pod hand with container creating error:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-24-040103   True        False         5h32m   Error while reconciling 4.1.0-0.nightly-2019-05-24-040103: an unknown error has occurred
[root@dhcp-140-138 ~]# oc get po -A |grep -v -E  "Running|Completed"
NAMESPACE                                               NAME                                                                  READY   STATUS              RESTARTS   AGE
openshift-controller-manager                            controller-manager-b6dm8                                              0/1     ContainerCreating   0          73m
openshift-kube-apiserver                                installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h3m
openshift-kube-apiserver                                revision-pruner-7-dell-r730-005.dsal.lab.eng.rdu2.redhat.com          0/1     ContainerCreating   0          3h32m
openshift-kube-controller-manager                       installer-9-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h31m
openshift-kube-scheduler                                installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h32m
openshift-monitoring                                    prometheus-adapter-fd66c5659-8m7nl                                    0/1     ContainerCreating   0          78m
[root@dhcp-140-138 ~]# oc describe po controller-manager-b6dm8 -n openshift-controller-manager
......
  Warning  FailedCreatePodSandBox  2m27s (x40 over 14m)  kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid


3. After do the "Recovery", the cluster still met error same with step 2.

Expected results:
2-3, The env should works well.


Additional info:

Comment 1 Tomáš Nožička 2019-05-28 06:50:22 UTC

This is apiserver with invalid certs. There seems to be a race between the recovery procedure and cert-sync and install processes which can overwrite the new certs with old ones. The workaround would be to run the procedure second time, starting with `regenerate-certs` and running all the following steps. Race fix will follow.

Comment 2 Tomáš Nožička 2019-05-30 13:47:42 UTC

https://github.com/openshift/cluster-kube-apiserver-operator/pull/483

Comment 3 Tomáš Nožička 2019-05-31 11:02:30 UTC

That has raised some worries for David about merging it in short time frame, agreed on smaller change that should fix the most obvious races for cert-syncer https://github.com/openshift/cluster-kube-apiserver-operator/pull/487 (we should have that cert syncer fix anyways to avoid stale caches).

Comment 5 Tomáš Nožička 2019-08-26 11:39:01 UTC

already bumped with https://github.com/openshift/cluster-kube-apiserver-operator/pull/487#issuecomment-503060868

Comment 6 Tomáš Nožička 2019-08-26 11:56:53 UTC

this isn't in 4.1 branch yet

Comment 13 zhou ying 2019-09-10 13:18:28 UTC

confirmed with latest payload 4.1.0-0.nightly-2019-09-09-223953, can't reproduce the issue.

Comment 15 errata-xmlrpc 2019-09-20 12:29:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2768