1714158 – [DR][bare metal] Pod hang with container create error

Bug 1714158 - [DR][bare metal] Pod hang with container create error

Summary: [DR][bare metal] Pod hang with container create error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Tomáš Nožička
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:	1745571 1749271
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-27 09:25 UTC by zhou ying
Modified:	2019-09-20 12:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1745571 (view as bug list)
Environment:
Last Closed:	2019-09-20 12:29:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 557	None	closed	Bug 1714158: Prevent cert-syncer to act on stale data	2020-10-20 07:01:16 UTC
Github	openshift library-go pull 512	None	closed	Bug 1714158: Prevent cert-syncer to act on stale data	2020-10-20 07:01:17 UTC
Red Hat Product Errata	RHBA-2019:2768	None	None	None	2019-09-20 12:29:38 UTC

Description zhou ying 2019-05-27 09:25:49 UTC

Description of problem:
Follow Doc: "https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit"  to do the certificate recovery, after forcing rotation, some pods hand with create container error:
kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid


Version-Release number of selected component (if applicable):
Payload: 4.1.0-0.nightly-2019-05-24-040103

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit to do certificate recovery;
2. When do the "Breaking the cluster", after do force rotation, pod hand with container creating.
3. Try to follow the doc to do "Recovery", not work. 


Actual results:
2. The cluster not works well, pod hand with container creating error:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-24-040103   True        False         5h32m   Error while reconciling 4.1.0-0.nightly-2019-05-24-040103: an unknown error has occurred
[root@dhcp-140-138 ~]# oc get po -A |grep -v -E  "Running|Completed"
NAMESPACE                                               NAME                                                                  READY   STATUS              RESTARTS   AGE
openshift-controller-manager                            controller-manager-b6dm8                                              0/1     ContainerCreating   0          73m
openshift-kube-apiserver                                installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h3m
openshift-kube-apiserver                                revision-pruner-7-dell-r730-005.dsal.lab.eng.rdu2.redhat.com          0/1     ContainerCreating   0          3h32m
openshift-kube-controller-manager                       installer-9-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h31m
openshift-kube-scheduler                                installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h32m
openshift-monitoring                                    prometheus-adapter-fd66c5659-8m7nl                                    0/1     ContainerCreating   0          78m
[root@dhcp-140-138 ~]# oc describe po controller-manager-b6dm8 -n openshift-controller-manager
......
  Warning  FailedCreatePodSandBox  2m27s (x40 over 14m)  kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid


3. After do the "Recovery", the cluster still met error same with step 2.

Expected results:
2-3, The env should works well.


Additional info:

Comment 1 Tomáš Nožička 2019-05-28 06:50:22 UTC

This is apiserver with invalid certs. There seems to be a race between the recovery procedure and cert-sync and install processes which can overwrite the new certs with old ones. The workaround would be to run the procedure second time, starting with `regenerate-certs` and running all the following steps. Race fix will follow.

Comment 2 Tomáš Nožička 2019-05-30 13:47:42 UTC

https://github.com/openshift/cluster-kube-apiserver-operator/pull/483

Comment 3 Tomáš Nožička 2019-05-31 11:02:30 UTC

That has raised some worries for David about merging it in short time frame, agreed on smaller change that should fix the most obvious races for cert-syncer https://github.com/openshift/cluster-kube-apiserver-operator/pull/487 (we should have that cert syncer fix anyways to avoid stale caches).

Comment 5 Tomáš Nožička 2019-08-26 11:39:01 UTC

already bumped with https://github.com/openshift/cluster-kube-apiserver-operator/pull/487#issuecomment-503060868

Comment 6 Tomáš Nožička 2019-08-26 11:56:53 UTC

this isn't in 4.1 branch yet

Comment 13 zhou ying 2019-09-10 13:18:28 UTC

confirmed with latest payload 4.1.0-0.nightly-2019-09-09-223953, can't reproduce the issue.

Comment 15 errata-xmlrpc 2019-09-20 12:29:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2768

Note You need to log in before you can comment on or make changes to this bug.