Bug 1714158 - [DR][bare metal] Pod hang with container create error
Summary: [DR][bare metal] Pod hang with container create error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.1.z
Assignee: Tomáš Nožička
QA Contact: zhou ying
URL:
Whiteboard:
Depends On: 1745571 1749271
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-27 09:25 UTC by zhou ying
Modified: 2019-09-20 12:29 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1745571 (view as bug list)
Environment:
Last Closed: 2019-09-20 12:29:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 557 0 None closed Bug 1714158: Prevent cert-syncer to act on stale data 2020-10-20 07:01:16 UTC
Github openshift library-go pull 512 0 None closed Bug 1714158: Prevent cert-syncer to act on stale data 2020-10-20 07:01:17 UTC
Red Hat Product Errata RHBA-2019:2768 0 None None None 2019-09-20 12:29:38 UTC

Description zhou ying 2019-05-27 09:25:49 UTC
Description of problem:
Follow Doc: "https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit"  to do the certificate recovery, after forcing rotation, some pods hand with create container error:
kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid


Version-Release number of selected component (if applicable):
Payload: 4.1.0-0.nightly-2019-05-24-040103

How reproducible:
Always

Steps to Reproduce:
1. Follow the doc: https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit to do certificate recovery;
2. When do the "Breaking the cluster", after do force rotation, pod hand with container creating.
3. Try to follow the doc to do "Recovery", not work. 


Actual results:
2. The cluster not works well, pod hand with container creating error:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-24-040103   True        False         5h32m   Error while reconciling 4.1.0-0.nightly-2019-05-24-040103: an unknown error has occurred
[root@dhcp-140-138 ~]# oc get po -A |grep -v -E  "Running|Completed"
NAMESPACE                                               NAME                                                                  READY   STATUS              RESTARTS   AGE
openshift-controller-manager                            controller-manager-b6dm8                                              0/1     ContainerCreating   0          73m
openshift-kube-apiserver                                installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h3m
openshift-kube-apiserver                                revision-pruner-7-dell-r730-005.dsal.lab.eng.rdu2.redhat.com          0/1     ContainerCreating   0          3h32m
openshift-kube-controller-manager                       installer-9-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h31m
openshift-kube-scheduler                                installer-8-dell-r730-005.dsal.lab.eng.rdu2.redhat.com                0/1     ContainerCreating   0          3h32m
openshift-monitoring                                    prometheus-adapter-fd66c5659-8m7nl                                    0/1     ContainerCreating   0          78m
[root@dhcp-140-138 ~]# oc describe po controller-manager-b6dm8 -n openshift-controller-manager
......
  Warning  FailedCreatePodSandBox  2m27s (x40 over 14m)  kubelet, dell-r730-005.dsal.lab.eng.rdu2.redhat.com  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-b6dm8_openshift-controller-manager_b96a131c-8052-11e9-a88b-14187743ef41_0(b6444aa1a8e31dcb69def2041d5f77351b944272e0c765916515ce9f72d94223): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'Get https://api-int.baremetal-lab-02.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-controller-manager/pods/controller-manager-b6dm8: x509: certificate has expired or is not yet valid


3. After do the "Recovery", the cluster still met error same with step 2.

Expected results:
2-3, The env should works well.


Additional info:

Comment 1 Tomáš Nožička 2019-05-28 06:50:22 UTC
This is apiserver with invalid certs. There seems to be a race between the recovery procedure and cert-sync and install processes which can overwrite the new certs with old ones. The workaround would be to run the procedure second time, starting with `regenerate-certs` and running all the following steps. Race fix will follow.

Comment 3 Tomáš Nožička 2019-05-31 11:02:30 UTC
That has raised some worries for David about merging it in short time frame, agreed on smaller change that should fix the most obvious races for cert-syncer https://github.com/openshift/cluster-kube-apiserver-operator/pull/487 (we should have that cert syncer fix anyways to avoid stale caches).

Comment 6 Tomáš Nožička 2019-08-26 11:56:53 UTC
this isn't in 4.1 branch yet

Comment 13 zhou ying 2019-09-10 13:18:28 UTC
confirmed with latest payload 4.1.0-0.nightly-2019-09-09-223953, can't reproduce the issue.

Comment 15 errata-xmlrpc 2019-09-20 12:29:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2768


Note You need to log in before you can comment on or make changes to this bug.