Bug 1835146 - [BM IPI] etcd containers not started on master after restoring to previous state
Summary: [BM IPI] etcd containers not started on master after restoring to previous state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.5.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1836270
TreeView+ depends on / blocked
 
Reported: 2020-05-13 08:53 UTC by Yurii Prokulevych
Modified: 2020-07-13 17:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: restore process was stopping all static pods but only restarting etcd, api-server, api-scheduler and controller-manager. Bare metal has network pods which are stopped, but not restarted. Consequence: Without the network pods restarted, the kubelets cannot communicate and the cluster doesn't come up. Fix: Only stop the 4 pods which are affected by restore. Result: The cluster comes up fine.
Clone Of:
: 1836270 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:38:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 352 0 None closed Bug 1835146: Cluster restore should not stop network pods on bare-metal 2020-12-29 11:06:11 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:38:26 UTC

Description Yurii Prokulevych 2020-05-13 08:53:52 UTC
Description of problem:
-----------------------
After restoring cluster to previous state there are no etcd containers on 2 masters, and on recovery master there is just 1 `etcd` container


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
Installed: 4.4.0-0.nightly-2020-05-01-231319
Updated to 4.4.3
Restored to: 4.4.0-0.nightly-2020-05-01-231319


Steps to Reproduce:
1. https://url.corp.redhat.com/7fc8d2a 


Actual results:
---------------
etcd containers ain't runing on 2 masters nodes out of 3

All nodes are in `SchedulingDisabled` state

Some operators didn't restore to previous version:
oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-05-01-231319   True        False         False      17h
cloud-credential                           4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
cluster-autoscaler                         4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
console                                    4.4.0-0.nightly-2020-05-01-231319   True        False         False      3h8m
csi-snapshot-controller                    4.4.3                               False       True          False      72m
dns                                        4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
etcd                                       4.4.0-0.nightly-2020-05-01-231319   True        False         False      3h4m
image-registry                             4.4.3                               False       True          False      72m
ingress                                    4.4.0-0.nightly-2020-05-01-231319   True        False         False      3h13m
insights                                   4.4.3                               True        False         False      18h
kube-apiserver                             4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
kube-controller-manager                    4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
kube-scheduler                             4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
kube-storage-version-migrator              4.4.0-0.nightly-2020-05-01-231319   True        False         False      3h13m
machine-api                                4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
machine-config                             4.4.0-0.nightly-2020-05-01-231319   True        False         False      17h
marketplace                                4.4.0-0.nightly-2020-05-01-231319   True        False         False      3h8m
monitoring                                 4.4.0-0.nightly-2020-05-01-231319   True        False         False      17h
network                                    4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
node-tuning                                4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
openshift-apiserver                        4.4.0-0.nightly-2020-05-01-231319   True        False         False      3h13m
openshift-controller-manager               4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
openshift-samples                          4.4.0-0.nightly-2020-05-01-231319   True        True          False      178m
operator-lifecycle-manager                 4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-05-01-231319   True        False         False      3h7m
service-ca                                 4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
service-catalog-apiserver                  4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h
service-catalog-controller-manager         4.4.3                               True        False         False      18h
storage                                    4.4.0-0.nightly-2020-05-01-231319   True        False         False      18h


Expected results:
-----------------
etcd containers started on all masters

Nodes are in Ready state

All cluster operators are rolled back to previous version


Additional info:
----------------
Virtual setup: 3masters + 2 workers + CNV

after manually uncordon'ing 3 masters:

Name:         cluster
Namespace:
Labels:       <none>
Annotations:  release.openshift.io/create-only: true
API Version:  operator.openshift.io/v1
Kind:         Etcd
Metadata:
  Creation Timestamp:  2020-05-12T13:53:56Z
  Generation:          2
  Resource Version:    560342
  Self Link:           /apis/operator.openshift.io/v1/etcds/cluster
  UID:                 9c6368e4-3c28-4ae0-a565-2c72552fb396
Spec:
  Force Redeployment Reason:  recovery-2020-05-13 07:05:01.344000145+00:00
  Management State:           Managed
Status:
  Conditions:
    Last Transition Time:            2020-05-12T14:05:45Z
    Reason:                          NoUnsupportedConfigOverrides
    Status:                          True
    Type:                            UnsupportedConfigOverridesUpgradeable
    Last Transition Time:            2020-05-12T14:59:01Z
    Status:                          False
    Type:                            InstallerControllerDegraded
    Last Transition Time:            2020-05-12T14:08:33Z
    Message:                         3 nodes are active; 3 nodes are at revision 4; 0 nodes have achieved new revision 5
    Status:                          True
    Type:                            StaticPodsAvailable
    Last Transition Time:            2020-05-13T08:39:48Z
    Message:                         3 nodes are at revision 4; 0 nodes have achieved new revision 5
    Status:                          True
    Type:                            NodeInstallerProgressing
    Last Transition Time:            2020-05-12T14:05:45Z
    Status:                          False
    Type:                            NodeInstallerDegraded
    Last Transition Time:            2020-05-12T14:05:45Z
    Reason:                          HostEndpoints2Updated
    Status:                          False
    Type:                            HostEndpoints2Degraded
    Last Transition Time:            2020-05-12T14:14:43Z
    Status:                          False
    Type:                            StaticPodsDegraded
    Last Transition Time:            2020-05-13T08:39:27Z
    Message:                         The master nodes not ready: node "master-0-0" not ready since 2020-05-13 07:03:11 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.), node "m
aster-0-2" not ready since 2020-05-13 07:03:11 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
    Reason:                          MasterNodesReady
    Status:                          True
    Type:                            NodeControllerDegraded
    Last Transition Time:            2020-05-13T08:39:29Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            ScriptControllerDegraded
    Last Transition Time:            2020-05-12T14:05:47Z
    Status:                          False
    Type:                            InstallerPodPendingDegraded
    Last Transition Time:            2020-05-12T14:05:47Z
    Status:                          False
    Type:                            InstallerPodContainerWaitingDegraded
    Last Transition Time:            2020-05-12T14:05:47Z
    Status:                          False
    Type:                            InstallerPodNetworkingDegraded
    Last Transition Time:            2020-05-12T14:05:48Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            EnvVarControllerDegraded
    Last Transition Time:            2020-05-12T14:05:48Z
    Status:                          False
    Type:                            ConfigObservationDegraded
    Last Transition Time:            2020-05-13T08:39:45Z
    Status:                          False
    Type:                            RevisionControllerDegraded
    Last Transition Time:            2020-05-12T14:05:58Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            BackingResourceControllerDegraded
    Last Transition Time:            2020-05-13T05:07:07Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            ClusterMemberControllerDegraded
    Last Transition Time:            2020-05-12T14:06:04Z
    Status:                          False
    Type:                            TargetConfigControllerDegraded
    Last Transition Time:            2020-05-12T14:06:07Z
    Status:                          False
    Type:                            ResourceSyncControllerDegraded
    Last Transition Time:            2020-05-12T14:06:08Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            EtcdCertSignerControllerDegraded
    Last Transition Time:            2020-05-12T14:06:08Z
    Last Transition Time:            2020-05-12T14:06:08Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            EtcdStaticResourcesDegraded
    Last Transition Time:            2020-05-13T05:07:06Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            EtcdMemberIPMigratorDegraded
    Last Transition Time:            2020-05-13T05:07:05Z
    Reason:                          MembersReported
    Status:                          False
    Type:                            EtcdMembersControllerDegraded
    Last Transition Time:            2020-05-13T05:07:05Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            BootstrapTeardownDegraded
    Last Transition Time:            2020-05-13T05:20:04Z
    Message:                         No unhealthy members found
    Reason:                          AsExpected
    Status:                          False
    Type:                            EtcdMembersDegraded
    Last Transition Time:            2020-05-12T14:09:31Z
    Message:                         etcd-bootstrap member is already removed
    Reason:                          BootstrapAlreadyRemoved
    Status:                          True
    Type:                            EtcdRunningInCluster
    Last Transition Time:            2020-05-13T05:11:04Z
    Message:                         master-0-1 members are available,  have not started,  are unhealthy,  are unknown
    Reason:                          EtcdQuorate
    Status:                          True
    Type:                            EtcdMembersAvailable
    Last Transition Time:            2020-05-12T14:08:27Z
    Message:                         all members have started
    Reason:                          AsExpected
    Status:                          False
    Type:                            EtcdMembersProgressing
  Latest Available Revision:         5
  Latest Available Revision Reason:
  Node Statuses:
    Current Revision:  4
    Node Name:         master-0-1
    Current Revision:  4
    Node Name:         master-0-0
    Target Revision:   5
    Current Revision:  4
    Node Name:         master-0-2
  Ready Replicas:      0
Events:                <none>

Comment 9 errata-xmlrpc 2020-07-13 17:38:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.