1815539 – upgrade from 4.1->4.2->4.3->4.4 fails: unexpected on-disk state validating against rendered-master

Bug 1815539 - upgrade from 4.1->4.2->4.3->4.4 fails: unexpected on-disk state validating against rendered-master

Summary: upgrade from 4.1->4.2->4.3->4.4 fails: unexpected on-disk state validating a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1815179 (view as bug list)
Depends On:	1815203
Blocks:	1776665
TreeView+	depends on / blocked

Reported:	2020-03-20 14:22 UTC by Sam Batschelet
Modified:	2021-04-05 17:45 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1815203
Environment:
Last Closed:	2020-05-04 11:47:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 273	0	None	closed	[release-4.4] Bug 1815539: bindata/etcd: rename DR scripts to cluster-backup and cluster-restore	2021-02-05 12:02:01 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:47:24 UTC

Comment 1 Sam Batschelet 2020-03-20 23:29:29 UTC

*** Bug 1815179 has been marked as a duplicate of this bug. ***

Comment 4 ge liu 2020-03-23 15:32:48 UTC

verified with upgrade 4.3.5 to 4.4.0-0.nightly-2020-03-23-010639, added some resource to etcd before upgrade.

Comment 5 Yanping Zhang 2020-03-25 03:28:00 UTC

Upgrade a cluster from 4.2->4.3->4.4, still failed at 4.3->4.4.
error info from mcp master shows:     
message: 'Node ip-10-0-172-254.us-east-2.compute.internal is reporting: "rename
      /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh:
      invalid cross-device link"'

Here are some more infos:
# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
cloud-credential                           4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
cluster-autoscaler                         4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
console                                    4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
csi-snapshot-controller                    4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
dns                                        4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
etcd                                       4.4.0-0.nightly-2020-03-23-115620   True        False         False      16m
image-registry                             4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
ingress                                    4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
insights                                   4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
kube-apiserver                             4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
kube-controller-manager                    4.4.0-0.nightly-2020-03-23-115620   True        False         False      14h
kube-scheduler                             4.4.0-0.nightly-2020-03-23-115620   True        False         False      14h
kube-storage-version-migrator              4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
machine-api                                4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
machine-config                             4.3.0-0.nightly-2020-03-20-053743   False       True          True       13h
marketplace                                4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
monitoring                                 4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
network                                    4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
node-tuning                                4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
openshift-apiserver                        4.4.0-0.nightly-2020-03-23-115620   True        False         True       13h
openshift-controller-manager               4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
openshift-samples                          4.4.0-0.nightly-2020-03-23-115620   True        False         False      14h
operator-lifecycle-manager                 4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
service-ca                                 4.4.0-0.nightly-2020-03-23-115620   True        False         False      18h
service-catalog-apiserver                  4.4.0-0.nightly-2020-03-23-115620   True        False         False      13h
service-catalog-controller-manager         4.4.0-0.nightly-2020-03-23-115620   True        False         False      15h
storage                                    4.4.0-0.nightly-2020-03-23-115620   True        False         False      14h

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-20-053743   True        True          14h     Unable to apply 4.4.0-0.nightly-2020-03-23-115620: the cluster operator openshift-apiserver is degraded

# oc get node
NAME                                         STATUS                     ROLES    AGE   VERSION
ip-10-0-129-165.us-east-2.compute.internal   Ready                      master   17h   v1.16.2
ip-10-0-143-184.us-east-2.compute.internal   Ready                      worker   17h   v1.17.1
ip-10-0-148-46.us-east-2.compute.internal    Ready                      worker   17h   v1.17.1
ip-10-0-158-80.us-east-2.compute.internal    Ready                      master   17h   v1.16.2
ip-10-0-160-88.us-east-2.compute.internal    Ready                      worker   17h   v1.17.1
ip-10-0-172-254.us-east-2.compute.internal   Ready,SchedulingDisabled   master   17h   v1.16.2

[root@MiWiFi-R1CM ~]# oc get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-03-24T09:07:48Z"
  generation: 1
  name: machine-config
  resourceVersion: "434388"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: eebea21a-6dae-11ea-91ef-02b5a04d55cc
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-03-24T13:34:42Z"
    message: Cluster not available for 4.4.0-0.nightly-2020-03-23-115620
    status: "False"
    type: Available
  - lastTransitionTime: "2020-03-24T13:36:44Z"
    message: Working towards 4.4.0-0.nightly-2020-03-23-115620
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-03-24T13:34:41Z"
    message: 'Unable to apply 4.4.0-0.nightly-2020-03-23-115620: timed out waiting
      for the condition during syncRequiredMachineConfigPools: pool master has not
      progressed to latest configuration: controller version mismatch for rendered-master-ae1f3d111090dc50b108976cfe9743cb
      expected d5d9a488c1e0e19e1d3044bd0fac90096b0224d6 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca,
      retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-03-24T09:08:46Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension: {}
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: master
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: worker
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: machine-config-controller
    resource: controllerconfigs
  versions:
  - name: operator
    version: 4.3.0-0.nightly-2020-03-20-053743
[root@MiWiFi-R1CM ~]# oc get co openshift-apiserver -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-03-24T09:07:59Z"
  generation: 1
  name: openshift-apiserver
  resourceVersion: "145634"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  uid: f5477f87-6dae-11ea-91ef-02b5a04d55cc
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-03-24T13:42:46Z"
    message: 'APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable'
    reason: APIServerDeployment_UnavailablePod
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-03-24T13:22:42Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-03-24T13:30:08Z"
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2020-03-24T09:07:59Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: openshiftapiservers
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-apiserver-operator
    resource: namespaces
  - group: ""
    name: openshift-apiserver
    resource: namespaces
  - group: ""
    name: host-etcd-2
    namespace: openshift-etcd
    resource: endpoints
  - group: apiregistration.k8s.io
    name: v1.apps.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.authorization.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.build.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.image.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.oauth.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.project.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.quota.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.route.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.security.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.template.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.user.openshift.io
    resource: apiservices
  versions:
  - name: operator
    version: 4.4.0-0.nightly-2020-03-23-115620
  - name: openshift-apiserver
    version: 4.4.0-0.nightly-2020-03-23-115620

[root@MiWiFi-R1CM ~]# oc get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-03-24T09:07:48Z"
  generation: 1
  name: machine-config
  resourceVersion: "434388"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: eebea21a-6dae-11ea-91ef-02b5a04d55cc
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-03-24T13:34:42Z"
    message: Cluster not available for 4.4.0-0.nightly-2020-03-23-115620
    status: "False"
    type: Available
  - lastTransitionTime: "2020-03-24T13:36:44Z"
    message: Working towards 4.4.0-0.nightly-2020-03-23-115620
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-03-24T13:34:41Z"
    message: 'Unable to apply 4.4.0-0.nightly-2020-03-23-115620: timed out waiting
      for the condition during syncRequiredMachineConfigPools: pool master has not
      progressed to latest configuration: controller version mismatch for rendered-master-ae1f3d111090dc50b108976cfe9743cb
      expected d5d9a488c1e0e19e1d3044bd0fac90096b0224d6 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca,
      retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-03-24T09:08:46Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension: {}
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: master
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: worker
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: machine-config-controller
    resource: controllerconfigs
  versions:
  - name: operator
    version: 4.3.0-0.nightly-2020-03-20-053743

[root@MiWiFi-R1CM ~]# oc get mcp master -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2020-03-24T09:07:53Z"
  generation: 4
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    operator.machineconfiguration.openshift.io/required-for-upgrade: ""
  name: master
  resourceVersion: "144629"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
  uid: f1ea987f-6dae-11ea-91ef-02b5a04d55cc
spec:
  configuration:
    name: rendered-master-c9dbe1108410107d4942f4fd2c14cc4f
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-master
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-f1ea987f-6dae-11ea-91ef-02b5a04d55cc-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: master
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/master: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2020-03-24T09:08:21Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2020-03-24T13:40:25Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2020-03-24T13:40:25Z"
    message: All nodes are updating to rendered-master-c9dbe1108410107d4942f4fd2c14cc4f
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2020-03-24T13:41:01Z"
    message: 'Node ip-10-0-172-254.us-east-2.compute.internal is reporting: "rename
      /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh:
      invalid cross-device link"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2020-03-24T13:41:01Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded
  configuration:
    name: rendered-master-ae1f3d111090dc50b108976cfe9743cb
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-master
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-f1ea987f-6dae-11ea-91ef-02b5a04d55cc-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-ssh
  degradedMachineCount: 1
  machineCount: 3
  observedGeneration: 4
  readyMachineCount: 0
  unavailableMachineCount: 1
  updatedMachineCount: 0

Comment 6 Wenjing Zheng 2020-03-25 05:57:48 UTC

I can also see this bug when upgrade from 4.3.8 to 4.4.0-0.nightly-2020-03-24-225110:
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.8     True        True          126m    Unable to apply 4.4.0-0.nightly-2020-03-24-225110: the cluster operator openshift-apiserver is degraded
Status:
  Available Updates:  <nil>
  Conditions:
    Last Transition Time:  2020-03-25T02:29:17Z
    Message:               Done applying 4.3.8
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-03-25T04:34:12Z
    Message:               Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable
    Reason:                ClusterOperatorDegraded
    Status:                True
    Type:                  Failing
    Last Transition Time:  2020-03-25T03:50:04Z
    Message:               Unable to apply 4.4.0-0.nightly-2020-03-24-225110: the cluster operator openshift-apiserver is degraded
    Reason:                ClusterOperatorDegraded
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-03-25T02:32:29Z
    Status:                True
    Type:                  RetrievedUpdates

Comment 8 Wenjing Zheng 2020-03-25 11:45:44 UTC

Must gather paused with below error:
$ oc adm must-gather
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c67d14314241030909c260f2e67667df4f499997c40bb7144413e8ede5abe53
[must-gather      ] OUT namespace/openshift-must-gather-llnm6 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rhjnh created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c67d14314241030909c260f2e67667df4f499997c40bb7144413e8ede5abe53 created
[must-gather-7r5xp] POD Wrote inspect data to must-gather.
[must-gather-7r5xp] POD Gathering data for ns/openshift-cluster-version...
[must-gather-7r5xp] POD Wrote inspect data to must-gather.
[must-gather-7r5xp] POD Gathering data for ns/openshift-config...
[must-gather-7r5xp] POD Gathering data for ns/openshift-config-managed...
[must-gather-7r5xp] POD Gathering data for ns/openshift-authentication...
[must-gather-7r5xp] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-7r5xp] POD Gathering data for ns/openshift-ingress...
[must-gather-7r5xp] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-7r5xp] POD Gathering data for ns/openshift-machine-api...
[must-gather-7r5xp] POD Gathering data for ns/openshift-console-operator...
[must-gather-7r5xp] POD Gathering data for ns/openshift-console...
[must-gather-7r5xp] POD Gathering data for ns/openshift-csi-snapshot-controller...
[must-gather-7r5xp] POD Gathering data for ns/openshift-csi-snapshot-controller-operator...
[must-gather-7r5xp] POD Gathering data for ns/openshift-dns-operator...
[must-gather-7r5xp] POD Gathering data for ns/openshift-dns...
[must-gather-7r5xp] OUT waiting for gather to complete
[must-gather-7r5xp] OUT gather never finished: pods "must-gather-7r5xp" not found
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rhjnh deleted
[must-gather      ] OUT namespace/openshift-must-gather-llnm6 deleted
error: gather never finished for pod must-gather-7r5xp: pods "must-gather-7r5xp" not found

Comment 13 errata-xmlrpc 2020-05-04 11:47:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 14 W. Trevor King 2021-04-05 17:45:54 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.