Bug 1815539
| Summary: | upgrade from 4.1->4.2->4.3->4.4 fails: unexpected on-disk state validating against rendered-master | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Sam Batschelet <sbatsche> |
| Component: | Etcd Operator | Assignee: | Sam Batschelet <sbatsche> |
| Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.4 | CC: | dgoodwin, geliu, jialiu, kgarriso, mifiedle, minmli, rpattath, wzheng, yanpzhan |
| Target Milestone: | --- | ||
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1815203 | Environment: | |
| Last Closed: | 2020-05-04 11:47:02 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1815203 | ||
| Bug Blocks: | 1776665 | ||
|
Comment 1
Sam Batschelet
2020-03-20 23:29:29 UTC
verified with upgrade 4.3.5 to 4.4.0-0.nightly-2020-03-23-010639, added some resource to etcd before upgrade. Upgrade a cluster from 4.2->4.3->4.4, still failed at 4.3->4.4.
error info from mcp master shows:
message: 'Node ip-10-0-172-254.us-east-2.compute.internal is reporting: "rename
/etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh:
invalid cross-device link"'
Here are some more infos:
# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
cloud-credential 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
cluster-autoscaler 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
console 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
csi-snapshot-controller 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
dns 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
etcd 4.4.0-0.nightly-2020-03-23-115620 True False False 16m
image-registry 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
ingress 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
insights 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
kube-apiserver 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
kube-controller-manager 4.4.0-0.nightly-2020-03-23-115620 True False False 14h
kube-scheduler 4.4.0-0.nightly-2020-03-23-115620 True False False 14h
kube-storage-version-migrator 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
machine-api 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
machine-config 4.3.0-0.nightly-2020-03-20-053743 False True True 13h
marketplace 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
monitoring 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
network 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
node-tuning 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
openshift-apiserver 4.4.0-0.nightly-2020-03-23-115620 True False True 13h
openshift-controller-manager 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
openshift-samples 4.4.0-0.nightly-2020-03-23-115620 True False False 14h
operator-lifecycle-manager 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
service-ca 4.4.0-0.nightly-2020-03-23-115620 True False False 18h
service-catalog-apiserver 4.4.0-0.nightly-2020-03-23-115620 True False False 13h
service-catalog-controller-manager 4.4.0-0.nightly-2020-03-23-115620 True False False 15h
storage 4.4.0-0.nightly-2020-03-23-115620 True False False 14h
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.0-0.nightly-2020-03-20-053743 True True 14h Unable to apply 4.4.0-0.nightly-2020-03-23-115620: the cluster operator openshift-apiserver is degraded
# oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-129-165.us-east-2.compute.internal Ready master 17h v1.16.2
ip-10-0-143-184.us-east-2.compute.internal Ready worker 17h v1.17.1
ip-10-0-148-46.us-east-2.compute.internal Ready worker 17h v1.17.1
ip-10-0-158-80.us-east-2.compute.internal Ready master 17h v1.16.2
ip-10-0-160-88.us-east-2.compute.internal Ready worker 17h v1.17.1
ip-10-0-172-254.us-east-2.compute.internal Ready,SchedulingDisabled master 17h v1.16.2
[root@MiWiFi-R1CM ~]# oc get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: "2020-03-24T09:07:48Z"
generation: 1
name: machine-config
resourceVersion: "434388"
selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
uid: eebea21a-6dae-11ea-91ef-02b5a04d55cc
spec: {}
status:
conditions:
- lastTransitionTime: "2020-03-24T13:34:42Z"
message: Cluster not available for 4.4.0-0.nightly-2020-03-23-115620
status: "False"
type: Available
- lastTransitionTime: "2020-03-24T13:36:44Z"
message: Working towards 4.4.0-0.nightly-2020-03-23-115620
status: "True"
type: Progressing
- lastTransitionTime: "2020-03-24T13:34:41Z"
message: 'Unable to apply 4.4.0-0.nightly-2020-03-23-115620: timed out waiting
for the condition during syncRequiredMachineConfigPools: pool master has not
progressed to latest configuration: controller version mismatch for rendered-master-ae1f3d111090dc50b108976cfe9743cb
expected d5d9a488c1e0e19e1d3044bd0fac90096b0224d6 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca,
retrying'
reason: RequiredPoolsFailed
status: "True"
type: Degraded
- lastTransitionTime: "2020-03-24T09:08:46Z"
reason: AsExpected
status: "True"
type: Upgradeable
extension: {}
relatedObjects:
- group: ""
name: openshift-machine-config-operator
resource: namespaces
- group: machineconfiguration.openshift.io
name: master
resource: machineconfigpools
- group: machineconfiguration.openshift.io
name: worker
resource: machineconfigpools
- group: machineconfiguration.openshift.io
name: machine-config-controller
resource: controllerconfigs
versions:
- name: operator
version: 4.3.0-0.nightly-2020-03-20-053743
[root@MiWiFi-R1CM ~]# oc get co openshift-apiserver -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: "2020-03-24T09:07:59Z"
generation: 1
name: openshift-apiserver
resourceVersion: "145634"
selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
uid: f5477f87-6dae-11ea-91ef-02b5a04d55cc
spec: {}
status:
conditions:
- lastTransitionTime: "2020-03-24T13:42:46Z"
message: 'APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable'
reason: APIServerDeployment_UnavailablePod
status: "True"
type: Degraded
- lastTransitionTime: "2020-03-24T13:22:42Z"
reason: AsExpected
status: "False"
type: Progressing
- lastTransitionTime: "2020-03-24T13:30:08Z"
reason: AsExpected
status: "True"
type: Available
- lastTransitionTime: "2020-03-24T09:07:59Z"
reason: AsExpected
status: "True"
type: Upgradeable
extension: null
relatedObjects:
- group: operator.openshift.io
name: cluster
resource: openshiftapiservers
- group: ""
name: openshift-config
resource: namespaces
- group: ""
name: openshift-config-managed
resource: namespaces
- group: ""
name: openshift-apiserver-operator
resource: namespaces
- group: ""
name: openshift-apiserver
resource: namespaces
- group: ""
name: host-etcd-2
namespace: openshift-etcd
resource: endpoints
- group: apiregistration.k8s.io
name: v1.apps.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.authorization.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.build.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.image.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.oauth.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.project.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.quota.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.route.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.security.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.template.openshift.io
resource: apiservices
- group: apiregistration.k8s.io
name: v1.user.openshift.io
resource: apiservices
versions:
- name: operator
version: 4.4.0-0.nightly-2020-03-23-115620
- name: openshift-apiserver
version: 4.4.0-0.nightly-2020-03-23-115620
[root@MiWiFi-R1CM ~]# oc get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: "2020-03-24T09:07:48Z"
generation: 1
name: machine-config
resourceVersion: "434388"
selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
uid: eebea21a-6dae-11ea-91ef-02b5a04d55cc
spec: {}
status:
conditions:
- lastTransitionTime: "2020-03-24T13:34:42Z"
message: Cluster not available for 4.4.0-0.nightly-2020-03-23-115620
status: "False"
type: Available
- lastTransitionTime: "2020-03-24T13:36:44Z"
message: Working towards 4.4.0-0.nightly-2020-03-23-115620
status: "True"
type: Progressing
- lastTransitionTime: "2020-03-24T13:34:41Z"
message: 'Unable to apply 4.4.0-0.nightly-2020-03-23-115620: timed out waiting
for the condition during syncRequiredMachineConfigPools: pool master has not
progressed to latest configuration: controller version mismatch for rendered-master-ae1f3d111090dc50b108976cfe9743cb
expected d5d9a488c1e0e19e1d3044bd0fac90096b0224d6 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca,
retrying'
reason: RequiredPoolsFailed
status: "True"
type: Degraded
- lastTransitionTime: "2020-03-24T09:08:46Z"
reason: AsExpected
status: "True"
type: Upgradeable
extension: {}
relatedObjects:
- group: ""
name: openshift-machine-config-operator
resource: namespaces
- group: machineconfiguration.openshift.io
name: master
resource: machineconfigpools
- group: machineconfiguration.openshift.io
name: worker
resource: machineconfigpools
- group: machineconfiguration.openshift.io
name: machine-config-controller
resource: controllerconfigs
versions:
- name: operator
version: 4.3.0-0.nightly-2020-03-20-053743
[root@MiWiFi-R1CM ~]# oc get mcp master -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: "2020-03-24T09:07:53Z"
generation: 4
labels:
machineconfiguration.openshift.io/mco-built-in: ""
operator.machineconfiguration.openshift.io/required-for-upgrade: ""
name: master
resourceVersion: "144629"
selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
uid: f1ea987f-6dae-11ea-91ef-02b5a04d55cc
spec:
configuration:
name: rendered-master-c9dbe1108410107d4942f4fd2c14cc4f
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-master
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-master-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-master-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-master-f1ea987f-6dae-11ea-91ef-02b5a04d55cc-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-master-ssh
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: master
nodeSelector:
matchLabels:
node-role.kubernetes.io/master: ""
paused: false
status:
conditions:
- lastTransitionTime: "2020-03-24T09:08:21Z"
message: ""
reason: ""
status: "False"
type: RenderDegraded
- lastTransitionTime: "2020-03-24T13:40:25Z"
message: ""
reason: ""
status: "False"
type: Updated
- lastTransitionTime: "2020-03-24T13:40:25Z"
message: All nodes are updating to rendered-master-c9dbe1108410107d4942f4fd2c14cc4f
reason: ""
status: "True"
type: Updating
- lastTransitionTime: "2020-03-24T13:41:01Z"
message: 'Node ip-10-0-172-254.us-east-2.compute.internal is reporting: "rename
/etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh:
invalid cross-device link"'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
- lastTransitionTime: "2020-03-24T13:41:01Z"
message: ""
reason: ""
status: "True"
type: Degraded
configuration:
name: rendered-master-ae1f3d111090dc50b108976cfe9743cb
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-master
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-master-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-master-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-master-f1ea987f-6dae-11ea-91ef-02b5a04d55cc-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-master-ssh
degradedMachineCount: 1
machineCount: 3
observedGeneration: 4
readyMachineCount: 0
unavailableMachineCount: 1
updatedMachineCount: 0
I can also see this bug when upgrade from 4.3.8 to 4.4.0-0.nightly-2020-03-24-225110:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.8 True True 126m Unable to apply 4.4.0-0.nightly-2020-03-24-225110: the cluster operator openshift-apiserver is degraded
Status:
Available Updates: <nil>
Conditions:
Last Transition Time: 2020-03-25T02:29:17Z
Message: Done applying 4.3.8
Status: True
Type: Available
Last Transition Time: 2020-03-25T04:34:12Z
Message: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable
Reason: ClusterOperatorDegraded
Status: True
Type: Failing
Last Transition Time: 2020-03-25T03:50:04Z
Message: Unable to apply 4.4.0-0.nightly-2020-03-24-225110: the cluster operator openshift-apiserver is degraded
Reason: ClusterOperatorDegraded
Status: True
Type: Progressing
Last Transition Time: 2020-03-25T02:32:29Z
Status: True
Type: RetrievedUpdates
Must gather paused with below error: $ oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c67d14314241030909c260f2e67667df4f499997c40bb7144413e8ede5abe53 [must-gather ] OUT namespace/openshift-must-gather-llnm6 created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rhjnh created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c67d14314241030909c260f2e67667df4f499997c40bb7144413e8ede5abe53 created [must-gather-7r5xp] POD Wrote inspect data to must-gather. [must-gather-7r5xp] POD Gathering data for ns/openshift-cluster-version... [must-gather-7r5xp] POD Wrote inspect data to must-gather. [must-gather-7r5xp] POD Gathering data for ns/openshift-config... [must-gather-7r5xp] POD Gathering data for ns/openshift-config-managed... [must-gather-7r5xp] POD Gathering data for ns/openshift-authentication... [must-gather-7r5xp] POD Gathering data for ns/openshift-authentication-operator... [must-gather-7r5xp] POD Gathering data for ns/openshift-ingress... [must-gather-7r5xp] POD Gathering data for ns/openshift-cloud-credential-operator... [must-gather-7r5xp] POD Gathering data for ns/openshift-machine-api... [must-gather-7r5xp] POD Gathering data for ns/openshift-console-operator... [must-gather-7r5xp] POD Gathering data for ns/openshift-console... [must-gather-7r5xp] POD Gathering data for ns/openshift-csi-snapshot-controller... [must-gather-7r5xp] POD Gathering data for ns/openshift-csi-snapshot-controller-operator... [must-gather-7r5xp] POD Gathering data for ns/openshift-dns-operator... [must-gather-7r5xp] POD Gathering data for ns/openshift-dns... [must-gather-7r5xp] OUT waiting for gather to complete [must-gather-7r5xp] OUT gather never finished: pods "must-gather-7r5xp" not found [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rhjnh deleted [must-gather ] OUT namespace/openshift-must-gather-llnm6 deleted error: gather never finished for pod must-gather-7r5xp: pods "must-gather-7r5xp" not found Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |