1859781 – 4.4.12->4.4.13 -> 4.5.3 upgrade failed

Bug 1859781 - 4.4.12->4.4.13 -> 4.5.3 upgrade failed

Summary: 4.4.12->4.4.13 -> 4.5.3 upgrade failed

Keywords:
Status:	CLOSED DUPLICATE of bug 1858907
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Kirsten Garrison
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-23 00:34 UTC by Veer Muchandi
Modified:	2020-07-28 16:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-28 16:18:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Veer Muchandi 2020-07-23 00:34:16 UTC

Description of problem:
Cluster upgrade from 4.4.13 to 4.5.3 failing. 


Version-Release number of selected component (if applicable):
4.5.3

How reproducible:

Not sure if it can be reproduced

Steps to Reproduce:
1. Successfully upgraded the cluster to 4.4.13
2. Started upgrade to 4.5.3 using fast-4.5

Actual results:

```
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: '2019-07-09T19:30:44Z'
  generation: 1
  name: machine-config
  resourceVersion: '408031651'
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: 0ba3b960-a280-11e9-ab36-5254006a091b
spec: {}
status:
  conditions:
    - lastTransitionTime: '2020-07-22T18:29:31Z'
      message: Cluster not available for 4.5.3
      status: 'False'
      type: Available
    - lastTransitionTime: '2020-07-22T18:29:31Z'
      message: Working towards 4.5.3
      status: 'True'
      type: Progressing
    - lastTransitionTime: '2020-07-22T18:29:31Z'
      message: >-
        Unable to apply 4.5.3: refusing to read osImageURL version "4.4.13",
        operator version "4.5.3"
      reason: RenderConfigFailed
      status: 'True'
      type: Degraded
    - lastTransitionTime: '2019-10-31T03:40:39Z'
      reason: AsExpected
      status: 'True'
      type: Upgradeable
  extension: {}
  relatedObjects:
    - group: ''
      name: openshift-machine-config-operator
      resource: namespaces
    - group: machineconfiguration.openshift.io
      name: master
      resource: machineconfigpools
    - group: machineconfiguration.openshift.io
      name: worker
      resource: machineconfigpools
    - group: machineconfiguration.openshift.io
      name: machine-config-controller
      resource: controllerconfigs
  versions:
    - name: operator
      version: 4.4.13

```

Expected results:
Successful cluster upgrade to 4.5.3

Additional info:

Comment 1 Veer Muchandi 2020-07-23 00:46:27 UTC

Node logs: https://drive.google.com/file/d/1O6iuJWvMGIXsXQB9kMtSbkgyMMgs0WOU/view?usp=sharing

Comment 3 Kirsten Garrison 2020-07-23 01:13:40 UTC

Couple of things I noted in slack thread:

First there seems to be a 3 hour lag in the MCD logs for worker1.ocp4.home.ocpcloud.com:

2020-07-22T14:32:37.951289107Z rpm-ostreed.service: Consumed 48ms CPU time
2020-07-22T17:54:28.276100608Z I0722 17:54:28.266579    1819 start.go:74] Version: v4.4.0-202007120152.p0-dirty (601c2285f497bf7c73d84737b9977a0e697cb86a)

and  also for the same node:
  - lastTransitionTime: "2020-07-22T17:55:17Z"
    message: 'Node worker1.ocp4.home.ocpcloud.com is reporting: "unexpected on-disk
      state validating against rendered-worker-10e4babd0ed0c4d1accb2ef9875563e4"'
    reason: 1 nodes are reporting degraded status on sync

Comment 4 Kirsten Garrison 2020-07-23 01:31:08 UTC

Something happened to that last worker node bc all the other nodes managed to get through the upgrade:
workerpool:
 degradedMachineCount: 1
  machineCount: 5
  observedGeneration: 36
  readyMachineCount: 2
  unavailableMachineCount: 0
  updatedMachineCount: 2

master pool:
  degradedMachineCount: 0
  machineCount: 3
  observedGeneration: 37
  readyMachineCount: 3
  unavailableMachineCount: 0
  updatedMachineCount: 3

from daemon-xxhc5:
2020-07-22T14:31:35.657958348Z I0722 14:31:35.657403 1969796 update.go:1181] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:511ea3dbbb6e6059a52251f6291305f2a9ba5f98192cfb92989fda88960fcf82
2020-07-22T14:31:39.44500565Z I0722 14:31:35.882099 2282307 rpm-ostree.go:366] Running captured: rpm-ostree status --json
2020-07-22T14:31:39.44500565Z Starting RPM-OSTree System Management Daemon...
2020-07-22T14:31:39.44500565Z Reading config file '/etc/rpm-ostreed.conf'
2020-07-22T14:31:39.44500565Z In idle state; will auto-exit in 64 seconds
2020-07-22T14:31:39.44500565Z Started RPM-OSTree System Management Daemon.
2020-07-22T14:31:39.44500565Z client(id:cli dbus:1.1104 unit:machine-config-daemon-host.service uid:0) added; new total=1
2020-07-22T14:31:39.44500565Z client(id:cli dbus:1.1104 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0
2020-07-22T14:31:39.44500565Z In idle state; will auto-exit in 61 seconds
2020-07-22T14:31:39.44500565Z I0722 14:31:35.970212 2282307 rpm-ostree.go:154] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5fb27f5dd5ccc00ef2b6832784ffe26ca5500e89e69b57e7539ec114f3769423
2020-07-22T14:31:39.44500565Z I0722 14:31:35.970341 2282307 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:511ea3dbbb6e6059a52251f6291305f2a9ba5f98192cfb92989fda88960fcf82
2020-07-22T14:31:39.44500565Z 2020-07-22 14:31:36.666095785 +0000 UTC m=+0.422869210 system refresh
2020-07-22T14:32:37.951289107Z In idle state; will auto-exit in 63 seconds
2020-07-22T14:32:37.951289107Z rpm-ostreed.service: Consumed 48ms CPU time
2020-07-22T17:54:28.276100608Z I0722 17:54:28.266579    1819 start.go:74] Version: v4.4.0-202007120152.p0-dirty (601c2285f497bf7c73d84737b9977a0e697cb86a)
2020-07-22T17:54:28.339464867Z I0722 17:54:28.339425    1819 start.go:84] Calling chroot("/rootfs")
2020-07-22T17:54:28.342989824Z I0722 17:54:28.342460    1819 rpm-ostree.go:366] Running captured: rpm-ostree status --json
2020-07-22T17:54:29.825114415Z I0722 17:54:29.825044    1819 daemon.go:209] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5fb27f5dd5ccc00ef2b6832784ffe26ca5500e89e69b57e7539ec114f3769423 (44.81.202007070223-0)

Note above it's trying to upgrade to @sha456:511xxx but ends up back in the previous osImageURL...

Eventually that same degraded worker node (worker1.ocp4.home.ocpcloud.com) that somehow disappeared in the logs for 3.5 hours (14:31->17:54??!!) also reports out of disk:
2020-07-22T18:03:40.197805821Z I0722 18:03:40.196817       1 node_controller.go:433] Pool master: node master1.ocp4.home.ocpcloud.com is now reporting unready: node master1.ocp4.home.ocpcloud.com is reporting OutOfDisk=True

Comment 5 Kirsten Garrison 2020-07-23 01:42:18 UTC

Sorry scratch that last part the  node in question node isn't reporting OOD... another master and worker are.

So summary:
Upgrade rolled out to the pools, all nodes updated but 1 worker. That worker attempts to upgrade, seems to have stopped logging in the MCC for 3 hours, comes back in the old osImageURL and goes degraded.  Other nodes a master and a worker end up OOD which I believe the reporter had to go in and manually fix(?).

The daemon logs stop at 
2020-07-22T14:32:37.951289107Z rpm-ostreed.service: Consumed 48ms CPU time

but we never see the successful image pull, podman inspect, podman create, etc... that usually logs.. it just comes back in the same osImageURL (sha256:5fb2xxx) as before 3.5 hours later:
2020-07-22T17:54:29.825114415Z I0722 17:54:29.825044    1819 daemon.go:209] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5fb27f5dd5ccc00ef2b6832784ffe26ca5500e89e69b57e7539ec114f3769423 (44.81.202007070223-0)

Comment 6 Kirsten Garrison 2020-07-23 02:18:21 UTC

Ok so this is actually an existing bug! Upgrades from 4.4.13->4.5.3 with platform:None & no platform:None and no platformStatus won't work rn

Checking MCO logs, the panic!:

2020-07-22T19:37:00.505915583Z E0722 19:37:00.505818       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)

*** This bug has been marked as a duplicate of bug 1858026 ***

Comment 7 Kirsten Garrison 2020-07-23 02:25:20 UTC

I'm actually going to keep this open as this cluster seems to have a lot of stuff going on.. the timestamps of the MCO panic are way after the other osImageURL upgrade attempt... so i think something happened before and will look closer

Comment 8 Kirsten Garrison 2020-07-23 02:38:46 UTC

Just to confirm a hunch, double checked and this is not from the 4.5.3 upgrade - it's looks like it's from a previous upgrade:

The workers that successfully upgraded upgraded t 4.4.13:

2020-07-22T14:29:25.598548034Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:511ea3dbbb6e6059a52251f6291305f2a9ba5f98192cfb92989fda88960fcf82
2020-07-22T14:29:25.598548034Z               CustomOrigin: Managed by machine-config-operator
2020-07-22T14:29:25.598548034Z                    Version: 44.82.202007141430-0 (2020-07-14T14:36:52Z)\


And the stuck node is booted into 4.4.12:

2020-07-22T13:57:45.054462248Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5fb27f5dd5ccc00ef2b6832784ffe26ca5500e89e69b57e7539ec114f3769423
2020-07-22T13:57:45.054462248Z               CustomOrigin: Managed by machine-config-operator
2020-07-22T13:57:45.054462248Z                    Version: 44.81.202007070223-0 (2020-07-07T02:28:48Z)

So the worker pool wasn't upgraded correctly and then the 4.5.3 upgrade proceeded..

So :
    message: 'Node worker1.ocp4.home.ocpcloud.com is reporting: "unexpected on-disk
      state validating against rendered-worker-10e4babd0ed0c4d1accb2ef9875563e4"'

Is really saying this node isn't in 4.4.13 when we expect it to be.

Comment 9 Kirsten Garrison 2020-07-23 02:45:51 UTC

Summary:

The upgrade from 4.4.12->4.4.13 failed on the worker pool, then the 4.5.3 upgrade started bc the master pool did successfully get to 4.4.13 and as the MCO upgraded it then hit bug 1858026. 

In an ideal world you want to stay on 4.4.12/4.4.13 bc the fix is not yet in 4.5.x and we don't have a workaround so even if we fix your cluster you will eventually hit the above BZ.

Comment 10 Kirsten Garrison 2020-07-27 18:42:36 UTC

Hi Veer

To get your 4.4.12 workers to a good 4.4.13 state let's try to manually pivot each node:

1) Manually pivot to desired osImageURL (using 4.4.13 machine-os-content below)
   - oc debug node/<nodename> -- chroot /host pivot quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:511ea3dbbb6e6059a52251f6291305f2a9ba5f98192cfb92989fda88960fcf82
2) Reboot node

Once those are upgraded the 4.5.3 upgrade should proceed. You will be stuck with the MCO panic from the broken 4.5.3 and you can proceed either with a nightly or wait for 4.5.4
https://bugzilla.redhat.com/show_bug.cgi?id=1858907#c5

Comment 11 Kirsten Garrison 2020-07-27 18:54:51 UTC

Note: just do the above to the worker that is stuck I believe it is worker1.ocp4.home.ocpcloud.com as mentioned above it was the worker that I noted had problems rolling out the upgrade.

Comment 12 Kirsten Garrison 2020-07-28 16:18:02 UTC

We manually pivoted Veer's workers and he was able to successfully upgrade to 4.5.4

*** This bug has been marked as a duplicate of bug 1858907 ***

Note You need to log in before you can comment on or make changes to this bug.