1821364 – Upgrade from 4.3 -> 4.4 does not complete due to controller version mismatch for rendered-master

Bug 1821364 - Upgrade from 4.3 -> 4.4 does not complete due to controller version mismatch for rendered-master

Summary: Upgrade from 4.3 -> 4.4 does not complete due to controller version mismatch ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1817455
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-06 16:55 UTC by Roshni
Modified:	2021-04-05 17:47 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-07 20:21:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Roshni 2020-04-06 16:55:42 UTC

Description of problem:
Upgrade from 4.3 -> 4.4 does not complete

Version-Release number of selected component (if applicable):
4.4.0-rc.6

How reproducible:


Steps to Reproduce:
1. Upgrade from 4.1 -> 4.2 -> 4.3 - > 4.4
1. I ran the following yaml with cluster-loader.py from https://github.com/openshift/svt/tree/master/openshift_scalability. I ran the same after each upgrade so there are 150 such projects on that cluster now.

# cat upgrade.yaml 
projects:
  - num: 50
    basename: svt-43-
    templates:
      -
        num: 6
        file: ./content/build-template.json
      -
        num: 10
        file: ./content/image-stream-template.json
      -
        num: 2
        file: ./content/deployment-config-0rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e
8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 10
        file: ./content/ssh-secret-template.json
      -
        num: 5
        file: ./content/route-template.json
      -
        num: 10
        file: ./content/configmap-template.json
2.
3.

Actual results:
Upgrade from 4.3 -> 4.4 hangs for more than 2hrs.

Expected results:
Upgrade should be successful in a reasonable time.

Additional info:
must-gather log files http://file.rdu.redhat.com/rpattath/must-gather.local.476848403869687153.tar.gz

Comment 1 Mike Fiedler 2020-04-06 20:37:25 UTC

Setting target to 4.4 for triage, please move as needed.

Comment 2 W. Trevor King 2020-04-06 22:18:28 UTC

Specific update path from the must gather:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .state + " " + .version'
2020-04-03T19:35:52Z Partial 4.4.0-rc.6
2020-04-03T18:14:01Z Completed 4.3.10
2020-04-03T16:57:08Z Completed 4.2.27
2020-04-03T16:19:45Z Completed 4.1.38

That means you were exposed to the cloud cred crashloop due to the born-in-4.1 infrastructure state [1] whose fix has still not been released for 4.3 (bug 1816704).  Looking at the CVO logs:

$ grep 'Running sync.*in state\|Result of work' namespaces/openshift-cluster-version/pods/cluster-version-operator-7df5777f65-wxqb5/cluster-version-operator/cluster-version-operator/logs/current.log 
2020-04-03T20:06:43.647575285Z I0403 20:06:43.647543       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.4.0-rc.6 (force=true) on generation 6 in state Updating at attempt 0
2020-04-03T20:12:28.699292741Z I0403 20:12:28.699277       1 task_graph.go:596] Result of work: [Cluster operator machine-config is still updating]
2020-04-03T20:12:51.48599661Z I0403 20:12:51.485938       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.4.0-rc.6 (force=true) on generation 6 in state Updating at attempt 1
2020-04-03T20:18:36.537564761Z I0403 20:18:36.537555       1 task_graph.go:596] Result of work: [Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable]
2020-04-03T20:19:22.270219971Z I0403 20:19:22.270165       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.4.0-rc.6 (force=true) on generation 6 in state Updating at attempt 2
2020-04-03T20:25:07.321847832Z I0403 20:25:07.321831       1 task_graph.go:596] Result of work: [Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 16:
...
2020-04-03T22:05:56.898686674Z I0403 22:05:56.898633       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.4.0-rc.6 (force=true) on generation 6 in state Updating at attempt 14
2020-04-03T22:11:41.950432412Z I0403 22:11:41.950406       1 task_graph.go:596] Result of work: [Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 16:
2020-04-03T22:15:08.95251417Z I0403 22:15:08.952464       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release:4.4.0-rc.6 (force=true) on generation 6 in state Updating at attempt 15

So the MCO got bumped, didn't complete quickly, and subsequent CVO loops through the manifest graph [2] got stuck earlier on the NodeInstallerDegraded.  On that front:

$ for X in cluster-scoped-resources/core/nodes/*.yaml; do yaml2json <"${X}" | jq -r '.metadata.name + " " + .metadata.annotations["machineconfiguration.openshift.io/currentConfig"] + " " + .spec.unschedulable'; done | grep -v worker
ip-10-0-128-10.us-west-2.compute.internal rendered-master-35008f45237ce28cb908b06b9a52c324 true
ip-10-0-153-72.us-west-2.compute.internal rendered-master-35008f45237ce28cb908b06b9a52c324 
ip-10-0-175-113.us-west-2.compute.internal rendered-master-35008f45237ce28cb908b06b9a52c324 

So this is probably a dup of the... whatever the NoSchedule bug is...

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1813343#c5
[2]: https://github.com/openshift/cluster-version-operator/blob/d9e628368db44a9ea9e6c372e06e307b40f60de0/docs/user/reconciliation.md

Comment 3 W. Trevor King 2020-04-06 22:51:12 UTC

NoSchedule stuff is in bug 1814241 / bug 1814282.  Not clear to me what all should be closed as a dup.

Comment 4 Sam Batschelet 2020-04-06 22:58:50 UTC

> That means you were exposed to the cloud cred crashloop due to the born-in-4.1 infrastructure state [1] whose fix has still not been released for 4.3 (bug 1816704).  Looking at the CVO logs:

Seeing a lot of cloud-credential-operator CrashLooping

```
288626:Apr 03 19:47:20.491108 ip-10-0-153-72 hyperkube[1189]: I0403 19:47:20.490998    1189 status_manager.go:568] Status for pod "cloud-credential-operator-d9b5745df-86jhh_openshift-cloud-credential-operator(e1e7350d-1316-4123-90b5-504a5414757e)" updated successfully: (49, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-03 18:50:10 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-03 19:43:14 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [manager]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-03 19:43:14 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [manager]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-03 18:50:10 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.153.72 PodIP:10.130.0.15 PodIPs:[{IP:10.130.0.15}] StartTime:2020-04-03 18:50:10 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:manager State:{Waiting:&ContainerStateWaiting{Reason:CrashLoopBackOff,Message:back-off 5m0s restarting failed container=manager pod=cloud-credential-operator-d9b5745df-86jhh_openshift-cloud-credential-operator(e1e7350d-1316-4123-90b5-504a5414757e),} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:2,Signal:0,Reason:Error,Message:2 +0x18b

```

Comment 7 W. Trevor King 2020-04-07 20:21:33 UTC

Probably not a dup of bug 1814282, because no e2e tests mean no leaked e2e csi-* namespaces.  Repeating [1] on this cluster's must-gather:

  $ grep 'nodeName: ip-10-0-128-10\.' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-*/*.yaml
  namespaces/openshift-machine-config-operator/pods/machine-config-daemon-m9bc9/machine-config-daemon-m9bc9.yaml:  nodeName: ip-10-0-128-10.us-west-2.compute.internal
  $ tail -n2 namespaces/openshift-machine-config-operator/pods/machine-config-daemon-m9bc9/machine-config-daemon/machine-config-daemon/logs/current.log 
  2020-04-03T22:17:47.172038081Z I0403 22:17:47.172033  129600 update.go:811] Removed stale file "/etc/kubernetes/manifests/etcd-member.yaml"
  2020-04-03T22:17:47.172089457Z E0403 22:17:47.172072  129600 writer.go:135] Marking Degraded due to: rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link

So it's another dup of bug 1817455, which was fixed in 4.4 and will go out with whatever the next RC after rc.6 is.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1821369#c3

*** This bug has been marked as a duplicate of bug 1817455 ***

Comment 8 Mike Fiedler 2020-04-08 12:27:18 UTC

@rpattath, you can repeat this test but make the final upgrade to the latest 4.4 nightly instead of RC 6

Comment 9 W. Trevor King 2020-04-08 20:34:19 UTC

rc.7 likely getting cut tomorrow-ish, if you don't want to bother with nightlies.

Comment 10 Peter Ruan 2020-04-16 20:08:10 UTC

verified by doing the following upgrade path:

Version	State	Started	Completed
4.4.0-0.nightly-2020-04-15-185505	Completed	
Apr 16, 12:16 am
	
2 minutes ago
4.3.12	Completed	
Apr 16, 11:09 am
	
Apr 16, 11:52 am
4.2.28	Completed	
Apr 16, 8:00 am
	
Apr 16, 8:36 am
4.1.38	Completed	
Apr 16, 7:10 am
	
Apr 16, 7:22 am

Comment 11 W. Trevor King 2021-04-05 17:47:10 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.