Bug 2002120 - Disconnected upgrade from 4.7.24 -> 4.7.29 will not complete
Summary: Disconnected upgrade from 4.7.24 -> 4.7.29 will not complete
Keywords:
Status: CLOSED DUPLICATE of bug 2000746
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: s390x
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Yu Qi Zhang
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-08 01:13 UTC by Philip Chan
Modified: 2021-12-14 00:02 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-13 23:47:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
master-0 daemon log (13.60 MB, text/plain)
2021-09-22 15:36 UTC, Philip Chan
no flags Details
worker-0 daemon log (13.64 MB, text/plain)
2021-09-22 15:38 UTC, Philip Chan
no flags Details
must-gather.Sept22 partaa (19.00 MB, application/gzip)
2021-09-22 15:44 UTC, Philip Chan
no flags Details
must-gather.Sept22 partab (19.00 MB, application/octet-stream)
2021-09-22 15:45 UTC, Philip Chan
no flags Details
must-gather.Sept22 partac (3.09 MB, application/octet-stream)
2021-09-22 15:46 UTC, Philip Chan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker MULTIARCH-1667 0 None None None 2021-09-08 01:15:21 UTC

Description Philip Chan 2021-09-08 01:13:30 UTC
Description of problem:
1. When performing a disconnected upgrade on KVM hosted OCP Z cluster running 4.7.24 to 4.7.29, the upgrade process does not complete.

2. All operators will show that it is upgraded to the target 4.7.29 release, except for the machine-config operator:

machine-config                             4.7.24    False       True          True       137m

3. The cluster status will stay stuck at this point:

[root@bastion ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.24    True        True          51m     Working towards 4.7.29: 560 of 669 done (83% complete)

4. One of the master/control nodes and one of the worker nodes will be stuck at Ready,SchedulingDisabled:

[root@bastion ~]# oc get nodes
NAME                                                     STATUS                     ROLES    AGE    VERSION
master-0.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready                      master   4h9m   v1.20.0+558d959
master-1.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready,SchedulingDisabled   master   4h9m   v1.20.0+558d959
master-2.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready                      master   4h9m   v1.20.0+558d959
worker-0.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready,SchedulingDisabled   worker   4h1m   v1.20.0+558d959
worker-1.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready                      worker   4h1m   v1.20.0+558d959

5. When taking a closer look at one of these nodes, master-1 for example, it will show this "No such file or directory" error.

[root@bastion ~]# oc describe node master-1.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com
...
Annotations:        k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_master-0.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com","mac-address":"52:54:00:da:70:41...
                    k8s.ovn.org/node-chassis-id: 63a4adbf-120d-4767-ac63-c1a27c9ebead
                    k8s.ovn.org/node-local-nat-ip: {"default":["169.254.5.142"]}
                    k8s.ovn.org/node-mgmt-port-mac-address: 16:6e:ee:3b:d0:3f
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.79.21/24"}
                    k8s.ovn.org/node-subnets: {"default":"10.130.0.0/23"}
                    machineconfiguration.openshift.io/currentConfig: rendered-master-d20c7a4b400256a687718be1d2c2e2a8
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-e3807e131147bdb9afe722b453ec015e
                    machineconfiguration.openshift.io/reason:
                      failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:285b4e56265d28a5e46bb54714933348f8517d63de4d237568e457a995e3a...
                      error: opendir(/run/mco-machine-os-content/os-content-838794730/srv/repo): No such file or directory
                    machineconfiguration.openshift.io/state: Degraded
                    volumes.kubernetes.io/controller-managed-attach-detach: true

Version-Release number of selected component (if applicable):
OCP 4.7.29 

How reproducible:
Consistently reproducible

Steps to Reproduce:
1. Start with a OCP cluster version 4.7.24 - disconnected/restricted install type.
2. Perform the operations for a disconnected/restricted upgrade to target release 4.7.29 from the mirror site.
3. Observe the upgrade does not complete and machine-config operator remains at 4.7.24 and a few master/work nodes remain in Ready,SchedulingDisabled state.

Actual results:
The OCP cluster upgrade process will fail to complete.

Expected results:
The OCP cluster and all its operators should upgrade to the targeted release of OCP 4.7.29.

Additional info:
We are also experiencing upgrade problems from 4.7.24 to 4.7.29 on zVM. Additional details to follow from Kyle Moser.

Comment 2 Philip Chan 2021-09-08 14:22:06 UTC
We also performed tests with connected and disconnected upgrades for the following releases:
4.7.24 --> 4.7.30 (KVM and zVM)
4.7.29 --> 4.7.30 (KVM and zVM)

These upgrades succeeded.  We will continue to look at the 4.7.29 upgrade failure and post any new information to this bugzilla.

Comment 3 Carvel Baus 2021-09-10 14:30:40 UTC
I don't think this is a blocker as a functional upgrade path exists to a later release version (4.7.30). Also, lowering severity to medium for same reason this is not a blocker, a functional upgrade path exists. 

There is insufficient information here to make any sort of determination as to why this is happening. I am going to assign to the MCO team to help assist in asking the right questions to understand what is going on.

Comment 4 Prashanth Sundararaman 2021-09-10 17:51:29 UTC
Kyle/Phil,

Can we get the machine-config-daemon logs on the node with the issue? and also can you check what temp directories are present inside /run/mco-machine-os-content inside the daemon? does the directory /run/mco-machine-os-content/os-content-838794730 exist? did the image extraction succeed ?

Thanks
Prashanth

Comment 5 Colin Walters 2021-09-13 19:35:36 UTC
> failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:285b4e56265d28a5e46bb54714933348f8517d63de4d237568e457a995e3a...

Odd, that sha256 checksum looks truncated.

The digest for 4.7.29 is:

```
$ oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.7.29-s390x

$
```

Mounting that container, I do see /srv/repo in there.

This might be related to https://bugzilla.redhat.com/show_bug.cgi?id=2000195 somehow?

May be worth retrying with 4.7.30.

Comment 6 Yu Qi Zhang 2021-09-20 21:59:01 UTC
We probably will need a bit more logs, at the minimum the machine-config-daemon logs on a failing node, or a full must-gather is best.

Like Colin said, since this is in a disconnected setup, it is possible that the update is actually failing on podman actions, and the error bubbled up does not contain the full picture

Comment 7 Philip Chan 2021-09-22 15:35:32 UTC
I re-ran a new disconnected upgrade from 4.7.24 to 4.7.29 to gather additional logs to all the above inquiries.

As a baseline, here are the nodes that are currently stuck during MCO update(master-0 and worker-0):

# oc get nodes
NAME                                                    STATUS                     ROLES    AGE   VERSION
master-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready,SchedulingDisabled   master   12h   v1.20.0+558d959
master-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready                      master   12h   v1.20.0+558d959
master-2.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready                      master   12h   v1.20.0+558d959
worker-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready,SchedulingDisabled   worker   12h   v1.20.0+558d959
worker-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   Ready                      worker   12h   v1.20.0+558d959

And these are the new machine-config-daemons:

# oc get pods -o wide -n openshift-machine-config-operator
NAME                                        READY   STATUS    RESTARTS   AGE   IP              NODE                                                    NOMINATED NODE   READINESS GATES
machine-config-controller-8fdd8d8f9-cxqrm   1/1     Running   0          10h   10.128.0.77     master-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-daemon-bz456                 2/2     Running   0          10h   192.168.79.24   worker-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-daemon-dr4bm                 2/2     Running   0          10h   192.168.79.22   master-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-daemon-hxl4d                 2/2     Running   0          10h   192.168.79.21   master-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-daemon-klbjr                 2/2     Running   0          10h   192.168.79.25   worker-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-daemon-mmddq                 2/2     Running   0          10h   192.168.79.23   master-2.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-operator-7f489c49d5-b2gp8    1/1     Running   0          10h   10.130.0.78     master-2.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-server-hb7sb                 1/1     Running   0          10h   192.168.79.23   master-2.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-server-r776l                 1/1     Running   0          10h   192.168.79.21   master-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>
machine-config-server-tws9p                 1/1     Running   0          10h   192.168.79.22   master-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com   <none>           <none>

@Yu - I will post the must-gather and machine-config-daemon logs for further review.  This is just to note what must-gather summary ended with:

When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: edb5478c-8670-4722-81b1-1a8e9814ebbf
ClusterVersion: Updating to "4.7.29" from "4.7.24" for 12 hours: Unable to apply 4.7.29: wait has exceeded 40 minutes for these operators: openshift-apiserver
ClusterOperators:
	clusteroperator/authentication is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
	clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-7c76795c5d-6tmcp" cannot be scheduled: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
	clusteroperator/machine-config is not available (Cluster not available for 4.7.29) because Unable to apply 4.7.29: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)
	clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()

@Prashanth and @Colin - If we look at master-0 about the missing directory:

Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-master-61ba39de1d60c9d70ee2d1337e1a7ad4
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-240d001f2dc7cdd91891266f11a87798
                    machineconfiguration.openshift.io/reason:
                      failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:285b4e56265d28a5e46bb54714933348f8517d63de4d237568e457a995e3a...
                      error: opendir(/run/mco-machine-os-content/os-content-152081691/srv/repo): No such file or directory
                    machineconfiguration.openshift.io/state: Degraded
                    volumes.kubernetes.io/controller-managed-attach-detach: true

That specific directory does not exist.  However, the upgrade continues to try and extract the contents to a new directory.  From what I can tell, it can only get as far as the extensions directory before it stops, deletes the directory and retries:

[core@master-0 ~]$ ls /run/mco-machine-os-content/os-content-389481125/
bin  boot  etc  extensions

[core@master-0 ~]$ ls /run/mco-machine-os-content/os-content-170249399/
bin  boot  etc  extensions

The sha256 checksum may look weird, but I can download directly from quay.io or from our mirror-registry and extract the contents without any issue:

# oc image extract -a /root/.ocp4_pull_secret quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:285b4e56265d28a5e46bb54714933348f8517d63de4d237568e457a995e3abed
# ls
bin  boot  dev  etc  extensions  home  lib  lib64  lost+found  media  mnt  opt  pkglist.txt  proc  root  run  sbin  srv  sys  tmp  usr  var

# oc image extract -a /root/disconnectedinstall/pull-secret.json bastion:5000/ocp4/openshift4:4.7.29-s390x-machine-os-content
# ls
bin  boot  dev  etc  extensions  home  lib  lib64  lost+found  media  mnt  opt  pkglist.txt  proc  root  run  sbin  srv  sys  tmp  usr  var

Also, I wanted to confirm we are able to perform a disconnected upgrade from 4.7.24 -> 4.7.30 successfully.  Which is good, the problem is resolved with a later release build.  The main concern is that since 4.7.29 continues to be available on the public mirror site with this potential problem.  A customer may still be able to try the upgrade to 4.7.29.  And if their cluster gets into this state, is there a solution to upgrade them to 4.7.30 or above as a resolution?

Comment 8 Philip Chan 2021-09-22 15:36:35 UTC
Created attachment 1825363 [details]
master-0 daemon log

Comment 9 Philip Chan 2021-09-22 15:38:16 UTC
Created attachment 1825364 [details]
worker-0 daemon log

Comment 10 Philip Chan 2021-09-22 15:44:58 UTC
Created attachment 1825365 [details]
must-gather.Sept22 partaa

Comment 11 Philip Chan 2021-09-22 15:45:36 UTC
Created attachment 1825366 [details]
must-gather.Sept22 partab

Comment 12 Philip Chan 2021-09-22 15:46:34 UTC
Created attachment 1825367 [details]
must-gather.Sept22 partac

Comment 13 Yu Qi Zhang 2021-09-22 16:42:03 UTC
Ah ok I think this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2000195 which has https://bugzilla.redhat.com/show_bug.cgi?id=2000746 as tracking the fix to 4.7.z, which as you noted is fixed in 4.7.30 https://bugzilla.redhat.com/show_bug.cgi?id=2000746#c6

As to pulling edges, we may need to evaluate the impact. Might be more than just a few 4.7 z stream releases.

Comment 14 Yu Qi Zhang 2021-11-22 17:06:02 UTC
@Trevor some time has passed now. Do you know if any action has been taken on this, or whether we still need to perform anything?

Essentially all disconnected upgrades in the window of ~6 (?) 4.7 releases are broken. I am not sure if anyone else ran into this though, and we do have much newer 4.7 edges. Is it maybe fine to close this?

Comment 15 W. Trevor King 2021-12-13 23:47:58 UTC
Sorry for the slow response; I'd lost track of this one.  Sounds like restricted network updates from... something into 4.7.29 and maybe some earlier 4.7.z will hang on this.  I'm not clear on mitigations or anything.  If there were any attached cases, or when the fix from bug 2000746's 4.7.30 was newer, I'd probably attach UpgradeBlocker and an impact statement request per [1], to fill in those missing pieces.  Even if we said "4.7.30 has been out for a while, so there are many alternative paths, and we aren't blocking edges on this", having clarity on affected updates, symptoms, mitigations, etc. can help folks who do happen to be bitten.

However, the lack of attached cases and the age of the 4.7.30 fix suggest that nobody's out there at the moment, or likely to be out there in the future, and in need of that clarity.  So I'm going to close as a dup, and we'll revisit mitigation notes and such if anyone actually trips over this in the wild.

[1]: https://github.com/openshift/enhancements/tree/2911c46bf7d2f22eb1ab81739b4f9c2603fd0c07/enhancements/update/update-blocker-lifecycle#impact-statement-request

*** This bug has been marked as a duplicate of bug 2000746 ***


Note You need to log in before you can comment on or make changes to this bug.