Bug 1940585 - Upgrading Vsphere UPI cluster from 4.6.20 to 4.7 fails with Failed to enable unit: Unit file nodeip-configuration.service does not exist.
Summary: Upgrading Vsphere UPI cluster from 4.6.20 to 4.7 fails with Failed to enable ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.z
Assignee: MCO Team
QA Contact: Rio Liu
URL:
Whiteboard: UpdateRecommendationsBlocked
Depends On: 1910738
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-18 16:49 UTC by Brandon Anderson
Modified: 2021-11-03 08:44 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: misconfigured nodeip-configuration service in 4.6 vSphere UPI was fixed in MCO 4.7 but not 4.6 Consequence: when cluster upgrades 4.6.x -> 4.6.y -> 4.7 AND masters complete 4.7 upgrade before workers complete 4.6.y upgrade, 4.7 MCO will stop the entire upgrade (due to faulty 4.6 service) Fix: fix nodeip-configuration service in 4.6 Result: upgrade completes.
Clone Of:
Environment:
Last Closed: 2021-04-20 19:27:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2493 0 None open Bug 1940585: conditionally set nodeip-service enabled for vSphere 2021-03-26 20:16:12 UTC
Red Hat Bugzilla 1910738 1 unspecified CLOSED OCP 4.7 Installation fails on VMWare due to 1 worker that is degraded 2021-08-18 22:10:59 UTC
Red Hat Product Errata RHBA-2021:1153 0 None None None 2021-04-20 19:27:39 UTC

Description Brandon Anderson 2021-03-18 16:49:03 UTC
Description of problem: Uograding from 4.6.20 to 4.7 in UPI Vsphere environment complets but MCO fails into degraded, error observed is:

I0318 14:27:44.146917    2880 update.go:1554] Writing systemd unit "update-ca.service"
E0318 14:27:44.158590    2880 writer.go:135] Marking Degraded due to: error enabling units: Failed to enable unit: Unit file nodeip-configuration.service does not exist.



How reproducible: Attempt to upgrade Vsphere UPI cluster from 4.6.20 to 4.7



Actual results: Cluster fails after upgrade due to MCO being degraded


Expected results: Completed upgrade resulting in stable 4.7 cluster


Additional info:

Appears to be the same issue as documented and resolved in https://bugzilla.redhat.com/show_bug.cgi?id=1910738

This previous BZ was for creating a new Vsphere UPI environment, however, and was included with the 4.7 release. It is possible that there are some differences with the upgrade process that are not addressed by the fix and needs further review for resolution.

Mustgather is in case 02896214

Hydra link to mustgather: https://attachments.access.redhat.com/hydra/rest/cases/02896214/attachments/94108bd8-d567-4b15-979f-6d6ff2248803

Additional outputs:

[root@PCN0001S152 U025883]# oc get machineconfigpool,nodes -o wide
NAME                                                         CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/infra    rendered-infra-f74a515c835999bbca7344045c7d3bf5    True      False      False      3              3                   3                     0                      507d
machineconfigpool.machineconfiguration.openshift.io/master   rendered-master-965f2d09ea96ff06612b10cb3ecafa02   True      False      False      3              3                   3                     0                      512d
machineconfigpool.machineconfiguration.openshift.io/worker   rendered-worker-12c9558e39356590240463d1b4b14365   False     True       True       9              5                   5                     3                      344d

NAME                   STATUS                     ROLES    AGE    VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
node/cgeocpworkxxs01   Ready                      worker   393d   v1.20.0+5fbfd19   172.24.160.1    172.24.160.1    Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/cgeocpworkxxs02   Ready                      worker   393d   v1.19.0+8d12420   172.24.160.2    172.24.160.2    Red Hat Enterprise Linux CoreOS 46.82.202102231542-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
node/cgeocpworkxxs03   Ready                      worker   392d   v1.20.0+5fbfd19   172.24.160.3    172.24.160.3    Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/cgeocpworkxxs04   Ready                      worker   392d   v1.20.0+5fbfd19   172.24.160.4    172.24.160.4    Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxdocpworkxxs01   Ready                      worker   393d   v1.20.0+5fbfd19   172.24.224.1    172.24.224.1    Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxdocpworkxxs02   Ready                      worker   393d   v1.20.0+5fbfd19   172.24.224.2    172.24.224.2    Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxxocpinfraxs01   Ready                      infra    510d   v1.20.0+5fbfd19   172.24.240.10   172.24.240.10   Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxxocpinfraxs02   Ready                      infra    510d   v1.20.0+5fbfd19   172.24.240.11   172.24.240.11   Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxxocpinfraxs03   Ready                      infra    510d   v1.20.0+5fbfd19   172.24.240.12   172.24.240.12   Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxxocpmasters01   Ready                      master   512d   v1.20.0+5fbfd19   172.24.240.1    172.24.240.1    Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxxocpmasters02   Ready                      master   512d   v1.20.0+5fbfd19   172.24.240.2    172.24.240.2    Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxxocpmasters03   Ready                      master   512d   v1.20.0+5fbfd19   172.24.240.3    172.24.240.3    Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8
node/xxxocpocsxxxs01   Ready,SchedulingDisabled   worker   398d   v1.19.0+8d12420   172.24.240.20   172.24.240.20   Red Hat Enterprise Linux CoreOS 46.82.202102231542-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
node/xxxocpocsxxxs02   Ready                      worker   398d   v1.19.0+8d12420   172.24.240.21   172.24.240.21   Red Hat Enterprise Linux CoreOS 46.82.202102231542-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
node/xxxocpocsxxxs03   Ready                      worker   398d   v1.19.0+8d12420   172.24.240.22   172.24.240.22   Red Hat Enterprise Linux CoreOS 46.82.202102231542-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8

Please let me know if any additional information is required from the customer to expedite the process.

Comment 1 Yu Qi Zhang 2021-03-18 19:16:49 UTC
I can look into the must-gather at some point, but at a glance this is saying that while "nodeip-configuration.service" does not exist on the system, the machineconfig with the reference to enable it does. This means either the MC itself is misconfigured, or the service should exist but was deleted somehow.

The service is here: https://github.com/openshift/machine-config-operator/blob/master/templates/common/on-prem/units/nodeip-configuration.service.yaml. Would you be able to see if that systemd service exists on the node that is exhibiting the failure?

Comment 2 Roberto Docampo Suarez 2021-03-18 19:27:48 UTC
Hi, there are the services on the node


[root@xxxocpocsxxxs03 etc]# find . | grep service
./services
./systemd/system/dbus-org.freedesktop.nm-dispatcher.service
./systemd/system/dbus-org.freedesktop.timedate1.service
./systemd/system/basic.target.wants/coreos-update-ca-trust.service
./systemd/system/basic.target.wants/ignition-firstboot-complete.service
./systemd/system/basic.target.wants/microcode.service
./systemd/system/getty.target.wants/getty
./systemd/system/multi-user.target.wants/NetworkManager.service
./systemd/system/multi-user.target.wants/afterburn-checkin.service
./systemd/system/multi-user.target.wants/afterburn-firstboot-checkin.service
./systemd/system/multi-user.target.wants/auditd.service
./systemd/system/multi-user.target.wants/chronyd.service
./systemd/system/multi-user.target.wants/console-login-helper-messages-gensnippet-ssh-keys.service
./systemd/system/multi-user.target.wants/console-login-helper-messages-issuegen.service
./systemd/system/multi-user.target.wants/coreos-generate-iscsi-initiatorname.service
./systemd/system/multi-user.target.wants/coreos-liveiso-success.service
./systemd/system/multi-user.target.wants/gcp-routes.service
./systemd/system/multi-user.target.wants/irqbalance.service
./systemd/system/multi-user.target.wants/mdmonitor.service
./systemd/system/multi-user.target.wants/rhcos-growpart.service
./systemd/system/multi-user.target.wants/sshd.service
./systemd/system/multi-user.target.wants/sssd.service
./systemd/system/multi-user.target.wants/vmtoolsd.service
./systemd/system/multi-user.target.wants/ovs-configuration.service
./systemd/system/multi-user.target.wants/node-valid-hostname.service
./systemd/system/multi-user.target.wants/vsphere-hostname.service
./systemd/system/multi-user.target.wants/machine-config-daemon-firstboot.service
./systemd/system/multi-user.target.wants/kubelet.service
./systemd/system/multi-user.target.wants/mcd-write-pivot-reboot.service
./systemd/system/multi-user.target.wants/update-ca.service
./systemd/system/multi-user.target.wants/machine-config-daemon-pull.service
./systemd/system/multi-user.target.wants/nodeip-configuration.service
./systemd/system/multi-user.target.wants/openvswitch.service
./systemd/system/multi-user.target.wants/ovsdb-server.service
./systemd/system/network-online.target.wants/NetworkManager-wait-online.service
./systemd/system/sysinit.target.wants/lvm2-monitor.service
./systemd/system/sysinit.target.wants/multipathd.service
./systemd/system/sysinit.target.wants/selinux-autorelabel-mark.service
./systemd/system/vmtoolsd.service.requires
./systemd/system/vmtoolsd.service.requires/vgauthd.service
./systemd/system/node-valid-hostname.service
./systemd/system/mcd-write-pivot-reboot.service
./systemd/system/update-ca.service
./systemd/system/vsphere-hostname.service
./systemd/system/ovs-configuration.service
./systemd/system/kubelet.service
./systemd/system/pivot.service.d
./systemd/system/pivot.service.d/10-mco-default-env.conf
./systemd/system/crio.service.d
./systemd/system/crio.service.d/20-stream-address.conf
./systemd/system/crio.service.d/10-mco-default-env.conf
./systemd/system/kubelet.service.d
./systemd/system/kubelet.service.d/10-mco-default-env.conf
./systemd/system/machine-config-daemon-host.service.d
./systemd/system/local-fs.target.wants/ostree-remount.service
./systemd/system/ovs-vswitchd.service.d
./systemd/system/ovs-vswitchd.service.d/10-ovs-vswitchd-restart.conf
./systemd/system/ovsdb-server.service.d
./systemd/system/ovsdb-server.service.d/10-ovsdb-restart.conf
./systemd/system/zincati.service.d
./systemd/system/zincati.service.d/mco-disabled.conf
./systemd/system/machine-config-daemon-firstboot.service
./systemd/system/machine-config-daemon-pull.service
./systemd/user/basic.target.wants/systemd-tmpfiles-setup.service
./systemd/user/default.target.wants/io.podman.service
./systemd/user/multi-user.target.wants/podman.service
./machine-config-daemon/orig/etc/systemd/system/crio.service.d
./machine-config-daemon/orig/etc/systemd/system/kubelet.service.d
./machine-config-daemon/orig/etc/systemd/system/machine-config-daemon-host.service.d
./machine-config-daemon/orig/etc/systemd/system/pivot.service.d
./machine-config-daemon/noorig/etc/systemd/system/crio.service.d
./machine-config-daemon/noorig/etc/systemd/system/crio.service.d/20-stream-address.conf.mcdnoorig


[root@xxxocpocsxxxs03 etc]# systemctl list-units --type=service --all
  UNIT                                                      LOAD      ACTIVE   SUB     DESCRIPTION
  afterburn-checkin.service                                 loaded    inactive dead    Afterburn (Check In)
  afterburn-firstboot-checkin.service                       loaded    inactive dead    Afterburn (Firstboot Check In)
  auditd.service                                            loaded    active   running Security Auditing Service
  blk-availability.service                                  loaded    inactive dead    Availability of block devices
  chronyd.service                                           loaded    active   running NTP client/server
● cloud-init-local.service                                  not-found inactive dead    cloud-init-local.service
  console-login-helper-messages-gensnippet-ssh-keys.service loaded    active   exited  Generate SSH keys snippet for display via console-login-helper-messages
  console-login-helper-messages-issuegen.service            loaded    inactive dead    Generate console-login-helper-messages issue snippet
  coreos-generate-iscsi-initiatorname.service               loaded    inactive dead    CoreOS Generate iSCSI Initiator Name
● coreos-growpart.service                                   not-found inactive dead    coreos-growpart.service
  coreos-liveiso-success.service                            loaded    inactive dead    CoreOS Live ISO virtio success
  coreos-update-ca-trust.service                            loaded    active   exited  Run update-ca-trust
  crio-wipe.service                                         loaded    inactive dead    CRI-O Auto Update Script
  crio.service                                              loaded    active   running Open Container Initiative Daemon
  dbus.service                                              loaded    active   running D-Bus System Message Bus
● display-manager.service                                   not-found inactive dead    display-manager.service
  dm-event.service                                          loaded    inactive dead    Device-mapper event daemon
  dracut-cmdline.service                                    loaded    inactive dead    dracut cmdline hook
  dracut-initqueue.service                                  loaded    inactive dead    dracut initqueue hook
  dracut-mount.service                                      loaded    inactive dead    dracut mount hook
  dracut-pre-mount.service                                  loaded    inactive dead    dracut pre-mount hook
  dracut-pre-pivot.service                                  loaded    inactive dead    dracut pre-pivot and cleanup hook
  dracut-pre-trigger.service                                loaded    inactive dead    dracut pre-trigger hook
  dracut-pre-udev.service                                   loaded    inactive dead    dracut pre-udev hook
  dracut-shutdown.service                                   loaded    active   exited  Restore /run/initramfs on shutdown
  emergency.service                                         loaded    inactive dead    Emergency Shell
● fcoe.service                                              not-found inactive dead    fcoe.service
  gcp-routes.service                                        loaded    inactive dead    Update GCP routes for forwarded IPs.
  getty                                        loaded    active   running Getty on tty1
  ignition-firstboot-complete.service                       loaded    inactive dead    Mark boot complete
  initrd-cleanup.service                                    loaded    inactive dead    Cleaning Up and Shutting Down Daemons
  initrd-parse-etc.service                                  loaded    inactive dead    Reload Configuration from the Real Root
  initrd-switch-root.service                                loaded    inactive dead    Switch Root
  initrd-udevadm-cleanup-db.service                         loaded    inactive dead    Cleanup udevd DB
  irqbalance.service                                        loaded    active   running irqbalance daemon
  iscsi-shutdown.service                                    loaded    inactive dead    Logout off all iSCSI sessions on shutdown
  iscsi.service                                             loaded    inactive dead    Login and scanning of iSCSI devices
  iscsid.service                                            loaded    inactive dead    Open-iSCSI
  iscsiuio.service                                          loaded    inactive dead    iSCSI UserSpace I/O driver
  kmod-static-nodes.service                                 loaded    active   exited  Create list of required static device nodes for the current kernel
  kubelet.service                                           loaded    active   running Kubernetes Kubelet
  ldconfig.service                                          loaded    inactive dead    Rebuild Dynamic Linker Cache
lines 1-43

Comment 3 Roberto Docampo Suarez 2021-03-18 19:30:37 UTC
More clean:

[root@xxxocpocsxxxs03 etc]# systemctl list-units --type=service --all | grep ip
  console-login-helper-messages-gensnippet-ssh-keys.service loaded    active   exited  Generate SSH keys snippet for display via console-login-helper-messages
  console-login-helper-messages-issuegen.service            loaded    inactive dead    Generate console-login-helper-messages issue snippet
  crio-wipe.service                                         loaded    inactive dead    CRI-O Auto Update Script
  multipathd.service                                        loaded    inactive dead    Device-Mapper Multipath Device Controller
● nodeip-configuration.service                              not-found inactive dead    nodeip-configuration.service

Comment 4 Yu Qi Zhang 2021-03-18 21:31:02 UTC
I don't have access to hydra, and for some reason I cannot download https://access.redhat.com/support/cases/#/case/02896214/discussion?attachmentId=a092K000020fK6YQAU (keeps on failing to download), and https://access.redhat.com/support/cases/#/case/02896214/discussion?attachmentId=a092K000020fLm9QAE seems to only be OCS items. Could you maybe upload it to google drive?

Based on the MCD logs on the customer case, and the above output, it seems that the machineconfig entry that should have contained nodeip-configuration.service was deleted
(you can see that ./systemd/system/multi-user.target.wants/nodeip-configuration.service exists, but that should be dangling.) I don't believe that is expected, or if the service should have been deleted, the corresponding entry with its enablement should also have been.

You can also check this by looking at the latest rendered machineconfig that the worker node is updating to (desiredConfig in the node annotation, should also be in the MCD logs), and seeing the entry for "nodeip-configuration.service".

Comment 5 Roberto Docampo Suarez 2021-03-18 22:31:40 UTC
To try to check your suspects. I checked a not upgraded (yet) note. The  nodeip-configuration.service not exists in this worker node (6.4.20). Seems to this service not exists in 6,4,20 and MCO not upgraded to 7.1.

To try to advance in the problem, i define the nodeip-configuration.service in xxxocpocsxs01 node, After that, the MCO ends with a suspected desired config... but no changes (old kubernetes version, old coreos version). Seems to the process finish ok but not changes done.



Regards

Comment 6 Yu Qi Zhang 2021-03-19 00:29:50 UTC
I just looked at the 4.6 service at https://github.com/openshift/machine-config-operator/blob/release-4.6/templates/common/vsphere/units/nodeip-configuration.service.yaml. It appears that the service itself only exists if {{ if .Infra.Status.PlatformStatus.VSphere.APIServerInternalIP -}} BUT it is enabled regardless. This didn't have issues in 4.6 but will in 4.7 because the 4.7 MCO now tries to enable via systemctl, will will fail if the service doesn't exist

In 4.7 I think we use the on-prem template for this at https://github.com/openshift/machine-config-operator/blob/release-4.7/templates/common/on-prem/units/nodeip-configuration.service.yaml which shouldn't be a problem. I'm a bit surprised that it still causes issues because the 4.7 template should not have contained any reference to this.

For now, I will loop in Joseph Callen. Joseph, do you know what the expected behaviour for that service should be on vsphere deployments? Did something change there between 4.6 and 4.7?

A way you can work around this is to find the machineconfig that contains a reference to that file (oc get mc -> look through all the **non-rendered** MCs). If it exists in the following format:

> name: nodeip-configuration.service
> enabled: true
> contents: |

(empty contents)

Then something may have gone wrong in the template rendering.

Comment 9 Yu Qi Zhang 2021-03-19 17:11:11 UTC
I took a look at the must-gather. It seems that the old error the node is stuck on was:

2021-03-18T11:57:29.918577180Z E0318 11:57:29.918526    2880 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-12c9558e39356590240463d1b4b14365: content mismatch for file "/etc/systemd/system/crio.service.d/10-mco-default-env.conf"

This prevented the completion of a previous 4.6->4.6 upgrade. With a force applied, we skipped this check, but the MCD was unable to progress because of the nodeip file error. Let me take a step back and explain how this error manifested and how we can proceed.

1. How the error manifested:

The fundamental reason why it appeared was that there was an ongoing upgrade on the worker pool from a 4.6 version to another 4.6 version that had not completed when the cluster started upgrading to 4.7. This is due to the fast that the worker pool is not tracked for previous upgrade completing. Starting in 4.7 this behaviour will change, such that any future incomplete worker pools will block updating to 4.8.

So regarding the template in question, nodeip-configuration.service:
in 4.6 we have it empty but enabled (wrong)
in 4.7 we have it written but disabled (correct)

So the timeline looks something like this: the cluster went from 4.6.a->4.6.b->4.7.

before that worker pool completed 4.6.a->4.6.b (one node was stuck), the upgrade to 4.7 started, during that upgrade, at some point all MCO pods get updated in a rolling fashion. Keep in mind one worker node is still doing the 4.6.a->4.6.b upgrade

The 4.7 MCD, taking over, says: hey I see there's an ongoing upgrade to workers. Let me try to continue that (since you applied a forcefile)

It does the upgrade from 4.6.a->4.6.b, and because the 4.6 template for nodeip-configuration.service, it errors there. The 4.6 MCD silently allowed it (and wrote a dangling symlink), but since 4.7 fixed it, the 4.7 MCD just went ahead and degraded.


2. how we can proceed.
I see that since some of the workers have completed the 4.7 upgrade, this means that the upgrade itself should be fine. Let's do the following to the node that is having issues, "xxxocpocsxxxs03":
oc edit node/xxxocpocsxxxs03
edit the desiredConfig annotation to rendered-worker-e7f7298f9bf6e5b1159778e4cf36c9a9 (the 4.7 conf)
watch the MCD on the node progress, and apply a new forcefile (access the node, touch /etc/machine-config-daemon-force) if needed, if the first error shows up again (it may not)

Then the rest of the cluster should continue upgrading.

As for the template issue I mentioned earlier, we will fix it for a later 4.6 release.

Comment 12 Patrick Dillon 2021-03-24 15:43:59 UTC
4.6 nodeip-configuration.service template needs to have enabled value wrapped in a conditional so that it is disabled for vsphere upi,

Comment 17 W. Trevor King 2021-03-30 20:57:42 UTC
Based on the blocker+ status, we're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact? Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

I'd asked for an impact statement in the masterward bug 1910738 as well, but sounds like the MCO has been refactored in between, so I thought I'd ask for an impact statement here too just to cover my bases.  Feel free to consolidate the responses if that's easier.

Comment 18 Yu Qi Zhang 2021-03-30 21:37:57 UTC
Who is impacted?

* vSphere UPI customers updating from 4.6 to 4.7 who have any non-master MachineConfigPools (worker or custom) which were currently rolling out changes (paused pools are not a problem). Of clusters in that situation, all of them will hit the bug during an update to 4.7.

* Example: an incomplete 4.5-4.6 upgrade in the worker pool, an incomplete 4.6->4.6 upgrade in the worker pool, a new machineconfig rolling out in the worker pool.

What is the impact?

* When the 4.7 MCO takes over, the newly deployed MCD on in-progress nodes will attempt to continue the previous upgrade to a 4.6 version which contains the problematic MC. It will fail and block there, degrading the corresponding pool.

* The templated MC itself is not a problem starting in 4.7 as that template has been reworked.

How involved is remediation?

* If upgrade has not been triggered, best to wait and check that the worker pool has finished any previous updates before starting a 4.7 upgrade. Alternatively, wait for a new 4.7 version with a complimentary fix.

* If the upgrade has been triggered, check the corresponding worker MCP to find the newest rendered config for 4.7 (spec.Config NOT status.spec.config field) or find the newest ignition 3.2 rendered config in the list of Machineconfigs. Manually update the stuck node's desiredConfig annotation to the 4.7 one instead of a 4.6 one, and the MCD should do the rest. You should only have to do this once.

Is this a regression?

 * Yes. The issue exists on 4.6 only

Comment 19 Yu Qi Zhang 2021-03-30 21:51:36 UTC
Correction to the previous statement: "You should only have to do this once."

You should only have to do this once per stuck-in-upgrade nodes, which defaults to 1 but depending on maxUnavailable settings may be multiple. Nodes that have yet to start an update should not be affected by this as they should immediately attempt to upgrade to the 4.7 worker config when it's their turn, if no other errors occur.

Comment 21 Lalatendu Mohanty 2021-03-31 15:28:52 UTC
As per #18 this is a regression. But it is not clear which 4.6.z versions has the regression. Can you please clarify?

Comment 22 Lalatendu Mohanty 2021-03-31 16:21:24 UTC
As discussed in slack with @jerzhang the issue is applicable to all 4.6.z releases.

Comment 23 W. Trevor King 2021-03-31 16:41:14 UTC
This effects all 4.6 releases, and only impacts 4.6 -> 4.7 updates, so there's still a benefit to cutting 4.6.z fixes to help folks on older 4.6, even without this change landing.  Setting blocker-.  This bug is still urgent, and we still want this fixed as quickly as possible :)

Comment 31 errata-xmlrpc 2021-04-20 19:27:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1153


Note You need to log in before you can comment on or make changes to this bug.