Description of problem: This change introduced a bug that is triggered when updating. https://github.com/openshift/machine-config-operator/pull/2380 The hostname's of the RHCOS virtual machines within installation on vSphere must be the same as the guest name defined in vCenter. When upgrading to the latest release-4.7 NetworkManager is setting the hostname defined by DHCP causing the upgrade to fail and the first pivoted machines to go NotReady. Upgraded master: [root@ip-172-31-245-117 ~]# systemctl status vsphere-hostname.service ● vsphere-hostname.service - vSphere hostname Loaded: loaded (/etc/systemd/system/vsphere-hostname.service; enabled; vendor preset: disabled) Active: inactive (dead) Condition: start condition failed at Tue 2021-03-23 20:22:15 UTC; 11min ago └─ ConditionPathExists=/etc/ignition-machine-config-encapsulated.json was not met Even after removing the condition [root@ip-172-31-245-117 ~]# systemctl daemon-reload [root@ip-172-31-245-117 ~]# systemctl restart vsphere-hostname.service [root@ip-172-31-245-117 ~]# ^restart^status systemctl status vsphere-hostname.service ● vsphere-hostname.service - vSphere hostname Loaded: loaded (/etc/systemd/system/vsphere-hostname.service; enabled; vendor preset: disabled) Active: inactive (dead) since Tue 2021-03-23 20:37:07 UTC; 7s ago Process: 17050 ExecStart=/usr/local/bin/vsphere-hostname.sh (code=exited, status=0/SUCCESS) Main PID: 17050 (code=exited, status=0/SUCCESS) CPU: 22ms Mar 23 20:37:06 ip-172-31-245-117.us-west-2.compute.internal systemd[1]: Started vSphere hostname. Mar 23 20:37:07 jcallen2-vkhbn-master-1 systemd[1]: vsphere-hostname.service: Succeeded. Mar 23 20:37:07 jcallen2-vkhbn-master-1 systemd[1]: vsphere-hostname.service: Consumed 22ms CPU time [root@ip-172-31-245-117 ~]# hostnamectl Static hostname: jcallen2-vkhbn-master-1 Transient hostname: ip-172-31-245-117.us-west-2.compute.internal Icon name: computer-vm Chassis: vm Machine ID: 8c9a26759d20412c9fa962dd49a3271e Boot ID: e61c22fbe0c64cda8bd34601bd80f3a7 Virtualization: vmware Operating System: Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa) CPE OS Name: cpe:/o:redhat:enterprise_linux:8::coreos Kernel: Linux 4.18.0-240.15.1.el8_3.x86_64 Architecture: x86-64 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This was discovered while trying to work on another BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1935539 # jcallen @ magnesium in ~/go/src/github.com/openshift/machine-config-operator on git:vsphere_offload_47_test x [16:56:50] $ git --no-pager diff release-4.7 diff --git a/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml b/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml new file mode 100644 index 00000000..1b5daae2 --- /dev/null +++ b/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml @@ -0,0 +1,14 @@ +filesystem: "root" +mode: 0744 +path: "/etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl" +contents: + inline: | + #!/bin/bash + # Workaround: + # https://bugzilla.redhat.com/show_bug.cgi?id=1941714 + # https://bugzilla.redhat.com/show_bug.cgi?id=1935539 + if [ "$2" == "up" ]; then + logger -s "99-vsphere-disable-tx-udp-tnl triggered by ${2}." + ethtool -K ${DEVICE_IFACE} tx-udp_tnl-segmentation off + ethtool -K ${DEVICE_IFACE} tx-udp_tnl-csum-segmentation off + fi diff --git a/templates/common/vsphere/files/vsphere-hostname.yaml b/templates/common/vsphere/files/vsphere-hostname.yaml index d9096235..5b79101a 100644 --- a/templates/common/vsphere/files/vsphere-hostname.yaml +++ b/templates/common/vsphere/files/vsphere-hostname.yaml @@ -5,9 +5,6 @@ contents: #!/usr/bin/env bash set -e - # only run if the hostname is not set - test -f /etc/hostname && exit 0 || : - if vm_name=$(/bin/vmtoolsd --cmd 'info-get guestinfo.hostname'); then /usr/bin/hostnamectl set-hostname --static ${vm_name} fi The release image: quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe has the above changes ➜ ~ oc adm upgrade --allow-explicit-upgrade --force --allow-upgrade-with-warnings --to-image quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe ➜ ~ oc get node NAME STATUS ROLES AGE VERSION jcallen2-vkhbn-master-0 Ready master 135m v1.19.0+2f3101c jcallen2-vkhbn-master-1 NotReady,SchedulingDisabled master 135m v1.19.0+2f3101c jcallen2-vkhbn-master-2 Ready master 134m v1.19.0+2f3101c jcallen2-vkhbn-worker-5pcrg NotReady,SchedulingDisabled worker 125m v1.19.0+2f3101c jcallen2-vkhbn-worker-bcqgn Ready worker 124m v1.19.0+2f3101c jcallen2-vkhbn-worker-krqqv Ready worker 124m v1.19.0+2f3101c
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? All vSphere customers leveraging the vSphere cloud providerupgrading from 4.6.z and 4.7.3 What is the impact? Is it serious enough to warrant blocking edges? Nodes may lose node names which can have serious impacts on the stability of the control plane and workloads. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Each node must be SSH'ed and have the node name set manually. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes, this is a regressiong introduced in 4.7.4
Based on comment 3, I've filed [1] to block *->4.7.4 edges. [1]: https://github.com/openshift/cincinnati-graph-data/pull/731
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days