Bug 1942207 - [vsphere] hostname are changed when upgrading from 4.6 to 4.7.x causing upgrades to fail
Summary: [vsphere] hostname are changed when upgrading from 4.6 to 4.7.x causing upgra...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: x86_64
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: rvanderp
QA Contact: Michael Nguyen
URL:
Whiteboard: UpdateRecommendationsBlocked
Depends On:
Blocks: 1943143
TreeView+ depends on / blocked
 
Reported: 2021-03-23 20:53 UTC by Joseph Callen
Modified: 2024-06-14 00:59 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Hostname set by the vsphere-hostname service is only applied on installation of the node. Consequence: If the hostname is not statically set prior to upgrading, the hostname may be lost. Fix: Remove condition which allowed the vsphere-hostname service to only run when a node is installed. Result:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:55:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2486 0 None open Fixes: Bug 1942207: [vsphere] hostnames are changed when upgrading from 4.6 to 4.7.x causing upgrades to fail 2021-03-24 12:43:58 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:55:26 UTC

Description Joseph Callen 2021-03-23 20:53:14 UTC
Description of problem:

This change introduced a bug that is triggered when updating.
https://github.com/openshift/machine-config-operator/pull/2380

The hostname's of the RHCOS virtual machines within installation on vSphere must be the same as the guest name defined in vCenter.

When upgrading to the latest release-4.7 NetworkManager is setting the hostname defined by DHCP causing the upgrade to fail and the first pivoted machines to go NotReady.


Upgraded master:

[root@ip-172-31-245-117 ~]# systemctl status vsphere-hostname.service
● vsphere-hostname.service - vSphere hostname
   Loaded: loaded (/etc/systemd/system/vsphere-hostname.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Tue 2021-03-23 20:22:15 UTC; 11min ago
           └─ ConditionPathExists=/etc/ignition-machine-config-encapsulated.json was not met

Even after removing the condition

[root@ip-172-31-245-117 ~]# systemctl daemon-reload
[root@ip-172-31-245-117 ~]# systemctl restart vsphere-hostname.service
[root@ip-172-31-245-117 ~]# ^restart^status
systemctl status vsphere-hostname.service
● vsphere-hostname.service - vSphere hostname
   Loaded: loaded (/etc/systemd/system/vsphere-hostname.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2021-03-23 20:37:07 UTC; 7s ago
  Process: 17050 ExecStart=/usr/local/bin/vsphere-hostname.sh (code=exited, status=0/SUCCESS)
 Main PID: 17050 (code=exited, status=0/SUCCESS)
      CPU: 22ms

Mar 23 20:37:06 ip-172-31-245-117.us-west-2.compute.internal systemd[1]: Started vSphere hostname.
Mar 23 20:37:07 jcallen2-vkhbn-master-1 systemd[1]: vsphere-hostname.service: Succeeded.
Mar 23 20:37:07 jcallen2-vkhbn-master-1 systemd[1]: vsphere-hostname.service: Consumed 22ms CPU time
[root@ip-172-31-245-117 ~]# hostnamectl
   Static hostname: jcallen2-vkhbn-master-1
Transient hostname: ip-172-31-245-117.us-west-2.compute.internal
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 8c9a26759d20412c9fa962dd49a3271e
           Boot ID: e61c22fbe0c64cda8bd34601bd80f3a7
    Virtualization: vmware
  Operating System: Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa)
       CPE OS Name: cpe:/o:redhat:enterprise_linux:8::coreos
            Kernel: Linux 4.18.0-240.15.1.el8_3.x86_64
      Architecture: x86-64

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Joseph Callen 2021-03-23 20:58:00 UTC
This was discovered while trying to work on another BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1935539

# jcallen @ magnesium in ~/go/src/github.com/openshift/machine-config-operator on git:vsphere_offload_47_test x [16:56:50]
$ git --no-pager diff release-4.7
diff --git a/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml b/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml
new file mode 100644
index 00000000..1b5daae2
--- /dev/null
+++ b/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml
@@ -0,0 +1,14 @@
+filesystem: "root"
+mode: 0744
+path: "/etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl"
+contents:
+    inline: |
+      #!/bin/bash
+      # Workaround:
+      # https://bugzilla.redhat.com/show_bug.cgi?id=1941714
+      # https://bugzilla.redhat.com/show_bug.cgi?id=1935539
+      if [ "$2" == "up" ]; then
+        logger -s "99-vsphere-disable-tx-udp-tnl triggered by ${2}."
+        ethtool -K ${DEVICE_IFACE} tx-udp_tnl-segmentation off
+        ethtool -K ${DEVICE_IFACE} tx-udp_tnl-csum-segmentation off
+      fi
diff --git a/templates/common/vsphere/files/vsphere-hostname.yaml b/templates/common/vsphere/files/vsphere-hostname.yaml
index d9096235..5b79101a 100644
--- a/templates/common/vsphere/files/vsphere-hostname.yaml
+++ b/templates/common/vsphere/files/vsphere-hostname.yaml
@@ -5,9 +5,6 @@ contents:
     #!/usr/bin/env bash
     set -e

-    # only run if the hostname is not set
-    test -f /etc/hostname && exit 0 || :
-
     if vm_name=$(/bin/vmtoolsd --cmd 'info-get guestinfo.hostname'); then
         /usr/bin/hostnamectl set-hostname --static ${vm_name}
     fi

The release image: quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe
has the above changes

➜  ~ oc adm upgrade --allow-explicit-upgrade --force --allow-upgrade-with-warnings --to-image quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe

➜  ~ oc get node
NAME                          STATUS                        ROLES    AGE    VERSION
jcallen2-vkhbn-master-0       Ready                         master   135m   v1.19.0+2f3101c
jcallen2-vkhbn-master-1       NotReady,SchedulingDisabled   master   135m   v1.19.0+2f3101c
jcallen2-vkhbn-master-2       Ready                         master   134m   v1.19.0+2f3101c
jcallen2-vkhbn-worker-5pcrg   NotReady,SchedulingDisabled   worker   125m   v1.19.0+2f3101c
jcallen2-vkhbn-worker-bcqgn   Ready                         worker   124m   v1.19.0+2f3101c
jcallen2-vkhbn-worker-krqqv   Ready                         worker   124m   v1.19.0+2f3101c

Comment 3 rvanderp 2021-03-30 17:38:39 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
All vSphere customers leveraging the vSphere cloud providerupgrading from 4.6.z and 4.7.3 

What is the impact?  Is it serious enough to warrant blocking edges?
Nodes may lose node names which can have serious impacts on the stability of the control plane and workloads.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
Each node must be SSH'ed and have the node name set manually.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
Yes, this is a regressiong introduced in 4.7.4

Comment 4 W. Trevor King 2021-03-30 17:43:30 UTC
Based on comment 3, I've filed [1] to block *->4.7.4 edges.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/731

Comment 9 errata-xmlrpc 2021-07-27 22:55:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 10 Red Hat Bugzilla 2023-09-15 01:03:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.