Bug 1942207

Summary: [vsphere] hostname are changed when upgrading from 4.6 to 4.7.x causing upgrades to fail
Product: OpenShift Container Platform Reporter: Joseph Callen <jcallen>
Component: Machine Config OperatorAssignee: rvanderp
Machine Config Operator sub component: platform-vsphere QA Contact: Michael Nguyen <mnguyen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: alexisph, aos-bugs, bjarolim, fiezzi, jhou, jima, mkrejci, openshift-bugs-escalate, rvanderp, wking
Version: 4.7Keywords: UpgradeBlocker, Upgrades
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: UpdateRecommendationsBlocked
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Hostname set by the vsphere-hostname service is only applied on installation of the node. Consequence: If the hostname is not statically set prior to upgrading, the hostname may be lost. Fix: Remove condition which allowed the vsphere-hostname service to only run when a node is installed. Result:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:55:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1943143    

Description Joseph Callen 2021-03-23 20:53:14 UTC
Description of problem:

This change introduced a bug that is triggered when updating.
https://github.com/openshift/machine-config-operator/pull/2380

The hostname's of the RHCOS virtual machines within installation on vSphere must be the same as the guest name defined in vCenter.

When upgrading to the latest release-4.7 NetworkManager is setting the hostname defined by DHCP causing the upgrade to fail and the first pivoted machines to go NotReady.


Upgraded master:

[root@ip-172-31-245-117 ~]# systemctl status vsphere-hostname.service
● vsphere-hostname.service - vSphere hostname
   Loaded: loaded (/etc/systemd/system/vsphere-hostname.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Tue 2021-03-23 20:22:15 UTC; 11min ago
           └─ ConditionPathExists=/etc/ignition-machine-config-encapsulated.json was not met

Even after removing the condition

[root@ip-172-31-245-117 ~]# systemctl daemon-reload
[root@ip-172-31-245-117 ~]# systemctl restart vsphere-hostname.service
[root@ip-172-31-245-117 ~]# ^restart^status
systemctl status vsphere-hostname.service
● vsphere-hostname.service - vSphere hostname
   Loaded: loaded (/etc/systemd/system/vsphere-hostname.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2021-03-23 20:37:07 UTC; 7s ago
  Process: 17050 ExecStart=/usr/local/bin/vsphere-hostname.sh (code=exited, status=0/SUCCESS)
 Main PID: 17050 (code=exited, status=0/SUCCESS)
      CPU: 22ms

Mar 23 20:37:06 ip-172-31-245-117.us-west-2.compute.internal systemd[1]: Started vSphere hostname.
Mar 23 20:37:07 jcallen2-vkhbn-master-1 systemd[1]: vsphere-hostname.service: Succeeded.
Mar 23 20:37:07 jcallen2-vkhbn-master-1 systemd[1]: vsphere-hostname.service: Consumed 22ms CPU time
[root@ip-172-31-245-117 ~]# hostnamectl
   Static hostname: jcallen2-vkhbn-master-1
Transient hostname: ip-172-31-245-117.us-west-2.compute.internal
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 8c9a26759d20412c9fa962dd49a3271e
           Boot ID: e61c22fbe0c64cda8bd34601bd80f3a7
    Virtualization: vmware
  Operating System: Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa)
       CPE OS Name: cpe:/o:redhat:enterprise_linux:8::coreos
            Kernel: Linux 4.18.0-240.15.1.el8_3.x86_64
      Architecture: x86-64

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Joseph Callen 2021-03-23 20:58:00 UTC
This was discovered while trying to work on another BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1935539

# jcallen @ magnesium in ~/go/src/github.com/openshift/machine-config-operator on git:vsphere_offload_47_test x [16:56:50]
$ git --no-pager diff release-4.7
diff --git a/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml b/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml
new file mode 100644
index 00000000..1b5daae2
--- /dev/null
+++ b/templates/common/vsphere/files/vsphere-disable-vmxnet3v4-features.yaml
@@ -0,0 +1,14 @@
+filesystem: "root"
+mode: 0744
+path: "/etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl"
+contents:
+    inline: |
+      #!/bin/bash
+      # Workaround:
+      # https://bugzilla.redhat.com/show_bug.cgi?id=1941714
+      # https://bugzilla.redhat.com/show_bug.cgi?id=1935539
+      if [ "$2" == "up" ]; then
+        logger -s "99-vsphere-disable-tx-udp-tnl triggered by ${2}."
+        ethtool -K ${DEVICE_IFACE} tx-udp_tnl-segmentation off
+        ethtool -K ${DEVICE_IFACE} tx-udp_tnl-csum-segmentation off
+      fi
diff --git a/templates/common/vsphere/files/vsphere-hostname.yaml b/templates/common/vsphere/files/vsphere-hostname.yaml
index d9096235..5b79101a 100644
--- a/templates/common/vsphere/files/vsphere-hostname.yaml
+++ b/templates/common/vsphere/files/vsphere-hostname.yaml
@@ -5,9 +5,6 @@ contents:
     #!/usr/bin/env bash
     set -e

-    # only run if the hostname is not set
-    test -f /etc/hostname && exit 0 || :
-
     if vm_name=$(/bin/vmtoolsd --cmd 'info-get guestinfo.hostname'); then
         /usr/bin/hostnamectl set-hostname --static ${vm_name}
     fi

The release image: quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe
has the above changes

➜  ~ oc adm upgrade --allow-explicit-upgrade --force --allow-upgrade-with-warnings --to-image quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image quay.io/jcallen/origin-release@sha256:81067c5c77dec5d950abf6bcb93edb6e7aea534f45e0a4a144a9f3e39c4acbbe

➜  ~ oc get node
NAME                          STATUS                        ROLES    AGE    VERSION
jcallen2-vkhbn-master-0       Ready                         master   135m   v1.19.0+2f3101c
jcallen2-vkhbn-master-1       NotReady,SchedulingDisabled   master   135m   v1.19.0+2f3101c
jcallen2-vkhbn-master-2       Ready                         master   134m   v1.19.0+2f3101c
jcallen2-vkhbn-worker-5pcrg   NotReady,SchedulingDisabled   worker   125m   v1.19.0+2f3101c
jcallen2-vkhbn-worker-bcqgn   Ready                         worker   124m   v1.19.0+2f3101c
jcallen2-vkhbn-worker-krqqv   Ready                         worker   124m   v1.19.0+2f3101c

Comment 3 rvanderp 2021-03-30 17:38:39 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
All vSphere customers leveraging the vSphere cloud providerupgrading from 4.6.z and 4.7.3 

What is the impact?  Is it serious enough to warrant blocking edges?
Nodes may lose node names which can have serious impacts on the stability of the control plane and workloads.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
Each node must be SSH'ed and have the node name set manually.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
Yes, this is a regressiong introduced in 4.7.4

Comment 4 W. Trevor King 2021-03-30 17:43:30 UTC
Based on comment 3, I've filed [1] to block *->4.7.4 edges.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/731

Comment 9 errata-xmlrpc 2021-07-27 22:55:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 10 Red Hat Bugzilla 2023-09-15 01:03:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days