1935539 – Openshift-apiserver CO unavailable during cluster upgrade from 4.6 to 4.7

Bug 1935539 - Openshift-apiserver CO unavailable during cluster upgrade from 4.6 to 4.7

Summary: Openshift-apiserver CO unavailable during cluster upgrade from 4.6 to 4.7

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Joseph Callen
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:	UpdateRecommendationsBlocked
Duplicates (2):	1926345 1941322 (view as bug list)
Depends On:
Blocks:	1941246 1944165
TreeView+	depends on / blocked

Reported:	2021-03-05 04:09 UTC by Yash Chouksey
Modified:	2024-12-20 19:43 UTC (History)
CC List:	53 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1944165 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:51:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
iptables-master3 (148.66 KB, text/plain) 2021-03-19 09:00 UTC, Yash Chouksey	no flags	Details
vmsupport (2.61 MB, application/gzip) 2021-03-22 20:55 UTC, David J. M. Karlsen	no flags	Details
kubeapi FIN and RST on kube API on virtIO NIC Emulation (212.24 KB, image/png) 2021-04-27 09:30 UTC, Simon Foley	no flags	Details
kubeapi Less FIN and RST on kube API on vmxnet3 NIC Emulation (216.68 KB, image/png) 2021-04-27 10:10 UTC, Simon Foley	no flags	Details
kubeapi Exampels of fewer FIN and RST on kube API on vmxnet3 NIC Emulation (231.13 KB, image/png) 2021-04-27 10:12 UTC, Simon Foley	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2472	0	None	closed	Bug 1935539: vSphere: Disable tx udp_csum segmentation	2021-03-22 09:13:36 UTC
Github	openshift machine-config-operator pull 2482	0	None	closed	Bug 1935539: vSphere: udp tnl workaround cannot use nmcli	2021-03-27 04:18:32 UTC
Red Hat Bugzilla	1941714	1	None	None	None	2024-12-20 19:47:18 UTC
Red Hat Knowledge Base (Solution)	5896081	0	None	None	None	2021-03-22 09:48:53 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:51:41 UTC

Internal Links: 1941714 1952358

Comment 2 Lukasz Szaszkiewicz 2021-03-05 12:46:38 UTC

Hi there,

I have checked the attached must-gather. 
It looks like openshift-apiserver-operator is in a degraded state because kube-apiserver is not able to reach the Aggregated API servers (openshift-apiserver).
The kube-apiserver logs suggest the server is not able to create a network connection to the downstream servers, numerous "dial tcp ... i/o timeout".

I'm assigning to the network team to help diagnose potential a network issue.

Comment 3 Ben Bennett 2021-03-08 15:56:24 UTC

Possibly the same as https://bugzilla.redhat.com/show_bug.cgi?id=1935591

Comment 13 Lalatendu Mohanty 2021-03-18 19:25:55 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 17 Yash Chouksey 2021-03-19 09:00:20 UTC

Created attachment 1764604 [details]
iptables-master3

Comment 21 W. Trevor King 2021-03-19 20:26:39 UTC

I am not the assignee, and my hold on this is pretty tenuous, but since we just blocked all 4.6->4.7 edges on this bug [1], here's my attempt at an impact statement per comment 13's template:

Who is impacted?
  vSphere running 4.7 on HW 14 and later.  So block 4.6->4.7 to protect clusters running 4.6 who are currently not affected.
What is the impact?
  Cross-node pod-to-pod connections are unreliable.  Kube API-server degrades, or maybe just access to it?  Or something?  Eventually things like auth operator get mad and go Available=False.  Actual impact on the cluster is all the stuff that happens downstream of flaky Kube-API access...
How involved is remediation?
  Still working this angle, but it won't be pretty.  Possibly folks can move VMs to a non-vulnerable vSphere version?
Is this a regression?
  Yup, doesn't seem to impact 4.6 or earlier.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/718

Comment 22 Dan Winship 2021-03-19 21:28:07 UTC

(In reply to W. Trevor King from comment #21)
> I am not the assignee, and my hold on this is pretty tenuous, but since we
> just blocked all 4.6->4.7 edges on this bug [1], here's my attempt at an
> impact statement per comment 13's template:
> 
> Who is impacted?
>   vSphere running 4.7 on HW 14 and later.  So block 4.6->4.7 to protect
> clusters running 4.6 who are currently not affected.
> What is the impact?
>   Cross-node pod-to-pod connections are unreliable.  Kube API-server
> degrades, or maybe just access to it?  Or something?  Eventually things like
> auth operator get mad and go Available=False.  Actual impact on the cluster
> is all the stuff that happens downstream of flaky Kube-API access...

Cross-node pod-to-pod or host-to-pod connections fail because packets are being dropped by vSphere between the nodes (apparently because of checksum errors, apparently because of a kernel change between 4.6 and 4.7). The problem only affects VXLAN traffic, not any other node-to-node traffic. (Well, AFAIK it's not clear at this time if it affects Geneve traffic, as used by ovn-kubernetes.)

The visible effect in terms of clusteroperators/telemetry/etc is that kube-apiserver stops being able to talk to openshift-apiserver, and then failures cascade from there. But all cross-node pod-to-pod traffic (including customer workload traffic) is actually broken.

(The "unreliable" in the comment above is wrong; things are totally broken. I got confused before because I saw there was still some VXLAN traffic going between nodes, so I thought that for some reason some packets were getting through but others weren't. But really what was happening is that ARP and ICMP packets over VXLAN get through, but TCP and UDP packets over VXLAN don't. So basically everything is broken.)

Comment 23 Michael Gugino 2021-03-19 23:57:02 UTC

Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1936556

machine-api is periodically unable to connect to vsphere API with 'no route to host'.

Comment 28 Joseph Callen 2021-03-22 14:37:35 UTC

Should this BZ be moved to the RHEL component instead?
I have added the VMware folks if we can make appropriate comments public so they can view.

Comment 30 Aniket Bhat 2021-03-22 14:42:13 UTC

*** Bug 1941322 has been marked as a duplicate of this bug. ***

Comment 32 Aniket Bhat 2021-03-22 14:53:59 UTC

*** Bug 1926345 has been marked as a duplicate of this bug. ***

Comment 33 Joseph Callen 2021-03-22 17:27:54 UTC

VMware is requesting that every customer that has hit this issue create a corresponding support request with VMware retrieve the appropriate logs and packet traces.

Comment 35 David J. M. Karlsen 2021-03-22 20:08:46 UTC

@jcallen I believe we are hit by this one.
VMWare 6.7u3, hwlevel 15, went from 4.6.20 to 4.7.1. case 02893982 in RH support portal.
when one open cases with vmware they are typically eager to just close them, or will close them as "not related to vmware - its an openshift issue". How are they supposed to be reported to vmware, and what forensics should be added?

Comment 36 David J. M. Karlsen 2021-03-22 20:11:55 UTC

more info too, when I saw this issue I started debugging a bit, and our workers are RHEL, while the masters are RHCOS. If I create a debugging pod where I run dig against the openshift-dns pods, I notice that I get response from the RHEL-ones, but not from the RHCOS ones. This might help narrow the issue.

Comment 37 Ronak Doshi 2021-03-22 20:48:55 UTC

(In reply to David J. M. Karlsen from comment #35)
> @jcallen I believe we are hit by this one.
> VMWare 6.7u3, hwlevel 15, went from 4.6.20 to 4.7.1. case 02893982 in RH
> support portal.
> when one open cases with vmware they are typically eager to just close them,
> or will close them as "not related to vmware - its an openshift issue". How
> are they supposed to be reported to vmware, and what forensics should be
> added?

Could you upload host support bundle (using vm-support command on host) and also mention the port ID of the vm where the vxlan traffic is initiated?

Comment 38 David J. M. Karlsen 2021-03-22 20:55:53 UTC

Created attachment 1765394 [details]
vmsupport

Note: this is after turning off checksum on iface (in order to get the install to work)

Comment 39 Ronak Doshi 2021-03-22 20:58:33 UTC

(In reply to David J. M. Karlsen from comment #38)
> Created attachment 1765394 [details]
> vmsupport
> 
> Note: this is after turning off checksum on iface (in order to get the
> install to work)

This won't help then as it does not have the issue. Please reproduce the issue by enabling it on test setup and collect support bundle for investigation.

Comment 40 Joseph Callen 2021-03-22 23:55:50 UTC

Hi David,

(In reply to David J. M. Karlsen from comment #36)
> more info too, when I saw this issue I started debugging a bit, and our
> workers are RHEL, while the masters are RHCOS.


Hmm RHEL workers would be 7 so I would suspect maybe don't have the updated vmxnet3 driver
Here is the coresponding RHEL kernel BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1941714

> If I create a debugging pod
> where I run dig against the openshift-dns pods, I notice that I get response
> from the RHEL-ones, but not from the RHCOS ones. This might help narrow the
> issue.

Comment 41 David J. M. Karlsen 2021-03-23 00:17:51 UTC

@jcallen that one seems internal/closed for public "You are not authorized to access bug #1941714.

"

Comment 44 Ross Brattain 2021-03-29 20:48:55 UTC

No degraded operators on 4.8.0-0.ci-2021-03-28-220420 and 4.8.0-0.ci-2021-03-29-154349

VMC
Hypervisor:	VMware ESXi, 7.0.1, 17460241
Model:	Amazon EC2 i3en.metal-2tb
Processor Type:	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz


UDP offloads disabled

vHW 17
# for f in $(oc get nodes --no-headers -o custom-columns=N:.metadata.name ) ; do oc debug node/$f -- ethtool -k ens192 | grep udp_tnl | tee udp-$f & done

tx-udp_tnl-segmentation: off
tx-udp_tnl-csum-segmentation: off

vHW 17
# oc adm node-logs -l  kubernetes.io/os=linux -g udp.tnl | tee udp-tnl.log
Mar 29 10:50:48.692957 compute-0 root[1919]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
Mar 29 10:50:52.664168 compute-1 root[1663]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
Mar 29 10:52:30.458263 control-plane-0 root[1865]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
Mar 29 10:52:35.130729 control-plane-1 root[1705]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
Mar 29 10:52:40.255658 control-plane-2 root[1937]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
SDN

Mar 29 08:48:44.766826 compute-0 root[1543]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
Mar 29 08:51:22.308202 compute-1 root[1672]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
Mar 29 08:54:10.123662 control-plane-0 root[1683]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
Mar 29 08:59:20.426804 control-plane-1 root[1689]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
Mar 29 08:59:24.414762 control-plane-2 root[1696]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192.
OVN


4.8.0-0.ci-2021-03-29-154349

UDP offloads disabled
# vHW 14
udp-compute-0:tx-udp_tnl-segmentation: off
udp-compute-0:tx-udp_tnl-csum-segmentation: off
# vWH 15
udp-compute-1:tx-udp_tnl-segmentation: off
udp-compute-1:tx-udp_tnl-csum-segmentation: off
# vHW 13
udp-control-plane-0:tx-udp_tnl-segmentation: off [fixed]
udp-control-plane-0:tx-udp_tnl-csum-segmentation: off [fixed]
udp-control-plane-1:tx-udp_tnl-segmentation: off [fixed]
udp-control-plane-1:tx-udp_tnl-csum-segmentation: off [fixed]
udp-control-plane-2:tx-udp_tnl-segmentation: off [fixed]
udp-control-plane-2:tx-udp_tnl-csum-segmentation: off [fixed]


NAME                                       VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.ci-2021-03-29-154349   True        False         False      57m

Comment 45 W. Trevor King 2021-03-31 03:39:03 UTC

Setting UpdateRecommendationsBlocked [1], since we have blocked 4.6->4.7 for this series since [2].

[1]: https://github.com/openshift/enhancements/pull/426
[2]: https://github.com/openshift/cincinnati-graph-data/pull/718

Comment 46 Joseph Callen 2021-04-07 13:58:49 UTC

Currently we have a single nested environment where we can reproduce this issue. Applying the same configuration to a physical environment is not reproducing the issue.

We need additional information. If those impacted could tell us:

1.) ESXi version w/build numbers
2.) Type of switch used (Standard, Distributed, NSX-T Opaque)
3.) Switch security/policy - Promiscuous mode, MAC address changes, Forged transmits
4.) CPU model (e.g. Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz)
5.) virtual hardware version (we know it must be past 14)

Once we can reproduce we can provide VMware the logs they are requesting.

Comment 50 arild.lager 2021-04-10 17:28:29 UTC

(In reply to Joseph Callen from comment #46)
> Currently we have a single nested environment where we can reproduce this
> issue. Applying the same configuration to a physical environment is not
> reproducing the issue.
> 
> We need additional information. If those impacted could tell us:
> 
> 1.) ESXi version w/build numbers
> 2.) Type of switch used (Standard, Distributed, NSX-T Opaque)
> 3.) Switch security/policy - Promiscuous mode, MAC address changes, Forged
> transmits
> 4.) CPU model (e.g. Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz)
> 5.) virtual hardware version (we know it must be past 14)
> 
> Once we can reproduce we can provide VMware the logs they are requesting.

Info from our environment related to Case 02893982:
1.) 6.7.0, 17167734
2.) Distributed
3.) 3* Rejected
4.) Intel Xeo Gold 6132 @ 2.60 GHz
5.) ESXi6.7 Upd 2 and later ( VM Version 15 )

Regards 
Arild Lager

Comment 51 Simon Foley 2021-04-26 22:23:09 UTC

@ Dan Winship 
         can somebody be a bit more specific on the "apparently because of a kernel change between 4.6 and 4.7".

The unverified work around of disabling VxLan offload in the Vsphere network interface drivers 


https://access.redhat.com/solutions/5896081 

"nmcli con mod [connection id] ethtool.feature-tx-udp_tnl-segmentation off ethtool.feature-tx-udp_tnl-csum-segmentation off"


However I rebuilt my ESXi 7.0b Hypervisor with Proxmox 6.3-6 instead and I am still getting exactly the same problem with the VirtIO Network Interface drivers. 
The only difference in this case is that disabling VxLan offload does not seem to resolve the problem.
So there is something much more universal going on here. 

If the solution to fix the root cause lies in the network interface drivers as a response t the kernel changes, 
i'd like to log a bug with VirtIO as well if I know what the kernel changes were that triggered the issue.

Thx
Axel

Comment 52 Simon Foley 2021-04-27 09:30:55 UTC

Created attachment 1775888 [details]
kubeapi FIN and RST on kube API on virtIO NIC Emulation

Comment 53 Simon Foley 2021-04-27 10:10:56 UTC

Created attachment 1775890 [details]
kubeapi Less FIN and RST on kube API on vmxnet3 NIC Emulation

Comment 54 Simon Foley 2021-04-27 10:12:42 UTC

Created attachment 1775891 [details]
kubeapi Exampels of fewer  FIN and RST on kube API on vmxnet3 NIC Emulation

Comment 55 Simon Foley 2021-04-27 10:36:26 UTC

Hi guys,
        I have been doing some testing on Proxmox before I rebuild back to ESXi hypervisor using the latest openshift-installer 4.7.7 on a bare metal install.
I found that the issue was very prevalent when using virtIO, E1000 & RTL8139 NIC emulation on all fresh installs, to the extent that I could not get a bootstrap to complete.

When I sniffed the traffic from the HAProxy (bootstrap server) I could see the kubeapi reject the master nodes connection on TCP Port 6443 with FIN and RST.
This could be consistent with the packets in the bootstrap server getting dropped internally in the driver and the bootstrap server garbage collecting the TIME-WAIT sessions.

https://bugzilla.redhat.com/attachment.cgi?id=1775888

(192.168.100.123 is the HAProxy which you can infer is the bootstrap server responding to the master node 192.168.100.201)

However I was able to get a significant improvement by using the vmxnet3 NIC Driver\emulation. 
The initial master-> bootstrap process worked without any issues or FIN/RST and you can then see connections flip and new sessions inbound to the master node on TCP port 6443 work fine too.

https://bugzilla.redhat.com/attachment.cgi?id=1775890

The interesting thing is that one of the suggestions was to disable UDP checksum offload (see below). I tried that with out any success on virtIO/e1000,rtl8139 nic types.
As this was all TCP i did not bother with the suggested work around as all my traffic issues was TCP.

Now the bootstrap process did fail on first attempt with eventual FIN/RST's coming from the bootstrap server to the master node;
https://bugzilla.redhat.com/attachment.cgi?id=1775891

time="2021-04-27T09:15:20+01:00" level=debug msg="Still waiting for the Kubernetes API: Get \"https://api.mycluster.openshift.lan:6443/version?timeout=32s\": net/http: TLS handshake timeout"
time="2021-04-27T09:15:40+01:00" level=debug msg="Still waiting for the Kubernetes API: an error on the server (\"\") has prevented the request from succeeding"

0427 09:49:49.168763    5618 reflector.go:127] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps)
I0427 09:50:37.220105    5618 trace.go:205] Trace[1916406559]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (27-Apr-2021 09:50:27.210) (total time: 10009ms):
Trace[1916406559]: [10.009902819s] [10.009902819s] END
E0427 09:50:37.220123    5618 reflector.go:127] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps)
ERROR Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: an error on the server ("") has prevented the request from succeeding (get clusteroperators.config.openshift.io) 


However the bootstrap was able to complete on the second attempt which was simply not happening after 2 or three times with any of the other NIC emulation driver interfaces.

I am not sure that the RH Workaround is actually a work around at all.

Instead I am going to try to disable TCP checksum offload to see if that makes any difference 

nmcli con mod [connection id]  ethtool.feature-tx-udp_tnl-segmentation off ethtool.feature-tx-udp_tnl-csum-segmentation off

If anybody (Dan I think mentioned it) has any hints as to what the kernel changes were that has been suggested as a trigger to this problem ... it would be appreciated.

Comment 56 Simon Foley 2021-04-27 11:12:38 UTC

Apologies typo I was going to try disabling 

nmcli con mod [connection id]  ethtool.feature-tx-tcp-segmentation off

I only have a basic Intel 82574L NIC for testing so its support for any HW off load in a hypervisor environment is going to be slim.
There are a lot of options to disable here (see below) and I am not sure if the reported problem was dependent on your NIC supporting the offload or not!
It seem pointless me installing a Mellanox or SolarFlare NIC until I know more as I can recreate the problem on a basic Intel 82574L gigabit NIC on both ESXi and Proxmox !    

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/configuring-ethtool-offload-features_configuring-and-managing-networking

Comment 57 Dan Winship 2021-04-27 12:41:33 UTC

(In reply to Simon Foley from comment #51)
> However I rebuilt my ESXi 7.0b Hypervisor with Proxmox 6.3-6 instead and I
> am still getting exactly the same problem with the VirtIO Network Interface
> drivers. 
> 
> The only difference in this case is that disabling VxLan offload does not
> seem to resolve the problem.

You are not seeing this bug. You're seeing something completely unrelated. Please file a new bug.

Comment 58 Simon Foley 2021-05-01 20:43:34 UTC

Hi Dan,
"You are not seeing this bug. You're seeing something completely unrelated. Please file a new bug"

You may be correct, but I am not convinced that the solution suggested is actually the *real* cause of the problems people are seeing on this and other bugs.

Have you actually tried to install Openshift 4.7 on private bare metal hardware ? IT DOES NOT WORK PERIOD! It has been broken ever since 4.7 was released for the last 3 months. try an install from bare metal.

I have spend 3 months and over 30 installations with packet captures running to figure out why RH has so badly broken Openshift 4.7.

There is without question something very wrong going on within RHCOS 4.7 (possibly RHEL 8.3) and there is packet loss happening in the OS (my hunch its RHCOS).
all the traces show that the app stack thinks SYN Flooding is happening and there is garbage collection on the TIME_WAIT sessions.

This is why I was interested to try to recreate this bug as I did not believe it was root cause.

FYI I can see this happening on *all* the ESXi 7+ releases ..... ESXi 6.7 releases. I was working my way down getting to the suggestion that ESXi 6.5 with HWVersion 13 may work (I have just finished my test testing on the 6.7 releases.

Why I am suspicious that the proposed solution in this ticket, VMXnet3 in ESXi 6.7 + is root cause

... is that *every* supported hardware version in PROXMOX and *every* combination of their NIC emulation also has NO impact on 4.7 being able to install on bare metal.

In fact VmxNet3 in PROXMOX reduces (what I am assuming is root cause) packet loss in RHCOS Openshift 4.7 compares to other NIC Emulations when running OpenshiftSDN as the network emulation layer.

Lets be very clear here .... all problems disappear if you abandon VmxNet3 in Openshift ( networkType: OpenShiftSDN) and in your install yaml file your change it to (networkType: OVNKubernetes).

Do this and it will always install!

So this to me looks like an issue in how RHCOS interacts with VmxNet3 in OpenshiftSDN Network emulation VMXNet3 and it has nothing to do with VMWare Hardware versions directly.

So I have gone out of my way to see if I can recreate the problem in this BZ as it is contrary to my hypotheses.

I have tried various NIC Cards, Mellanox, SolarFlare, Intel that have various degrees of support for SR-IOV and Offload capabilities (ones that support it and don't support it).

I have tried the suggested work around (disabling offload on UDP checksum) in all cases it does not prevent Openshift failing to install on bare metal using networkType: OpenShiftSDN in a virtualised environment).

The packet loss in RHCOS seems to be sporadic and inconsistent so I am wondering if the solution here was a false positive
... and people we not truing to install from bare metal and that the disabling of checksum offloading just happened to co inside with the issue not being occurring at that time on a prebuilt cluster.

Try installing from *scratch* using latest RHCOS and Openshift 4.7 /// on your same hardware from bare metal ... an see if the disabling of udp checksum actually helps ... I dont believe it will.

hence I was politely asking what "kernel" changes people believed was root cause to this BZ so I could investigate and see if it was a false positive as I suspect.

THx
Axel

Comment 59 Joseph Callen 2021-05-03 13:13:53 UTC

See:
https://github.com/torvalds/linux/commits/master/drivers/net/vmxnet3
https://github.com/torvalds/linux/tree/a31135e36eccd0d16e500d3041f23c3ece62096f/drivers/net/vmxnet3

Search: VMXNET3_REV_4
https://access.redhat.com/labs/rhcb/RHEL-8.3/kernel-4.18.0-240.el8/sources/raw/drivers/net/vmxnet3/vmxnet3_drv.c

Directly from VMware:
https://bugzilla.redhat.com/show_bug.cgi?id=1941714#c24

Comment 60 Dan Winship 2021-05-03 13:25:02 UTC

(In reply to Simon Foley from comment #58)
> Hi Dan,
>      "You are not seeing this bug. You're seeing something completely
> unrelated. Please file a new bug"
> 
> You may be correct, but I am not convinced that the solution suggested is
> actually the *real* cause of the problems people are seeing on this and
> other bugs.

This bug is not tracking all possible cases of "installs on vmware fail". It is tracking a specific failure originally reported by a specific person which was then root caused to a specific problem which has been confirmed to exist by VMware; the vmxnet3 kernel driver expects VXLAN offload to work for all VXLAN packets, but the underlying virtual hardware only supports it on specific ports, and so as a result, in certain configurations, OCP-on-VMware ends up sending un-checksummed VXLAN packets to the underlying virtual network, and those packets then get dropped because they have invalid checksums. As a result, all TCP-over-VXLAN and UDP-over-VXLAN traffic is dropped (though IIRC ICMP-over-VXLAN and ARP-over-VXLAN work). Disabling VXLAN offload fixes the bug.

No one is claiming that this bug report covers all possible cases of installation failures on VMware, or even all possible cases of VXLAN-related installation failures on VMware. However, there was a specific reproducible install failure which is now fixed.

> Have you actually tried to install Openshift 4.7 on private bare metal
> hardware ? IT DOES NOT WORK PERIOD! It has been broken ever since 4.7 was
> released for the last 3 months. try an install from bare metal.

Lots of people have done this successfully... I'm not sure what's going wrong in your environment...

(In reply to Joseph Callen from comment #59)
> Directly from VMware:
> https://bugzilla.redhat.com/show_bug.cgi?id=1941714#c24

(that's a private bug, but the comment is basically saying what I just said above)

Comment 65 Immanuvel 2021-05-31 05:56:55 UTC

Hi Team,

Please let us know if Any  interesting  things needs to be shared to the customer  regarding this bug ?

Thanks 
IMMANUVEL M

Comment 68 errata-xmlrpc 2021-07-27 22:51:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 69 johannes 2022-11-04 08:00:35 UTC

This is happening while upgrading from 4.10 -> 4.11 also...
And while upgrading from 4.11.0-0.okd-2022-08-20-022919 to 4.11.0-0.okd-2022-10-28-153352
ESXi 7.0 U2 with HW version 19

The workaround that is working for us:
https://access.redhat.com/solutions/5997331

Note You need to log in before you can comment on or make changes to this bug.

aivaraslaimikis
alexisph
alkazako
amigliet
anbhat
anowak
aos-bugs
arild.lager
asadawar
bleanhar
ChetRHosey
christian.affolter
danw
david.karlsen
dgautam
dkulkarn
doshir
echen
fiezzi
ikke
imm
itsoiref
jani.eerola
jcallen
jlapthor
johannes
jsavanyo
lmohanty
luaparicio
mbetti
mfojtik
mgugino
openshift-bugs-escalate
pasik
rbobek
rdagerfall
rfreiman
ribarry
rjamadar
rsandu
rvanderp
samuel.prette
shishika
shtanaka
simon
simore
skrenger
srengan
ttsiouli
vkochuku
wking
xiguo
yhe