Hi there, I have checked the attached must-gather. It looks like openshift-apiserver-operator is in a degraded state because kube-apiserver is not able to reach the Aggregated API servers (openshift-apiserver). The kube-apiserver logs suggest the server is not able to create a network connection to the downstream servers, numerous "dial tcp ... i/o timeout". I'm assigning to the network team to help diagnose potential a network issue.
Possibly the same as https://bugzilla.redhat.com/show_bug.cgi?id=1935591
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Created attachment 1764604 [details] iptables-master3
I am not the assignee, and my hold on this is pretty tenuous, but since we just blocked all 4.6->4.7 edges on this bug [1], here's my attempt at an impact statement per comment 13's template: Who is impacted? vSphere running 4.7 on HW 14 and later. So block 4.6->4.7 to protect clusters running 4.6 who are currently not affected. What is the impact? Cross-node pod-to-pod connections are unreliable. Kube API-server degrades, or maybe just access to it? Or something? Eventually things like auth operator get mad and go Available=False. Actual impact on the cluster is all the stuff that happens downstream of flaky Kube-API access... How involved is remediation? Still working this angle, but it won't be pretty. Possibly folks can move VMs to a non-vulnerable vSphere version? Is this a regression? Yup, doesn't seem to impact 4.6 or earlier. [1]: https://github.com/openshift/cincinnati-graph-data/pull/718
(In reply to W. Trevor King from comment #21) > I am not the assignee, and my hold on this is pretty tenuous, but since we > just blocked all 4.6->4.7 edges on this bug [1], here's my attempt at an > impact statement per comment 13's template: > > Who is impacted? > vSphere running 4.7 on HW 14 and later. So block 4.6->4.7 to protect > clusters running 4.6 who are currently not affected. > What is the impact? > Cross-node pod-to-pod connections are unreliable. Kube API-server > degrades, or maybe just access to it? Or something? Eventually things like > auth operator get mad and go Available=False. Actual impact on the cluster > is all the stuff that happens downstream of flaky Kube-API access... Cross-node pod-to-pod or host-to-pod connections fail because packets are being dropped by vSphere between the nodes (apparently because of checksum errors, apparently because of a kernel change between 4.6 and 4.7). The problem only affects VXLAN traffic, not any other node-to-node traffic. (Well, AFAIK it's not clear at this time if it affects Geneve traffic, as used by ovn-kubernetes.) The visible effect in terms of clusteroperators/telemetry/etc is that kube-apiserver stops being able to talk to openshift-apiserver, and then failures cascade from there. But all cross-node pod-to-pod traffic (including customer workload traffic) is actually broken. (The "unreliable" in the comment above is wrong; things are totally broken. I got confused before because I saw there was still some VXLAN traffic going between nodes, so I thought that for some reason some packets were getting through but others weren't. But really what was happening is that ARP and ICMP packets over VXLAN get through, but TCP and UDP packets over VXLAN don't. So basically everything is broken.)
Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1936556 machine-api is periodically unable to connect to vsphere API with 'no route to host'.
Should this BZ be moved to the RHEL component instead? I have added the VMware folks if we can make appropriate comments public so they can view.
*** Bug 1941322 has been marked as a duplicate of this bug. ***
*** Bug 1926345 has been marked as a duplicate of this bug. ***
VMware is requesting that every customer that has hit this issue create a corresponding support request with VMware retrieve the appropriate logs and packet traces.
@jcallen I believe we are hit by this one. VMWare 6.7u3, hwlevel 15, went from 4.6.20 to 4.7.1. case 02893982 in RH support portal. when one open cases with vmware they are typically eager to just close them, or will close them as "not related to vmware - its an openshift issue". How are they supposed to be reported to vmware, and what forensics should be added?
more info too, when I saw this issue I started debugging a bit, and our workers are RHEL, while the masters are RHCOS. If I create a debugging pod where I run dig against the openshift-dns pods, I notice that I get response from the RHEL-ones, but not from the RHCOS ones. This might help narrow the issue.
(In reply to David J. M. Karlsen from comment #35) > @jcallen I believe we are hit by this one. > VMWare 6.7u3, hwlevel 15, went from 4.6.20 to 4.7.1. case 02893982 in RH > support portal. > when one open cases with vmware they are typically eager to just close them, > or will close them as "not related to vmware - its an openshift issue". How > are they supposed to be reported to vmware, and what forensics should be > added? Could you upload host support bundle (using vm-support command on host) and also mention the port ID of the vm where the vxlan traffic is initiated?
Created attachment 1765394 [details] vmsupport Note: this is after turning off checksum on iface (in order to get the install to work)
(In reply to David J. M. Karlsen from comment #38) > Created attachment 1765394 [details] > vmsupport > > Note: this is after turning off checksum on iface (in order to get the > install to work) This won't help then as it does not have the issue. Please reproduce the issue by enabling it on test setup and collect support bundle for investigation.
Hi David, (In reply to David J. M. Karlsen from comment #36) > more info too, when I saw this issue I started debugging a bit, and our > workers are RHEL, while the masters are RHCOS. Hmm RHEL workers would be 7 so I would suspect maybe don't have the updated vmxnet3 driver Here is the coresponding RHEL kernel BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1941714 > If I create a debugging pod > where I run dig against the openshift-dns pods, I notice that I get response > from the RHEL-ones, but not from the RHCOS ones. This might help narrow the > issue.
@jcallen that one seems internal/closed for public "You are not authorized to access bug #1941714. "
No degraded operators on 4.8.0-0.ci-2021-03-28-220420 and 4.8.0-0.ci-2021-03-29-154349 VMC Hypervisor: VMware ESXi, 7.0.1, 17460241 Model: Amazon EC2 i3en.metal-2tb Processor Type: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz UDP offloads disabled vHW 17 # for f in $(oc get nodes --no-headers -o custom-columns=N:.metadata.name ) ; do oc debug node/$f -- ethtool -k ens192 | grep udp_tnl | tee udp-$f & done tx-udp_tnl-segmentation: off tx-udp_tnl-csum-segmentation: off vHW 17 # oc adm node-logs -l kubernetes.io/os=linux -g udp.tnl | tee udp-tnl.log Mar 29 10:50:48.692957 compute-0 root[1919]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. Mar 29 10:50:52.664168 compute-1 root[1663]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. Mar 29 10:52:30.458263 control-plane-0 root[1865]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. Mar 29 10:52:35.130729 control-plane-1 root[1705]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. Mar 29 10:52:40.255658 control-plane-2 root[1937]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. SDN Mar 29 08:48:44.766826 compute-0 root[1543]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. Mar 29 08:51:22.308202 compute-1 root[1672]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. Mar 29 08:54:10.123662 control-plane-0 root[1683]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. Mar 29 08:59:20.426804 control-plane-1 root[1689]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. Mar 29 08:59:24.414762 control-plane-2 root[1696]: 99-vsphere-disable-tx-udp-tnl triggered by up on device ens192. OVN 4.8.0-0.ci-2021-03-29-154349 UDP offloads disabled # vHW 14 udp-compute-0:tx-udp_tnl-segmentation: off udp-compute-0:tx-udp_tnl-csum-segmentation: off # vWH 15 udp-compute-1:tx-udp_tnl-segmentation: off udp-compute-1:tx-udp_tnl-csum-segmentation: off # vHW 13 udp-control-plane-0:tx-udp_tnl-segmentation: off [fixed] udp-control-plane-0:tx-udp_tnl-csum-segmentation: off [fixed] udp-control-plane-1:tx-udp_tnl-segmentation: off [fixed] udp-control-plane-1:tx-udp_tnl-csum-segmentation: off [fixed] udp-control-plane-2:tx-udp_tnl-segmentation: off [fixed] udp-control-plane-2:tx-udp_tnl-csum-segmentation: off [fixed] NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.ci-2021-03-29-154349 True False False 57m
Setting UpdateRecommendationsBlocked [1], since we have blocked 4.6->4.7 for this series since [2]. [1]: https://github.com/openshift/enhancements/pull/426 [2]: https://github.com/openshift/cincinnati-graph-data/pull/718
Currently we have a single nested environment where we can reproduce this issue. Applying the same configuration to a physical environment is not reproducing the issue. We need additional information. If those impacted could tell us: 1.) ESXi version w/build numbers 2.) Type of switch used (Standard, Distributed, NSX-T Opaque) 3.) Switch security/policy - Promiscuous mode, MAC address changes, Forged transmits 4.) CPU model (e.g. Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz) 5.) virtual hardware version (we know it must be past 14) Once we can reproduce we can provide VMware the logs they are requesting.
(In reply to Joseph Callen from comment #46) > Currently we have a single nested environment where we can reproduce this > issue. Applying the same configuration to a physical environment is not > reproducing the issue. > > We need additional information. If those impacted could tell us: > > 1.) ESXi version w/build numbers > 2.) Type of switch used (Standard, Distributed, NSX-T Opaque) > 3.) Switch security/policy - Promiscuous mode, MAC address changes, Forged > transmits > 4.) CPU model (e.g. Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz) > 5.) virtual hardware version (we know it must be past 14) > > Once we can reproduce we can provide VMware the logs they are requesting. Info from our environment related to Case 02893982: 1.) 6.7.0, 17167734 2.) Distributed 3.) 3* Rejected 4.) Intel Xeo Gold 6132 @ 2.60 GHz 5.) ESXi6.7 Upd 2 and later ( VM Version 15 ) Regards Arild Lager
@ Dan Winship can somebody be a bit more specific on the "apparently because of a kernel change between 4.6 and 4.7". The unverified work around of disabling VxLan offload in the Vsphere network interface drivers https://access.redhat.com/solutions/5896081 "nmcli con mod [connection id] ethtool.feature-tx-udp_tnl-segmentation off ethtool.feature-tx-udp_tnl-csum-segmentation off" However I rebuilt my ESXi 7.0b Hypervisor with Proxmox 6.3-6 instead and I am still getting exactly the same problem with the VirtIO Network Interface drivers. The only difference in this case is that disabling VxLan offload does not seem to resolve the problem. So there is something much more universal going on here. If the solution to fix the root cause lies in the network interface drivers as a response t the kernel changes, i'd like to log a bug with VirtIO as well if I know what the kernel changes were that triggered the issue. Thx Axel
Created attachment 1775888 [details] kubeapi FIN and RST on kube API on virtIO NIC Emulation
Created attachment 1775890 [details] kubeapi Less FIN and RST on kube API on vmxnet3 NIC Emulation
Created attachment 1775891 [details] kubeapi Exampels of fewer FIN and RST on kube API on vmxnet3 NIC Emulation
Hi guys, I have been doing some testing on Proxmox before I rebuild back to ESXi hypervisor using the latest openshift-installer 4.7.7 on a bare metal install. I found that the issue was very prevalent when using virtIO, E1000 & RTL8139 NIC emulation on all fresh installs, to the extent that I could not get a bootstrap to complete. When I sniffed the traffic from the HAProxy (bootstrap server) I could see the kubeapi reject the master nodes connection on TCP Port 6443 with FIN and RST. This could be consistent with the packets in the bootstrap server getting dropped internally in the driver and the bootstrap server garbage collecting the TIME-WAIT sessions. https://bugzilla.redhat.com/attachment.cgi?id=1775888 (192.168.100.123 is the HAProxy which you can infer is the bootstrap server responding to the master node 192.168.100.201) However I was able to get a significant improvement by using the vmxnet3 NIC Driver\emulation. The initial master-> bootstrap process worked without any issues or FIN/RST and you can then see connections flip and new sessions inbound to the master node on TCP port 6443 work fine too. https://bugzilla.redhat.com/attachment.cgi?id=1775890 The interesting thing is that one of the suggestions was to disable UDP checksum offload (see below). I tried that with out any success on virtIO/e1000,rtl8139 nic types. As this was all TCP i did not bother with the suggested work around as all my traffic issues was TCP. Now the bootstrap process did fail on first attempt with eventual FIN/RST's coming from the bootstrap server to the master node; https://bugzilla.redhat.com/attachment.cgi?id=1775891 time="2021-04-27T09:15:20+01:00" level=debug msg="Still waiting for the Kubernetes API: Get \"https://api.mycluster.openshift.lan:6443/version?timeout=32s\": net/http: TLS handshake timeout" time="2021-04-27T09:15:40+01:00" level=debug msg="Still waiting for the Kubernetes API: an error on the server (\"\") has prevented the request from succeeding" 0427 09:49:49.168763 5618 reflector.go:127] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps) I0427 09:50:37.220105 5618 trace.go:205] Trace[1916406559]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (27-Apr-2021 09:50:27.210) (total time: 10009ms): Trace[1916406559]: [10.009902819s] [10.009902819s] END E0427 09:50:37.220123 5618 reflector.go:127] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps) ERROR Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: an error on the server ("") has prevented the request from succeeding (get clusteroperators.config.openshift.io) However the bootstrap was able to complete on the second attempt which was simply not happening after 2 or three times with any of the other NIC emulation driver interfaces. I am not sure that the RH Workaround is actually a work around at all. Instead I am going to try to disable TCP checksum offload to see if that makes any difference nmcli con mod [connection id] ethtool.feature-tx-udp_tnl-segmentation off ethtool.feature-tx-udp_tnl-csum-segmentation off If anybody (Dan I think mentioned it) has any hints as to what the kernel changes were that has been suggested as a trigger to this problem ... it would be appreciated.
Apologies typo I was going to try disabling nmcli con mod [connection id] ethtool.feature-tx-tcp-segmentation off I only have a basic Intel 82574L NIC for testing so its support for any HW off load in a hypervisor environment is going to be slim. There are a lot of options to disable here (see below) and I am not sure if the reported problem was dependent on your NIC supporting the offload or not! It seem pointless me installing a Mellanox or SolarFlare NIC until I know more as I can recreate the problem on a basic Intel 82574L gigabit NIC on both ESXi and Proxmox ! https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/configuring-ethtool-offload-features_configuring-and-managing-networking
(In reply to Simon Foley from comment #51) > However I rebuilt my ESXi 7.0b Hypervisor with Proxmox 6.3-6 instead and I > am still getting exactly the same problem with the VirtIO Network Interface > drivers. > > The only difference in this case is that disabling VxLan offload does not > seem to resolve the problem. You are not seeing this bug. You're seeing something completely unrelated. Please file a new bug.
Hi Dan, "You are not seeing this bug. You're seeing something completely unrelated. Please file a new bug" You may be correct, but I am not convinced that the solution suggested is actually the *real* cause of the problems people are seeing on this and other bugs. Have you actually tried to install Openshift 4.7 on private bare metal hardware ? IT DOES NOT WORK PERIOD! It has been broken ever since 4.7 was released for the last 3 months. try an install from bare metal. I have spend 3 months and over 30 installations with packet captures running to figure out why RH has so badly broken Openshift 4.7. There is without question something very wrong going on within RHCOS 4.7 (possibly RHEL 8.3) and there is packet loss happening in the OS (my hunch its RHCOS). all the traces show that the app stack thinks SYN Flooding is happening and there is garbage collection on the TIME_WAIT sessions. This is why I was interested to try to recreate this bug as I did not believe it was root cause. FYI I can see this happening on *all* the ESXi 7+ releases ..... ESXi 6.7 releases. I was working my way down getting to the suggestion that ESXi 6.5 with HWVersion 13 may work (I have just finished my test testing on the 6.7 releases. Why I am suspicious that the proposed solution in this ticket, VMXnet3 in ESXi 6.7 + is root cause ... is that *every* supported hardware version in PROXMOX and *every* combination of their NIC emulation also has NO impact on 4.7 being able to install on bare metal. In fact VmxNet3 in PROXMOX reduces (what I am assuming is root cause) packet loss in RHCOS Openshift 4.7 compares to other NIC Emulations when running OpenshiftSDN as the network emulation layer. Lets be very clear here .... all problems disappear if you abandon VmxNet3 in Openshift ( networkType: OpenShiftSDN) and in your install yaml file your change it to (networkType: OVNKubernetes). Do this and it will always install! So this to me looks like an issue in how RHCOS interacts with VmxNet3 in OpenshiftSDN Network emulation VMXNet3 and it has nothing to do with VMWare Hardware versions directly. So I have gone out of my way to see if I can recreate the problem in this BZ as it is contrary to my hypotheses. I have tried various NIC Cards, Mellanox, SolarFlare, Intel that have various degrees of support for SR-IOV and Offload capabilities (ones that support it and don't support it). I have tried the suggested work around (disabling offload on UDP checksum) in all cases it does not prevent Openshift failing to install on bare metal using networkType: OpenShiftSDN in a virtualised environment). The packet loss in RHCOS seems to be sporadic and inconsistent so I am wondering if the solution here was a false positive ... and people we not truing to install from bare metal and that the disabling of checksum offloading just happened to co inside with the issue not being occurring at that time on a prebuilt cluster. Try installing from *scratch* using latest RHCOS and Openshift 4.7 /// on your same hardware from bare metal ... an see if the disabling of udp checksum actually helps ... I dont believe it will. hence I was politely asking what "kernel" changes people believed was root cause to this BZ so I could investigate and see if it was a false positive as I suspect. THx Axel
See: https://github.com/torvalds/linux/commits/master/drivers/net/vmxnet3 https://github.com/torvalds/linux/tree/a31135e36eccd0d16e500d3041f23c3ece62096f/drivers/net/vmxnet3 Search: VMXNET3_REV_4 https://access.redhat.com/labs/rhcb/RHEL-8.3/kernel-4.18.0-240.el8/sources/raw/drivers/net/vmxnet3/vmxnet3_drv.c Directly from VMware: https://bugzilla.redhat.com/show_bug.cgi?id=1941714#c24
(In reply to Simon Foley from comment #58) > Hi Dan, > "You are not seeing this bug. You're seeing something completely > unrelated. Please file a new bug" > > You may be correct, but I am not convinced that the solution suggested is > actually the *real* cause of the problems people are seeing on this and > other bugs. This bug is not tracking all possible cases of "installs on vmware fail". It is tracking a specific failure originally reported by a specific person which was then root caused to a specific problem which has been confirmed to exist by VMware; the vmxnet3 kernel driver expects VXLAN offload to work for all VXLAN packets, but the underlying virtual hardware only supports it on specific ports, and so as a result, in certain configurations, OCP-on-VMware ends up sending un-checksummed VXLAN packets to the underlying virtual network, and those packets then get dropped because they have invalid checksums. As a result, all TCP-over-VXLAN and UDP-over-VXLAN traffic is dropped (though IIRC ICMP-over-VXLAN and ARP-over-VXLAN work). Disabling VXLAN offload fixes the bug. No one is claiming that this bug report covers all possible cases of installation failures on VMware, or even all possible cases of VXLAN-related installation failures on VMware. However, there was a specific reproducible install failure which is now fixed. > Have you actually tried to install Openshift 4.7 on private bare metal > hardware ? IT DOES NOT WORK PERIOD! It has been broken ever since 4.7 was > released for the last 3 months. try an install from bare metal. Lots of people have done this successfully... I'm not sure what's going wrong in your environment... (In reply to Joseph Callen from comment #59) > Directly from VMware: > https://bugzilla.redhat.com/show_bug.cgi?id=1941714#c24 (that's a private bug, but the comment is basically saying what I just said above)
Hi Team, Please let us know if Any interesting things needs to be shared to the customer regarding this bug ? Thanks IMMANUVEL M
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
This is happening while upgrading from 4.10 -> 4.11 also... And while upgrading from 4.11.0-0.okd-2022-08-20-022919 to 4.11.0-0.okd-2022-10-28-153352 ESXi 7.0 U2 with HW version 19 The workaround that is working for us: https://access.redhat.com/solutions/5997331