** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.
kube-apiserver has not connectivity to the aggregated apiservers, e.g. from master 1: 2021-07-28T13:36:45.938423139Z E0728 13:36:45.938066 20 controller.go:116] loading OpenAPI spec for "v1.route.openshift.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: dial tcp 10.130.0.6:8443: connect: no route to host Similar messages appear in the other kube-apiserver instances, and (!) for different aggregated apiservers (metrics, oauth, ...). Looks like pod networking is down. At the same time the openshift-apiserver (the one I checked) I up and happy (it provides route.openshift.io among other APIs). Moving to networking.
more ./quay-io-openshift-origin-must-gather-sha256-e5e5166f37d7bd043f25276ad450f7aa57d96604e8c1a6c26ab42a9253689079/cluster-scoped-resources/config.openshift.io/infrastructures.yaml --- apiVersion: config.openshift.io/v1 items: - apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2021-07-28T12:52:57Z" generation: 1 name: cluster resourceVersion: "659" uid: e58515fa-5dfc-4399-ae0d-1ac422d8792e spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: VSphere status: apiServerInternalURI: https://api-int.marineprod.scotland.gov.uk:6443 apiServerURL: https://api.marineprod.scotland.gov.uk:6443 It seems that the internal url used to expose the apiserver https://api.marineprod.scotland.gov.uk:6443 is not reachable, causing a cascade of network failures. This url resolves to 192.168.24.116 > et \"https://[api-int.marineprod.scotland.gov.uk]:6443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-8-3master.marineprod.scotland.gov.uk?timeout=1m0s\": dial tcp 192.168.24.116:6443 can you verify that url is working correctly?
Hello Team, One of our customers is having the same issue where after upgrading the cluster from 4.7.21 to 4.8.3 (running over VSphere as disconnected UPI), the openshift-apiserver operator is degraded with following errors. oc get co openshift-apiserver -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" creationTimestamp: "2021-06-24T10:34:19Z" generation: 1 name: openshift-apiserver resourceVersion: "20095132" uid: 8aff4f02-b91e-495a-93bb-2cc3b0f88045 spec: {} status: conditions: - lastTransitionTime: "2021-08-03T19:10:58Z" message: All is well reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2021-08-06T11:35:15Z" message: 'APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 0/3 pods have been updated to the latest generation' reason: APIServerDeployment_PodsUpdating status: "True" type: Progressing - lastTransitionTime: "2021-08-05T17:02:44Z" message: |- APIServicesAvailable: "apps.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "security.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) APIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) reason: APIServices_Error status: "False" type: Available - lastTransitionTime: "2021-06-24T10:36:24Z" message: All is well reason: AsExpected status: "True" type: Upgradeable extension: null relatedObjects: - group: operator.openshift.io name: cluster resource: openshiftapiservers - group: "" name: openshift-config resource: namespaces - group: "" name: openshift-config-managed resource: namespaces - group: "" name: openshift-apiserver-operator resource: namespaces - group: "" name: openshift-apiserver resource: namespaces - group: "" name: openshift-etcd-operator resource: namespaces - group: "" name: host-etcd-2 namespace: openshift-etcd resource: endpoints - group: controlplane.operator.openshift.io name: "" namespace: openshift-apiserver resource: podnetworkconnectivitychecks - group: apiregistration.k8s.io name: v1.apps.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.authorization.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.build.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.image.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.project.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.quota.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.route.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.security.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.template.openshift.io resource: apiservices versions: - name: operator version: 4.8.3 - name: openshift-apiserver version: 4.8.3 We tried the following workaround but no luck. --> https://access.redhat.com/solutions/5896081 The endpoints of openshift-apiserver pods over port 8443 are not accessible across the nodes i.e. on master1 only the endpoint for openshift-apiserver pod which is running on that node was accessible. The cluster was upgraded completely and after that only this issue is coming up. I will be attaching the must-gather.
Hello Antonio, The customer disabled the offloading on the primary NIC for all the nodes but still, the issue persists. Regards, Ayush Garg
Hi Ronak, Do you have any idea why we are being required to disable `tx-checksum-ip-generic`. This looks similar to the previous VMXNET3 issue. https://bugzilla.redhat.com/show_bug.cgi?id=1941714
*** *** Every customer attached to this case needs to open an immediate support case with VMware *** *** We _need_ the following: - vSphere version with build numbers - Switch type - Virtual machine hardware version
*** Bug 1997292 has been marked as a duplicate of this bug. ***
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Who is impacted? OpenShift 4.7.24+ and 4.8 clusters running atop vSphere HW14, new installs and upgrades to the affected versions What is the impact? Is it serious enough to warrant blocking edges? SDN Packet loss resulting in service unavailability. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Unknown final remediation but a workaround of disabling tx-checksum-ip-generic has been shown to improve the situation Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes, 4.7.24+ and 4.8.2+ are known to be affected This is an initial assessment which will be updated when we have more information.
so just to confirm, with VM HW version set to 15 see c#40 for details, after upgrading my cluster (actually it did not finish completely) from 4.7.21 to 4.7.24 I got the following issue: # oc get co |grep -v "True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.24 False True True 3h23m console 4.7.24 False False True 3h27m monitoring 4.7.24 False True True 3h21m openshift-apiserver 4.7.24 False False False 3h25m operator-lifecycle-manager-packageserver 4.7.24 False True False 3h22m [root@bastion mg]# which did not change even after waiting nearly 4 hours. Commands too ages and creating a new project timed out with ~~~ Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projectrequests.project.openshift.io) ~~~ setting has been: ~~~ [root@bastion mg]# for NODE in master0 master1 master2 worker0 worker1 worker2 worker3 worker4 worker5; do echo ${NODE};ssh -o StrictHostKeyChecking=no core@${NODE} sudo ethtool -k ens192 |egrep 'tx-checksum-ip-generic';done master0 tx-checksum-ip-generic: on master1 tx-checksum-ip-generic: on master2 tx-checksum-ip-generic: on worker0 tx-checksum-ip-generic: on worker1 tx-checksum-ip-generic: on worker2 tx-checksum-ip-generic: on worker3 tx-checksum-ip-generic: on worker4 tx-checksum-ip-generic: on worker5 tx-checksum-ip-generic: on [root@bastion mg]# ~~~ ( I has a must-gather and a sosreport from one master, ping me if needed) as soon as I change this to off ~~~ [root@bastion mg]# for NODE in master0 master1 master2 worker0 worker1 worker2 worker3 worker4 worker5; do echo ${NODE};ssh -o StrictHostKeyChecking=no core@${NODE} sudo ethtool -K ens192 tx-checksum-ip-generic off;done master0 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] master1 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] master2 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] worker0 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] worker1 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] worker2 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] worker3 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] worker4 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] worker5 Actual changes: tx-checksum-ip-generic: off tx-tcp-segmentation: off [not requested] tx-tcp6-segmentation: off [not requested] [root@bastion mg]# ~~~ everything is nearly instantaneous fine again and I can create new projects ~~~ [root@bastion]# oc get co |grep -v "True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [root@bastion]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.24 True False False 3m21s baremetal 4.7.24 True False False 9h cloud-credential 4.7.24 True False False 10h cluster-autoscaler 4.7.24 True False False 9h config-operator 4.7.24 True False False 9h console 4.7.24 True False False 3m33s csi-snapshot-controller 4.7.24 True False False 7h26m dns 4.7.24 True False False 9h etcd 4.7.24 True False False 9h image-registry 4.7.24 True False False 9h ingress 4.7.24 True False False 9h insights 4.7.24 True False False 9h kube-apiserver 4.7.24 True False False 9h kube-controller-manager 4.7.24 True False False 9h kube-scheduler 4.7.24 True False False 9h kube-storage-version-migrator 4.7.24 True False False 7h36m machine-api 4.7.24 True False False 9h machine-approver 4.7.24 True False False 9h machine-config 4.7.24 True False False 6h49m marketplace 4.7.24 True False False 77s monitoring 4.7.24 True False False 3m4s network 4.7.24 True False False 9h node-tuning 4.7.24 True False False 8h openshift-apiserver 4.7.24 True False False 3m51s openshift-controller-manager 4.7.24 True False False 8h openshift-samples 4.7.24 True False False 8h operator-lifecycle-manager 4.7.24 True False False 9h operator-lifecycle-manager-catalog 4.7.24 True False False 9h operator-lifecycle-manager-packageserver 4.7.24 True False False 3m50s service-ca 4.7.24 True False False 9h storage 4.7.24 True False False 7h27m [root@bastion]# ~~~
(In reply to Joseph Callen from comment #25) > Hi Ronak, > > Do you have any idea why we are being required to disable > `tx-checksum-ip-generic`. This looks similar to the previous VMXNET3 issue. > https://bugzilla.redhat.com/show_bug.cgi?id=1941714 What exactly is the setup and what is the issue? Couple of questions: Are tunnels being used? If yes, what tunneling protocol is being used? What is the destination port being used for the tunnel? Does the vmxnet3 driver have the fix from PR 1941714? Thanks, Ronak
Based on the impact statement in comment 43, we have stopped recommending folks update from versions that are not impacted to versions that are impacted [1]. [1]: https://github.com/openshift/cincinnati-graph-data/pull/1008
(In reply to Ronak Doshi from comment #46) > (In reply to Joseph Callen from comment #25) > > Hi Ronak, > > > > Do you have any idea why we are being required to disable > > `tx-checksum-ip-generic`. This looks similar to the previous VMXNET3 issue. > > https://bugzilla.redhat.com/show_bug.cgi?id=1941714 > > What exactly is the setup and what is the issue? Similar to the last udp issue. Standard and Distributed vSwitch(s). NSX-T not effected - still the same CI setup using VMC. > > Couple of questions: > Are tunnels being used? If yes, what tunneling protocol is being used? Yes, either VXLAN or GENEVE > What is the destination port being used for the tunnel? No idea - SDN folks on this BZ, please respond > Does the vmxnet3 driver have the fix from PR 1941714? Yes and even if it didn't we have a workaround in place to disable the previous issues with udp. > > Thanks, > Ronak Based on Cathy's response of changes 8.4 could it be caused by: https://github.com/torvalds/linux/commit/8a7f280f29a80f6e0798f5d6e07c5dd8726620fe#diff-db4c3dfb5fede7bacdecc2e2c486cb29369c21885ffa6ccb6cd4220c37b0fa75 or https://github.com/torvalds/linux/commit/1dac3b1bc66dc68dbb0c9f43adac71a7d0a0331a#diff-db4c3dfb5fede7bacdecc2e2c486cb29369c21885ffa6ccb6cd4220c37b0fa75 Ronak, can you see private comments? If not pasting from previous comment [root@inf14:~] vsish -e cat /net/portsets/$(net-stats -l |grep master |awk '{print $4}')/ports/$(net-stats -l |grep master |awk '{print $1}')/vmxnet3/txSummary stats of a vmxnet3 vNIC tx queue { generation:1424 pkts tx ok:12564827 bytes tx ok:6111084746 TSO pkts tx ok:786793 TSO bytes tx ok:4352028717 unicast pkts tx ok:12564748 unicast bytes tx ok:6111081428 multicast pkts tx ok:0 multicast bytes tx ok:0 broadcast pkts tx ok:79 broadcast bytes tx ok:3318 pkts tx failure:0 pkts discarded:341556 <-------------------- ******* error when copying hdrs:0 tso header errors:0 pkt allocation failures:0 # of times a tx queue is stopped:0 failed to map some guest buffers:0 tx completion failure due to stale enableGen:0 giant tso pkts requiring more than 1 pkt handle:0 failed to split a giant tso pkt:0 giant non-tso pkts requiring more than 1 pkt handle:0 failed to create a pkt from more than 1 pkt handle:0 encap (outer) header errors:341556 <------------------------------****** encap (inner) tso header errors:0 }
(In reply to Joseph Callen from comment #49) > (In reply to Ronak Doshi from comment #46) > > (In reply to Joseph Callen from comment #25) > > > Hi Ronak, > > > > > > Do you have any idea why we are being required to disable > > > `tx-checksum-ip-generic`. This looks similar to the previous VMXNET3 issue. > > > https://bugzilla.redhat.com/show_bug.cgi?id=1941714 > > > > What exactly is the setup and what is the issue? > > Similar to the last udp issue. Standard and Distributed vSwitch(s). NSX-T > not effected - still the same CI setup using VMC. > > > > > Couple of questions: > > Are tunnels being used? If yes, what tunneling protocol is being used? > Yes, either VXLAN or GENEVE > > > What is the destination port being used for the tunnel? > No idea - SDN folks on this BZ, please respond > > > Does the vmxnet3 driver have the fix from PR 1941714? > Yes and even if it didn't we have a workaround in place to disable the > previous issues with udp. > > > > > Thanks, > > Ronak > > Based on Cathy's response of changes 8.4 could it be caused by: > https://github.com/torvalds/linux/commit/ > 8a7f280f29a80f6e0798f5d6e07c5dd8726620fe#diff- > db4c3dfb5fede7bacdecc2e2c486cb29369c21885ffa6ccb6cd4220c37b0fa75 > or > https://github.com/torvalds/linux/commit/ > 1dac3b1bc66dc68dbb0c9f43adac71a7d0a0331a#diff- > db4c3dfb5fede7bacdecc2e2c486cb29369c21885ffa6ccb6cd4220c37b0fa75 > > Ronak, can you see private comments? If not pasting from previous comment > > [root@inf14:~] vsish -e cat /net/portsets/$(net-stats -l |grep master |awk > '{print $4}')/ports/$(net-stats -l |grep master |awk '{print > $1}')/vmxnet3/txSummary > stats of a vmxnet3 vNIC tx queue { > generation:1424 > pkts tx ok:12564827 > bytes tx ok:6111084746 > TSO pkts tx ok:786793 > TSO bytes tx ok:4352028717 > unicast pkts tx ok:12564748 > unicast bytes tx ok:6111081428 > multicast pkts tx ok:0 > multicast bytes tx ok:0 > broadcast pkts tx ok:79 > broadcast bytes tx ok:3318 > pkts tx failure:0 > pkts discarded:341556 <-------------------- ******* > error when copying hdrs:0 > tso header errors:0 > pkt allocation failures:0 > # of times a tx queue is stopped:0 > failed to map some guest buffers:0 > tx completion failure due to stale enableGen:0 > giant tso pkts requiring more than 1 pkt handle:0 > failed to split a giant tso pkt:0 > giant non-tso pkts requiring more than 1 pkt handle:0 > failed to create a pkt from more than 1 pkt handle:0 > encap (outer) header errors:341556 <------------------------------****** > encap (inner) tso header errors:0 > } I cannot see private comments. Based on the counters it seems something was not as expected in the encapsulation header. In the previous udp issue, it was the destination port. So, I would link to know what destination port is used here. > > Does the vmxnet3 driver have the fix from PR 1941714? > Yes and even if it didn't we have a workaround in place to disable the > previous issues with udp. Btw, if I remember correctly, the fix was that tunnel offloads were disabled in the previous PR. If so, then how are tunnel offloads enabled here? Shouldn't they be disabled? Also, is NSX-T installed here? Thanks, Ronak
Also, packet capture (with --ng option) as done in PR 1941714 would be helpful and appreciated.
Hi Daniel, Can you answer Ronak's questions regarding your OCP and vSphere cluster specifics? Can you also provide `ethtool -k ens192` and `uname -a` for that master Thanks!
In all cases the kernel is 4.18.0-305.10.2.el8_4 or 4.18.0-305.12.1.el8_4.
@sdodson isn't that given the OCP version as MCO handles the masters? In our case we'll have workers on RHEL 7.x 3.10.0-1160.36.2.el7.x86_64 while masters on 4.18.0-305.10.2.el8_4.x86_64 RHCOS.
(In reply to David J. M. Karlsen from comment #54) > @sdodson isn't that given the OCP version as MCO handles the > masters? > In our case we'll have workers on RHEL 7.x 3.10.0-1160.36.2.el7.x86_64 > while masters on 4.18.0-305.10.2.el8_4.x86_64 RHCOS. Sure, but no where else in this bug has it been mentioned that RHEL7 workers are involved. I think we'd probably want a unique bug to track that variant as it may require a RHEL7 kernel fix in the end. We'll also want to verify that the problem exists between two RHEL7 workers and not just between RHCOS control plane and RHEL7 workers.
Thanks Jatan, just to make my data complete: # cat etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.4 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.4" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.4 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.4" # # cat uname Linux master1.ocp4-csa.coe.muc.redhat.com 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Mon Jul 12 04:43:18 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux #cat ethtool_-k_ens192 Features for ens192: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off [fixed] receive-hashing: on highdma: on rx-vlan-filter: on [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: off [fixed] tx-gre-csum-segmentation: off [fixed] tx-ipxip4-segmentation: off [fixed] tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: off tx-udp_tnl-csum-segmentation: off tx-gso-partial: off [fixed] tx-tunnel-remcsum-segmentation: off [fixed] tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] tx-gso-list: off [fixed] rx-gro-list: off tls-hw-rx-offload: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off [fixed] esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: off [fixed] tls-hw-tx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] I've got a must-gather and a sos-report from this cluster I am going to attach
(In reply to Ronak Doshi from comment #51) > Also, packet capture (with --ng option) as done in PR 1941714 would be > helpful and appreciated. I have captured data like so https://bugzilla.redhat.com/show_bug.cgi?id=1941714#c10 after I moved all masters to one esxi and I am happy to provide the data, however it is too big to attach it to this bz If something esle should be captured, pls let me know and be as specific as possible as I am neither a VMWare admin nor a NW guy ;)
(In reply to daniel from comment #62) > (In reply to Ronak Doshi from comment #51) > > Also, packet capture (with --ng option) as done in PR 1941714 would be > > helpful and appreciated. > > I have captured data like so > https://bugzilla.redhat.com/show_bug.cgi?id=1941714#c10 > after I moved all masters to one esxi > > and I am happy to provide the data, however it is too big to attach it to > this bz > > If something esle should be captured, pls let me know and be as specific as > possible as I am neither a VMWare admin nor a NW guy ;) Based on comment 58, tx-udp_tnl-segmentation: off tx-udp_tnl-csum-segmentation: off The overlay offloads are disabled. This means stack will calculate inner header checksums. So, I am not able to understand, how the packets are requesting offloads. The stats shared in comment 49 are for ens192 right? [root@inf14:~] vsish -e cat /net/portsets/$(net-stats -l |grep master |awk '{print $4}')/ports/$(net-stats -l |grep master |awk '{print $1}')/vmxnet3/txSummary stats of a vmxnet3 vNIC tx queue { ... pkts discarded:341556 <------------------------------****** encap (outer) header errors:341556 <------------------------------****** ... } If so, could you capture packets on ens192 using tcpdump inside the vm when you see the issue? Thanks, Ronak
This bug is progressing toward closure via a workaround deployed in OpenShift, we've opened Bug 1998572 to track kernel fix so that we may remove the workaround in future OpenShift versions enabling OpenShift to make use of default offload feature set provided by vmxnet3 driver.
(In reply to Scott Dodson from comment #71) Dear Red Hat, > This bug is progressing toward closure via a workaround deployed in > OpenShift, we've opened Bug 1998572 to track kernel fix so that we may > remove the workaround in future OpenShift versions enabling OpenShift to > make use of default offload feature set provided by vmxnet3 driver. we think it should backport to OCP4.7 and OCP4.8 after bug 1998572 is fixed. Does Red Hat have plan for that? Is there any ticket to track for that? Regards.
*** Bug 1993153 has been marked as a duplicate of this bug. ***
(In reply to weiguo fan from comment #73) > (In reply to Scott Dodson from comment #71) > Dear Red Hat, > > > This bug is progressing toward closure via a workaround deployed in > > OpenShift, we've opened Bug 1998572 to track kernel fix so that we may > > remove the workaround in future OpenShift versions enabling OpenShift to > > make use of default offload feature set provided by vmxnet3 driver. > > we think it should backport to OCP4.7 and OCP4.8 after bug 1998572 is fixed. > Does Red Hat have plan for that? > Is there any ticket to track for that? > > Regards. The workaround has already been backported to 4.8 and 4.7. When a kernel fix becomes available that removes the need for these workarounds we will confirm that it fixes the problem in all relevant versions of OpenShift and the workaround will be removed after that.
> The workaround has already been backported to 4.8 and 4.7. When a kernel fix > becomes available that removes the need for these workarounds we will > confirm that it fixes the problem in all relevant versions of OpenShift and > the workaround will be removed after that. Thanks for the information, Scott. Cloud you kindly let us know the 4.8 and 4.7 versions that the workaround is included? Regards.
The 4.8 workaround is being tracked in bug 1998106. The 4.7 workaround is being tracked in bug 1998112. Both are likely to go out with the next supported release in their respective z streams, but neither has been released yet.
*** Bug 1993723 has been marked as a duplicate of this bug. ***
Workaround for those who cannot immediately upgrade is to disable tx-checksum-ip-generic on vmxnet3 interfaces, ex: ethtool -K ens192 tx-checksum-ip-generic off
*** Bug 1996577 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days