Bug 2048556
Summary: | VM with 16+ CPUs - no connectivity if networkInterfaceMultiqueue is enabled | ||||||
---|---|---|---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Ruth Netser <rnetser> | ||||
Component: | Networking | Assignee: | Petr Horáček <phoracek> | ||||
Status: | CLOSED MIGRATED | QA Contact: | awax | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.10.0 | CC: | dholler, fdeutsch, gkapoor, jhopper, nkoenig, nrozen, omergi, phoracek | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.14.2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2023-12-14 16:07:16 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Ruth Netser
2022-01-31 14:07:03 UTC
Created attachment 1858254 [details]
tar with domxml and domcapabilities for VM with enabled and disabled multiqueue
Hi Ruth, thanks for the details and the sap-hana cluster Tried various combinations, if we remove sriov-net3 (and are left with the others) it does work. if we leave sriov-net3 by itself or with one of the others it doesn't do we have another cluster exactly like this one ? need to understand what is special with the sriov-net3, of maybe the hardware behind it has a problem ? thanks Update: checking the "ip r" / "ip a" of the VM when the bug happens, we see that the interfaces are flipped, so the routing is going via the sriov instead of via the default interface. If we use consistent network device naming by removing net.ifnames=0 from /etc/default/grub rebuild the grub (sudo grub2-mkconfig -o /boot/grub2/grub.cfg) and reboot the system, it works, the routing is now good (the primary has lower metric for the default gateway) See please https://bugzilla.redhat.com/show_bug.cgi?id=1874096#c14 for more info Hi Geetika, Can you please try to create the VM with mac for each sriov interface, and cloud-init that uses set-name according the macaddress match, on a sap-hand cluster with the 3 sriov interfaces, multi queue and 17+ cpus to see if it also solves the above problem? using set-name should give consistent network device naming, and therefore also solve the wrong routing. If it helps, we can update the templates, instead of updating the guest grub. Thanks Due to capacity, I'm moving this to 4.12. Hello, When running a VM with multiple interfaces and the guest OS use netX interface naming (e.g.:eth0, eth1, ...), the order of the interface may not be consistent. Please set the VM with predictable naming [1]. If the issue reproduces it will be great to have access to its environment. [1] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/ch-consistent_network_device_naming#sec-Naming_Schemes_Hierarchy (In reply to omergi from comment #9) > Hello, > > When running a VM with multiple interfaces and the guest OS use netX > interface naming (e.g.:eth0, eth1, ...), the order of the interface may not > be consistent. > Please set the VM with predictable naming [1]. > > If the issue reproduces it will be great to have access to its environment. > > [1] > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/ > html/networking_guide/ch-consistent_network_device_naming#sec- > Naming_Schemes_Hierarchy Petr, do we know in which component the problem is? From Or's investigation it looks like an issue with the guest config - it was not using consistent naming (something that especially SR-IOV VFs suffer from) and that lead to mismatched interfaces and issues with connectivity. How multiqueue (only applied on the Pod network) was related to that is a mystery to me. So to me it's a problem of QE's guest configuration until proven otherwise. If it really ends up being a multiqueue problem, then it could be CNV network (configuring multiqueue on the TAP), libvirt, or below. Thanks, Petr. It would be great if we can remove the uncertainty and understand if it's a guest configuration problem (do we need to document anything here then?) - or a mqueue problem. |