The command fails with: error: Failed to query for interfaces addresses error: internal error: wrong nlmsg len It worked fine with 6.9.x kernels. Reproducible: Always Steps to Reproduce: 1. Boot a 6.10.x kernel. 2. Start a libvirt VM. 3. Run `virsh domifaddr --source=arp <domain>` to get the VM's IP address. Actual Results: error: Failed to query for interfaces addresses error: internal error: wrong nlmsg len Expected Results: Succeeds and prints the correct output. libvirt-client-10.1.0-3.fc40.x86_64 kernel-6.10.1-200.fc40.x86_64 (from https://copr.fedorainfracloud.org/coprs/jforbes/kernel-stabilization/)
Hey! This bug was cloned to a RHEL issue tracker (because whatever shows up upstream will eventually show up in RHEL :-)), and I did a bit of poking and made a comment here: https://issues.redhat.com/browse/RHEL-52449?focusedId=25257731&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25257731 Note the 2 followup comments - these may show up *above* the linked comment, depending on what order jira is using for comments on your machine. In addition to the questions I asked there, could you also try updating your kernel from the copr just in case it really is some kernel regression that has already been fixed in 6.10.1? If not, then it would be useful to get the information I requested in those comments from your system as well - hopefully we can figure out what's triggering it so I can replicate the problem here, which would make it much easier to debug & test. Thanks!
So I finally had time to try to boot the 6.10.x kernel again (I am now at 6.10.3-200.fc40.x86_64 from official Fedora repos) and I tried the commands from the RHEL Jira - both "arp -an" and "ip neigh show" showed just four entries, so it doesn't seem to be a problem with ARP table length.
I think I found the bug... In libvirt's virArpTableGet(), this condition is weird: if ((len -= NLMSG_LENGTH(sizeof(*nh))) < 0) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("wrong nlmsg len")); goto cleanup; } NLMSG_LENGTH(sizeof(*nh)) is basically NLMSG_HDRLEN * 2, so the code rejects any netlink messages that have payload smaller than NLMSG_HDRLEN. But the final NLMSG_DONE message has payload of just 4 bytes, so it fails the check. I believe the "NLMSG_LENGTH(sizeof(*nh))" should be replaced with just "NLMSG_HDRLEN". I'm not sure why it only started to happen with kernel 6.10, though...
I just built libvirt with the suggested fix, installed it locally, and I can confirm that it fixes the issue. https://src.fedoraproject.org/fork/omos/rpms/libvirt/c/22072df7d08ce138bf8a6b015afdaf138b832e5f?branch=testing https://copr.fedorainfracloud.org/coprs/omos/libvirt-testing/
I managed to find a minimal reproducer (directly calling virArpTableGet() from a C program) and bisected it to this kernel commit: commit 7e4975f7e7fb0eba3cbb69d9c467750a1c3ce131 Author: Eric Dumazet <edumazet> Date: Thu Apr 18 09:51:05 2024 +0000 neighbour: fix neigh_dump_info() return value Change neigh_dump_table() and pneigh_dump_table() to either return 0 or -EMSGSIZE if not enough space was available in the skb. Then neigh_dump_info() can do the same. This allows NLMSG_DONE to be appended to the current skb at the end of a dump, saving a couple of recvmsg() system calls. Signed-off-by: Eric Dumazet <edumazet> Signed-off-by: David S. Miller <davem> The last sentence explains it - before the commit NLMSG_DONE would be always sent in a separate message, so the virArpTableGet() bug wouldn't trigger (since libvirt only looks at the first message it gets from the kernel).
Nice work tracking it down Ondrej! Laine does Ondrej's change seem sensible to you?
Actually, looking closer at the code, I think the following would be a better fix (untested). We need to check for NLMSG_DONE first, and then make sure that the length of the payload is >= sizeof(struct ndmsg). There is no need to check for NLMSG_HDRLEN, since NLMSG_OK() already takes care of that. diff --git a/src/util/virarptable.c b/src/util/virarptable.c index 299dddd664..54ffc41779 100644 --- a/src/util/virarptable.c +++ b/src/util/virarptable.c @@ -81,10 +81,12 @@ virArpTableGet(void) for (; NLMSG_OK(nh, msglen); nh = NLMSG_NEXT(nh, msglen)) { VIR_WARNINGS_RESET struct ndmsg *r = NLMSG_DATA(nh); - int len = nh->nlmsg_len; void *addr; - if ((len -= NLMSG_LENGTH(sizeof(*nh))) < 0) { + if (nh->nlmsg_type == NLMSG_DONE) + break; + + if (nh->nlmsg_len < NLMSG_LENGTH(sizeof(*r))) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("wrong nlmsg len")); goto cleanup; @@ -98,9 +100,6 @@ virArpTableGet(void) (!(r->ndm_state == NUD_STALE || r->ndm_state == NUD_REACHABLE))) continue; - if (nh->nlmsg_type == NLMSG_DONE) - return table; - VIR_WARNINGS_NO_CAST_ALIGN parse_rtattr(tb, NDA_MAX, NDA_RTA(r), nh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
I just got back from a week away from the computer so I'm just now seeing this. Thanks for all the investigation Ondrej! I looked over that code before I left for my vacation and didn't catch the problem as you have (I obviously didn't look very carefully though, since I misinterpreted it "just a bit") :-) I'll try out Ondrej's suggested fix from Comment 7, and if it works on my (admittedly *non*-error-reproducing) 6.10 machine, then we should get it pushed upstream. Ondrej - would you be willing/able to git send-email whatever is the final patch to devel.org? (with a Signed-off-by: included - see https://libvirt.org/hacking.html and https://libvirt.org/submitting-patches.html for our particular contribution idiosyncracies.) I want to make sure you get credit for your work. (you've certainly saved me/us a lot of time!) (or if you'd prefer, you can git send-email the patch to me, and I'll look it over before posting it to the mailing list in your name.)
Oh wait - in the time between me starting to read the updates here, and finally getting my response posted, it looks like Martin Kletzander has posted an alternative patch upstream. Can you see if that works for you too (I'm about to test it here).
Fixed upstream with commit v10.6.0-51-g4de8962a79c1: commit 4de8962a79c1ddedc93e88f0104b5bd228fbbb71 Author: Martin Kletzander <mkletzan> Date: Fri Aug 16 14:02:48 2024 +0200 virarptable: End parsing earlier in case of NLMSG_DONE
Yes, Martin's patches fix the problem for me, too 👍
FEDORA-2024-15f710d324 (libvirt-10.1.0-4.fc40) has been submitted as an update to Fedora 40. https://bodhi.fedoraproject.org/updates/FEDORA-2024-15f710d324
FEDORA-2024-15f710d324 has been pushed to the Fedora 40 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-15f710d324` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-15f710d324 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2024-15f710d324 (libvirt-10.1.0-4.fc40) has been pushed to the Fedora 40 stable repository. If problem still persists, please make note of it in this bug report.