Bug 2302245 - virsh domifaddr --source=arp <domain> fails with kernel 6.10
Summary: virsh domifaddr --source=arp <domain> fails with kernel 6.10
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: libvirt
Version: 40
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Laine Stump
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-08-01 13:49 UTC by Ondrej Mosnáček
Modified: 2024-09-04 02:23 UTC (History)
10 users (show)

Fixed In Version: libvirt-10.1.0-4.fc40
Clone Of:
Environment:
Last Closed: 2024-09-04 02:23:14 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Ondrej Mosnáček 2024-08-01 13:49:55 UTC
The command fails with:
error: Failed to query for interfaces addresses
error: internal error: wrong nlmsg len

It worked fine with 6.9.x kernels.

Reproducible: Always

Steps to Reproduce:
1. Boot a 6.10.x kernel.
2. Start a libvirt VM.
3. Run `virsh domifaddr --source=arp <domain>` to get the VM's IP address.
Actual Results:  
error: Failed to query for interfaces addresses
error: internal error: wrong nlmsg len

Expected Results:  
Succeeds and prints the correct output.

libvirt-client-10.1.0-3.fc40.x86_64
kernel-6.10.1-200.fc40.x86_64 (from https://copr.fedorainfracloud.org/coprs/jforbes/kernel-stabilization/)

Comment 1 Laine Stump 2024-08-05 18:40:53 UTC
Hey! This bug was cloned to a RHEL issue tracker (because whatever shows up upstream will eventually show up in RHEL :-)), and I did a bit of poking and made a comment here:

https://issues.redhat.com/browse/RHEL-52449?focusedId=25257731&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25257731

Note the 2 followup comments - these may show up *above* the linked comment, depending on what order jira is using for comments on your machine. In addition to the questions I asked there, could you also try updating your kernel from the copr just in case it really is some kernel regression that has already been fixed in 6.10.1?

If not, then it would be useful to get the information I requested in those comments from your system as well - hopefully we can figure out what's triggering it so I can replicate the problem here, which would make it much easier to debug & test.

Thanks!

Comment 2 Ondrej Mosnáček 2024-08-12 13:58:52 UTC
So I finally had time to try to boot the 6.10.x kernel again (I am now at 6.10.3-200.fc40.x86_64 from official Fedora repos) and I tried the commands from the RHEL Jira - both "arp -an" and "ip neigh show" showed just four entries, so it doesn't seem to be a problem with ARP table length.

Comment 3 Ondrej Mosnáček 2024-08-12 14:41:18 UTC
I think I found the bug... In libvirt's virArpTableGet(), this condition is weird:

        if ((len -= NLMSG_LENGTH(sizeof(*nh))) < 0) {
            virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
                           _("wrong nlmsg len"));
            goto cleanup;
        }

NLMSG_LENGTH(sizeof(*nh)) is basically NLMSG_HDRLEN * 2, so the code rejects any netlink messages that have payload smaller than NLMSG_HDRLEN. But the final NLMSG_DONE message has payload of just 4 bytes, so it fails the check. I believe the "NLMSG_LENGTH(sizeof(*nh))" should be replaced with just "NLMSG_HDRLEN".

I'm not sure why it only started to happen with kernel 6.10, though...

Comment 4 Ondrej Mosnáček 2024-08-12 16:38:43 UTC
I just built libvirt with the suggested fix, installed it locally, and I can confirm that it fixes the issue.

https://src.fedoraproject.org/fork/omos/rpms/libvirt/c/22072df7d08ce138bf8a6b015afdaf138b832e5f?branch=testing
https://copr.fedorainfracloud.org/coprs/omos/libvirt-testing/

Comment 5 Ondrej Mosnáček 2024-08-12 18:44:08 UTC
I managed to find a minimal reproducer (directly calling virArpTableGet() from a C program) and bisected it to this kernel commit:

commit 7e4975f7e7fb0eba3cbb69d9c467750a1c3ce131
Author: Eric Dumazet <edumazet>
Date:   Thu Apr 18 09:51:05 2024 +0000

    neighbour: fix neigh_dump_info() return value
    
    Change neigh_dump_table() and pneigh_dump_table()
    to either return 0 or -EMSGSIZE if not enough
    space was available in the skb.
    
    Then neigh_dump_info() can do the same.
    
    This allows NLMSG_DONE to be appended to the current
    skb at the end of a dump, saving a couple of recvmsg()
    system calls.
    
    Signed-off-by: Eric Dumazet <edumazet>
    Signed-off-by: David S. Miller <davem>

The last sentence explains it - before the commit NLMSG_DONE would be always sent in a separate message, so the virArpTableGet() bug wouldn't trigger (since libvirt only looks at the first message it gets from the kernel).

Comment 6 Cole Robinson 2024-08-13 18:45:22 UTC
Nice work tracking it down Ondrej! Laine does Ondrej's change seem sensible to you?

Comment 7 Ondrej Mosnáček 2024-08-13 22:50:30 UTC
Actually, looking closer at the code, I think the following would be a better fix (untested). We need to check for NLMSG_DONE first, and then make sure that the length of the payload is >= sizeof(struct ndmsg). There is no need to check for NLMSG_HDRLEN, since NLMSG_OK() already takes care of that.

diff --git a/src/util/virarptable.c b/src/util/virarptable.c
index 299dddd664..54ffc41779 100644
--- a/src/util/virarptable.c
+++ b/src/util/virarptable.c
@@ -81,10 +81,12 @@ virArpTableGet(void)
     for (; NLMSG_OK(nh, msglen); nh = NLMSG_NEXT(nh, msglen)) {
         VIR_WARNINGS_RESET
         struct ndmsg *r = NLMSG_DATA(nh);
-        int len = nh->nlmsg_len;
         void *addr;
 
-        if ((len -= NLMSG_LENGTH(sizeof(*nh))) < 0) {
+        if (nh->nlmsg_type == NLMSG_DONE)
+            break;
+
+        if (nh->nlmsg_len < NLMSG_LENGTH(sizeof(*r))) {
             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
                            _("wrong nlmsg len"));
             goto cleanup;
@@ -98,9 +100,6 @@ virArpTableGet(void)
             (!(r->ndm_state == NUD_STALE || r->ndm_state == NUD_REACHABLE)))
             continue;
 
-        if (nh->nlmsg_type == NLMSG_DONE)
-            return table;
-
         VIR_WARNINGS_NO_CAST_ALIGN
         parse_rtattr(tb, NDA_MAX, NDA_RTA(r),
                      nh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));

Comment 8 Laine Stump 2024-08-17 02:08:06 UTC
I just got back from a week away from the computer so I'm just now seeing this. Thanks for all the investigation Ondrej! I looked over that code before I left for my vacation and didn't catch the problem as you have (I obviously didn't look very carefully though, since I misinterpreted it "just a bit") :-)

I'll try out Ondrej's suggested fix from Comment 7, and if it works on my (admittedly *non*-error-reproducing) 6.10 machine, then we should get it pushed upstream. Ondrej - would you be willing/able to git send-email whatever is the final patch to devel.org? (with a Signed-off-by: included - see https://libvirt.org/hacking.html and https://libvirt.org/submitting-patches.html for our particular contribution idiosyncracies.) I want to make sure you get credit for your work. (you've certainly saved me/us a lot of time!)

(or if you'd prefer, you can git send-email the patch to me, and I'll look it over before posting it to the mailing list in your name.)

Comment 9 Laine Stump 2024-08-17 02:11:20 UTC
Oh wait - in the time between me starting to read the updates here, and finally getting my response posted, it looks like Martin Kletzander has posted an alternative patch upstream. Can you see if that works for you too (I'm about to test it here).

Comment 10 Martin Kletzander 2024-08-19 11:46:04 UTC
Fixed upstream with commit v10.6.0-51-g4de8962a79c1:

commit 4de8962a79c1ddedc93e88f0104b5bd228fbbb71
Author: Martin Kletzander <mkletzan>
Date:   Fri Aug 16 14:02:48 2024 +0200

    virarptable: End parsing earlier in case of NLMSG_DONE

Comment 11 Ondrej Mosnáček 2024-08-19 14:10:55 UTC
Yes, Martin's patches fix the problem for me, too 👍

Comment 12 Fedora Update System 2024-08-27 15:07:15 UTC
FEDORA-2024-15f710d324 (libvirt-10.1.0-4.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-15f710d324

Comment 13 Fedora Update System 2024-08-28 03:54:37 UTC
FEDORA-2024-15f710d324 has been pushed to the Fedora 40 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-15f710d324`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-15f710d324

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 14 Fedora Update System 2024-09-04 02:23:14 UTC
FEDORA-2024-15f710d324 (libvirt-10.1.0-4.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.