Created attachment 1923814 [details] Log of git bisect Description of problem: Virtual machines that boots from nfs mounted qcow2 images fails to boot when the nfs server runs kernel 6.0.5 The erros are often Booting from Hard Disk... GRUB loading. Welcome to GRUB! error ../../grub-core/kern/dl.c:ELF header smaller than expected. Entering rescue mode... grub rescue> Version-Release number of selected component (if applicable): qemu-6.2.0-16.fc36.x86_64 kernel-6.0.5-200.fc36.x86_64 How reproducible: always Steps to Reproduce: 1. Run a virtual machine with boot disk qcow2 formatted mounted from an nfs server running kernel-6.0.5-200.fc36.x86_64 Actual results: Machine fails to boot Expected results: Clean boot Additional info: Git repo is linux-stable A git bisect lands me at 310d9d5a5009a93377200b98daa2d84aa2bd8160 So commit 746de1f86fcd works But cherry picking f0f6b614f83d 310d9d5a5009 bfbfb6182ad1 on top of 746de1f86fcd is broken. The bfbfb6182ad1 is picked due to the commit comment commit bfbfb6182ad1d7d184b16f25165faad879147f79 Author: Al Viro <viro.org.uk> Date: Sat Sep 10 22:14:02 2022 +0100 nfsd_splice_actor(): handle compound pages pipe_buffer might refer to a compound page (and contain more than a PAGE_SIZE worth of data). Theoretically it had been possible since way back, but nfsd_splice_actor() hadn't run into that until copy_page_to_iter() change. Fortunately, the only thing that changes for compound pages is that we need to stuff each relevant subpage in and convert the offset into offset in the first subpage. Acked-by: Chuck Lever <chuck.lever> Tested-by: Benjamin Coddington <bcodding> Fixes: f0f6b614f83d "copy_page_to_iter(): don't split high-order page in case of ITER_PIPE" Signed-off-by: Al Viro <viro.org.uk>
Created attachment 1923830 [details] working rpcdebug dump
Created attachment 1923831 [details] filed rpcdebug dump
Hi Anders, thanks for the report. What's the backing filesystem exported by nfsd where the VM's qcow2 image resides? It might be useful to see some trace output from nfsd (if you can get your server to only service the one VM). I'm interested in offsets/lengths when the problem occurs: grep nfsd_read /sys/kernel/debug/tracing/available_events > /sys/kernel/debug/tracing/set_event .. reproduce, then send the last few lines of /sys/kernel/debug/tracing/trace. Turn that tracing back off with: > /sys/kernel/debug/tracing/set_event
Created attachment 1924258 [details] Full trace of nfs_read From my bisecting and code inspection, it looks like the second part of commit 310d9d5a5009 is to blame, and it also looks suspiciously simpler than the iov_iter_bvec_advance (called from iov_iter_advance) it's meant to replace.
(In reply to Benjamin Coddington from comment #3) > Hi Anders, thanks for the report. What's the backing filesystem exported by > nfsd where the VM's qcow2 image resides? xfs Sorry for missing that question in previous reply!
Created attachment 1924329 [details] Trace of nfs activity for a working boot (5.19.4-200.fc36.x86_64) Sad to say the reads looks exactly the same up to the point when previous dump stopped in grub loading :-(
(In reply to Anders Blomdell from comment #4) > From my bisecting and code inspection, it looks like the second part of > commit 310d9d5a5009 is to blame, and it also looks suspiciously simpler than > the iov_iter_bvec_advance (called from iov_iter_advance) it's meant to > replace. .. but how is nfsd getting in there? I don't see the call graph.. maybe from net/core/ somewhere? What arch is this? Can you do something like: @@ -1458,6 +1461,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, p = *pages; for (int k = 0; k < n; k++) get_page(p[k] = page + k); + + if (current->comm == "nfsd") + trace_dump_stack(0); + I've tried to reproduce this, but could not. So, without Al looking at this and knowing /exactly/ what the problem is, my next step would be to narrow down the nfsd server's behavior to the corruption. I wonder if you can reproduce the corruption by just reading that single 512 block. Can you compare reading that last 512 block from the NFS client and NFS server and verify the server is corrupting it? Something like this on the client: dd skip=5659199 if=/mnt/server/qcow iflag=direct count=1 Something like this on the server: dd skip=5659199 if=/exports/xfs/qcow iflag=direct count=1 However, I'd doubt the single 512 block is triggering this -- rather its one of those 65536 reads earlier, so we'd be dealing with multiple pages or a compound. If you can reproduce it with a single NFS READ, you could insert a panic and we could take apart the vmcore.
(In reply to Benjamin Coddington from comment #7) > + > + if (current->comm == "nfsd") > + trace_dump_stack(0); > + .. er no, sorry ^^ thats not going to work -- use something like: if (strcmp(current->comm, "nfsd") == 0)
(In reply to Benjamin Coddington from comment #7) > (In reply to Anders Blomdell from comment #4) > > From my bisecting and code inspection, it looks like the second part of > > commit 310d9d5a5009 is to blame, and it also looks suspiciously simpler than > > the iov_iter_bvec_advance (called from iov_iter_advance) it's meant to > > replace. > > .. but how is nfsd getting in there? I don't see the call graph.. maybe > from net/core/ somewhere? > > What arch is this? x86_64 > Can you do something like: > > @@ -1458,6 +1461,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct > iov_iter *i, > p = *pages; > for (int k = 0; k < n; k++) > get_page(p[k] = page + k); > + > + if (current->comm == "nfsd") > + trace_dump_stack(0); > + > > I've tried to reproduce this, but could not. So, without Al looking at this > and knowing /exactly/ what the problem is, my next step would be to narrow > down the nfsd server's behavior to the corruption. > > I wonder if you can reproduce the corruption by just reading that single 512 > block. > > Can you compare reading that last 512 block from the NFS client and NFS > server and verify the server is corrupting it? Copying the image to another server (5.19) and mounting it from there works Copying the VM to another machine (5.19.4) trying to run it from the 6.0.5 NFS server fails > Something like this on the client: dd skip=5659199 if=/mnt/server/qcow > iflag=direct count=1 > Something like this on the server: dd skip=5659199 if=/exports/xfs/qcow > iflag=direct count=1 > > However, I'd doubt the single 512 block is triggering this -- rather its one > of those 65536 reads earlier, so we'd be dealing with multiple pages or a > compound. If you can reproduce it with a single NFS READ, you could insert > a panic and we could take apart the vmcore. I have set up a reproduce on 6.0.5 on a new machine, and it seems like a fresh mounting of the image is needed, these are the hooks i use: /etc/libvirt/hooks/qemu: #!/bin/bash if [ -x "/etc/libvirt/hooks/qemu.d/$1.hook" ] ; then exec "/etc/libvirt/hooks/qemu.d/$1.hook" "$1" "$2" fi /etc/libvirt/hooks/qemu.d/test-03.hook: #!/bin/bash ( set -x echo $@ mkdir -p /tmp/image if [[ "$2" == "prepare" ]] ; then if ! mountpoint -q /tmp/image ; then mount [2001:470:28:988:1401:2310:4100:0]:/xfs \ /tmp/image fi elif [[ "$2" == "release" ]] ; then umount /tmp/image fi ) 2>&1 | logger -t debugging And to make matters worse, it sometimes starts working after some tests, where /sys/kernel/debug/tracing/trace doesn't show any traffic, but the image is (in this case) correctly read (the curse of caching, maybe :-(); but a reboot breaks it again
(In reply to Anders Blomdell from comment #9) > And to make matters worse, it sometimes starts working after some tests, > where /sys/kernel/debug/tracing/trace doesn't show any > traffic, but the image is (in this case) correctly read (the curse of > caching, maybe :-(); but a reboot breaks it again Yes, caching is in play - and so I suspect your bisect may be incorrect. Were you dropping caches or rebooting the NFS client between kernel versions of the NFS server? The problematic block of your qcow2 image may have been cached on the client.
Created attachment 1926369 [details] Trace of failing boot (nfs server 6.0.5-200.fc36.x86_64, qemu client 5.17.13-300.fc36.x86_64)
Created attachment 1926370 [details] Comparison of data read (wireshark trace) and actual data file It looks like there is a problem with page-sized reads that are not page aligned.
Created attachment 1926440 [details] Handle case where len+offset crosses a page boundary in nfsd_splice_actor
(In reply to Benjamin Coddington from comment #7) > I've tried to reproduce this, but could not. So, without Al looking at this > and knowing /exactly/ what the problem is, my next step would be to narrow > down the nfsd server's behavior to the corruption. If the patch looks right to you, could you pass it on to Al?
(In reply to Anders Blomdell from comment #14) > (In reply to Benjamin Coddington from comment #7) > > I've tried to reproduce this, but could not. So, without Al looking at this > > and knowing /exactly/ what the problem is, my next step would be to narrow > > down the nfsd server's behavior to the corruption. > If the patch looks right to you, could you pass it on to Al? I cannot figure out why your patch should fix up your problem, because I don't understand how you're having this problem. I cannot reproduce it. There must be something else at play that's preventing my setup from having your issue, and I'd really like to figure out what that could be. Here's how I am trying to reproduce it: [root@bcodding-fedora ~]# uname -r 6.0.5-200.fc36.x86_64 I am using your offsets and read sizes in this script, but there is no difference in data: #!/bin/bash nfs_dd() { dd iflag=skip_bytes,direct if=/mnt/localhost/foo skip=$1 bs=$2 count=1 2>/dev/null | hexdump } local_dd() { dd iflag=skip_bytes if=/exports/foo skip=$1 bs=$2 count=1 2>/dev/null | hexdump } diff_block() { diff <(local_dd $1 $2) <(nfs_dd $1 $2) } dd if=/dev/random of=/exports/foo bs=4096 count=128 exportfs -o rw,no_root_squash localhost:/exports mount -t nfs -ov4.1,sec=sys localhost:/exports /mnt/localhost diff_block 0 16384 diff_block 0 16384 diff_block 16384 32768 diff_block 49152 65536 diff_block 0 112 diff_block 196608 800 diff_block 65536 65536 diff_block 112 8 diff_block 504 8 diff_block 262144 65536 diff_block 327680 512 diff_block 327680 512 diff_block 393216 512 diff_block 393728 4096 diff_block 397824 4096 diff_block 401920 4096 diff_block 406016 4096 diff_block 410112 4096 diff_block 414208 4096 diff_block 418304 4096 diff_block 422400 4096 diff_block 426496 4096 diff_block 430592 4096 diff_block 434688 4096 diff_block 438784 4096 diff_block 442880 4096 diff_block 446976 4096 diff_block 451072 4096 diff_block 455168 3584 umount /mnt/localhost exportfs -ua
I have not been able to reproduce it with something other than qemu+qcow2,but looking at attachment https://bugzilla.redhat.com/attachment.cgi?id=1926370 "Comparison of data read (wireshark trace) and actual data file", the problem is clearly visible for all the failed reads. What happens is that when a read start at a non-zero page offset and the end is at a non-zero offset of a later page, the page loading loop misses the last page.
(In reply to Benjamin Coddington from comment #15) > (In reply to Anders Blomdell from comment #14) > > (In reply to Benjamin Coddington from comment #7) > > > I've tried to reproduce this, but could not. So, without Al looking at this > > > and knowing /exactly/ what the problem is, my next step would be to narrow > > > down the nfsd server's behavior to the corruption. > > If the patch looks right to you, could you pass it on to Al? > > I cannot figure out why your patch should fix up your problem, because I > don't understand how you're having this problem. I cannot reproduce it. > There must be something else at play that's preventing my setup from having > your issue, and I'd really like to figure out what that could be. Here's > how I am trying to reproduce it: > > [root@bcodding-fedora ~]# uname -r > 6.0.5-200.fc36.x86_64 > > I am using your offsets and read sizes in this script, but there is no > difference in data: > > #!/bin/bash > > nfs_dd() { > dd iflag=skip_bytes,direct if=/mnt/localhost/foo skip=$1 bs=$2 count=1 > 2>/dev/null | hexdump > } > > local_dd() { > dd iflag=skip_bytes if=/exports/foo skip=$1 bs=$2 count=1 2>/dev/null | > hexdump > } > > diff_block() { > diff <(local_dd $1 $2) <(nfs_dd $1 $2) > } > Not a total surprise, since 'dd iflag=skip_bytes if=/tmp/image/test-03.root.fay-01.img skip=393728 bs=4096 count=1 2>/dev/null | wc' results in two nfs reads, each page-aligned (output is from printk's added in nfsd_splice_actor): [ 499.430658] offset: 0, length: 4096 [ 499.430672] page: 0000000099ace5bf [ 499.430677] offset: 0, length: 4096 [ 499.430681] page: 000000001daf5338
(In reply to Anders Blomdell from comment #17) > Not a total surprise, since 'dd iflag=skip_bytes > if=/tmp/image/test-03.root.fay-01.img skip=393728 bs=4096 count=1 > 2>/dev/null | wc' results in > two nfs reads, each page-aligned (output is from printk's added in > nfsd_splice_actor): Yes, NFS will do page-aligned reads unless you use direct IO, which is why I use iflag=skip_bytes,direct for the NFS reads. I assume libvirt is using direct IO - that should result in the client issuing non-aligned read requests.
Sorry about that , now I get: [ 466.569949] offset: 2048, length: 2048 [ 466.569963] page: 000000005d7f4195 [ 466.569968] offset: 0, length: 2048 [ 466.569972] page: 00000000f8d0f2cd
(In reply to Benjamin Coddington from comment #18) > (In reply to Anders Blomdell from comment #17) > > Not a total surprise, since 'dd iflag=skip_bytes > > if=/tmp/image/test-03.root.fay-01.img skip=393728 bs=4096 count=1 > > 2>/dev/null | wc' results in > > two nfs reads, each page-aligned (output is from printk's added in > > nfsd_splice_actor): > > Yes, NFS will do page-aligned reads unless you use direct IO, which is why I > use iflag=skip_bytes,direct for the NFS reads. I assume libvirt is using > direct IO - that should result in the client issuing non-aligned read > requests. I have submitted it upstreams, (you should be on the CC list), I guess that somebody with greater insight will be able to create a reproducer...
This message is a reminder that Fedora Linux 36 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 36 on 2023-05-16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '36'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 36 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 36 entered end-of-life (EOL) status on 2023-05-16. Fedora Linux 36 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.