Bug 2142132 - Kernel 6.0.5 breaks virtual machines that boot from nfs mounted qcow2 images
Summary: Kernel 6.0.5 breaks virtual machines that boot from nfs mounted qcow2 images
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 36
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-11 18:33 UTC by Anders Blomdell
Modified: 2023-05-25 17:16 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-25 17:16:26 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Log of git bisect (3.16 KB, text/plain)
2022-11-11 18:33 UTC, Anders Blomdell
no flags Details
working rpcdebug dump (192.74 KB, text/plain)
2022-11-11 20:15 UTC, Anders Blomdell
no flags Details
filed rpcdebug dump (34.12 KB, text/plain)
2022-11-11 20:16 UTC, Anders Blomdell
no flags Details
Full trace of nfs_read (54.10 KB, text/plain)
2022-11-14 15:30 UTC, Anders Blomdell
no flags Details
Trace of nfs activity for a working boot (5.19.4-200.fc36.x86_64) (229.32 KB, text/plain)
2022-11-14 19:47 UTC, Anders Blomdell
no flags Details
Trace of failing boot (nfs server 6.0.5-200.fc36.x86_64, qemu client 5.17.13-300.fc36.x86_64) (58.65 KB, text/plain)
2022-11-22 11:12 UTC, Anders Blomdell
no flags Details
Comparison of data read (wireshark trace) and actual data file (7.92 KB, text/plain)
2022-11-22 11:16 UTC, Anders Blomdell
no flags Details
Handle case where len+offset crosses a page boundary in nfsd_splice_actor (535 bytes, patch)
2022-11-22 16:16 UTC, Anders Blomdell
no flags Details | Diff

Description Anders Blomdell 2022-11-11 18:33:28 UTC
Created attachment 1923814 [details]
Log of git bisect

Description of problem:

Virtual machines that boots from nfs mounted qcow2 images fails to boot when
the nfs server runs kernel 6.0.5

The erros are often 

  Booting from Hard Disk...
  GRUB loading.
  Welcome to GRUB!

  error ../../grub-core/kern/dl.c:ELF header smaller than expected.
  Entering rescue mode...
  grub rescue>
 
Version-Release number of selected component (if applicable):

  qemu-6.2.0-16.fc36.x86_64
  kernel-6.0.5-200.fc36.x86_64

How reproducible:

  always

Steps to Reproduce:
1. Run a virtual machine with boot disk qcow2 formatted mounted from an nfs server 
   running kernel-6.0.5-200.fc36.x86_64

Actual results:

Machine fails to boot

Expected results:

Clean boot

Additional info: 

Git repo is linux-stable

A git bisect lands me at 310d9d5a5009a93377200b98daa2d84aa2bd8160

So commit 746de1f86fcd works

But cherry picking f0f6b614f83d 310d9d5a5009 bfbfb6182ad1 on top of 746de1f86fcd is broken.

The bfbfb6182ad1 is picked due to the commit comment

commit bfbfb6182ad1d7d184b16f25165faad879147f79
Author: Al Viro <viro.org.uk>
Date:   Sat Sep 10 22:14:02 2022 +0100

    nfsd_splice_actor(): handle compound pages
    
    pipe_buffer might refer to a compound page (and contain more than a PAGE_SIZE
    worth of data).  Theoretically it had been possible since way back, but
    nfsd_splice_actor() hadn't run into that until copy_page_to_iter() change.
    Fortunately, the only thing that changes for compound pages is that we
    need to stuff each relevant subpage in and convert the offset into offset
    in the first subpage.
    
    Acked-by: Chuck Lever <chuck.lever>
    Tested-by: Benjamin Coddington <bcodding>
    Fixes: f0f6b614f83d "copy_page_to_iter(): don't split high-order page in case of ITER_PIPE"
    Signed-off-by: Al Viro <viro.org.uk>

Comment 1 Anders Blomdell 2022-11-11 20:15:29 UTC
Created attachment 1923830 [details]
working rpcdebug dump

Comment 2 Anders Blomdell 2022-11-11 20:16:19 UTC
Created attachment 1923831 [details]
filed rpcdebug dump

Comment 3 Benjamin Coddington 2022-11-14 14:12:28 UTC
Hi Anders, thanks for the report.  What's the backing filesystem exported by nfsd where the VM's qcow2 image resides?

It might be useful to see some trace output from nfsd (if you can get your server to only service the one VM).  I'm interested in offsets/lengths when the problem occurs:

grep nfsd_read /sys/kernel/debug/tracing/available_events > /sys/kernel/debug/tracing/set_event

.. reproduce, then send the last few lines of /sys/kernel/debug/tracing/trace.

Turn that tracing back off with:

> /sys/kernel/debug/tracing/set_event

Comment 4 Anders Blomdell 2022-11-14 15:30:04 UTC
Created attachment 1924258 [details]
Full trace of nfs_read

From my bisecting and code inspection, it looks like the second part of commit 310d9d5a5009 is to blame, and it also looks suspiciously simpler than the iov_iter_bvec_advance (called from iov_iter_advance) it's meant to replace.

Comment 5 Anders Blomdell 2022-11-14 16:45:29 UTC
(In reply to Benjamin Coddington from comment #3)
> Hi Anders, thanks for the report.  What's the backing filesystem exported by
> nfsd where the VM's qcow2 image resides?

xfs

Sorry for missing that question in previous reply!

Comment 6 Anders Blomdell 2022-11-14 19:47:48 UTC
Created attachment 1924329 [details]
Trace of nfs activity for a working boot (5.19.4-200.fc36.x86_64)

Sad to say the reads looks exactly the same up to the point when previous dump stopped in grub loading :-(

Comment 7 Benjamin Coddington 2022-11-15 15:27:06 UTC
(In reply to Anders Blomdell from comment #4)
> From my bisecting and code inspection, it looks like the second part of
> commit 310d9d5a5009 is to blame, and it also looks suspiciously simpler than
> the iov_iter_bvec_advance (called from iov_iter_advance) it's meant to
> replace.

.. but how is nfsd getting in there?  I don't see the call graph.. maybe from net/core/ somewhere?

What arch is this?

Can you do something like:

@@ -1458,6 +1461,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
                p = *pages;
                for (int k = 0; k < n; k++)
                        get_page(p[k] = page + k);
+
+               if (current->comm == "nfsd")
+                       trace_dump_stack(0);
+

I've tried to reproduce this, but could not.  So, without Al looking at this and knowing /exactly/ what the problem is, my next step would be to narrow down the nfsd server's behavior to the corruption.

I wonder if you can reproduce the corruption by just reading that single 512 block.

Can you compare reading that last 512 block from the NFS client and NFS server and verify the server is corrupting it?

Something like this on the client: dd skip=5659199 if=/mnt/server/qcow iflag=direct count=1
Something like this on the server: dd skip=5659199 if=/exports/xfs/qcow iflag=direct count=1

However, I'd doubt the single 512 block is triggering this -- rather its one of those 65536 reads earlier, so we'd be dealing with multiple pages or a compound.  If you can reproduce it with a single NFS READ, you could insert a panic and we could take apart the vmcore.

Comment 8 Benjamin Coddington 2022-11-15 15:31:38 UTC
(In reply to Benjamin Coddington from comment #7)

> +
> +               if (current->comm == "nfsd")
> +                       trace_dump_stack(0);
> +

.. er no, sorry ^^ thats not going to work -- use something like:

if (strcmp(current->comm, "nfsd") == 0)

Comment 9 Anders Blomdell 2022-11-16 14:59:15 UTC
(In reply to Benjamin Coddington from comment #7)
> (In reply to Anders Blomdell from comment #4)
> > From my bisecting and code inspection, it looks like the second part of
> > commit 310d9d5a5009 is to blame, and it also looks suspiciously simpler than
> > the iov_iter_bvec_advance (called from iov_iter_advance) it's meant to
> > replace.
> 
> .. but how is nfsd getting in there?  I don't see the call graph.. maybe
> from net/core/ somewhere?
> 
> What arch is this?
x86_64

> Can you do something like:
> 
> @@ -1458,6 +1461,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct
> iov_iter *i,
>                 p = *pages;
>                 for (int k = 0; k < n; k++)
>                         get_page(p[k] = page + k);
> +
> +               if (current->comm == "nfsd")
> +                       trace_dump_stack(0);
> +
> 
> I've tried to reproduce this, but could not.  So, without Al looking at this
> and knowing /exactly/ what the problem is, my next step would be to narrow
> down the nfsd server's behavior to the corruption.
> 
> I wonder if you can reproduce the corruption by just reading that single 512
> block.
> 
> Can you compare reading that last 512 block from the NFS client and NFS
> server and verify the server is corrupting it?

  Copying the image to another server (5.19) and mounting it from there works

  Copying the VM to another machine (5.19.4) trying to run it from the 6.0.5
  NFS server fails

> Something like this on the client: dd skip=5659199 if=/mnt/server/qcow
> iflag=direct count=1
> Something like this on the server: dd skip=5659199 if=/exports/xfs/qcow
> iflag=direct count=1
> 
> However, I'd doubt the single 512 block is triggering this -- rather its one
> of those 65536 reads earlier, so we'd be dealing with multiple pages or a
> compound.  If you can reproduce it with a single NFS READ, you could insert
> a panic and we could take apart the vmcore.

I have set up a reproduce on 6.0.5 on a new machine, and it seems like a fresh mounting of
the image is needed, these are the hooks i use:

/etc/libvirt/hooks/qemu:

#!/bin/bash

if [ -x "/etc/libvirt/hooks/qemu.d/$1.hook" ] ; then
    exec "/etc/libvirt/hooks/qemu.d/$1.hook" "$1" "$2"
fi


/etc/libvirt/hooks/qemu.d/test-03.hook:

#!/bin/bash
(
    set -x
    echo $@
    
    mkdir -p /tmp/image
    if [[ "$2" == "prepare" ]] ; then
	if ! mountpoint -q /tmp/image ; then
	    mount [2001:470:28:988:1401:2310:4100:0]:/xfs \
		  /tmp/image
	fi
    elif [[ "$2" == "release" ]] ; then
	umount /tmp/image
    fi
) 2>&1 | logger -t debugging

And to make matters worse, it sometimes starts working after some tests, where  /sys/kernel/debug/tracing/trace doesn't show any
traffic, but the image is (in this case) correctly read (the curse of caching, maybe :-(); but a reboot breaks it again

Comment 10 Benjamin Coddington 2022-11-16 15:31:36 UTC
(In reply to Anders Blomdell from comment #9)
 
> And to make matters worse, it sometimes starts working after some tests,
> where  /sys/kernel/debug/tracing/trace doesn't show any
> traffic, but the image is (in this case) correctly read (the curse of
> caching, maybe :-(); but a reboot breaks it again

Yes, caching is in play - and so I suspect your bisect may be incorrect.  Were you dropping caches or rebooting the NFS client between kernel versions of the NFS server?  The problematic block of your qcow2 image may have been cached on the client.

Comment 11 Anders Blomdell 2022-11-22 11:12:36 UTC
Created attachment 1926369 [details]
Trace of failing boot (nfs server 6.0.5-200.fc36.x86_64, qemu client 5.17.13-300.fc36.x86_64)

Comment 12 Anders Blomdell 2022-11-22 11:16:43 UTC
Created attachment 1926370 [details]
Comparison of data read (wireshark trace) and actual data file

It looks like there is a problem with page-sized reads that are not page aligned.

Comment 13 Anders Blomdell 2022-11-22 16:16:40 UTC
Created attachment 1926440 [details]
Handle case where len+offset crosses a page boundary  in nfsd_splice_actor

Comment 14 Anders Blomdell 2022-11-22 16:20:53 UTC
(In reply to Benjamin Coddington from comment #7)
> I've tried to reproduce this, but could not.  So, without Al looking at this
> and knowing /exactly/ what the problem is, my next step would be to narrow
> down the nfsd server's behavior to the corruption.
If the patch looks right to you, could you pass it on to Al?

Comment 15 Benjamin Coddington 2022-11-22 20:54:48 UTC
(In reply to Anders Blomdell from comment #14)
> (In reply to Benjamin Coddington from comment #7)
> > I've tried to reproduce this, but could not.  So, without Al looking at this
> > and knowing /exactly/ what the problem is, my next step would be to narrow
> > down the nfsd server's behavior to the corruption.
> If the patch looks right to you, could you pass it on to Al?

I cannot figure out why your patch should fix up your problem, because I don't understand how you're having this problem.  I cannot reproduce it.  There must be something else at play that's preventing my setup from having your issue, and I'd really like to figure out what that could be.  Here's how I am trying to reproduce it:

[root@bcodding-fedora ~]# uname -r
6.0.5-200.fc36.x86_64

I am using your offsets and read sizes in this script, but there is no difference in data:

#!/bin/bash

nfs_dd() {
	dd iflag=skip_bytes,direct if=/mnt/localhost/foo skip=$1 bs=$2 count=1 2>/dev/null | hexdump 
}

local_dd() {
	dd iflag=skip_bytes if=/exports/foo skip=$1 bs=$2 count=1 2>/dev/null | hexdump 
}

diff_block() {
	diff <(local_dd $1 $2) <(nfs_dd $1 $2)
}

dd if=/dev/random of=/exports/foo bs=4096 count=128

exportfs -o rw,no_root_squash localhost:/exports
mount -t nfs -ov4.1,sec=sys localhost:/exports /mnt/localhost

diff_block 0      16384
diff_block 0      16384
diff_block 16384  32768
diff_block 49152  65536
diff_block 0      112
diff_block 196608 800
diff_block 65536  65536
diff_block 112    8
diff_block 504    8
diff_block 262144 65536
diff_block 327680 512
diff_block 327680 512
diff_block 393216 512
diff_block 393728 4096 
diff_block 397824 4096 
diff_block 401920 4096 
diff_block 406016 4096 
diff_block 410112 4096 
diff_block 414208 4096 
diff_block 418304 4096 
diff_block 422400 4096 
diff_block 426496 4096 
diff_block 430592 4096 
diff_block 434688 4096 
diff_block 438784 4096 
diff_block 442880 4096 
diff_block 446976 4096 
diff_block 451072 4096 
diff_block 455168 3584 

umount /mnt/localhost
exportfs -ua

Comment 16 Anders Blomdell 2022-11-23 06:31:04 UTC
I have not been able to reproduce it with something other than qemu+qcow2,but looking at attachment https://bugzilla.redhat.com/attachment.cgi?id=1926370 "Comparison of data read (wireshark trace) and actual data file", the problem is clearly visible for all the failed reads. What happens is that when a read start at a non-zero page offset and the end is at a non-zero offset of a later page, the page loading loop misses the last page.

Comment 17 Anders Blomdell 2022-11-23 08:08:31 UTC
(In reply to Benjamin Coddington from comment #15)
> (In reply to Anders Blomdell from comment #14)
> > (In reply to Benjamin Coddington from comment #7)
> > > I've tried to reproduce this, but could not.  So, without Al looking at this
> > > and knowing /exactly/ what the problem is, my next step would be to narrow
> > > down the nfsd server's behavior to the corruption.
> > If the patch looks right to you, could you pass it on to Al?
> 
> I cannot figure out why your patch should fix up your problem, because I
> don't understand how you're having this problem.  I cannot reproduce it. 
> There must be something else at play that's preventing my setup from having
> your issue, and I'd really like to figure out what that could be.  Here's
> how I am trying to reproduce it:
> 
> [root@bcodding-fedora ~]# uname -r
> 6.0.5-200.fc36.x86_64
> 
> I am using your offsets and read sizes in this script, but there is no
> difference in data:
> 
> #!/bin/bash
> 
> nfs_dd() {
> 	dd iflag=skip_bytes,direct if=/mnt/localhost/foo skip=$1 bs=$2 count=1
> 2>/dev/null | hexdump 
> }
> 
> local_dd() {
> 	dd iflag=skip_bytes if=/exports/foo skip=$1 bs=$2 count=1 2>/dev/null |
> hexdump 
> }
> 
> diff_block() {
> 	diff <(local_dd $1 $2) <(nfs_dd $1 $2)
> }
> 
Not a total surprise, since 'dd iflag=skip_bytes if=/tmp/image/test-03.root.fay-01.img skip=393728 bs=4096 count=1 2>/dev/null | wc' results in
two nfs reads, each page-aligned (output is from printk's added in nfsd_splice_actor):

  [  499.430658] offset: 0, length: 4096
  [  499.430672] page: 0000000099ace5bf
  [  499.430677] offset: 0, length: 4096
  [  499.430681] page: 000000001daf5338

Comment 18 Benjamin Coddington 2022-11-23 11:15:05 UTC
(In reply to Anders Blomdell from comment #17)
> Not a total surprise, since 'dd iflag=skip_bytes
> if=/tmp/image/test-03.root.fay-01.img skip=393728 bs=4096 count=1
> 2>/dev/null | wc' results in
> two nfs reads, each page-aligned (output is from printk's added in
> nfsd_splice_actor):

Yes, NFS will do page-aligned reads unless you use direct IO, which is why I use iflag=skip_bytes,direct for the NFS reads.  I assume libvirt is using direct IO - that should result in the client issuing non-aligned read requests.

Comment 19 Anders Blomdell 2022-11-23 12:09:10 UTC
Sorry about that , now I get:

[  466.569949] offset: 2048, length: 2048
[  466.569963] page: 000000005d7f4195
[  466.569968] offset: 0, length: 2048
[  466.569972] page: 00000000f8d0f2cd

Comment 20 Anders Blomdell 2022-11-23 12:12:02 UTC
(In reply to Benjamin Coddington from comment #18)
> (In reply to Anders Blomdell from comment #17)
> > Not a total surprise, since 'dd iflag=skip_bytes
> > if=/tmp/image/test-03.root.fay-01.img skip=393728 bs=4096 count=1
> > 2>/dev/null | wc' results in
> > two nfs reads, each page-aligned (output is from printk's added in
> > nfsd_splice_actor):
> 
> Yes, NFS will do page-aligned reads unless you use direct IO, which is why I
> use iflag=skip_bytes,direct for the NFS reads.  I assume libvirt is using
> direct IO - that should result in the client issuing non-aligned read
> requests.

I have submitted it upstreams, (you should be on the CC list), I guess that somebody with greater insight will be able to create a reproducer...

Comment 21 Ben Cotton 2023-04-25 18:12:03 UTC
This message is a reminder that Fedora Linux 36 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 36 on 2023-05-16.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '36'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 36 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 22 Ludek Smid 2023-05-25 17:16:26 UTC
Fedora Linux 36 entered end-of-life (EOL) status on 2023-05-16.

Fedora Linux 36 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.