RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2028370 - [xfstests/nfs generic/476] test never finishes
Summary: [xfstests/nfs generic/476] test never finishes
Keywords:
Status: CLOSED ERRATA
Alias: None
Deadline: 2022-08-15
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: kernel
Version: 9.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 9.1
Assignee: Jeff Layton
QA Contact: Boyang Xue
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-02 08:32 UTC by Boyang Xue
Modified: 2022-11-15 12:12 UTC (History)
6 users (show)

Fixed In Version: kernel-5.14.0-150.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-15 10:51:55 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src/kernel centos-stream-9 merge_requests 1258 0 None opened nfs: fix hung DIO writes in -ENOSPC conditions 2022-08-11 11:40:16 UTC
Red Hat Issue Tracker RHELPLAN-104525 0 None None None 2021-12-02 08:38:31 UTC
Red Hat Product Errata RHSA-2022:8267 0 None None None 2022-11-15 10:52:09 UTC

Description Boyang Xue 2021-12-02 08:32:25 UTC
Description of problem:
xfstests generic/476 keeps running and never finishes on all RHEL-9 kernels up to the latest version (kernel-5.14.0-22.el9) with all NFS versoins. The same test runs finished with "pass" on RHEL-8 in less than 10 minutes.

local.config
```
[root@kvm108 ~]# cat /tmp/tmp.7XeinNduCS/repo_xfstests/local.config
FSTYP="nfs"
TEST_DIR="/test"
TEST_DEV="localhost:/test_TMP"
SCRATCH_MNT="/scratch"
SCRATCH_DEV="localhost:/scratch_TMP"
LOGWRITES_MNT="/logwrites"
LOGWRITES_DEV="/dev/vda6"
MKFS_OPTIONS="-b 4096"
MOUNT_OPTIONS="-o vers=4.2"
TEST_FS_MOUNT_OPTS="-o vers=4.2"
```

As a simplified reproducer, we can reproduce it with fsstress:

fsstress -w -d /scratch -n 100000 -p 4 -s 1638893987 # /scratch was mounted with 
"mount -t nfs -o vers=4.2 localhost:/scratch_TMP /scratch"

After running for hours, it looks like fsstress stucks at nfs_direct_wait()
# ps uxf
root        3620  0.0  0.1   7548  4352 pts/0    Ss   Nov30   0:00  |       \_ -bash
root        3751  0.0  0.0   2716   940 pts/0    S+   Nov30   0:00  |           \_ ltp/fsstress -w -d /scratch -n 100000 -p 4 -s 1638893987
root        3752  0.0  0.1   6716  5396 pts/0    D+   Nov30   0:41  |               \_ ltp/fsstress -w -d /scratch -n 100000 -p 4 -s 1638893987
root        3753  0.0  0.1   6748  5396 pts/0    D+   Nov30   0:40  |               \_ ltp/fsstress -w -d /scratch -n 100000 -p 4 -s 1638893987
root        3754  0.0  0.1   6792  5376 pts/0    D+   Nov30   0:40  |               \_ ltp/fsstress -w -d /scratch -n 100000 -p 4 -s 1638893987
root        3755  0.0  0.1   6708  5348 pts/0    S+   Nov30   0:40  |               \_ ltp/fsstress -w -d /scratch -n 100000 -p 4 -s 1638893987
[root@kvm108 ~]# cat /proc/3753/stack
[<0>] nfs_direct_wait+0x19/0x40 [nfs]
[<0>] nfs_file_direct_write+0x245/0x340 [nfs]
[<0>] new_sync_write+0x11c/0x1b0
[<0>] vfs_write+0x1be/0x250
[<0>] ksys_write+0x5f/0xe0
[<0>] do_syscall_64+0x3b/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

There's no special kernel message by checking "journalctl -k"

Version-Release number of selected component (if applicable):
kernel-5.14.0-22.el9

How reproducible:
95%

Steps to Reproduce:
1. Run xfstests generic/476 on nfs
2.
3.

Actual results:
Test never finishes

Expected results:
Test pass

Additional info:
N/A

Comment 1 xiagao 2022-01-07 06:53:31 UTC
I also hit this issue on virtiofs with xfstest + nfs.

nfs info:
127.0.0.1:/mnt/virtio_fs1_test_nfs on /tmp/virtio_fs1_test type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)
127.0.0.1:/mnt/virtio_fs2_test_nfs on /tmp/virtio_fs2_test type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)

pkg version:
kernel-5.14.0-39.el9.x86_64(guest & host)
qemu-kvm-6.2.0-1.el9.x86_64


Besides generic/476 test case, generic/531 also never finished on virtiofs.

@bxue, does generic/531 work in your side?

Comment 2 Boyang Xue 2022-01-07 09:04:12 UTC
(In reply to xiagao from comment #1)
> I also hit this issue on virtiofs with xfstest + nfs.
> 
> nfs info:
> 127.0.0.1:/mnt/virtio_fs1_test_nfs on /tmp/virtio_fs1_test type nfs4
> (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,
> timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.
> 0.1)
> 127.0.0.1:/mnt/virtio_fs2_test_nfs on /tmp/virtio_fs2_test type nfs4
> (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,
> timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.
> 0.1)
> 
> pkg version:
> kernel-5.14.0-39.el9.x86_64(guest & host)
> qemu-kvm-6.2.0-1.el9.x86_64
>

Thanks for sharing these information!
 
> 
> Besides generic/476 test case, generic/531 also never finished on virtiofs.
> 
> @bxue, does generic/531 work in your side?

There's some runs that generic/531 hangs on nfs, like
https://beaker.engineering.redhat.com/recipes/11180377 # kernel-5.14.0-37.el9, x86_64
But it's not 100%, at least all my runs on x86_64 VM didn't hit this problem.

Comment 4 xiagao 2022-01-14 01:29:31 UTC
(In reply to Boyang Xue from comment #2)
> (In reply to xiagao from comment #1)
> > I also hit this issue on virtiofs with xfstest + nfs.
> > 
> > nfs info:
> > 127.0.0.1:/mnt/virtio_fs1_test_nfs on /tmp/virtio_fs1_test type nfs4
> > (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,
> > timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.
> > 0.1)
> > 127.0.0.1:/mnt/virtio_fs2_test_nfs on /tmp/virtio_fs2_test type nfs4
> > (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,
> > timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.
> > 0.1)
> > 
> > pkg version:
> > kernel-5.14.0-39.el9.x86_64(guest & host)
> > qemu-kvm-6.2.0-1.el9.x86_64
> >
> 
> Thanks for sharing these information!
>  
> > 
> > Besides generic/476 test case, generic/531 also never finished on virtiofs.
> > 
> > @bxue, does generic/531 work in your side?
> 
> There's some runs that generic/531 hangs on nfs, like
> https://beaker.engineering.redhat.com/recipes/11180377 #
> kernel-5.14.0-37.el9, x86_64
> But it's not 100%, at least all my runs on x86_64 VM didn't hit this problem.

Boyang, thanks for your quick response. And in my test env generic/650 seems also experienced the same issue.

Comment 5 Boyang Xue 2022-01-14 05:37:15 UTC
(In reply to xiagao from comment #4)
> (In reply to Boyang Xue from comment #2)
> > (In reply to xiagao from comment #1)
> > > I also hit this issue on virtiofs with xfstest + nfs.
> > > 
> > > nfs info:
> > > 127.0.0.1:/mnt/virtio_fs1_test_nfs on /tmp/virtio_fs1_test type nfs4
> > > (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,
> > > timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.
> > > 0.1)
> > > 127.0.0.1:/mnt/virtio_fs2_test_nfs on /tmp/virtio_fs2_test type nfs4
> > > (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,
> > > timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.
> > > 0.1)
> > > 
> > > pkg version:
> > > kernel-5.14.0-39.el9.x86_64(guest & host)
> > > qemu-kvm-6.2.0-1.el9.x86_64
> > >
> > 
> > Thanks for sharing these information!
> >  
> > > 
> > > Besides generic/476 test case, generic/531 also never finished on virtiofs.
> > > 
> > > @bxue, does generic/531 work in your side?
> > 
> > There's some runs that generic/531 hangs on nfs, like
> > https://beaker.engineering.redhat.com/recipes/11180377 #
> > kernel-5.14.0-37.el9, x86_64
> > But it's not 100%, at least all my runs on x86_64 VM didn't hit this problem.
> 
> Boyang, thanks for your quick response. And in my test env generic/650 seems
> also experienced the same issue.

Yes I have few generic/650 runs never finished, but it has no problem on other runs.
Since generic/476, generic/531 and generic/650 they are all stress test, which highly depends on the spec/performance of the system, I think we have to take closer look to confirm they're real problems in general.
For generic/476, it's almost 100% chance the test never finishes in my test, but for generic/531 and generic/650 I still have batch of jobs running finished without problem.

Comment 7 J. Bruce Fields 2022-01-27 15:13:59 UTC
That's a test I usually skip upstream, unfortunately, because it takes an hour or so on my test setup.

Are you booting both client and server to the same kernel?

If so one simple way to narrow down the problem would be to try each combination (RHEL8 client, RHEL9 server, etc.).

Comment 10 Hu Shuai (Fujitsu) 2022-03-14 06:05:57 UTC
Hi, I tested xfstests/nfs:generic/476 on fx700 with RHEL-9.0.0-20220307.0(aarch64), it passed.
`generic/476 1795s ...  1783s`

Comment 11 xiagao 2022-03-16 02:17:36 UTC
xfstests/nfs: generic/476 also passed on x86_64(kernel-5.14.0-70.el9.x86_64)

2022-03-15 01:28:32: generic/476
2022-03-15 01:37:01:  509s

But generic/531 never finished.

Comment 12 Boyang Xue 2022-03-17 01:26:26 UTC
generic/476 still hangs in my tests on kernel-5.14.0-72.el9 on all arches, no matter in multi-host or single-host configuration. The local.config for multi-host is like:

```
FSTYP=nfs
TEST_DEV=sweetpig-17.4a2m.lab.eng.bos.redhat.com:/mnt/xfstests/test/nfs-server
TEST_DIR=/mnt/xfstests/test/nfs-client
SCRATCH_DEV=sweetpig-17.4a2m.lab.eng.bos.redhat.com:/mnt/xfstests/scratch/nfs-server
SCRATCH_MNT=/mnt/xfstests/scratch/nfs-client
MOUNT_OPTIONS="-o vers=4.2"
TEST_FS_MOUNT_OPTS="-o vers=4.2"
```

Comment 13 Jeff Layton 2022-07-12 11:14:58 UTC
generic/476 works just fine for me:

[jlayton@centos9 xfstests-dev]$ sudo ./check generic/476
FSTYP         -- nfs
PLATFORM      -- Linux/x86_64 centos9 5.14.0-125.mr1113_220707_1634.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 7 16:48:49 UTC 2022
MKFS_OPTIONS  -- ceph:/export/scratch
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 ceph:/export/scratch /mnt/scratch

generic/476        1512s
Ran: generic/476
Passed all 1 tests

My local.config:

export TEST_DIR=/mnt/test
export SCRATCH_MNT=/mnt/scratch
export FSTYP=nfs
export TEST_DEV="ceph:/export/test"
export SCRATCH_DEV="ceph:/export/scratch"

I didn't explicitly specify '-o vers=4.2' but that was what the client and server negotiated during the test. Boyang, are you still seeing problems with this test?

Comment 14 Boyang Xue 2022-07-13 11:13:30 UTC
Yes. I'm still seeing this problem.

[root@kvm103 ~]# cat repo_xfstests-dev/local.config
FSTYP=nfs
TEST_DIR=/test
TEST_DEV=localhost:/test_TMP
SCRATCH_MNT=/scratch
SCRATCH_DEV=localhost:/scratch_TMP
LOGWRITES_MNT=/logwrites
LOGWRITES_DEV=/dev/vda6
#MKFS_OPTIONS="-b 4096"
MOUNT_OPTIONS="-o vers=4.2"
TEST_FS_MOUNT_OPTS="-o vers=4.2"

[root@kvm103 repo_xfstests-dev]# ./check -nfs generic/476
FSTYP         -- nfs
PLATFORM      -- Linux/x86_64 kvm103 5.14.0-127.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Jul 9 08:20:22 EDT 2022
MKFS_OPTIONS  -- localhost:/scratch_TMP
MOUNT_OPTIONS -- -o vers=4.2 -o context=system_u:object_r:root_t:s0 localhost:/scratch_TMP /scratch

generic/476
(...stuck here...)

# ps uxf
root        3722  0.0  0.2  10064  6540 pts/0    S+   04:37   0:00  |           \_ /bin/bash ./check -nfs generic/476
root        3980  0.0  0.2   6724  5544 pts/0    S+   04:38   0:00  |               \_ /bin/bash /root/repo_xfstests-dev/tests/generic/476
root        4149  0.0  0.0   2728   832 pts/0    S    04:38   0:00  |                   \_ ./ltp/fsstress -w -d /scratch -n 100000 -p 4
root        4151  0.1  0.2   6784  5320 pts/0    D    04:38   0:44  |                       \_ ./ltp/fsstress -w -d /scratch -n 100000 -p 4
root        4152  0.1  0.2   6760  5240 pts/0    D    04:38   0:42  |                       \_ ./ltp/fsstress -w -d /scratch -n 100000 -p 4

[root@kvm103 ~]# cat /proc/4151/stack 
[<0>] nfs_direct_wait+0x19/0x40 [nfs]
[<0>] nfs_file_direct_write+0x248/0x340 [nfs]
[<0>] new_sync_write+0x11f/0x1b0
[<0>] vfs_write+0x1eb/0x280
[<0>] ksys_write+0x5f/0xe0
[<0>] do_syscall_64+0x5c/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

This was running with localhost exported nfs. I'm running the same test on remote exported nfs now to see if the result changes.

Comment 15 Jeff Layton 2022-07-13 12:26:01 UTC
That stack trace shows the client waiting on a WRITE reply from the server. What we'd probably need for you to do is track down the nfsd thread that is servicing the request and see why it's stuck.

Running tests on an NFS server exported over localhost can be problematic for anything that stresses the memory on the machine. It's quite possible to hit a deadlock where the server needs memory to satisfy a request, and tries to flush dirty NFS pages to free some up, but the server is still stuck waiting for memory and can't process the request.

In general, running simple functionality tests over localhost that don't stress the client or server is usually fine, but you want to avoid running anything too intensive in this sort of configuration to avoid that sort of deadlock.

Comment 16 Boyang Xue 2022-07-14 04:50:06 UTC
Hi Jeff,

Thanks for the explaination. I'm trying to provide more information here.

First here's the localhost configuration, when the test stucks as described in comment#14, the stack trace of the nfsd is changing, not fixed.
```
[root@kvm103 ~]# cat /proc/3716/stack 
[<0>] svc_get_next_xprt+0x10b/0x1d0 [sunrpc]
[<0>] svc_recv+0x8e/0x1b0 [sunrpc]
[<0>] nfsd+0xdb/0x150 [nfsd]
[<0>] kthread+0x149/0x170
[<0>] ret_from_fork+0x22/0x30
[root@kvm103 ~]# cat /proc/3716/stack 
[<0>] xlog_wait_on_iclog+0x117/0x180 [xfs]
[<0>] xfs_log_force_seq+0x8f/0x150 [xfs]
[<0>] xfs_file_fsync+0x149/0x210 [xfs]
[<0>] xfs_file_buffered_write+0x25d/0x2a0 [xfs]
[<0>] do_iter_readv_writev+0x152/0x1b0
[<0>] do_iter_write+0x7d/0x150
[<0>] nfsd_vfs_write+0x10b/0x680 [nfsd]
[<0>] nfsd4_write+0x143/0x1d0 [nfsd]
[<0>] nfsd4_proc_compound+0x3b2/0x6c0 [nfsd]
[<0>] nfsd_dispatch+0x149/0x270 [nfsd]
[<0>] svc_process_common+0x3b8/0x5e0 [sunrpc]
[<0>] svc_process+0xb7/0xf0 [sunrpc]
[<0>] nfsd+0xed/0x150 [nfsd]
[<0>] kthread+0x149/0x170
[<0>] ret_from_fork+0x22/0x30
[root@kvm103 ~]# cat /proc/3716/stack 
[<0>] __flush_work.isra.0+0x160/0x220
[<0>] xlog_cil_force_seq+0x6d/0x210 [xfs]
[<0>] xfs_log_force_seq+0x72/0x150 [xfs]
[<0>] xfs_file_fsync+0x149/0x210 [xfs]
[<0>] nfsd_commit+0x1a1/0x200 [nfsd]
[<0>] nfsd4_proc_compound+0x3b2/0x6c0 [nfsd]
[<0>] nfsd_dispatch+0x149/0x270 [nfsd]
[<0>] svc_process_common+0x3b8/0x5e0 [sunrpc]
[<0>] svc_process+0xb7/0xf0 [sunrpc]
[<0>] nfsd+0xed/0x150 [nfsd]
[<0>] kthread+0x149/0x170
[<0>] ret_from_fork+0x22/0x30

```
There's 50% write errors on the SCRATCH_MNT
```
[root@kvm103 ~]# nfsiostat 

localhost:/test_TMP mounted on /test:

           ops/s       rpc bklog
        2871.623           0.000

read:              ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   0.000           0.000           0.000        0 (0.0%)           0.000           0.000           0.000        0 (0.0%)
write:             ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   0.000           0.000           0.000        0 (0.0%)           0.000           0.000           0.000        0 (0.0%)

localhost:/scratch_TMP mounted on /scratch:

           ops/s       rpc bklog
        2871.623           0.000

read:              ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   0.341          37.267         109.279        0 (0.0%)           1.155           1.236           0.034        0 (0.0%)
write:             ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                1884.265      752077.789         399.136        0 (0.0%)           0.714           0.729           0.00781485438 (50.0%)
```

Second, when I run the test with multi-hosts configuration, the test stucks in part of runs:
https://beaker.engineering.redhat.com/jobs/6809647 # stucks on x86_64 and ppc64le
https://beaker.engineering.redhat.com/jobs/6809643 # stucks on aarch64 and ppc
https://beaker.engineering.redhat.com/jobs/6809639 # stucks on x86_64 and ppc64le

The NFSD's stack trace and the output of `nfsiostat` is very similiar with those in localhost configuration.

Please let me know if there's anything else information I can provide.

Comment 17 Jeff Layton 2022-07-14 11:36:43 UTC
If the stack trace is changing then progress is likely being made. Perhaps this is just a situation where you have extremely slow underlying storage?

The error count is interesting. It might be worth sniffing NFS traffic in your test rig to see if you can determine what sort of errors you're seeing.

In any case, I'll try to replicate your loopback mount configuration and give this test a go to see if I can reproduce it. It's unlikely we'll be able to make much progress on this w/o more to go on.

Comment 18 Boyang Xue 2022-07-15 01:37:31 UTC
I find when it stucks and `nfsiostat` shows 50% error rate in writing on SCRATCH_MNT, `df` shows SCRATCH_MNT is 100% full. Perhaps this is the reason why there's 50% error rate? Anyway, when I extend the backing storage of SCRATCH_MNT from 11G to 50G, the test run passed in 10 mins. The resulting output of `df` shows:

```
[root@kvm103 repo_xfstests-dev]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1.1G     0  1.1G   0% /dev
tmpfs           1.2G     0  1.2G   0% /dev/shm
tmpfs           455M  6.3M  449M   2% /run
/dev/vda7        24G  4.7G   19G  20% /
/dev/vda1      1014M  332M  683M  33% /boot
tmpfs           228M  4.0K  228M   1% /run/user/0
/dev/vda2        11G  111M   11G   1% /test_TMP
/dev/vdb         50G   15G   36G  30% /scratch_TMP
```

So it seems SCRATCH_MNT requires at least 15GB.

Comment 19 Jeff Layton 2022-07-19 11:04:23 UTC
A full disk can certainly cause writeback errors. I'm not sure what the space requirements for xfstests actually are, and the READMEs don't really say. It might be worth sending a note to fstests.org to see what the minimum is (if it's known).

Comment 20 Jeff Layton 2022-07-21 17:10:19 UTC
To be clear, this is probably indicative of a bug in the nfs client. It shouldn't hang when there are writeback errors, no matter the size of the underlying filesystem.

Comment 21 Jeff Layton 2022-07-21 17:11:37 UTC
cc'ing Chuck since he expressed some interest in this issue.

Comment 22 Chuck Lever 2022-07-21 18:04:50 UTC
(In reply to Jeff Layton from comment #20)
> To be clear, this is probably indicative of a bug in the nfs client. It
> shouldn't hang when there are writeback errors, no matter the size of the
> underlying filesystem.

Agreed, looks like a client-side issue. Zorro's posted call traces don't have any direct server call paths.

It might help to bisect the client kernel, as the report I saw suggested the failures don't occur on kernels before v5.10.

Comment 23 Jeff Layton 2022-07-21 18:11:31 UTC
Here's what I'm seeing in the fsstress process:

[jlayton@centos9 ~]$ sudo cat /proc/1951/stack
[<0>] nfs_direct_wait+0x19/0x40 [nfs]
[<0>] nfs_file_direct_write+0x248/0x340 [nfs]
[<0>] new_sync_write+0x11f/0x1b0
[<0>] vfs_write+0x1eb/0x280
[<0>] ksys_write+0x5f/0xe0
[<0>] do_syscall_64+0x5c/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae


I tried to attach strace to it, but it got stuck, so I'm pretty sure this thing is stuck in the same DIO write syscall for some reason. I turned up some random nfs write tracepoints and saw this:

   kworker/u32:0-2163    [001] ....  2540.086489: nfs4_write: error=0 (OK) fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 res=294912 stateid=1:0xb93c1f29 layoutstateid=0:0x00000000
   kworker/u32:0-2163    [001] ....  2540.086496: nfs_writeback_done: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 res=294912 status=294912 stable=FILE_SYNC verifier=0000000000000000
   kworker/u32:0-2163    [001] ....  2540.093373: nfs4_write: error=-28 (ENOSPC) fileid=00:2b:8889676 fhandle=0xb00f143d offset=1343488 count=753664 res=0 stateid=1:0xb93c1f29 layoutstateid=0:0x00000000
   kworker/u32:0-2163    [001] ....  2540.093378: nfs_writeback_done: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1343488 count=753664 res=0 status=-28 stable=UNSTABLE verifier=0000000000000000
   kworker/u32:0-2163    [001] ....  2540.096165: nfs_initiate_write: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 stable=FILE_SYNC
   kworker/u32:0-2163    [001] ....  2540.119430: nfs4_write: error=0 (OK) fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 res=294912 stateid=1:0xb93c1f29 layoutstateid=0:0x00000000
   kworker/u32:0-2163    [001] ....  2540.119439: nfs_writeback_done: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 res=294912 status=294912 stable=FILE_SYNC verifier=0000000000000000
   kworker/u32:0-2163    [001] ....  2540.126377: nfs4_write: error=-28 (ENOSPC) fileid=00:2b:8889676 fhandle=0xb00f143d offset=1343488 count=753664 res=0 stateid=1:0xb93c1f29 layoutstateid=0:0x00000000
   kworker/u32:0-2163    [001] ....  2540.126382: nfs_writeback_done: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1343488 count=753664 res=0 status=-28 stable=UNSTABLE verifier=0000000000000000
   kworker/u32:0-2163    [001] ....  2540.129160: nfs_initiate_write: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 stable=FILE_SYNC
   kworker/u32:0-2163    [001] ....  2540.154001: nfs4_write: error=0 (OK) fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 res=294912 stateid=1:0xb93c1f29 layoutstateid=0:0x00000000
   kworker/u32:0-2163    [001] ....  2540.154009: nfs_writeback_done: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 res=294912 status=294912 stable=FILE_SYNC verifier=0000000000000000
   kworker/u32:0-2163    [001] ....  2540.160889: nfs_writeback_done: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1343488 count=753664 res=0 status=-28 stable=UNSTABLE verifier=0000000000000000
   kworker/u32:0-2163    [001] ....  2540.163633: nfs_initiate_write: fileid=00:2b:8889676 fhandle=0xb00f143d offset=1048576 count=1048576 stable=FILE_SYNC

...so it looks like the client is cycling over and over trying to write the same range. Maybe the DIO error handling is borked?

Comment 24 Jeff Layton 2022-07-21 18:18:20 UTC
I'm going to go ahead and open this bug up to the public.

Note that the above is running against a small 10g partition on a separate physical NFS server. I think we need to disentangle the problems of running a loopback client and server for now.

Comment 25 Chuck Lever 2022-07-21 18:19:15 UTC
The ENOSPC result appears to get lost, which indeed indicates a bug in the error handling.

Also, I don't understand why the stable_how flag is flipping between FILE_SYNC and UNSTABLE. That might be related, or it could be a separate misbehavior.

Comment 26 Jeff Layton 2022-07-21 18:49:09 UTC
Once I have a full partition, I can reproduce this with a command like:

[jlayton@centos9 ~]$ sudo dd if=/dev/urandom of=/mnt/scratch/p0/d1/f31dd bs=100k count=1024 oflag=direct

...interestingly, 10k I/Os got an immediate error return:

[jlayton@centos9 ~]$ sudo dd if=/dev/urandom of=/mnt/scratch/p0/d1/f31dd bs=10k count=1024 oflag=direct
dd: error writing '/mnt/scratch/p0/d1/f31dd': No space left on device
dd: closing output file '/mnt/scratch/p0/d1/f31dd': No space left on device

Comment 27 Jeff Layton 2022-07-22 15:08:44 UTC
I added some tracepoints to the direct write codepaths, and got this:

We start by issuing a 100k write (matching the write from userland):
              dd-1324    [013] .....   138.644456: nfs_initiate_write: fileid=00:31:377858 fhandle=0x4161b7a9 offset=1024000 count=102400 stable=FILE_SYNC

It comes back short though (only 6 pages worth). Note that it came back FILE_SYNC though:

   kworker/u32:2-113     [008] .....   138.652357: nfs4_write: error=0 (OK) fileid=00:31:377858 fhandle=0x4161b7a9 offset=1024000 count=102400 res=24576 stateid=1:0x6582c8e8 layoutstateid=0:0x00000000
   kworker/u32:2-113     [008] .....   138.652376: nfs_writeback_done: fileid=00:31:377858 fhandle=0x4161b7a9 offset=1024000 count=102400 res=24576 status=24576 stable=FILE_SYNC verifier=0000000000000000
   kworker/u32:2-113     [008] .....   138.652380: nfs_writeback_result: fileid=00:31:377858 fhandle=0x4161b7a9 offset=1024000 count=102400 res=24576 status=24576 stable=FILE_SYNC verifier=0000000000000000

We fall into the short write logic in nfs_writeback_result, and the RPC gets reissued, starting at the end of the prior short write. That returns -ENOSPC:

   kworker/u32:2-113     [008] .....   138.658807: nfs4_write: error=-28 (ENOSPC) fileid=00:31:377858 fhandle=0x4161b7a9 offset=1048576 count=77824 res=0 stateid=1:0x6582c8e8 layoutstateid=0:0x00000000
   kworker/u32:2-113     [008] .....   138.658813: nfs_writeback_done: fileid=00:31:377858 fhandle=0x4161b7a9 offset=1048576 count=77824 res=0 status=-28 stable=UNSTABLE verifier=0000000000000000
   kworker/u32:2-113     [008] .....   138.658945: nfs_direct_write_completion: dreq=000000000d0ecac9 fileid=00:31:377858 fhandle=0x4161b7a9 ref=2 io_start=1024000 count=0 bytes_left=0 error=0 flags=

First problem: we decide to issue a commit here -- why?

   kworker/u32:2-113     [008] .....   138.659056: nfs_direct_write_complete: dreq=000000000d0ecac9 fileid=00:31:377858 fhandle=0x4161b7a9 ref=2 io_start=1024000 count=24576 bytes_left=0 error=-28 flags=DO_COMMIT
   kworker/u32:2-113     [008] .....   138.677659: nfs_initiate_commit: fileid=00:31:377858 fhandle=0x4161b7a9 offset=0 count=0
   kworker/u32:2-113     [008] .....   138.681016: nfs4_commit: error=0 (OK) fileid=00:31:377858 fhandle=0x4161b7a9 offset=0 count=0 layoutstateid=0:0x00000000
   kworker/u32:2-113     [008] .....   138.681022: nfs_commit_done: fileid=00:31:377858 fhandle=0x4161b7a9 offset=0 status=0 stable=FILE_SYNC verifier=3cb4da627ca64c37
   kworker/u32:2-113     [008] .....   138.681135: nfs_direct_commit_complete: dreq=000000000d0ecac9 fileid=00:31:377858 fhandle=0x4161b7a9 ref=2 io_start=1024000 count=24576 bytes_left=0 error=-28 flags=

...and then it reschedules the writes and the cycle begins anew.

   kworker/u32:2-113     [008] .....   138.681209: nfs_direct_write_complete: dreq=000000000d0ecac9 fileid=00:31:377858 fhandle=0x4161b7a9 ref=2 io_start=1024000 count=24576 bytes_left=0 error=-28 flags=RESCHED_WRITES
   kworker/u32:2-113     [008] .....   138.681384: nfs_initiate_write: fileid=00:31:377858 fhandle=0x4161b7a9 offset=1024000 count=102400 stable=FILE_SYNC

Comment 28 Jeff Layton 2022-07-22 18:21:38 UTC
Initial patchset posted here. I'm not sure it's the correct approach, but seems to fix the issue.

https://lore.kernel.org/linux-nfs/20220722181220.81636-1-jlayton@kernel.org/T/#t

Comment 32 Boyang Xue 2022-08-14 13:05:45 UTC
TEST PASS.

Run generic/476 for 3 times within 36 hours on each system. The test didn't finish on kernel-5.14.0-145.el9, and finished on the MR kernel. Link to Beaker jobs:

kernel-5.14.0-145.el9
https://url.corp.redhat.com/bz2028370-reproduce

kernel-5.14.0-145.mr1258_220811_1150.el9
https://url.corp.redhat.com/bz2028370-verify-1

Comment 35 Boyang Xue 2022-08-22 00:59:44 UTC
TEST PASS.

Run generic/476 for 3 times within 36 hours on each system. The test didn't finish on kernel-5.14.0-145.el9, and finished on kernel-5.14.0-150.el9. Link to Beaker jobs:

kernel-5.14.0-145.el9
https://url.corp.redhat.com/bz2028370-reproduce

kernel-5.14.0-150.el9
https://url.corp.redhat.com/bz2028370-verify-final

Comment 36 Boyang Xue 2022-08-22 14:56:09 UTC
It turns out I spoke too soon. More test runs may uncover a (perhaps new) bug, that xfstests-multihost-nfsv3-over-ext4-ppc64le looks like hanging in its 3rd runs. Please check the Beaker jobs:

https://beaker.engineering.redhat.com/jobs/6932794
https://beaker.engineering.redhat.com/jobs/6932798

Please note that it only happens on nfsv3 over ext4, ppc64le multi-host configuration among all other configurations, and the kernel version is kernel-5.14.0-150.el9.

When this problem happens, I find the scratch dir on the server side is 100% full

[root@ibm-p9z-25-lp8 ~]# df -h | grep xfstests
/dev/sda5       9.8G   32K  9.3G   1% /mnt/xfstests/test
/dev/sda3       9.8G  9.8G     0 100% /mnt/xfstests/scratch

And nfsiostat shows 30-40% write error rate
```
<server side>:/mnt/xfstests/test/nfs-server mounted on /mnt/xfstests/test/nfs-client:

           ops/s       rpc bklog
         109.515           0.000

read:              ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   0.000           0.000           0.000        0 (0.0%)           0.000           0.000           0.000        0 (0.0%)
write:             ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   0.000           0.000           0.000        0 (0.0%)           0.000           0.000           0.000        0 (0.0%)

<server side>:/mnt/xfstests/scratch/nfs-server mounted on /mnt/xfstests/scratch/nfs-client:

           ops/s       rpc bklog
         113.424           0.000

read:              ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   0.780          92.249         118.227        0 (0.0%)           1.017           1.043           0.016        0 (0.0%)
write:             ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)  avg queue (ms)          errors
                   2.403         841.814         350.361        0 (0.0%)          44.552          44.576           0.019   35336 (39.2%)
```

The /proc/<fsstress PID>/stack shows as follows, and is not changing
```
[root@ibm-p9z-25-lp4 ~]# cat /proc/118520/stack
[<0>] 0xc0000001ff990480
[<0>] __switch_to+0x12c/0x220
[<0>] rpc_wait_bit_killable+0x3c/0x110 [sunrpc]
[<0>] __rpc_execute+0x154/0x530 [sunrpc]
[<0>] rpc_execute+0x168/0x1d0 [sunrpc]
[<0>] rpc_run_task+0x1f0/0x2a0 [sunrpc]
[<0>] rpc_call_sync+0x80/0x100 [sunrpc]
[<0>] nfs3_rpc_wrapper+0x64/0x130 [nfsv3]
[<0>] nfs3_proc_access+0xb4/0x1b0 [nfsv3]
[<0>] nfs_do_access+0xd8/0x2c0 [nfs]
[<0>] nfs_permission+0xd8/0x268 [nfs]
[<0>] inode_permission+0x174/0x290
[<0>] may_open+0xa4/0x260
[<0>] do_open+0x1a4/0x500
[<0>] path_openat+0x170/0x380
[<0>] do_filp_open+0xac/0x180
[<0>] do_sys_openat2+0x100/0x220
[<0>] sys_openat+0x74/0xd0
[<0>] system_call_exception+0x160/0x300
[<0>] system_call_vectored_common+0xe8/0x278
```

Unlike the comment#26, dd 10k or 100k both fails with ENOSPC.

Jeff,

Could you please help check it?

Thanks,
Boyang

Comment 37 Jeff Layton 2022-08-22 15:05:06 UTC
Looking at this log:

    http://lab-02.hosts.prod.upshift.rdu2.redhat.com/beaker/logs/tasks/149048+/149048770/taskout.log

...which is on the server-side, it looks like maybe something screwed up with the test infrastructure:

nfs mount will be run on client side, this's server side
nfs mount will be run on client side, this's server side
** init_dev_param PASS Score:
Uploading resultoutputfile.log .done
** pre_param PASS Score:
Uploading resultoutputfile.log done
** Test_completed PASS Score:
Uploading resultoutputfile.log done
_test_online_fsck:nfs on /mnt/xfstests/test/nfs-client
haven't implemented a way to scrub nfs
nfs unmount will be run on client side, this's server side
_test_fsck:nfs on ibm-p9z-25-lp8.virt.pnr.lab.eng.rdu2.redhat.com:/mnt/xfstests/test/nfs-server
haven't implemented a way to check nfs
/usr/bin/rhts-sync-block -s TEST__nfs_resume_device ibm-p9z-25-lp4.virt.pnr.lab.eng.rdu2.redhat.com -- Blocking state(s) =  4_TEST__nfs_resume_device
Failed to connect to remote server ibm-p9z-25-lp4.virt.pnr.lab.eng.rdu2.redhat.com: Connection refused
Retrying rstrnt-sync-block for state 4_TEST__nfs_resume_device
Failed to connect to remote server ibm-p9z-25-lp4.virt.pnr.lab.eng.rdu2.redhat.com: Connection refused
Retrying rstrnt-sync-block for state 4_TEST__nfs_resume_device
Failed to connect to remote server ibm-p9z-25-lp4.virt.pnr.lab.eng.rdu2.redhat.com: Connection refused
Retrying rstrnt-sync-block for state 4_TEST__nfs_resume_device
Failed to connect to remote server ibm-p9z-25-lp4.virt.pnr.lab.eng.rdu2.redhat.com: Connection refused
Retrying rstrnt-sync-block for state 4_TEST__nfs_resume_device
Failed to connect to remote server ibm-p9z-25-lp4.virt.pnr.lab.eng.rdu2.redhat.com: Connection refused
Retrying rstrnt-sync-block for state 4_TEST__nfs_resume_device

...are these Connection refused errors expected?

If so, then is there a way for me to log into the client(s), so I can see what they're doing?

Comment 39 Jeff Layton 2022-08-22 16:03:06 UTC
Ok, you're right that that task is the one stuck. It's waiting on an ACCESS call to the server. Looking at the server though, all of the nfsd threads are sitting idle, waiting for RPCs. The client task is stuck here in __rpc_execute waiting for the RPC_TASK_QUEUED bit to clear:

                status = out_of_line_wait_on_bit(&task->tk_runstate,                                
                                RPC_TASK_QUEUED, rpc_wait_bit_killable,                             
                                TASK_KILLABLE);

That means that it got added to an RPC wait queue and we're waiting for it to come off. Here's the rpc in the debugfs tasks file:

    25226 8880      0 0x6 0xa442bfd9        0 rpc_default_ops [sunrpc] nfsv3 ACCESS a:call_status [sunrpc] q:none

The "q:none" means that the task->tk_waitqueue pointer is NULL.

The flags here (0x8800 in field #2) are:

    RPC_TASK_CRED_NOREF|RPC_TASK_SENT

RPC_TASK_QUEUED is not set. This looks like there was a problem waking up this task -- probably a race of some sort?

Comment 40 Jeff Layton 2022-08-22 16:04:03 UTC
In any case, this is an entirely different issue from the original problem with O_DIRECT writes. Boyang, can you open a new bug to track this? I don't see a reason to hold up the other valid fixes for this.

Comment 41 Jeff Layton 2022-08-22 16:29:00 UTC
Ahh, my mistake. QUEUED and RUNNING are tk_runstate flags and we can't see those in debugfs (which we should fix). Maybe I can poke around with the debugger...

Comment 42 Jeff Layton 2022-08-22 18:16:29 UTC
Oops, RPC_TASK_DYNAMIC is also set in tk_flags (not that it matters much for this problem). Looking at the second example, there are 2 stuck GETATTR requests in a similar state:

21458 8880      0 0x6 0xc85e5da8        0 rpc_default_ops [sunrpc] nfsv3 GETATTR a:call_status [sunrpc] q:none
 2341 8880      0 0x6 0x362569a8        0 rpc_default_ops [sunrpc] nfsv3 GETATTR a:call_status [sunrpc] q:none

The last field in the output is "q:none", which means that either tk_waitqueue was NULL, or that we were no longer in the RPC_IS_QUEUED state. Probably, the latter since I don't see how we could get a NULL pointer there with rpc_set_queued having been called at least once for this. So, this may still be a problem with wakeups.

What we probably ought to do is roll a patch that adds the tk_runstate to the output there, and then we can see whether the reply was ever received. I'll see what I can spin up.

Comment 43 Jeff Layton 2022-08-22 19:46:27 UTC
I don't think I can learn much more from the currently-stuck machines, so you can kill those off if you like. For now, I spun up a draft patch that adds the tk_runstate to the debugfs output:

diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c
index 7dc9cc929bfd..879f0c618b8c 100644
--- a/net/sunrpc/debugfs.c
+++ b/net/sunrpc/debugfs.c
@@ -30,8 +30,8 @@ tasks_show(struct seq_file *f, void *v)
        if (task->tk_rqstp)
                xid = be32_to_cpu(task->tk_rqstp->rq_xid);
 
-       seq_printf(f, "%5u %04x %6d 0x%x 0x%x %8ld %ps %sv%u %s a:%ps q:%s\n",
-               task->tk_pid, task->tk_flags, task->tk_status,
+       seq_printf(f, "%5u %04x %6d 0x%lx 0x%x 0x%x %8ld %ps %sv%u %s a:%ps q:%s\n",
+               task->tk_pid, task->tk_flags, task->tk_status, task->tk_runstate,
                clnt->cl_clid, xid, rpc_task_timeout(task), task->tk_ops,
                clnt->cl_program->name, clnt->cl_vers, rpc_proc_name(task),
                task->tk_action, rpc_waitq);


Kernel is building here:

    https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=47257987

Boyang, could you try to reproduce the problem with this kernel once it's built? After it gets stuck, please run:

    # cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks

...and collect the output. Knowing what the tk_runstate flags are set to might help us rule out some possibilities.
Opening a new bug to track this would be best though since this doesn't look related to the original bug.

Comment 44 Boyang Xue 2022-08-23 00:47:54 UTC
(In reply to Jeff Layton from comment #43)
> I don't think I can learn much more from the currently-stuck machines, so
> you can kill those off if you like. For now, I spun up a draft patch that
> adds the tk_runstate to the debugfs output:
> 
> diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c
> index 7dc9cc929bfd..879f0c618b8c 100644
> --- a/net/sunrpc/debugfs.c
> +++ b/net/sunrpc/debugfs.c
> @@ -30,8 +30,8 @@ tasks_show(struct seq_file *f, void *v)
>         if (task->tk_rqstp)
>                 xid = be32_to_cpu(task->tk_rqstp->rq_xid);
>  
> -       seq_printf(f, "%5u %04x %6d 0x%x 0x%x %8ld %ps %sv%u %s a:%ps
> q:%s\n",
> -               task->tk_pid, task->tk_flags, task->tk_status,
> +       seq_printf(f, "%5u %04x %6d 0x%lx 0x%x 0x%x %8ld %ps %sv%u %s a:%ps
> q:%s\n",
> +               task->tk_pid, task->tk_flags, task->tk_status,
> task->tk_runstate,
>                 clnt->cl_clid, xid, rpc_task_timeout(task), task->tk_ops,
>                 clnt->cl_program->name, clnt->cl_vers, rpc_proc_name(task),
>                 task->tk_action, rpc_waitq);
> 
> 
> Kernel is building here:
> 
>     https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=47257987
> 
> Boyang, could you try to reproduce the problem with this kernel once it's
> built? After it gets stuck, please run:
> 
>     # cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks
> 
> ...and collect the output. Knowing what the tk_runstate flags are set to
> might help us rule out some possibilities.

No problem. I'm running the tests now, and will report the result later today.

> Opening a new bug to track this would be best though since this doesn't look
> related to the original bug.

I've opened the BZ as

Bug 2120438 - [xfstests/nfs generic/476] nfsv3 over ext4 test gets stuck on multi-host ppc64le

Comment 45 Boyang Xue 2022-08-23 04:53:01 UTC
I just got the result. When it gets stuck, running "cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks" on the client shows

ibm-p9z-16-lp5.lab.eng.bos.redhat.com (https://beaker.engineering.redhat.com/jobs/6935876) (the output is changing)
```
[root@ibm-p9z-16-lp5 ~]# cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks
 5207 8801      0 0x16 0x6 0x92f92312     5994 nfs_pgio_common_ops [nfs] nfsv3 WRITE a:call_status [sunrpc] q:xprt_pending
 5214 8880      0 0x16 0x6 0x99f92312     5996 rpc_default_ops [sunrpc] nfsv3 SYMLINK a:call_status [sunrpc] q:xprt_pending
 5223 8880      0 0x16 0x6 0xa2f92312     5999 rpc_default_ops [sunrpc] nfsv3 SYMLINK a:call_status [sunrpc] q:xprt_pending
 5226 8880      0 0x16 0x6 0xa5f92312     5999 rpc_default_ops [sunrpc] nfsv3 SYMLINK a:call_status [sunrpc] q:xprt_pending
```

ibm-p9z-20-lp24.fs.lab.eng.bos.redhat.com (https://beaker.engineering.redhat.com/jobs/6935874) (the output is changing)
```
[root@ibm-p9z-20-lp24 ~]# cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks
20505 8801      0 0x16 0x6 0x8d3fee6d     5993 nfs_pgio_common_ops [nfs] nfsv3 WRITE a:call_status [sunrpc] q:xprt_pending
20511 8880      0 0x16 0x6 0x933fee6d     5994 rpc_default_ops [sunrpc] nfsv3 CREATE a:call_status [sunrpc] q:xprt_pending
20524 8880      0 0x16 0x6 0xa03fee6d     5995 rpc_default_ops [sunrpc] nfsv3 CREATE a:call_status [sunrpc] q:xprt_pending
20601 8880      0 0x16 0x6 0xed3fee6d     6000 rpc_default_ops [sunrpc] nfsv3 CREATE a:call_status [sunrpc] q:xprt_pending
```

I've reset the root passwords of ppc64le in the above two jobs, in case you would like to log in.

Comment 46 Boyang Xue 2022-08-23 09:00:20 UTC
I just prepared another two pairs of ppc64le for debug. I've extended the watchdog timer, so they should survive for as long as 10 days.

https://beaker.engineering.redhat.com/jobs/6936603
https://beaker.engineering.redhat.com/jobs/6936604

As always, their root passwords have been reset.

Comment 47 Jeff Layton 2022-08-23 10:50:39 UTC
Looking at ibm-p9z-16-lp5, I'm not seeing anything that is stuck. As you say, the output is changing, so it seems like progress is being made (even if it is slow).  ibm-p9z-20-lp24 is not up anymore, and it looks like the test completed successfully there. I don't think the same problem was reproduced in this run.

Ditto for the others that you kicked off in comment #46. They don't seem to be hung, at least not so far. I'll plan to check on them later today.

Comment 49 Jeff Layton 2022-08-23 16:23:42 UTC
We'll pursue the latest issue with this test hanging in bug #2120438 and leave this bug to cover the DIO issue that was originally fixed.

Comment 52 errata-xmlrpc 2022-11-15 10:51:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8267


Note You need to log in before you can comment on or make changes to this bug.