Description of problem: Getting error message fio: io_u error on file /mnt/share-test-0/csi-helper-yy346mdx: Input/output error: write offset=525594624, buflen=4096 on mounted share for fio runs. Version-Release number of selected component (if applicable): ceph version 19.2.0-124.1.hotfix.bz2357486.el9cp (1d6c39dcc35466271ef633ccfd91e51cc792656b) squid (stable) How reproducible: 100% Steps to Reproduce: 1. Create a Ganesha NFS mount 2. Mount the export 3. Run the following FIO: echo $bs ; echo $njobs ; echo $iodepth 10K 16 128 cat include.fio [global] #direct=1 rw=randwrite [test0] direct=1 rwmixread=50 bs=${bs} iodepth=${iodepth} numjobs=${njobs} ioengine=libaio lat_percentiles=1 unified_rw_reporting=1 group_reporting=1 size=1G while true; do fio include.fio --filename=$(pwd)/test0; done Actual results: io_u error on file /mnt/share-test-0/csi-helper-yy346mdx: Input/output error: write offset=525594624, buflen=4096 on mounted share for fio runs. Expected results: Successful I/O test Additional info:
Error in kern.log NFS: Server wrote zero bytes, expected 4096.
Hi Eric, Can you please share the ganesha logs to check on failure. Also, can you also share "ceph -s" output. Do we have sufficient space on disk to complete the fio profile? I ran the same test, but the fio completed without any issues Could you please share the Ganesha logs to investigate the failure further? Can you confirm whether there's enough available space on the disk to complete the FIO workload? It would be helpful to see the output of ceph -s to look for any osd related issues. For reference, I ran the same FIO profile on my setup and it completed successfully without any issues. # rpm -qa | grep nfs libnfsidmap-2.5.4-27.el9_5.1.x86_64 nfs-utils-2.5.4-27.el9_5.1.x86_64 nfs-ganesha-selinux-6.5-13.el9cp.noarch nfs-ganesha-6.5-13.el9cp.x86_64 nfs-ganesha-ceph-6.5-13.el9cp.x86_64 nfs-ganesha-rados-grace-6.5-13.el9cp.x86_64 nfs-ganesha-rados-urls-6.5-13.el9cp.x86_64 nfs-ganesha-rgw-6.5-13.el9cp.x86_64 # ceph --version ceph version 19.2.1-188.el9cp (834ac46f780fbdc2ac4ba4851a36db6df3c1aa6f) squid (stable) [ceph: root@cali013 /]# ceph nfs export ls nfsganesha [ "/ganesha1" ] [ceph: root@cali013 /]# ceph nfs export info nfsganesha /ganesha1 { "access_type": "RW", "clients": [], "cluster_id": "nfsganesha", "export_id": 1, "fsal": { "cmount_path": "/", "fs_name": "cephfs", "name": "CEPH", "user_id": "nfs.nfsganesha.cephfs.2c1043d4" }, "path": "/volumes/ganeshagroup/ganesha1/1ffcbfcd-a03e-4311-a39c-2d2d912cb273", "protocols": [ 3, 4 ], "pseudo": "/ganesha1", "security_label": true, "squash": "none", "transports": [ "TCP" ] } Client ----- # cat include.fio [global] rw=randwrite ioengine=libaio lat_percentiles=1 unified_rw_reporting=1 group_reporting=1 size=1G [test0] direct=1 rwmixread=50 bs=4k # Default, can be overridden iodepth=1 # Default, can be overridden numjobs=1 # Default, can be overridden # fio include.fio --section=test0 --bs=10k --iodepth=128 --numjobs=16 --filename=/mnt/ganesha/fio_file test0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process test0: Laying out IO file (1 file / 1024MiB) Jobs: 1 (f=1): [w(1)][100.0%][r=2892KiB/s][r=723 IOPS][eta 00m:00s] test0: (groupid=0, jobs=1): err= 0: pid=8732: Sun May 11 01:54:42 2025 mixed: IOPS=714, BW=2859KiB/s (2928kB/s)(1024MiB/366759msec) slat (usec): min=4, max=6680, avg=36.53, stdev=21.65 clat (usec): min=432, max=10780, avg=1358.42, stdev=170.85 lat (usec): min=960, max=10833, avg=1394.95, stdev=172.80 clat percentiles (usec): | 1.00th=[ 1123], 5.00th=[ 1172], 10.00th=[ 1205], 20.00th=[ 1237], | 30.00th=[ 1270], 40.00th=[ 1303], 50.00th=[ 1336], 60.00th=[ 1369], | 70.00th=[ 1401], 80.00th=[ 1450], 90.00th=[ 1532], 95.00th=[ 1614], | 99.00th=[ 1909], 99.50th=[ 2073], 99.90th=[ 2671], 99.95th=[ 3032], | 99.99th=[ 5997] lat percentiles (usec): | 1.00th=[ 1156], 5.00th=[ 1205], 10.00th=[ 1237], 20.00th=[ 1287], | 30.00th=[ 1303], 40.00th=[ 1336], 50.00th=[ 1369], 60.00th=[ 1401], | 70.00th=[ 1434], 80.00th=[ 1483], 90.00th=[ 1565], 95.00th=[ 1663], | 99.00th=[ 1942], 99.50th=[ 2114], 99.90th=[ 2737], 99.95th=[ 3064], | 99.99th=[ 6128] bw ( KiB/s): min= 2368, max= 3160, per=100.00%, avg=2860.38, stdev=100.58, samples=733 iops : min= 592, max= 790, avg=715.08, stdev=25.15, samples=733 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=99.32%, 4=0.66%, 10=0.02%, 20=0.01% cpu : usr=0.63%, sys=1.14%, ctx=505001, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): MIXED: bw=2859KiB/s (2928kB/s), 2859KiB/s-2859KiB/s (2928kB/s-2928kB/s), io=1024MiB (1074MB), run=366759-366759msec [root@ceph-nfsclusterlive-bs2b7t-node6 home]# df -hT Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 4.0M 0 4.0M 0% /dev tmpfs tmpfs 1.8G 0 1.8G 0% /dev/shm tmpfs tmpfs 732M 28M 704M 4% /run /dev/vda4 xfs 39G 2.5G 37G 7% / /dev/vda3 xfs 960M 176M 785M 19% /boot /dev/vda2 vfat 200M 7.1M 193M 4% /boot/efi tmpfs tmpfs 366M 0 366M 0% /run/user/0 10.8.130.16:/ganesha1 nfs4 22T 1.0G 22T 1% /mnt/ganesha
Based on the NFS client-side log message emitted, we know that the EIO is generated by the client, triggered by the NFS server sending a write reply with no error on the wire, but 0 bytes written. The linux client thinks that is server misbehavior and returns EIO to the app. Logs were requested so we enabled NFS4 and FSAL logs on the ganesha side; however, the problem did not reproduce. We'll retry with less logging enabled. Any preference as to minimal logging config? Does anyone think FSAL or NFS_V4 logs has better chance of showing something useful to help root cause this problem?
I should have mentioned that we know the EIO is generated by the client because we found the code which produced the log message, and that code path returns EIO for a write that had no error on the wire, but returned 0-bytes written: https://github.com/linux-nfs/nfsd/blob/a79be02bba5c31f967885c7f3bf3a756d77d11d9/fs/nfs/write.c#L1620 Here's the requested ceph -s output: [ceph: root@dal1-qz2-sr5-rk025-s06 /]# ceph -s cluster: id: 12852256-21e6-11f0-8891-6cfe549bafb8 health: HEALTH_OK services: mon: 3 daemons, quorum dal1-qz2-sr5-rk025-s06,dal2-qz2-sr2-rk089-s06,dal3-qz2-sr3-rk279-s06 (age 5d) mgr: dal1-qz2-sr5-rk025-s06.vdomyr(active, since 5d), standbys: dal2-qz2-sr2-rk089-s06.qougcd, dal3-qz2-sr3-rk279-s06.shdatw mds: 3/3 daemons up, 3 hot standby osd: 360 osds: 360 up (since 5d), 360 in (since 2w) data: volumes: 1/1 healthy pools: 6 pools, 53281 pgs objects: 2.10k objects, 5.3 GiB usage: 411 GiB used, 1.0 PiB / 1.0 PiB avail pgs: 53281 active+clean io: client: 3.4 MiB/s rd, 395 MiB/s wr, 36 op/s rd, 3.14k op/s wr
Created attachment 2089595 [details] Ganesha logs with FSAL = DEBUG_FULL turned on. Ganesha logs with FSAL = DEBUG_FULL turned on.
Created attachment 2089848 [details] Latest debug run with EVENTS enabled
fio-3.36 Starting 4 processes fio: io_u error on file /denali/test0: Input/output error: write offset=411146240, buflen=10240 fio: pid=20150, err=5/file:io_u.c:1896, func=io_u error, error=Input/output error test0: (groupid=0, jobs=4): err= 5 (file:io_u.c:1896, func=io_u error, error=Input/output error): pid=20150: Wed May 14 18:06:35 2025 mixed: IOPS=6263, BW=61.2MiB/s (64.1MB/s)(3303MiB/53997msec) slat (nsec): min=1646, max=7383.3k, avg=12673.05, stdev=49816.05 clat (usec): min=3297, max=89935, avg=10194.26, stdev=5733.47 lat (usec): min=3306, max=89938, avg=10206.93, stdev=5733.35 clat percentiles (usec): | 1.00th=[ 6063], 5.00th=[ 7046], 10.00th=[ 7504], 20.00th=[ 8094], | 30.00th=[ 8455], 40.00th=[ 8848], 50.00th=[ 9241], 60.00th=[ 9634], | 70.00th=[10028], 80.00th=[10683], 90.00th=[11994], 95.00th=[14353], | 99.00th=[44827], 99.50th=[55837], 99.90th=[70779], 99.95th=[73925], | 99.99th=[82314] lat percentiles (usec): | 1.00th=[ 6063], 5.00th=[ 7046], 10.00th=[ 7504], 20.00th=[ 8094], | 30.00th=[ 8455], 40.00th=[ 8848], 50.00th=[ 9241], 60.00th=[ 9634], | 70.00th=[10028], 80.00th=[10683], 90.00th=[11994], 95.00th=[14353], | 99.00th=[44827], 99.50th=[55837], 99.90th=[70779], 99.95th=[73925], | 99.99th=[83362] bw ( KiB/s): min=42280, max=76720, per=99.97%, avg=62617.78, stdev=1683.07, samples=428 iops : min= 4228, max= 7672, avg=6261.69, stdev=168.30, samples=428 lat (msec) : 4=0.01%, 10=69.65%, 20=27.96%, 50=1.62%, 100=0.77% cpu : usr=0.87%, sys=2.12%, ctx=296239, majf=0, minf=63 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.1%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=338212,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): MIXED: bw=61.2MiB/s (64.1MB/s), 61.2MiB/s-61.2MiB/s (64.1MB/s-64.1MB/s), io=3303MiB (3463MB), run=53997-53997msec root@ubuntu-bx2-2x8-03:/denali#
We're currently working on getting a tcpdump wire trace of the problem. Unfortunately, it hasn't reproduced with tcpdump running.
@msaini -- I notice in your recreate you are running numjobs=1, and it appears the bug was opened with numbjobs=16. Given there's a question about concurrent writes causing the issue, can we make sure the recreate efforts so far match the FIO job run in the original environment (e.g. more jobs). Latest recreate attempt in the last comment appears to be numjobs=4.
tcpdump running inside the VSI finally captured a WRITE reply whose status was NFS4_OK with count=0. This has only happened once for me in about 10 reproductions. The interesting thing is that all other WRITE call/reply pairs that I've inspected are about 150 pkts apart in the capture file. However, the count=0 WRITE reply (pkt 18) is the 3rd packet after its correpsonding WRITE call (pkt 15). Pkt numbers are from a pcap file I created by exporting the last few hundred pkts from a 425MB pcap file. The packet capture for other my other randwrite reproductions show the EIO offset write reply with count=10240 which is correct. Also interesting is that even though tcpdump reports no dropped pkts, the pcap file does not always contain the WRITE offset that caused fio to fail EIO. Since I'm using -s 300, the pcap file would miss any write RPCs that started after 300 bytes in an ethernet frame, so that's one possible explanation for "missing writes". All this is to say that I'm not exactly trusting my tcpdump captures collected from inside a VSI, and I'll work to get a wire trace on the protocol node / on the metal. This is more tedious because of our magical networking. I'm attaching a tarball with the abbreviated capture file, fio output, and my notes called "jeff-repro-0count_write.tar.gz"
Created attachment 2089965 [details] capture file, fio results, and notes showing NFS4_OK write reply with count=0 tarball contains a few hundred packets exported from a 425MB tcpdump which shows a WRITE call writing 10k, and a NFS4_OK WRITE reply with count=0. I include fio output showing file offset with EIO, and my notes.
This issue seems to occur because the ganesha server occasionally returns a zero-length write when busy, which the NFS client interprets as an error as highlighted by Jeff in earlier comment. Have we investigated ganesha server node from slow storage or high load or network congestion perspective ? While fio retries short writes, it is treating zero-length writes as errors, leading to the reported issue. I believe the continue_on_error option in fio could help bypass these short write errors but I'm uncertain if ignoring them is the appropriate long-term solution in this context. We should see this issue only with non-direct fio runs considering the fact that nfs_writeback_result reports EIO when ganesha returns 0-bytes written. Can we try to reproduce this issue with `direct=1` fio runs ?
Created attachment 2090177 [details] Logs from ganesha debug build 2 FIO error message with offset io-3.36 Starting 4 processes fio: io_u error on file /denali/test0: Input/output error: write offset=882667520, buflen=10240 fio: pid=1084, err=5/file:io_u.c:1896, func=io_u error, error=Input/output error
Hi Russell, (In reply to Russell Cattelan from comment #15) > Created attachment 2090177 [details] > Logs from ganesha debug build 2 > > FIO error message with offset > > io-3.36 > Starting 4 processes > fio: io_u error on file /denali/test0: Input/output error: write > offset=882667520, buflen=10240 > fio: pid=1084, err=5/file:io_u.c:1896, func=io_u error, error=Input/output > error Thanks for the ganesha debug logs. The ganesha logs does show nfs write op related to write call at offset 882667520 with buffer length 10240. % grep 882667520 ganesha_mucho_logs.out 16/05/2025 21:58:55 : epoch 6827b514 : dal1-qz2-sr2-rk044-s18 : ganesha.nfsd-2[svc_28] nfs4_op_write :NFS4 :EVENT :calling write_arg 0x7f7614569800 arg_WRITE4 0x7f75ac0303a8 res_WRITE4 0x7f75ac136528 offset = 882667520 io_request = 10240 fsal_stable = 1 16/05/2025 21:58:55 : epoch 6827b514 : dal1-qz2-sr2-rk044-s18 : ganesha.nfsd-2[svc_28] ceph_fsal_write2 :FSAL :EVENT :CDBG: Calling ceph_ll_nonblocking_readv_writev for write ceph_ll_ioinfo 0x7f76141b0b88 offset 882667520 length 10240 We would like to capture the libcephfs client logs for further investigation of this issue. Kindly collect libcephfs debug logs using below steps : [1] Enable libcephfs debug logging. On ceph cluster admin node : # ceph config set client debug_client 20 # ceph config set client log_file /var/log/ceph/libcephfs_client.log [2] Enable NFS4 and FSAL logs on the ganesha side [3] Reproduce fio issue reporting "io_u error on file /denali/test0: Input/output error: write" [4] Disable libcephfs debug logging. On ceph cluster admin node : # ceph config rm client debug_client # ceph config rm client log_file [5] Collect ganesha debug logs [6] Collect libcephfs debug logs from ganesha nodes. The /var/log/ceph/libcephfs_client.log will be generated inside the ganesha container, so collect libcephfs_client.log from every ganesha node. - Find ganesha containers running on a ganesha node # podman ps -a|grep ganesha - Find libcephfs_client.log generated on every ganesha container and compress it. Upload compressed logs to BZ. # find /var/ -iname libcephfs_client.log|xargs du -sh # tar -czvf /tmp/libcephfs_`hostname -s`_$(date +%d%b%y_%H-%M-%S).tgz <libcephfs_file_path_from_ganesha_container1> <libcephfs_file_path_from_ganesha_container2> # ls -ltr /tmp NOTE : Upload newly generated compressed file from /tmp directory prefixed by 'libcephfs_' keyword. e.g # podman ps -a|grep ganesha b47b01053e12 cp.stg.icr.io/cp/ibm-ceph/ceph-8-rhel9@sha256:ca65e6bfabd1652fec495211ae72d8a4af3271bcd88ea948b623089381b982f3 -F -L STDERR -N N... 2 days ago Up 2 days 80/tcp, 5000/tcp, 6789/tcp, 6800/tcp, 6801/tcp, 6802/tcp, 6803/tcp, 6804/tcp, 6805/tcp ceph-f7d28fcc-310b-11f0-9028-005056bb2bda-nfs-nfsganesha-0-0-node1-xdvemh 680f0bddc833 cp.stg.icr.io/cp/ibm-ceph/ceph-8-rhel9@sha256:ca65e6bfabd1652fec495211ae72d8a4af3271bcd88ea948b623089381b982f3 -F -L STDERR -N N... 45 hours ago Up 45 hours 80/tcp, 5000/tcp, 6789/tcp, 6800/tcp, 6801/tcp, 6802/tcp, 6803/tcp, 6804/tcp, 6805/tcp ceph-f7d28fcc-310b-11f0-9028-005056bb2bda-nfs-ganesha1-1-0-node1-wuatin # find /var/ -iname libcephfs_client.log|xargs du -sh 2.1G /var/lib/containers/storage/overlay/cf94103061b9f9517ddd014d32d470f283863b3de8ef4c911b2ddb6fe6389b2b/diff/var/log/ceph/libcephfs_client.log 2.1G /var/lib/containers/storage/overlay/cf94103061b9f9517ddd014d32d470f283863b3de8ef4c911b2ddb6fe6389b2b/merged/var/log/ceph/libcephfs_client.log 0 /var/lib/containers/storage/overlay/482dc25a8c631a6651aa2e880957f6215049624441ff2bd4351ee2e6a8c170a4/diff/var/log/ceph/libcephfs_client.log 0 /var/lib/containers/storage/overlay/482dc25a8c631a6651aa2e880957f6215049624441ff2bd4351ee2e6a8c170a4/merged/var/log/ceph/libcephfs_client.log 0 /var/log/ceph/libcephfs_client.log # tar -czvf /tmp/libcephfs_`hostname -s`_$(date +%d%b%y_%H-%M-%S).tgz /var/lib/containers/storage/overlay/cf94103061b9f9517ddd014d32d470f283863b3de8ef4c911b2ddb6fe6389b2b/merged/var/log/ceph/libcephfs_client.log /var/lib/containers/storage/overlay/482dc25a8c631a6651aa2e880957f6215049624441ff2bd4351ee2e6a8c170a4/merged/var/log/ceph/libcephfs_client.log tar: Removing leading `/' from member names /var/lib/containers/storage/overlay/cf94103061b9f9517ddd014d32d470f283863b3de8ef4c911b2ddb6fe6389b2b/merged/var/log/ceph/libcephfs_client.log tar: Removing leading `/' from hard link targets # ls -ltr /tmp/ ... -rw-r--r--. 1 root root 66586734 May 20 22:03 libcephfs_node1_20May25_22-03-02.tgz
Created attachment 2091162 [details] ganesha logs from the debug image without zerocopy and FSAL async features ganesha logs from the debug image without zerocopy and FSAL async features VSI running fio against ganesha reproduced the EIO problem. No wire trace was collected this time.
Hi Jeff, (In reply to jeff.a.smith from comment #22) > Created attachment 2091162 [details] > ganesha logs from the debug image without zerocopy and FSAL async features > > ganesha logs from the debug image without zerocopy and FSAL async features > > VSI running fio against ganesha reproduced the EIO problem. No wire trace > was collected this time. The ganesha logs attached in comment#22 does not contain debug logs. It only shows logs when ganesha daemon started : May 20 14:36:01 dal3-qz2-sr3-rk247-s18 ceph-b57c0d78-e8d2-11ef-820c-7cc2554980d4-nfs-r134-50b8f3ae-99aa-46be-beb6-07ca3db878c9-0-0-dal3-qz2-sr3-rk247-s18-khutyv[3672130]: 20/05/2025 14:36:01 : epoch 682c9350 : dal3-qz2-sr3-rk247-s18 : ganesha.nfsd-2[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully May 20 14:36:01 dal3-qz2-sr3-rk247-s18 ceph-b57c0d78-e8d2-11ef-820c-7cc2554980d4-nfs-r134-50b8f3ae-99aa-46be-beb6-07ca3db878c9-0-0-dal3-qz2-sr3-rk247-s18-khutyv[3672130]: 20/05/2025 14:36:01 : epoch 682c9350 : dal3-qz2-sr3-rk247-s18 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- May 20 14:36:01 dal3-qz2-sr3-rk247-s18 ceph-b57c0d78-e8d2-11ef-820c-7cc2554980d4-nfs-r134-50b8f3ae-99aa-46be-beb6-07ca3db878c9-0-0-dal3-qz2-sr3-rk247-s18-khutyv[3672130]: 20/05/2025 14:36:01 : epoch 682c9350 : dal3-qz2-sr3-rk247-s18 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED May 20 14:36:01 dal3-qz2-sr3-rk247-s18 ceph-b57c0d78-e8d2-11ef-820c-7cc2554980d4-nfs-r134-50b8f3ae-99aa-46be-beb6-07ca3db878c9-0-0-dal3-qz2-sr3-rk247-s18-khutyv[3672130]: 20/05/2025 14:36:01 : epoch 682c9350 : dal3-qz2-sr3-rk247-s18 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- May 20 14:36:01 dal3-qz2-sr3-rk247-s18 ceph-b57c0d78-e8d2-11ef-820c-7cc2554980d4-nfs-r134-50b8f3ae-99aa-46be-beb6-07ca3db878c9-0-0-dal3-qz2-sr3-rk247-s18-khutyv[3672130]: 20/05/2025 14:36:01 : epoch 682c9350 : dal3-qz2-sr3-rk247-s18 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(1) May 20 14:36:11 dal3-qz2-sr3-rk247-s18 ceph-b57c0d78-e8d2-11ef-820c-7cc2554980d4-nfs-r134-50b8f3ae-99aa-46be-beb6-07ca3db878c9-0-0-dal3-qz2-sr3-rk247-s18-khutyv[3672130]: 20/05/2025 14:36:11 : epoch 682c9350 : dal3-qz2-sr3-rk247-s18 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(1) clid count(1) May 20 14:36:11 dal3-qz2-sr3-rk247-s18 ceph-b57c0d78-e8d2-11ef-820c-7cc2554980d4-nfs-r134-50b8f3ae-99aa-46be-beb6-07ca3db878c9-0-0-dal3-qz2-sr3-rk247-s18-khutyv[3672130]: 20/05/2025 14:36:11 : epoch 682c9350 : dal3-qz2-sr3-rk247-s18 : ganesha.nfsd-2[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
Comment on attachment 2091162 [details] ganesha logs from the debug image without zerocopy and FSAL async features Please ignore these logs as the container was not running the latest debug code from Frank.
Created attachment 2091331 [details] Latest run with async turned off. fio include.fio --filename="$(pwd)/test0" test0: (g=0): rw=randwrite, bs=(R) 10.0KiB-10.0KiB, (W) 10.0KiB-10.0KiB, (T) 10.0KiB-10.0KiB, ioengine=libaio, iodepth=16 ... fio-3.36 Starting 4 processes fio: io_u error on file /denali/test0: Input/output error: write offset=942694400, buflen=10240 fio: pid=19578, err=5/file:io_u.c:1896, func=io_u error, error=Input/output error test0: (groupid=0, jobs=4): err= 5 (file:io_u.c:1896, func=io_u error, error=Input/output error): pid=19578: Fri May 23 16:36:41 2025
This morning I verified that QOS feature is configured and enabled in a11 (the CEPH cluster in our ibmcloud env reproducing EIO): [ceph: root@dal1-qz2-sr2-rk044-s18 /]# ceph nfs cluster qos get r134-0803038e-8f26-4335-a1c1-a75707fce95d { "combined_rw_bw_control": true, "enable_bw_control": true, "enable_iops_control": true, "enable_qos": true, "max_export_combined_bw": "1.1GB", "max_export_iops": 35000, "qos_type": "PerShare" }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2025:9775
>Since the problem described in this bug report should be resolved in a recent advisory, >it has been closed with a resolution of ERRATA. > >For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. > >If the solution does not work for you, open a new bug report. >https://access.redhat.com/errata/RHSA-2025:9775 https://rbtv77apk.app > >Thanks Fellow!