Description of problem: An NFS client mounting either a GFS file system or an ext3 file system sitting on top of an lvm2 logical volume on top of a multipath md device on an NFS server getting occassional "server not responding" errors. Version-Release number of selected component (if applicable): RHEL4 U2 How reproducible: Every time Steps to Reproduce: 1. Create an multipath md device with one LUN 2. Create an LVM2 logical volume 3. Format with either GFS or ext3 4. Export via NFS 5. Create script to continuously copy large file to NFS mount on client Actual results: NFS client reports server not responding several times over a 10 minute period Expected results: NFS client does not report errors Additional info: This has been seen on multiple clusters with GFS. The NFS server (running GFS/Cluster Manager) sometimes reports nfsd: non-standard errno -38 when the NFS client gets this error. Sometimes it does not. I tried exactly the same test with the same NFS client and an ext3 volume on an internal drive on the same NFS server without a problem. I have tried to reproduce this on sd devices to be sure that the md driver is causing the problem but I am having problems creating another volume group on the same cluster with sd devices. I will write a separate bug on this issue. I have however seen this on multiple GFS clusters with multiple NFS clients, but all are using md multipath devices under LVM2.
*** Bug 171521 has been marked as a duplicate of this bug. ***
From Jin Lim at Crosswalk We found further findings below, and believe there is a client sensitivity in this bug. 1. A NFS client running on a 64-bit system do NOT generate any errors mentioned above. 2. A NFS client running on a old 32-bit system do constantly generate a IO error. However, the IO failures are always from "read" with the errno = 5 (IO Error). If we turn off jdata flag on the gfs filesystem then we do NOT see any read error on even the old clients. uname -a from the 64-bit client: Linux igtest04 2.6.9-7.ELdgs_smp #1 SMP Thu May 26 23:25:51 MDT 2005 x86_64 x86_64 x86_64 GNU/Linux uname -a from the 32-bit client: Linux sol-load-07 2.4.21-15.EL #1 Thu Apr 22 00:27:41 EDT 2004 i686 i686 i386 GNU/Linux unmae -a from the 64-bit server we tested against: Linux dev201 2.6.9-22.EL_1-2-1-1_dgs_smp #2 SMP Fri Oct 21 00:59:17 MDT 2005 x86_64 x86_64 x86_64 GNU/Linux
would it be possible to get bzip2 tethereal trace of the IO error? (i.e. tethereal -w /tmp/data.pcap host <server>; bzip2 /tmp/data.pcap ) Also is there anything in either /var/log/messages that might help?
Henry, can you provide the information listed in comment #8?
Kevin, I can do this for you, however, I need more elaboration on how to run cmd above. Can Steve D. contact me directly? Jin 303.635.7886 jlim. Thank you.
Steve, Me and my peer are not sure how exactly you want us to run the tool to trace of the IO error. Can you please elaborate more on it. The client load-8 see the read error on the mounted filesystem from the server sqa-02. 1) Do you want us to run the tool while running the test case generating the error? Or, does not matter if we do after the test? 2) Do you want us to run the tool only on the client side or both the client & server sides? 3) Which is right syntax? On the client side: tethereal -w /tmp/data.pcap host xx.xxx.x.xxx(remote server ip) or tethereal -w /tmp/data.pcap host xx.xxx.x.xxx(local client ip) 4) We do have configure both eth0(mgmt port) and eth1(data port) on the client, do you want us to capture the traces on both? If so, how do we do that? Thanks much, Jin
Henry, Can you also provide the md configuration files and the lvm2 volume layout information?
Hope this helps, volume & md information are below: [root@sqatwo01 ~]# lvdisplay --- Logical volume --- LV Name /dev/TestPool_03/TestVol_03 VG Name TestPool_03 LV UUID 1Pomvv-1qsq-HMxd-Cpjx-6qvF-a1rs-Z29OSw LV Write Access read/write LV Status available # open 1 LV Size 40.00 GB Current LE 10240 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:0 --- Logical volume --- LV Name /dev/VTL/VTL1 VG Name VTL LV UUID nEpQdA-fxvs-aySe-FeGW-fhim-UMZ4-GZU5Uw LV Write Access read/write LV Status available # open 1 LV Size 115.00 GB Current LE 29440 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:1 --- Logical volume --- LV Name /dev/TestPool_02/TestVol_02 VG Name TestPool_02 LV UUID QYeNgE-MU1H-Gaeg-Luv5-ZjmK-M4t2-ywDgqc LV Write Access read/write LV Status available # open 1 LV Size 70.00 GB Current LE 17920 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:2 --- Logical volume --- LV Name /dev/TestPool_01/TestVol_01 VG Name TestPool_01 LV UUID VjOXGi-h1e8-Oabm-Ul66-p8L1-CVVG-laYnWH LV Write Access read/write LV Status available # open 1 LV Size 30.00 GB Current LE 3840 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:3 --- Logical volume --- LV Name /dev/vg_igrid_01/crosswalk VG Name vg_igrid_01 LV UUID N6WQ9Y-uZNd-1Keq-u6AR-A346-w6Wq-7R5Swy LV Write Access read/write LV Status available # open 1 LV Size 10.00 GB Current LE 2560 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:4 --- Logical volume --- LV Name /dev/vg_igrid_01/nvbackup VG Name vg_igrid_01 LV UUID sQUei8-VaUL-lV2Q-SJqZ-AtM6-IpVu-24xisL LV Write Access read/write LV Status available # open 1 LV Size 20.00 GB Current LE 5120 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:5 [root@sqatwo01 ~]# vgdisplay --- Volume group --- VG Name TestPool_03 System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 4 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 2 Act PV 2 VG Size 239.99 GB PE Size 4.00 MB Total PE 61438 Alloc PE / Size 10240 / 40.00 GB Free PE / Size 51198 / 199.99 GB VG UUID JAkY67-w6ud-U71l-lBkZ-nn6X-2rFN-RHxRsb --- Volume group --- VG Name VTL System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 2 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 1 Act PV 1 VG Size 120.00 GB PE Size 4.00 MB Total PE 30719 Alloc PE / Size 29440 / 115.00 GB Free PE / Size 1279 / 5.00 GB VG UUID AY1Thr-0mee-uj73-057Q-KZLn-dV1S-ei6XI2 --- Volume group --- VG Name TestPool_02 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 4 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 1 Act PV 1 VG Size 80.00 GB PE Size 4.00 MB Total PE 20479 Alloc PE / Size 17920 / 70.00 GB Free PE / Size 2559 / 10.00 GB VG UUID SwPPu1-hCQU-A0Yg-E07Y-IsIA-WASH-xAnDSt --- Volume group --- VG Name TestPool_01 System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 6 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 2 Act PV 2 VG Size 59.98 GB PE Size 8.00 MB Total PE 7678 Alloc PE / Size 3840 / 30.00 GB Free PE / Size 3838 / 29.98 GB VG UUID 3uvyVH-4Tcd-Li7z-Dgnh-k1Ck-22rh-7g8OSz --- Volume group --- VG Name vg_igrid_01 System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 3 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 2 Act PV 2 VG Size 59.99 GB PE Size 4.00 MB Total PE 15358 Alloc PE / Size 7680 / 30.00 GB Free PE / Size 7678 / 29.99 GB VG UUID 5xeXfr-GoPZ-x86H-K8ss-0d3W-t9vi-S7jQ6z [root@sqatwo01 ~]# pvdisplay --- Physical volume --- PV Name /dev/md8 VG Name TestPool_03 PV Size 120.00 GB / not usable 0 Allocatable yes PE Size (KByte) 4096 Total PE 30719 Free PE 25599 Allocated PE 5120 PV UUID Rr1H2E-gPYI-v8d6-hEmS-KDpB-7M0I-14tCdf --- Physical volume --- PV Name /dev/md7 VG Name TestPool_03 PV Size 120.00 GB / not usable 0 Allocatable yes PE Size (KByte) 4096 Total PE 30719 Free PE 25599 Allocated PE 5120 PV UUID eTve6K-wPCw-TqN2-RT7g-eZkT-csKE-mAtr3u --- Physical volume --- PV Name /dev/md6 VG Name VTL PV Size 120.00 GB / not usable 0 Allocatable yes PE Size (KByte) 4096 Total PE 30719 Free PE 1279 Allocated PE 29440 PV UUID gs4yL2-MWE5-iTRx-G0wz-2t96-C8eL-DaENip --- Physical volume --- PV Name /dev/md3 VG Name TestPool_02 PV Size 80.00 GB / not usable 0 Allocatable yes PE Size (KByte) 4096 Total PE 20479 Free PE 2559 Allocated PE 17920 PV UUID 3vaflp-DyRL-2vaF-pLk0-C4qX-zY6n-9c4osQ --- Physical volume --- PV Name /dev/md9 VG Name TestPool_01 PV Size 29.99 GB / not usable 0 Allocatable yes PE Size (KByte) 8192 Total PE 3839 Free PE 1919 Allocated PE 1920 PV UUID peqbOa-TKvx-Qfjq-Ub3l-XjJA-Ubav-HBPzWc --- Physical volume --- PV Name /dev/md2 VG Name TestPool_01 PV Size 29.99 GB / not usable 0 Allocatable yes PE Size (KByte) 8192 Total PE 3839 Free PE 1919 Allocated PE 1920 PV UUID S6Vlf0-H5eU-HcXB-VVZ4-QfcM-0hMh-a5bQWn --- Physical volume --- PV Name /dev/md0 VG Name vg_igrid_01 PV Size 30.00 GB / not usable 0 Allocatable yes PE Size (KByte) 4096 Total PE 7679 Free PE 1279 Allocated PE 6400 PV UUID H5do3H-4XnE-753T-N82i-dyji-HgIR-Mdn3x6 --- Physical volume --- PV Name /dev/md1 VG Name vg_igrid_01 PV Size 30.00 GB / not usable 0 Allocatable yes PE Size (KByte) 4096 Total PE 7679 Free PE 6399 Allocated PE 1280 PV UUID rSASBF-YvS4-iClK-CuBC-ajmP-p7YQ-e4IUdq [root@sqatwo01 ~]# cat /proc/mdstat Personalities : [multipath] md9 : active multipath sdl[0] 31457216 blocks [1/1] [U] md8 : active multipath sdk[0] 125829056 blocks [1/1] [U] md7 : active multipath sdj[0] 125829056 blocks [1/1] [U] md6 : active multipath sdi[0] 125829056 blocks [1/1] [U] md5 : active multipath sdh[0] 83886016 blocks [1/1] [U] md4 : active multipath sdg[0] 83886016 blocks [1/1] [U] md3 : active multipath sdf[0] 83886016 blocks [1/1] [U] md2 : active multipath sde[0] 31457216 blocks [1/1] [U] md1 : active multipath sdd[0] 31457216 blocks [1/1] [U] md0 : active multipath sdc[0] 31457216 blocks [1/1] [U]
On more note: When we see the read fail on the client, nfsd on the server side reports the follwoing in /var/log/message. Oct 31 15:33:11 sqatwo01 kernel: nfsd: non-standard errno: -38
Latest update from the Crosswalk team... We had added more kernel debug messages then were able to isolate where the read failure was induced. The gfs_sendfile() function in ~/fs/gfs/ops_file.c returns the ENOSYS error for the read IO issued from a nfs client when gfs_jdata flag is turned on. Please see the code below for more details. static ssize_t gfs_sendfile(struct file *in_file, loff_t *offset, size_t count, read_actor_t actor, void __user *target) { : : : if (gfs_is_jdata(ip)) { retval = -ENOSYS; } else retval = generic_file_sendfile(in_file, offset, count, actor, target); out: gfs_holder_uninit(&gh); return retval; } Based on our finding above, the question is how we can safely turn on the jdata that we will not encounter the error for any data trasfer over a nfs network? Or, shoul we NOT turn on the jdata at all? Please advise. Should you need more details of our finding, please directly contact Jin at jlim or 303-635-7886. Thank you.
Based on the comments above, I am reassigning this defect to the cluster product and GFS, as well as reassigning the defect.
We can hit this in house with: (link13 is part of a 3 node DLM cluster) [root@link-13 ~]# pvcreate /dev/sdb1 [root@link-13 ~]# vgcreate myvg /dev/sdb1 [root@link-13 ~]# lvcreate -l 69748 -n myvol myvg [root@link-13 ~]# vgchange -ay myvg [root@link-13 ~]# gfs_mkfs -j 3 -J 32 -p lock_dlm -t LINK_131415:gfs0 /dev/myvg/myvol [root@link-13 ~]# mount -t gfs /dev/myvg/myvol /mnt [root@link-13 ~]# service nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] [root@link-13 ~]# gfs_tool setflag inherit_jdata /mnt [root@link-13 ~]# exportfs -o rw fore.lab.msp.redhat.com:/mnt : fore; mount link-13.lab.msp.redhat.com:/mnt /mnt : fore; pan2 -A -x 10 -l ./nfs.out -o ./nfs.out -t -x 10 -f ./nfsload.h2 [load9] doio (19181) 11:24:34 [load9] --------------------- [load9] Could not read 602148 bytes from /mnt/testfile9 for verification: Input/output error (5) [load9] Request number 1 [load9] syscall: write(4, 02522525640020, 602148) [load9] fd 4 is file /mnt/testfile9 - open flags are 010001 [load9] write done at file offset 46885851 - pattern is G:19181:fore:doio* [load9] ERROR in /var/log/messages: Nov 2 06:26:54 link-13 kernel: nfsd: non-standard errno: -38 <-------------------- nfsload.h2 -----------------------> <herd name="nfs load"> <test tag="load9"> <cmd> <![CDATA[ /usr/tests/sts/bin/iogen -s write,writev -t 1000b -T 10000b -o -m random -i 0 100000b:/mnt/testfile9 | /usr/tests/sts/bin/doio -avk ]]> </cmd> </test> . . .
After forcing nfsd to use the vfs_readv() path after gfs returns -ENOSYS from sendfile()... Nfs deals with data in pages, which is what all normal file i/o use. GFS does its own thing for jdata files, though, and treats them like metadata, which means it's all going through the buffer cache, not the page cache. Specifically, I get a kmap oops when nfs tries to send the "page" of jdata that's not really a page AFAICT. It's very unlikely that nfs could be modified to deal with non-paged data, and gfs's data journaling would need to be redesigned to go through the page cache. Perhaps some translation could be done to temporarily copy jdata buffers into pages so nfs could deal with them, but someone with VM-internals expertise would be needed to judge that.
The issue here is that current GFS jdata code puts data into buffer cache, instead of page cache, but NFSD's sendfile only interacts with page cache. There is a WIP in GFS2 to address this issue that will be part of future GFS releases. Per team meeting today, we'll not support this feature in RHEL 4 due to the scale and risks of the associated changes. Let us know if there are concerns.