Bug 171520
| Summary: | NFS client reporting server not responding errors when mounted to lvm2 logical volume on top of md multipath device | ||
|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Henry Harris <henry.harris> |
| Component: | gfs | Assignee: | Wendy Cheng <nobody+wcheng> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | GFS Bugs <gfs-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4 | CC: | kanderso, rkenna, steved, teigland |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2006-01-17 18:18:29 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Henry Harris
2005-10-22 01:39:51 UTC
*** Bug 171521 has been marked as a duplicate of this bug. *** From Jin Lim at Crosswalk We found further findings below, and believe there is a client sensitivity in this bug. 1. A NFS client running on a 64-bit system do NOT generate any errors mentioned above. 2. A NFS client running on a old 32-bit system do constantly generate a IO error. However, the IO failures are always from "read" with the errno = 5 (IO Error). If we turn off jdata flag on the gfs filesystem then we do NOT see any read error on even the old clients. uname -a from the 64-bit client: Linux igtest04 2.6.9-7.ELdgs_smp #1 SMP Thu May 26 23:25:51 MDT 2005 x86_64 x86_64 x86_64 GNU/Linux uname -a from the 32-bit client: Linux sol-load-07 2.4.21-15.EL #1 Thu Apr 22 00:27:41 EDT 2004 i686 i686 i386 GNU/Linux unmae -a from the 64-bit server we tested against: Linux dev201 2.6.9-22.EL_1-2-1-1_dgs_smp #2 SMP Fri Oct 21 00:59:17 MDT 2005 x86_64 x86_64 x86_64 GNU/Linux would it be possible to get bzip2 tethereal trace of the IO error? (i.e. tethereal -w /tmp/data.pcap host <server>; bzip2 /tmp/data.pcap ) Also is there anything in either /var/log/messages that might help? Henry, can you provide the information listed in comment #8? Kevin, I can do this for you, however, I need more elaboration on how to run cmd above. Can Steve D. contact me directly? Jin 303.635.7886 jlim. Thank you. Steve, Me and my peer are not sure how exactly you want us to run the tool to trace of the IO error. Can you please elaborate more on it. The client load-8 see the read error on the mounted filesystem from the server sqa-02. 1) Do you want us to run the tool while running the test case generating the error? Or, does not matter if we do after the test? 2) Do you want us to run the tool only on the client side or both the client & server sides? 3) Which is right syntax? On the client side: tethereal -w /tmp/data.pcap host xx.xxx.x.xxx(remote server ip) or tethereal -w /tmp/data.pcap host xx.xxx.x.xxx(local client ip) 4) We do have configure both eth0(mgmt port) and eth1(data port) on the client, do you want us to capture the traces on both? If so, how do we do that? Thanks much, Jin Henry, Can you also provide the md configuration files and the lvm2 volume layout information? Hope this helps, volume & md information are below:
[root@sqatwo01 ~]# lvdisplay
--- Logical volume ---
LV Name /dev/TestPool_03/TestVol_03
VG Name TestPool_03
LV UUID 1Pomvv-1qsq-HMxd-Cpjx-6qvF-a1rs-Z29OSw
LV Write Access read/write
LV Status available
# open 1
LV Size 40.00 GB
Current LE 10240
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:0
--- Logical volume ---
LV Name /dev/VTL/VTL1
VG Name VTL
LV UUID nEpQdA-fxvs-aySe-FeGW-fhim-UMZ4-GZU5Uw
LV Write Access read/write
LV Status available
# open 1
LV Size 115.00 GB
Current LE 29440
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:1
--- Logical volume ---
LV Name /dev/TestPool_02/TestVol_02
VG Name TestPool_02
LV UUID QYeNgE-MU1H-Gaeg-Luv5-ZjmK-M4t2-ywDgqc
LV Write Access read/write
LV Status available
# open 1
LV Size 70.00 GB
Current LE 17920
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:2
--- Logical volume ---
LV Name /dev/TestPool_01/TestVol_01
VG Name TestPool_01
LV UUID VjOXGi-h1e8-Oabm-Ul66-p8L1-CVVG-laYnWH
LV Write Access read/write
LV Status available
# open 1
LV Size 30.00 GB
Current LE 3840
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:3
--- Logical volume ---
LV Name /dev/vg_igrid_01/crosswalk
VG Name vg_igrid_01
LV UUID N6WQ9Y-uZNd-1Keq-u6AR-A346-w6Wq-7R5Swy
LV Write Access read/write
LV Status available
# open 1
LV Size 10.00 GB
Current LE 2560
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:4
--- Logical volume ---
LV Name /dev/vg_igrid_01/nvbackup
VG Name vg_igrid_01
LV UUID sQUei8-VaUL-lV2Q-SJqZ-AtM6-IpVu-24xisL
LV Write Access read/write
LV Status available
# open 1
LV Size 20.00 GB
Current LE 5120
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:5
[root@sqatwo01 ~]# vgdisplay
--- Volume group ---
VG Name TestPool_03
System ID
Format lvm2
Metadata Areas 2
Metadata Sequence No 4
VG Access read/write
VG Status resizable
Clustered yes
Shared no
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 2
Act PV 2
VG Size 239.99 GB
PE Size 4.00 MB
Total PE 61438
Alloc PE / Size 10240 / 40.00 GB
Free PE / Size 51198 / 199.99 GB
VG UUID JAkY67-w6ud-U71l-lBkZ-nn6X-2rFN-RHxRsb
--- Volume group ---
VG Name VTL
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 2
VG Access read/write
VG Status resizable
Clustered yes
Shared no
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 1
Act PV 1
VG Size 120.00 GB
PE Size 4.00 MB
Total PE 30719
Alloc PE / Size 29440 / 115.00 GB
Free PE / Size 1279 / 5.00 GB
VG UUID AY1Thr-0mee-uj73-057Q-KZLn-dV1S-ei6XI2
--- Volume group ---
VG Name TestPool_02
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
Clustered yes
Shared no
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 1
Act PV 1
VG Size 80.00 GB
PE Size 4.00 MB
Total PE 20479
Alloc PE / Size 17920 / 70.00 GB
Free PE / Size 2559 / 10.00 GB
VG UUID SwPPu1-hCQU-A0Yg-E07Y-IsIA-WASH-xAnDSt
--- Volume group ---
VG Name TestPool_01
System ID
Format lvm2
Metadata Areas 2
Metadata Sequence No 6
VG Access read/write
VG Status resizable
Clustered yes
Shared no
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 2
Act PV 2
VG Size 59.98 GB
PE Size 8.00 MB
Total PE 7678
Alloc PE / Size 3840 / 30.00 GB
Free PE / Size 3838 / 29.98 GB
VG UUID 3uvyVH-4Tcd-Li7z-Dgnh-k1Ck-22rh-7g8OSz
--- Volume group ---
VG Name vg_igrid_01
System ID
Format lvm2
Metadata Areas 2
Metadata Sequence No 3
VG Access read/write
VG Status resizable
Clustered yes
Shared no
MAX LV 0
Cur LV 2
Open LV 2
Max PV 0
Cur PV 2
Act PV 2
VG Size 59.99 GB
PE Size 4.00 MB
Total PE 15358
Alloc PE / Size 7680 / 30.00 GB
Free PE / Size 7678 / 29.99 GB
VG UUID 5xeXfr-GoPZ-x86H-K8ss-0d3W-t9vi-S7jQ6z
[root@sqatwo01 ~]# pvdisplay
--- Physical volume ---
PV Name /dev/md8
VG Name TestPool_03
PV Size 120.00 GB / not usable 0
Allocatable yes
PE Size (KByte) 4096
Total PE 30719
Free PE 25599
Allocated PE 5120
PV UUID Rr1H2E-gPYI-v8d6-hEmS-KDpB-7M0I-14tCdf
--- Physical volume ---
PV Name /dev/md7
VG Name TestPool_03
PV Size 120.00 GB / not usable 0
Allocatable yes
PE Size (KByte) 4096
Total PE 30719
Free PE 25599
Allocated PE 5120
PV UUID eTve6K-wPCw-TqN2-RT7g-eZkT-csKE-mAtr3u
--- Physical volume ---
PV Name /dev/md6
VG Name VTL
PV Size 120.00 GB / not usable 0
Allocatable yes
PE Size (KByte) 4096
Total PE 30719
Free PE 1279
Allocated PE 29440
PV UUID gs4yL2-MWE5-iTRx-G0wz-2t96-C8eL-DaENip
--- Physical volume ---
PV Name /dev/md3
VG Name TestPool_02
PV Size 80.00 GB / not usable 0
Allocatable yes
PE Size (KByte) 4096
Total PE 20479
Free PE 2559
Allocated PE 17920
PV UUID 3vaflp-DyRL-2vaF-pLk0-C4qX-zY6n-9c4osQ
--- Physical volume ---
PV Name /dev/md9
VG Name TestPool_01
PV Size 29.99 GB / not usable 0
Allocatable yes
PE Size (KByte) 8192
Total PE 3839
Free PE 1919
Allocated PE 1920
PV UUID peqbOa-TKvx-Qfjq-Ub3l-XjJA-Ubav-HBPzWc
--- Physical volume ---
PV Name /dev/md2
VG Name TestPool_01
PV Size 29.99 GB / not usable 0
Allocatable yes
PE Size (KByte) 8192
Total PE 3839
Free PE 1919
Allocated PE 1920
PV UUID S6Vlf0-H5eU-HcXB-VVZ4-QfcM-0hMh-a5bQWn
--- Physical volume ---
PV Name /dev/md0
VG Name vg_igrid_01
PV Size 30.00 GB / not usable 0
Allocatable yes
PE Size (KByte) 4096
Total PE 7679
Free PE 1279
Allocated PE 6400
PV UUID H5do3H-4XnE-753T-N82i-dyji-HgIR-Mdn3x6
--- Physical volume ---
PV Name /dev/md1
VG Name vg_igrid_01
PV Size 30.00 GB / not usable 0
Allocatable yes
PE Size (KByte) 4096
Total PE 7679
Free PE 6399
Allocated PE 1280
PV UUID rSASBF-YvS4-iClK-CuBC-ajmP-p7YQ-e4IUdq
[root@sqatwo01 ~]# cat /proc/mdstat
Personalities : [multipath]
md9 : active multipath sdl[0]
31457216 blocks [1/1] [U]
md8 : active multipath sdk[0]
125829056 blocks [1/1] [U]
md7 : active multipath sdj[0]
125829056 blocks [1/1] [U]
md6 : active multipath sdi[0]
125829056 blocks [1/1] [U]
md5 : active multipath sdh[0]
83886016 blocks [1/1] [U]
md4 : active multipath sdg[0]
83886016 blocks [1/1] [U]
md3 : active multipath sdf[0]
83886016 blocks [1/1] [U]
md2 : active multipath sde[0]
31457216 blocks [1/1] [U]
md1 : active multipath sdd[0]
31457216 blocks [1/1] [U]
md0 : active multipath sdc[0]
31457216 blocks [1/1] [U]
On more note: When we see the read fail on the client, nfsd on the server side reports the follwoing in /var/log/message. Oct 31 15:33:11 sqatwo01 kernel: nfsd: non-standard errno: -38 Latest update from the Crosswalk team...
We had added more kernel debug messages then were able to isolate where the
read failure was induced. The gfs_sendfile() function in ~/fs/gfs/ops_file.c
returns the ENOSYS error for the read IO issued from a nfs client when
gfs_jdata flag is turned on. Please see the code below for more details.
static ssize_t
gfs_sendfile(struct file *in_file, loff_t *offset, size_t count, read_actor_t
actor, void __user *target)
{
:
:
:
if (gfs_is_jdata(ip))
{
retval = -ENOSYS;
}
else
retval = generic_file_sendfile(in_file, offset, count, actor,
target);
out:
gfs_holder_uninit(&gh);
return retval;
}
Based on our finding above, the question is how we can safely turn on the
jdata that we will not encounter the error for any data trasfer over a nfs
network? Or, shoul we NOT turn on the jdata at all? Please advise.
Should you need more details of our finding, please directly contact Jin at
jlim or 303-635-7886. Thank you.
Based on the comments above, I am reassigning this defect to the cluster product and GFS, as well as reassigning the defect. We can hit this in house with:
(link13 is part of a 3 node DLM cluster)
[root@link-13 ~]# pvcreate /dev/sdb1
[root@link-13 ~]# vgcreate myvg /dev/sdb1
[root@link-13 ~]# lvcreate -l 69748 -n myvol myvg
[root@link-13 ~]# vgchange -ay myvg
[root@link-13 ~]# gfs_mkfs -j 3 -J 32 -p lock_dlm -t LINK_131415:gfs0
/dev/myvg/myvol
[root@link-13 ~]# mount -t gfs /dev/myvg/myvol /mnt
[root@link-13 ~]# service nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS daemon: [ OK ]
Starting NFS mountd: [ OK ]
[root@link-13 ~]# gfs_tool setflag inherit_jdata /mnt
[root@link-13 ~]# exportfs -o rw fore.lab.msp.redhat.com:/mnt
: fore; mount link-13.lab.msp.redhat.com:/mnt /mnt
: fore; pan2 -A -x 10 -l ./nfs.out -o ./nfs.out -t -x 10 -f ./nfsload.h2
[load9] doio (19181) 11:24:34
[load9] ---------------------
[load9] Could not read 602148 bytes from /mnt/testfile9 for verification:
Input/output error (5)
[load9] Request number 1
[load9] syscall: write(4, 02522525640020, 602148)
[load9] fd 4 is file /mnt/testfile9 - open flags are 010001
[load9] write done at file offset 46885851 - pattern is G:19181:fore:doio*
[load9]
ERROR in /var/log/messages: Nov 2 06:26:54 link-13 kernel: nfsd: non-standard
errno: -38
<-------------------- nfsload.h2 ----------------------->
<herd name="nfs load">
<test tag="load9">
<cmd> <![CDATA[
/usr/tests/sts/bin/iogen -s write,writev -t 1000b -T 10000b -o -m random
-i 0 100000b:/mnt/testfile9 | /usr/tests/sts/bin/doio -avk
]]> </cmd>
</test>
.
.
.
After forcing nfsd to use the vfs_readv() path after gfs returns -ENOSYS from sendfile()... Nfs deals with data in pages, which is what all normal file i/o use. GFS does its own thing for jdata files, though, and treats them like metadata, which means it's all going through the buffer cache, not the page cache. Specifically, I get a kmap oops when nfs tries to send the "page" of jdata that's not really a page AFAICT. It's very unlikely that nfs could be modified to deal with non-paged data, and gfs's data journaling would need to be redesigned to go through the page cache. Perhaps some translation could be done to temporarily copy jdata buffers into pages so nfs could deal with them, but someone with VM-internals expertise would be needed to judge that. The issue here is that current GFS jdata code puts data into buffer cache, instead of page cache, but NFSD's sendfile only interacts with page cache. There is a WIP in GFS2 to address this issue that will be part of future GFS releases. Per team meeting today, we'll not support this feature in RHEL 4 due to the scale and risks of the associated changes. Let us know if there are concerns. |