Bug 171520

Summary:	NFS client reporting server not responding errors when mounted to lvm2 logical volume on top of md multipath device
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Henry Harris <henry.harris>
Component:	gfs	Assignee:	Wendy Cheng <nobody+wcheng>
Status:	CLOSED NEXTRELEASE	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	kanderso, rkenna, steved, teigland
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-01-17 18:18:29 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Henry Harris 2005-10-22 01:39:51 UTC

Description of problem: An NFS client mounting either a GFS file system or an 
ext3 file system sitting on top of an lvm2 logical volume on top of a 
multipath md device on an NFS server getting occassional "server not 
responding" errors.


Version-Release number of selected component (if applicable): RHEL4 U2


How reproducible:
Every time

Steps to Reproduce:
1. Create an multipath md device with one LUN
2. Create an LVM2 logical volume
3. Format with either GFS or ext3
4. Export via NFS
5. Create script to continuously copy large file to NFS mount on client
  
Actual results:
NFS client reports server not responding several times over a 10 minute period

Expected results:
NFS client does not report errors

Additional info:
This has been seen on multiple clusters with GFS.  The NFS server (running 
GFS/Cluster Manager) sometimes reports nfsd: non-standard errno -38 when the 
NFS client gets this error.  Sometimes it does not.  I tried exactly the same 
test with the same NFS client and an ext3 volume on an internal drive on the 
same NFS server without a problem.  I have tried to reproduce this on sd 
devices to be sure that the md driver is causing the problem but I am having 
problems creating another volume group on the same cluster with sd devices.  I 
will write a separate bug on this issue.  I have however seen this on multiple 
GFS clusters with multiple NFS clients, but all are using md multipath devices 
under LVM2.

Comment 1 Henry Harris 2005-10-22 01:42:08 UTC

*** Bug 171521 has been marked as a duplicate of this bug. ***

Comment 4 Henry Harris 2005-10-27 19:56:05 UTC

From Jin Lim at Crosswalk
We found further findings below, and believe there is a client sensitivity in 
this bug.
 
 1. A NFS client running on a 64-bit system do NOT generate any errors 
mentioned above.

 2. A NFS client running on a old 32-bit system do constantly generate a IO 
error. However, the IO failures are always from "read" with the errno = 5 (IO 
Error). If we turn off jdata flag on the gfs filesystem then we do NOT see any 
read error on even the old clients.

  uname -a from the 64-bit client:
  Linux igtest04 2.6.9-7.ELdgs_smp #1 SMP Thu May 26 23:25:51 MDT 2005 x86_64 
x86_64 x86_64 GNU/Linux
 
  uname -a from the 32-bit client:
  Linux sol-load-07 2.4.21-15.EL #1 Thu Apr 22 00:27:41 EDT 2004 i686 i686 
i386 GNU/Linux

  unmae -a from the 64-bit server we tested against:
  Linux dev201 2.6.9-22.EL_1-2-1-1_dgs_smp #2 SMP Fri Oct 21 00:59:17 MDT 2005 
x86_64 x86_64 x86_64 GNU/Linux

Comment 8 Steve Dickson 2005-10-28 17:49:25 UTC

would it be possible to get bzip2  tethereal trace of the IO error?
(i.e. tethereal -w /tmp/data.pcap host <server>;  bzip2 /tmp/data.pcap )

Also is there anything in either /var/log/messages that might help?

Comment 9 Kiersten (Kerri) Anderson 2005-10-28 18:09:42 UTC

Henry, can you provide the information listed in comment #8?

Comment 10 Henry Harris 2005-10-31 17:35:15 UTC

Kevin,
I can do this for you, however, I need more elaboration on how to run cmd 
above. Can Steve D. contact me directly?
Jin 303.635.7886 jlim. Thank you.

Comment 11 Henry Harris 2005-10-31 17:48:56 UTC

Steve,
Me and my peer are not sure how exactly you want us to run the tool to trace 
of the IO error. Can you please elaborate more on it.

The client load-8 see the read error on the mounted filesystem from the server 
sqa-02.

1) Do you want us to run the tool while running the test case generating the 
error? Or, does not matter if we do after the test?

2) Do you want us to run the tool only on the client side or both the client & 
server sides?

3) Which is right syntax? 

  On the client side:
  tethereal -w /tmp/data.pcap host xx.xxx.x.xxx(remote server ip)
  or
  tethereal -w /tmp/data.pcap host xx.xxx.x.xxx(local client ip)

4) We do have configure both eth0(mgmt port) and eth1(data port) on the 
client, do you want us to capture the traces on both? If so, how do we do that?

Thanks much,
Jin

Comment 12 Kiersten (Kerri) Anderson 2005-10-31 21:04:09 UTC

Henry, 

Can you also provide the md configuration files and the lvm2 volume layout
information?

Comment 13 Henry Harris 2005-10-31 21:43:14 UTC

Hope this helps, volume & md information are below:

[root@sqatwo01 ~]# lvdisplay
  --- Logical volume ---
  LV Name                /dev/TestPool_03/TestVol_03
  VG Name                TestPool_03
  LV UUID                1Pomvv-1qsq-HMxd-Cpjx-6qvF-a1rs-Z29OSw
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                40.00 GB
  Current LE             10240
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:0

  --- Logical volume ---
  LV Name                /dev/VTL/VTL1
  VG Name                VTL
  LV UUID                nEpQdA-fxvs-aySe-FeGW-fhim-UMZ4-GZU5Uw
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                115.00 GB
  Current LE             29440
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:1

  --- Logical volume ---
  LV Name                /dev/TestPool_02/TestVol_02
  VG Name                TestPool_02
  LV UUID                QYeNgE-MU1H-Gaeg-Luv5-ZjmK-M4t2-ywDgqc
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                70.00 GB
  Current LE             17920
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:2

  --- Logical volume ---
  LV Name                /dev/TestPool_01/TestVol_01
  VG Name                TestPool_01
  LV UUID                VjOXGi-h1e8-Oabm-Ul66-p8L1-CVVG-laYnWH
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                30.00 GB
  Current LE             3840
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:3

  --- Logical volume ---
  LV Name                /dev/vg_igrid_01/crosswalk
  VG Name                vg_igrid_01
  LV UUID                N6WQ9Y-uZNd-1Keq-u6AR-A346-w6Wq-7R5Swy
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                10.00 GB
  Current LE             2560
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:4

  --- Logical volume ---
  LV Name                /dev/vg_igrid_01/nvbackup
  VG Name                vg_igrid_01
  LV UUID                sQUei8-VaUL-lV2Q-SJqZ-AtM6-IpVu-24xisL
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                20.00 GB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:5

[root@sqatwo01 ~]# vgdisplay
  --- Volume group ---
  VG Name               TestPool_03
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               239.99 GB
  PE Size               4.00 MB
  Total PE              61438
  Alloc PE / Size       10240 / 40.00 GB
  Free  PE / Size       51198 / 199.99 GB
  VG UUID               JAkY67-w6ud-U71l-lBkZ-nn6X-2rFN-RHxRsb

  --- Volume group ---
  VG Name               VTL
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               120.00 GB
  PE Size               4.00 MB
  Total PE              30719
  Alloc PE / Size       29440 / 115.00 GB
  Free  PE / Size       1279 / 5.00 GB
  VG UUID               AY1Thr-0mee-uj73-057Q-KZLn-dV1S-ei6XI2

  --- Volume group ---
  VG Name               TestPool_02
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               80.00 GB
  PE Size               4.00 MB
  Total PE              20479
  Alloc PE / Size       17920 / 70.00 GB
  Free  PE / Size       2559 / 10.00 GB
  VG UUID               SwPPu1-hCQU-A0Yg-E07Y-IsIA-WASH-xAnDSt

  --- Volume group ---
  VG Name               TestPool_01
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  6
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               59.98 GB
  PE Size               8.00 MB
  Total PE              7678
  Alloc PE / Size       3840 / 30.00 GB
  Free  PE / Size       3838 / 29.98 GB
  VG UUID               3uvyVH-4Tcd-Li7z-Dgnh-k1Ck-22rh-7g8OSz

  --- Volume group ---
  VG Name               vg_igrid_01
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               59.99 GB
  PE Size               4.00 MB
  Total PE              15358
  Alloc PE / Size       7680 / 30.00 GB
  Free  PE / Size       7678 / 29.99 GB
  VG UUID               5xeXfr-GoPZ-x86H-K8ss-0d3W-t9vi-S7jQ6z
[root@sqatwo01 ~]# pvdisplay
  --- Physical volume ---
  PV Name               /dev/md8
  VG Name               TestPool_03
  PV Size               120.00 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              30719
  Free PE               25599
  Allocated PE          5120
  PV UUID               Rr1H2E-gPYI-v8d6-hEmS-KDpB-7M0I-14tCdf

  --- Physical volume ---
  PV Name               /dev/md7
  VG Name               TestPool_03
  PV Size               120.00 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              30719
  Free PE               25599
  Allocated PE          5120
  PV UUID               eTve6K-wPCw-TqN2-RT7g-eZkT-csKE-mAtr3u

  --- Physical volume ---
  PV Name               /dev/md6
  VG Name               VTL
  PV Size               120.00 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              30719
  Free PE               1279
  Allocated PE          29440
  PV UUID               gs4yL2-MWE5-iTRx-G0wz-2t96-C8eL-DaENip

  --- Physical volume ---
  PV Name               /dev/md3
  VG Name               TestPool_02
  PV Size               80.00 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              20479
  Free PE               2559
  Allocated PE          17920
  PV UUID               3vaflp-DyRL-2vaF-pLk0-C4qX-zY6n-9c4osQ

  --- Physical volume ---
  PV Name               /dev/md9
  VG Name               TestPool_01
  PV Size               29.99 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       8192
  Total PE              3839
  Free PE               1919
  Allocated PE          1920
  PV UUID               peqbOa-TKvx-Qfjq-Ub3l-XjJA-Ubav-HBPzWc

  --- Physical volume ---
  PV Name               /dev/md2
  VG Name               TestPool_01
  PV Size               29.99 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       8192
  Total PE              3839
  Free PE               1919
  Allocated PE          1920
  PV UUID               S6Vlf0-H5eU-HcXB-VVZ4-QfcM-0hMh-a5bQWn

  --- Physical volume ---
  PV Name               /dev/md0
  VG Name               vg_igrid_01
  PV Size               30.00 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              7679
  Free PE               1279
  Allocated PE          6400
  PV UUID               H5do3H-4XnE-753T-N82i-dyji-HgIR-Mdn3x6

  --- Physical volume ---
  PV Name               /dev/md1
  VG Name               vg_igrid_01
  PV Size               30.00 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              7679
  Free PE               6399
  Allocated PE          1280
  PV UUID               rSASBF-YvS4-iClK-CuBC-ajmP-p7YQ-e4IUdq

[root@sqatwo01 ~]# cat /proc/mdstat
Personalities : [multipath]
md9 : active multipath sdl[0]
      31457216 blocks [1/1] [U]

md8 : active multipath sdk[0]
      125829056 blocks [1/1] [U]

md7 : active multipath sdj[0]
      125829056 blocks [1/1] [U]

md6 : active multipath sdi[0]
      125829056 blocks [1/1] [U]

md5 : active multipath sdh[0]
      83886016 blocks [1/1] [U]

md4 : active multipath sdg[0]
      83886016 blocks [1/1] [U]

md3 : active multipath sdf[0]
      83886016 blocks [1/1] [U]

md2 : active multipath sde[0]
      31457216 blocks [1/1] [U]

md1 : active multipath sdd[0]
      31457216 blocks [1/1] [U]

md0 : active multipath sdc[0]
      31457216 blocks [1/1] [U]

Comment 14 Henry Harris 2005-10-31 22:32:06 UTC

On more note:
When we see the read fail on the client, nfsd on the server side reports the 
follwoing in /var/log/message.
Oct 31 15:33:11 sqatwo01 kernel: nfsd: non-standard errno: -38

Comment 15 Henry Harris 2005-11-02 01:25:00 UTC

Latest update from the Crosswalk team...

We had added more kernel debug messages then were able to isolate where the 
read failure was induced. The gfs_sendfile() function in ~/fs/gfs/ops_file.c 
returns the ENOSYS error for the read IO issued from a nfs client when 
gfs_jdata flag is turned on.  Please see the code below for more details.

static ssize_t
gfs_sendfile(struct file *in_file, loff_t *offset, size_t count, read_actor_t 
actor, void __user *target)
{
        :
        :
        :

        if (gfs_is_jdata(ip))
        {
                retval = -ENOSYS;
        }
        else
                retval = generic_file_sendfile(in_file, offset, count, actor, 
target);

 out:
        gfs_holder_uninit(&gh);

        return retval;
}

Based on our finding above, the question is how we can safely turn on the 
jdata that we will not encounter the error for any data trasfer over a nfs 
network? Or, shoul we NOT turn on the jdata at all? Please advise.

Should you need more details of our finding, please directly contact Jin at 
jlim or 303-635-7886. Thank you.

Comment 16 Kiersten (Kerri) Anderson 2005-11-02 03:10:25 UTC

Based on the comments above, I am reassigning this defect to the cluster product
and GFS, as well as reassigning the defect.

Comment 18 Dean Jansa 2005-11-02 17:35:22 UTC

We can hit this in house with:

(link13 is part of a 3 node DLM cluster)

[root@link-13 ~]# pvcreate /dev/sdb1
[root@link-13 ~]# vgcreate myvg /dev/sdb1
[root@link-13 ~]# lvcreate -l  69748 -n myvol myvg
[root@link-13 ~]# vgchange -ay myvg
[root@link-13 ~]# gfs_mkfs -j 3 -J 32 -p lock_dlm -t LINK_131415:gfs0
/dev/myvg/myvol 
[root@link-13 ~]#  mount -t gfs /dev/myvg/myvol /mnt
[root@link-13 ~]# service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]            
Starting NFS mountd:                                       [  OK  ]
[root@link-13 ~]#  gfs_tool setflag inherit_jdata /mnt     
[root@link-13 ~]#  exportfs -o rw fore.lab.msp.redhat.com:/mnt

: fore; mount link-13.lab.msp.redhat.com:/mnt /mnt
: fore; pan2 -A -x 10 -l ./nfs.out -o ./nfs.out -t -x 10 -f ./nfsload.h2
    
[load9] doio (19181) 11:24:34                                                  
[load9] ---------------------
[load9] Could not read 602148 bytes from /mnt/testfile9 for verification: 
Input/output error (5)
[load9] Request number 1
[load9] syscall:  write(4, 02522525640020, 602148)
[load9]           fd 4 is file /mnt/testfile9 - open flags are 010001
[load9]           write done at file offset 46885851 - pattern is G:19181:fore:doio*
[load9] 

ERROR in /var/log/messages:  Nov  2 06:26:54 link-13 kernel: nfsd: non-standard
errno: -38


<-------------------- nfsload.h2 ----------------------->

<herd name="nfs load">

<test tag="load9">
    <cmd>  <![CDATA[
        /usr/tests/sts/bin/iogen -s write,writev -t 1000b -T 10000b -o -m random
-i 0 100000b:/mnt/testfile9 | /usr/tests/sts/bin/doio -avk
    ]]> </cmd>
</test>
 
.
.
.

Comment 19 David Teigland 2005-11-07 19:20:49 UTC

After forcing nfsd to use the vfs_readv() path after gfs returns
-ENOSYS from sendfile()...

Nfs deals with data in pages, which is what all normal file i/o use.
GFS does its own thing for jdata files, though, and treats them like
metadata, which means it's all going through the buffer cache, not the
page cache.

Specifically, I get a kmap oops when nfs tries to send the "page" of jdata
that's not really a page AFAICT.  It's very unlikely that nfs could be
modified to deal with non-paged data, and gfs's data journaling would need
to be redesigned to go through the page cache.  Perhaps some translation
could be done to temporarily copy jdata buffers into pages so nfs could
deal with them, but someone with VM-internals expertise would be needed to
judge that.

Comment 23 Wendy Cheng 2006-01-17 18:18:29 UTC

The issue here is that current GFS jdata code puts data into buffer cache,
instead of page cache, but NFSD's sendfile only interacts with page cache. There
is a WIP in GFS2 to address this issue that will be part of future GFS releases.
Per team meeting today, we'll not support this feature in RHEL 4 due to the
scale and risks of the associated changes.

Let us know if there are concerns.