Bug 1567889 - [Ganesha] Linux untar results in I/O errors on clients while in process of performing inservice upgrade
Summary: [Ganesha] Linux untar results in I/O errors on clients while in process of pe...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-16 11:32 UTC by Manisha Saini
Modified: 2018-11-19 10:40 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-19 10:40:19 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Manisha Saini 2018-04-16 11:32:06 UTC
Description of problem:

While performing inservice upgrade( post stopping the ganesha and glusterd service on 1st node and putting pcs cluster in standby mode) ,Observing I/O error on both the clients while performing linux untar on NFS mount point.

Post stopping the IO on one of the client,IO got hung for the another client.


Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
nfs-ganesha-2.4.4-18.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.4-18.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-54.6.el7rhgs.x86_64


How reproducible:


Steps to Reproduce:
1.Create 4 node ganesha cluster
2.Create  4 x 3 Distributed-repliacte volume.Export the volume via ganesha
3.Mount the volume on 2 clients via 2 different VIP's withy vers=4.0
4.Create 2 directories on mount point.Say dir1 and dir2.Copy tar file on both the dirs
4.From client 1,start linux untar on dir1 and similarly from client2,start untar  on dir2
5.Perform the following steps for inservice upgrade-
  a)pcs cluster standby
  b)# pcs cluster disable
    # pcs status
  c)systemctl stop nfs-ganesha
  d)# systemctl stop glusterd
    # pkill glusterfs
    # pkill glusterfsd


Actual results:
After some time,observed the linux untars on both the client caused I/O error.Stooped the IO's from second client,The IO got hung for 1st client.Mount point is accessible from both the clients 

=================
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/gf100.c: Cannot open: No such file or directory
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/gk104.c
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld: Cannot mkdir: Input/output error
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/gk104.c: Cannot open: No such file or directory
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/gt215.c
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld: Cannot mkdir: Input/output error
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/gt215.c: Cannot open: No such file or directory
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/mcp89.c
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld: Cannot mkdir: Input/output error
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/mcp89.c: Cannot open: No such file or directory
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/priv.h
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld: Cannot mkdir: Input/output error
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/msvld/priv.h: Cannot open: No such file or directory
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/nvdec/
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/nvdec: Cannot mkdir: Input/output error
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/nvdec/Kbuild
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/nvdec: Cannot mkdir: Input/output error
tar: linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/nvdec/Kbuild: Cannot open: No such file or directory
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/nvenc/
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/nvenc/Kbuild
linux-4.9.5/drivers/gpu/drm/nouveau/nvkm/engine/pm/
=========================

ganesha.log of the node on which the failover happened-

============================
6/04/2018 14:49:37 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-127] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:49:37 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-76] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:49:37 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-67] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:50:07 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-90] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error No such file or directory
16/04/2018 14:50:14 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-234] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:50:14 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-196] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:50:33 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-2] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error No such file or directory
16/04/2018 14:50:33 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-27] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:52:15 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-131] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:52:26 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-39] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:52:26 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-194] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:52:26 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-61] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:52:51 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-51] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error No such file or directory
16/04/2018 14:52:51 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-202] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:53:07 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-189] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Read-only file system
16/04/2018 14:56:50 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-165] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error No such file or directory
16/04/2018 14:57:41 : epoch 42e40000 : dhcp37-182.lab.eng.blr.redhat.com : ganesha.nfsd-14466[work-105] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error No such file or directory

==================================

Ganesha-gfapi.log-

=============

[Input/output error]
[2018-04-16 09:26:40.548608] E [MSGID: 114031] [client-rpc-fops.c:301:client3_3_mkdir_cbk] 0-Ganeshavol1-client-3: remote operation failed. Path: /dir2/linux-4.9.5/drivers/gpu/drm/arc [Input/output error]
[2018-04-16 09:26:40.548990] W [MSGID: 108001] [afr-transaction.c:814:afr_handle_quorum] 0-Ganeshavol1-replicate-1: /dir2/linux-4.9.5/drivers/gpu/drm/arc: Failing MKDIR as quorum is not met [Input/output error]
[2018-04-16 09:26:55.283265] E [MSGID: 114031] [client-rpc-fops.c:301:client3_3_mkdir_cbk] 0-Ganeshavol1-client-3: remote operation failed. Path: /dir1/linux-4.9.5/drivers/crypto/qat/qat_dh895xccvf [Input/output error]
[2018-04-16 09:26:55.283650] W [MSGID: 108001] [afr-transaction.c:814:afr_handle_quorum] 0-Ganeshavol1-replicate-1: /dir1/linux-4.9.5/drivers/crypto/qat/qat_dh895xccvf: Failing MKDIR as quorum is not met [Input/output error]
[2018-04-16 09:26:55.900982] E [MSGID: 114031] [client-rpc-fops.c:301:client3_3_mkdir_cbk] 0-Ganeshavol1-client-3: remote operation failed. Path: /dir1/linux-4.9.5/drivers/crypto/qce [Input/output error]
[2018-04-16 09:26:55.902276] E [MSGID: 114031] [client-rpc-fops.c:301:client3_3_mkdir_cbk] 0-Ganeshavol1-client-4: remote operation failed. Path: /dir1/linux-4.9.5/drivers/crypto/qce [Input/output error]
[2018-04-16 09:27:08.889041] E [MSGID: 114031] [client-rpc-fops.c:301:client3_3_mkdir_cbk] 0-Ganeshavol1-client-3: remote operation failed. Path: /dir1/linux-4.9.5/drivers/dma/ioat [Input/output error]
[2018-04-16 09:27:08.889353] W [MSGID: 108001] [afr-transaction.c:814:afr_handle_quorum] 0-Ganeshavol1-replicate-1: /dir1/linux-4.9.5/drivers/dma/ioat: Failing MKDIR as quorum is not met [Input/output error]
[2018-04-16 09:27:34.435245] E [MSGID: 114031] [client-rpc-fops.c:301:client3_3_mkdir_cbk] 0-Ganeshavol1-client-3: remote operation failed. Path: /dir2/linux-4.9.5/drivers/gpu/drm/msm/hdmi [Input/output error]
[2018-04-16 09:27:34.437509] W [MSGID: 108001] [afr-transaction.c:814:afr_handle_quorum] 0-Ganeshavol1-replicate-1: /dir2/linux-4.9.5/drivers/gpu/drm/msm/hdmi: Failing MKDIR as quorum is not met [Input/output error]
[2018-04-16 09:28:05.106352] E [MSGID: 114031] [client-rpc-fops.c:301:client3_3_mkdir_cbk] 0-Ganeshavol1-client-3: remote operation failed. Path: /dir1/linux-4.9.5/drivers/gpu/drm/amd/include/asic_reg/gca [Input/output error]

===============================
Expected results:

No I/O failure should be observed 

Additional info:

Attaching sosreports shortly

Comment 3 Daniel Gryniewicz 2018-04-16 12:32:33 UTC
So, it looks like maybe the cluster isn't sufficient to take down one node and still write data?  I'm not a gluster expert, but this is suggestive:

Failing MKDIR as quorum is not met [Input/output error]

There's nothing Ganesha could do in these circumstances, as it's getting EROFS.


Note You need to log in before you can comment on or make changes to this bug.