Bug 1293798

Summary: While performing rm -rf getting unknown error 527 on tiered volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RajeshReddy <rmekala>
Component: gluster-nfsAssignee: Niels de Vos <ndevos>
Status: CLOSED DUPLICATE QA Contact: storage-qa-internal <storage-qa-internal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: mzywusko, ndevos, rmekala
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-20 09:19:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Script to setup the environment and reproduce the problem none

Description RajeshReddy 2015-12-23 06:28:19 UTC
Description of problem:
=============
While performing rm -rf getting unknown error 527 on tiered volume 

Version-Release number of selected component (if applicable):
============
glusterfs-server-3.7.5-12.el7rhgs.x86_64

How reproducible:


Steps to Reproduce:
=============
1. Create 3x2 volume then attach 1x2 hot tier and mount it on client using NFS
2. Create a script to untar linux kernel and then remove all files and while removing files some times getting  unknown error 527 


Additional info:
=============
[root@tettnang ~]# gluster vol info system_rajesh
 
Volume Name: system_rajesh
Type: Tier
Volume ID: fac2f70a-8a9e-4d09-8a0a-1b73be4cc137
Status: Started
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 2 = 2
Brick1: tettnang.lab.eng.blr.redhat.com:/rhs/brick2/hot2
Brick2: tettnang.lab.eng.blr.redhat.com:/rhs/brick2/hot1
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 3 x 2 = 6
Brick3: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick1/system_rajesh
Brick4: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick1/system_rajesh
Brick5: vertigo.lab.eng.blr.redhat.com:/rhs/brick1/system_rajesh
Brick6: ninja.lab.eng.blr.redhat.com:/rhs/brick1/system_rajesh
Brick7: transformers.lab.eng.blr.redhat.com:/rhs/brick1/system_rajesh
Brick8: interstellar.lab.eng.blr.redhat.com:/rhs/brick1/system_rajesh
Options Reconfigured:
performance.readdir-ahead: on
features.ctr-enabled: on
cluster.tier-mode: cache


snippet of the NFS server log
================
[2015-12-21 10:04:25.018978] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-system_rajesh-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2015-12-21 10:04:25.019288] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-system_rajesh-client-1: Server lk version = 1
[2015-12-21 10:04:25.024976] I [MSGID: 108031] [afr-common.c:1872:afr_local_discovery_cbk] 0-system_rajesh-replicate-2: selecting local read_child system_rajesh-client-4
[2015-12-21 10:12:01.642135] W [graph.c:357:_log_if_unknown_option] 0-system_rajesh-tier-dht: option 'rule' is not recognized
[2015-12-21 10:12:01.727829] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.730073] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.735341] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.737728] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.765497] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.781583] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.800313] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.814846] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.825091] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.840992] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.842511] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.853348] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.858960] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.865087] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.875971] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh
[2015-12-21 10:12:01.881231] E [MSGID: 112070] [nfs3.c:2304:nfs3_write] 0-nfs-nfsv3: Volume is disabled: system_rajesh

Comment 2 RajeshReddy 2015-12-24 11:27:32 UTC
[root@tettnang ~]# mount | grep rajesh
transformers.lab.eng.blr.redhat.com:/system_rajesh/linux on /mnt/system_rajesh/linux type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.70.34.44,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=10.70.34.44)
[root@tettnang ~]# 

sosreport available @ /home/repo/sosreports/bug.1293798 on rhsqe-repo.lab.eng.blr.redhat.com

Comment 3 Niels de Vos 2016-01-18 15:40:41 UTC
Is this always reproducible or is it a one time happening?

Do you have a test-script (shell or distaf) that I can use to reproduce this?

Comment 4 RajeshReddy 2016-01-19 08:07:50 UTC
We observed this issue on two setups and we are not hitting the issue consistently

I was running fallowing script

for i in {1..1000}
do
tar -xvf linux-3.19.tar.gz
sleep 180
rm -rf linux-3.19
done

Comment 5 Niels de Vos 2016-01-19 19:00:20 UTC
Created attachment 1116320 [details]
Script to setup the environment and reproduce the problem

I'm using this to configure three systems and run the test. Its been running for a long time already, but there have not been any errors yet. I'm checking for the errors with:

  tail -f /var/log/glusterfs/nfs.log | grep ' E '

I'll keep this running over night, 1000 iterations are a lot and it seems pretty time consuming.

The snippet from your log shows that the problem happened 8 minutes after the NFS-server was started.
- Did the tarball extraction/removal start immediately after that?
- There was a message "option 'rule' is not recognized", do you know where that came from?

Could you check if my script should do the same actions as what you did on your environment?

Comment 6 Niels de Vos 2016-01-20 08:56:05 UTC
It seems that I have reproduced the error:

  rm: cannot remove ‘linux-3.19/arch/arm/boot/dts/include/dt-bindings’: Unknown error 527

Rather awkward, the whole linux-3.19/ directory is gone, so the 'rm' did finish completely.

However, "Volume is disabled" does not occur in the nfs.log. Instead, I have messages like these:

  W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 0-bz1293798-client-5: remote operation failed [Bad file descriptor]

A "Bad file descriptor" should get mapped to ESTALE. It is not acceptable that error 527 is passed back to the NFS-client.

Rajesh, please check if you have the same "Bad file descriptor" messages in your nfs.log.

Comment 7 Niels de Vos 2016-01-20 09:19:53 UTC
No errors in the <volume>-tier.log.

In the bricks however:

  E [MSGID: 113005] [posix.c:1700:posix_unlink] 0-bz1293798-posix: post operation fstat failed on fd=-1 [Bad file descriptor]
  E [MSGID: 115062] [server-rpc-fops.c:1097:server_unlink_cbk] 0-bz1293798-server: 375880: UNLINK /linux-3.19/tools/testing/selftests/powerpc/copyloops/memcpy_power7.S (4eb32857-ce11-482b-a83d-2a0fd0639eb1/memcpy_power7.S) ==> (Bad file descriptor) [Bad file descriptor]

Checking when this was last touched, it seems that upstream has a fix for it:

  http://review.gluster.org/13034

The chain of bugs that fixes this in a newer build of RHGS:

  bug 1293256 (Gluster mainline)
   + bug 1295347 (Gluster 3.7)
      + bug 1293237 (RHGS, glusterfs-3.7.5-13)


This bug was opened while tests were done with an older release (-12). I expect that this problem can not be reproduced with -13 or newer.

*** This bug has been marked as a duplicate of bug 1293237 ***

Comment 8 RajeshReddy 2016-01-20 14:41:01 UTC
May i know On which build you reproduced the problem

Comment 9 Niels de Vos 2016-01-21 10:12:05 UTC
To reproduce, I used the version you reported this against in comment #0:

  glusterfs-server-3.7.5-12.el7rhgs.x86_64

With -13 I did not get the "unknown error 527" messages anymore.