960843 – nfs: rm-rf * does not remove the data

Bug 960843 - nfs: rm-rf * does not remove the data

Summary: nfs: rm-rf * does not remove the data

Keywords:
Status:	CLOSED DUPLICATE of bug 1115367
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra G
QA Contact:	Saurabh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	966852
TreeView+	depends on / blocked

Reported:	2013-05-08 06:33 UTC by Saurabh
Modified:	2016-01-19 06:15 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.4.0.10rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	966852 (view as bug list)
Environment:
Last Closed:	2015-11-27 10:38:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Saurabh 2013-05-08 06:33:03 UTC

Description of problem:
volume type:- 6x2

when executing rm -rf from different mount-point on two different clients.
mount point are again from different servers of the rhs cluster

Volume info:-

[root@bigbend1 ~]# gluster volume info
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 8861323a-fe95-4772-812e-254838ec18ad
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.155:/rhs/brick1/d1r1
Brick2: 10.70.37.100:/rhs/brick1/d1r2
Brick3: 10.70.37.121:/rhs/brick1/d2r1
Brick4: 10.70.37.211:/rhs/brick1/d2r2
Brick5: 10.70.37.155:/rhs/brick1/d3r1
Brick6: 10.70.37.100:/rhs/brick1/d3r2
Brick7: 10.70.37.121:/rhs/brick1/d4r1
Brick8: 10.70.37.211:/rhs/brick1/d4r2
Brick9: 10.70.37.155:/rhs/brick1/d5r1
Brick10: 10.70.37.100:/rhs/brick1/d5r2
Brick11: 10.70.37.121:/rhs/brick1/d6r1
Brick12: 10.70.37.211:/rhs/brick1/d6r2
Version-Release number of selected component (if applicable):
glusterfs-3.4.0.4rhs-1.el6rhs.x86_64

How reproducible:
the logs are seen many a times.

Steps to Reproduce:
1. create a volume, start the volume using nodes, [a, b, c, d]
2. mount volume from node a and c on clients c1 and c2 respectively
3. create loads of data in the mount-point.(use only one mount point for creating data).

function for creating data:
    for i in range(10000):
        os.mkdir(mount_path_nfs + "/" + "%d"%(i))
        for j in range(100):
            os.mkdir(mount_path_nfs + "/" + "%d"%(i) + "/" + "%d"%(j))
            commands.getoutput("touch" + " " + mount_path_nfs + "/" + "%d"%(i) + "/" + "%d"%(j) + "/" + "%d"%(j) + ".file")

4. now start "rm -rf *" on both mount point as mentioned in step 2.
  
Actual results:
the rm -rf deletes some upto an amount, after that it just does not, happened to me already twice,

[root@rhsauto020 ~]# date
Wed May  8 11:52:53 IST 2013
[root@rhsauto020 ~]# 
[root@rhsauto020 ~]# 
[root@rhsauto020 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_rhsauto020-lv_root
                      47098232   1120824  43584936   3% /
tmpfs                  1961320         0   1961320   0% /dev/shm
/dev/vda1               495844     32420    437824   7% /boot
/dev/vdg1            103211888    192116  97776916   1% /export
10.70.34.114:/opt     51606528   6158336  42826752  13% /opt
10.70.37.155:dist-rep
                     213780480  12209024 201571456   6% /mnt/nfs-test
[root@rhsauto020 ~]# 


[root@rhsauto020 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_rhsauto020-lv_root
                      47098232   1120824  43584936   3% /
tmpfs                  1961320         0   1961320   0% /dev/shm
/dev/vda1               495844     32420    437824   7% /boot
/dev/vdg1            103211888    192116  97776916   1% /export
10.70.34.114:/opt     51606528   6158336  42826752  13% /opt
10.70.37.155:dist-rep
                     213780480  12209024 201571456   6% /mnt/nfs-test
[root@rhsauto020 ~]# 
[root@rhsauto020 ~]# 
[root@rhsauto020 ~]# date
Wed May  8 11:54:43 IST 2013


dmesg from client c1,

nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
SELinux: initialized (dev 0:13, type nfs), uses genfs_contexts
nfs: server 10.70.37.155 not responding, still trying
SELinux: initialized (dev 0:13, type nfs), uses genfs_contexts
nfs: server 10.70.37.155 not responding, still trying


dmesg from client c2,
lockd: server 10.70.37.155 not responding, still trying
lockd: unexpected unlock status: 1
lockd: server 10.70.37.155 not responding, still trying
lockd: server 10.70.37.155 OK
lockd: unexpected unlock status: 1
lockd: unexpected unlock status: 16777216
lockd: unexpected unlock status: 1
lockd: unexpected unlock status: 16777216
lockd: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
lockd: server 10.70.37.155 not responding, still trying
xs_tcp_setup_socket: connect returned unhandled error -107
nfs: server 10.70.37.155 OK
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 OK
SELinux: initialized (dev 0:16, type nfs), uses genfs_contexts
nfs: server 10.70.37.121 not responding, still trying
SELinux: initialized (dev 0:16, type nfs), uses genfs_contexts
nfs: server 10.70.37.121 not responding, still trying

Expected results:

rm -rf * should finish to completion, without errors.

Additional info:
two more bugs while these operations were going on were filed,
BZ 960834 and 960835

Comment 2 Amar Tumballi 2013-05-10 10:12:26 UTC

can we try with 'eager-lock disable' option for this? also check nfs server health.

Comment 3 shishir gowda 2013-05-11 06:24:40 UTC

One of the possible errors:

1. dht_access_cbk sees a ENOENT error from the subvolume
2. dht_migration_complete_check is called to check if file is migrated
3. fd is NULL as this is a path based op
4. But in this case inode is also NULL
5. dht_migration_complete_check fails as both fd and inode is NULL
6. dht_access2 returns with EUCLEAN error.

The inode might be NULL, as a parallel rm from another client might have succeeded, hence the ENOENT errors in the first case.

Will continue the investigations.

[2013-05-07 22:08:47.159766] W [client-rpc-fops.c:1369:client3_3_access_cbk] 0-dist-rep-client-0: remote operation failed: No such file or directory
[2013-05-07 22:08:47.160948] W [client-rpc-fops.c:1369:client3_3_access_cbk] 0-dist-rep-client-1: remote operation failed: No such file or directory

[2013-05-07 22:08:47.167239] E [dht-helper.c:1065:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_discover_complete+0x421) [0x7f8efb933721] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_layout_set+0x4e) [0x7f8efb91603e] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_inode_ctx_layout_get+0x1b) [0x7f8efb924cfb]))) 0-dist-rep-dht: invalid argument: inode
[2013-05-07 22:08:47.167919] E [dht-helper.c:1065:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_discover_complete+0x421) [0x7f8efb933721] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_layout_set+0x63) [0x7f8efb916053] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x34) [0x7f8efb916544]))) 0-dist-rep-dht: invalid argument: inode
[2013-05-07 22:08:47.167987] E [dht-helper.c:1084:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_discover_complete+0x421) [0x7f8efb933721] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_layout_set+0x63) [0x7f8efb916053] (-->/usr/lib64/glusterfs/3.4.0.4rhs/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x52) [0x7f8efb916562]))) 0-dist-rep-dht: invalid argument: inode

[2013-05-07 22:08:47.168052] W [nfs3.c:1522:nfs3svc_access_cbk] 0-nfs: c2a64999: /6825/54 => -1 (Structure needs cleaning)

Comment 4 shishir gowda 2013-05-11 06:32:28 UTC

Looks like selinux was also enabled
cat /tmp/rhsqe-repo.lab.eng.blr.redhat.com/sosreports/960834/bigbend1-2013050804431367968426/etc/selinux/config

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=enforcing
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted 

cat /tmp/rhsqe-repo.lab.eng.blr.redhat.com/sosreports/960834/bigbend3-2013050804431367968439/etc/selinux/config

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=enforcing
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

Comment 6 Saurabh 2013-05-17 04:02:08 UTC

It still is not fixed,

[root@bigbend1 ~]# gluster volume status 
Status of volume: dist-rep
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.115:/rhs/brick1/d1r1                     49152   Y       2209
Brick 10.70.37.164:/rhs/brick1/d1r2                     49152   Y       2205
Brick 10.70.37.55:/rhs/brick1/d2r1                      49152   Y       2198
Brick 10.70.37.168:/rhs/brick1/d2r2                     49152   Y       2196
Brick 10.70.37.115:/rhs/brick1/d3r1                     49153   Y       2218
Brick 10.70.37.164:/rhs/brick1/d3r2                     49153   Y       2214
Brick 10.70.37.55:/rhs/brick1/d4r1                      49153   Y       2207
Brick 10.70.37.168:/rhs/brick1/d4r2                     49153   Y       2205
Brick 10.70.37.115:/rhs/brick1/d5r1                     49154   Y       2227
Brick 10.70.37.164:/rhs/brick1/d5r2                     49154   Y       2223
Brick 10.70.37.55:/rhs/brick1/d6r1                      49154   Y       2216
Brick 10.70.37.168:/rhs/brick1/d6r2                     49154   Y       2214
NFS Server on localhost                                 2049    Y       2237
Self-heal Daemon on localhost                           N/A     Y       2243
NFS Server on c612bd05-bf73-445c-a206-45bea2b7d2bc      2049    Y       2226
Self-heal Daemon on c612bd05-bf73-445c-a206-45bea2b7d2b
c                                                       N/A     Y       2233
NFS Server on 474c7f95-5c0a-4142-b075-338b2612af37      2049    Y       2233
Self-heal Daemon on 474c7f95-5c0a-4142-b075-338b2612af3
7                                                       N/A     Y       2240
NFS Server on 3d039042-43c3-4ad2-93a0-3e74d75ab666      2049    Y       2224
Self-heal Daemon on 3d039042-43c3-4ad2-93a0-3e74d75ab66
6                                                       N/A     Y       2231
 
There are no active volume tasks



[root@rhsauto020 nfs-regression]# date
Fri May 17 09:26:54 IST 2013
[root@rhsauto020 nfs-regression]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_rhsauto020-lv_root
                      51606140   2016516  46968184   5% /
tmpfs                  1961320         0   1961320   0% /dev/shm
/dev/vda1               495844     37542    432702   8% /boot
/dev/mapper/vg_rhsauto020-lv_home
                     614742048    202088 583312876   1% /home
10.70.34.114:/opt     51606528   6159872  42825216  13% /opt
10.70.37.115:/dist-rep
                     213780480   5091392 208689088   3% /mnt/nfs-regression.1368703141




[root@rhsauto020 nfs-regression]# date
Fri May 17 09:29:07 IST 2013
[root@rhsauto020 nfs-regression]# 
[root@rhsauto020 nfs-regression]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_rhsauto020-lv_root
                      51606140   2016516  46968184   5% /
tmpfs                  1961320         0   1961320   0% /dev/shm
/dev/vda1               495844     37542    432702   8% /boot
/dev/mapper/vg_rhsauto020-lv_home
                     614742048    202088 583312876   1% /home
10.70.34.114:/opt     51606528   6159872  42825216  13% /opt
10.70.37.115:/dist-rep
                     213780480   5091392 208689088   3% /mnt/nfs-regression.1368703141

Comment 7 santosh pradhan 2013-05-22 09:19:53 UTC


Rajesh, It might be same as the one worked by Pranith in BZ 965987. Just check with him or test his patch. :)

Comment 10 Vivek Agarwal 2014-04-23 12:05:35 UTC

Assigning Raghvendra G, based on https://bugzilla.redhat.com/show_bug.cgi?id=960843#c9

Comment 11 Nagaprasad Sathyanarayana 2014-05-06 11:43:36 UTC

Dev ack to 3.0 RHS BZs

Comment 14 Susant Kumar Palai 2015-11-27 10:38:01 UTC


*** This bug has been marked as a duplicate of bug 1115367 ***

Note You need to log in before you can comment on or make changes to this bug.