Bug 1102550 - Dist-geo-rep : geo-rep failed to sync some files, with error "(Operation not permitted)" on slaves.
Summary: Dist-geo-rep : geo-rep failed to sync some files, with error "(Operation not ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: Aravinda VK
QA Contact: Bhaskar Bandari
URL:
Whiteboard:
Depends On: 1030309 1286208
Blocks: 1105083
TreeView+ depends on / blocked
 
Reported: 2014-05-29 07:55 UTC by Vijaykumar Koppad
Modified: 2015-11-27 12:31 UTC (History)
12 users (show)

Fixed In Version: glusterfs-3.6.0.18-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1105083 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:39:41 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description Vijaykumar Koppad 2014-05-29 07:55:58 UTC
Description of problem: geo-rep failed to sync some files, with error "(Operation not permitted)" on slaves.

log snippets for missed gfids.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# grep "a2b115f4-7fae-45d6-9f7b-0854099964f8" /var/log/glusterfs/geo-replication-slaves/f841cd51-e9cd-4ab3-b67e-06431cdd4e80\:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log
[2014-05-29 06:30:26.809188] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 9032: /.gfid/a2b115f4-7fae-45d6-9f7b-0854099964f8 => -1 (Operation not permitted)
[2014-05-29 06:30:58.093282] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 11985: /.gfid/a2b115f4-7fae-45d6-9f7b-0854099964f8 => -1 (Operation not permitted)
[2014-05-29 06:32:02.562229] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 19378: /.gfid/a2b115f4-7fae-45d6-9f7b-0854099964f8 => -1 (Operation not permitted)
[2014-05-29 06:33:01.295840] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 26700: /.gfid/a2b115f4-7fae-45d6-9f7b-0854099964f8 => -1 (Operation not permitted)
[2014-05-29 06:34:01.600121] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 34227: /.gfid/a2b115f4-7fae-45d6-9f7b-0854099964f8 => -1 (Operation not permitted)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>



Version-Release number of selected component (if applicable):glusterfs-3.6.0.9-1.el6rhs


How reproducible: Happening most of the times.


Steps to Reproduce:
1. create and start a geo-rep relationship between master(6x2 with nodes) and slave(6x2 with 4 nodes ). 
2. create data on master using crefi command "crefi -T 1 -n 100 --multi -b 10 -d 10 --random --min=1K --max=10K /mnt/master/"
3. check if it syncs all the files. 

Actual results: geo-rep failed to sync some files to slaves. 


Expected results:  geo-rep should sync all the files. 


Additional info:

Some more logs, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-05-29 06:30:06.957497] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 6268: /.gfid/3cbdfaae-02b8-4f8d-b38a-ba34d3e05eb1 => -1 (Operation not permitted)
[2014-05-29 06:30:06.961449] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 6599: /.gfid/31598d50-967e-418d-9f8a-ddcf2488dc97 => -1 (Operation not permitted)
[2014-05-29 06:30:06.976306] W [client-rpc-fops.c:448:client3_3_open_cbk] 0-slave-client-11: remote operation failed: Stale file handle. Path: /.gfid/68c8244e-e624-4026-ab83-985d18082d61 (68c8244e-e624-4026-ab83-985d18082d61)
[2014-05-29 06:30:06.978625] W [client-rpc-fops.c:448:client3_3_open_cbk] 0-slave-client-11: remote operation failed: Stale file handle. Path: /.gfid/68c8244e-e624-4026-ab83-985d18082d61 (68c8244e-e624-4026-ab83-985d18082d61)
[2014-05-29 06:30:06.978655] E [MSGID: 108009] [afr-open.c:269:afr_openfd_fix_open_cbk] 0-slave-replicate-5: Failed to open /.gfid/68c8244e-e624-4026-ab83-985d18082d61 on subvolume slave-client-11 [Stale file handle]
[2014-05-29 06:30:06.978731] W [client-rpc-fops.c:1564:client3_3_finodelk_cbk] 0-slave-client-11: remote operation failed: No such file or directory
[2014-05-29 06:30:06.979297] W [client-rpc-fops.c:1564:client3_3_finodelk_cbk] 0-slave-client-11: remote operation failed: No such file or directory
[2014-05-29 06:30:06.979881] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 6271: /.gfid/3d603e04-788a-4640-99d1-400f00fe448c => -1 (Operation not permitted)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-05-29 06:30:14.675177] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 7384: /.gfid/9e28a99c-2e00-4249-a684-271f6aca8e84 => -1 (Operation not permitted)
[2014-05-29 06:30:14.682584] W [fuse-bridge.c:757:fuse_attr_cbk] 0-glusterfs-fuse: 7661: STAT() /.gfid => -1 (Invalid argument)
[2014-05-29 06:30:14.701595] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 7387: /.gfid/9e5712df-2eeb-48bb-a260-d846c7b0b7a2 => -1 (Operation not permitted)
[2014-05-29 06:30:14.708378] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 7500: /.gfid/964a87ef-00fe-4ccd-8043-1716d3542ff4 => -1 (Operation not permitted)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-05-29 06:30:06.583311] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 6217: /.gfid/0cb17c31-c311-4e8a-b6f0-4ff38ff8fad0 => -1 (Operation not permitted)
[2014-05-29 06:30:06.587178] I [afr-self-heal-common.c:2147:afr_sh_post_nb_entrylk_missing_entry_sh_cbk] 0-slave-replicate-5: Non blocking entrylks failed.
[2014-05-29 06:30:06.587294] E [afr-self-heal-common.c:2869:afr_log_self_heal_completion_status] 0-slave-replicate-5:  gfid or missing entry self heal  failed,   on /.gfid/68c8244e-e624-4026-ab83-985d18082d61
[2014-05-29 06:30:06.592695] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 6547: /.gfid/2e8ab9c2-f0be-4853-be48-23c0de4fbe97 => -1 (Operation not permitted)
[2014-05-29 06:30:06.593205] W [client-rpc-fops.c:448:client3_3_open_cbk] 0-slave-client-11: remote operation failed: Stale file handle. Path: /.gfid/68c8244e-e624-4026-ab83-985d18082d61 (68c8244e-e624-4026-ab83-985d18082d61)
[2014-05-29 06:30:06.598266] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 6220: /.gfid/0d11d73b-d185-4e73-babf-f72e887f3c90 => -1 (Operation not permitted)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Some debug logs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-05-29 07:42:01.428482] D [dht-common.c:280:dht_discover_cbk] 0-slave-dht: lookup of (null) on slave-replicate-3 returned error (No such file or directory)
[2014-05-29 07:42:01.428740] D [dht-common.c:280:dht_discover_cbk] 0-slave-dht: lookup of (null) on slave-replicate-0 returned error (No such file or directory)
[2014-05-29 07:42:01.428807] D [dht-common.c:280:dht_discover_cbk] 0-slave-dht: lookup of (null) on slave-replicate-4 returned error (No such file or directory)
[2014-05-29 07:42:01.428568] D [dht-layout.c:669:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/replicate.so(afr_lookup_cbk+0x570) [0x7fd127aa4470] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_cbk+0x273) [0x7fd127828a53] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_complete+0x347) [0x7fd127821d47]))) 0-slave-dht: path=(null) err=No such file or directory on subvol=slave-replicate-0
[2014-05-29 07:42:01.428666] D [dht-layout.c:669:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/replicate.so(afr_lookup_cbk+0x570) [0x7fd127aa4470] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_cbk+0x273) [0x7fd127828a53] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_complete+0x347) [0x7fd127821d47]))) 0-slave-dht: path=(null) err=No such file or directory on subvol=slave-replicate-1
[2014-05-29 07:42:01.428755] D [dht-layout.c:669:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/replicate.so(afr_lookup_cbk+0x570) [0x7fd127aa4470] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_cbk+0x273) [0x7fd127828a53] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_complete+0x347) [0x7fd127821d47]))) 0-slave-dht: path=(null) err=No such file or directory on subvol=slave-replicate-2
[2014-05-29 07:42:01.428840] D [dht-layout.c:669:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/replicate.so(afr_lookup_cbk+0x570) [0x7fd127aa4470] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_cbk+0x273) [0x7fd127828a53] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_complete+0x347) [0x7fd127821d47]))) 0-slave-dht: path=(null) err=No such file or directory on subvol=slave-replicate-3
[2014-05-29 07:42:01.428910] D [dht-layout.c:669:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/replicate.so(afr_lookup_cbk+0x570) [0x7fd127aa4470] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_cbk+0x273) [0x7fd127828a53] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_complete+0x347) [0x7fd127821d47]))) 0-slave-dht: path=(null) err=No such file or directory on subvol=slave-replicate-4
[2014-05-29 07:42:01.429002] D [dht-layout.c:669:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/replicate.so(afr_lookup_cbk+0x570) [0x7fd127aa4470] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_cbk+0x273) [0x7fd127828a53] (-->/usr/lib64/glusterfs/3.6.0.9/xlator/cluster/distribute.so(dht_discover_complete+0x347) [0x7fd127821d47]))) 0-slave-dht: path=(null) err=No such file or directory on subvol=slave-replicate-5
[2014-05-29 07:42:01.429034] D [dht-common.c:203:dht_discover_complete] 0-slave-dht: normalizing failed on (null) (overlaps/holes present: yes, ENOENT errors: 0)
[2014-05-29 07:42:01.429117] D [dht-common.c:280:dht_discover_cbk] 0-slave-dht: lookup of (null) on slave-replicate-5 returned error (No such file or directory)
[2014-05-29 07:42:01.429224] D [dht-layout.c:649:dht_layout_normalize] 0-slave-dht: directory (null) looked up first time
[2014-05-29 07:42:01.429445] W [fuse-bridge.c:1937:fuse_create_cbk] 0-glusterfs-fuse: 75390: /.gfid/0382b218-fb7e-441d-9532-05ef399d27b4 => -1 (Operation not permitted)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 2 Venky Shankar 2014-05-30 06:36:15 UTC
"Operation not permitted" is due to the creation of file (using .gfid/<gfid>) in the normal code path. This happens when rsync triggers a file creation when the entry isn not found. But the entry should have already been there (as a part of entry_ops).

This mostly looks like the same case where DHT was failing to create files (parallel MKDIR + CREATE).

Vijaykumar,

Were there any DHT related logs in the client mount apart from what is already present in comment #1?

Comment 3 Vijaykumar Koppad 2014-05-30 06:59:28 UTC
No, there were no DHT logs apart form what I have pasted there. Those logs are effect of something, doesn't look like the cause. 
If you check afr logs, all sub-volume returned with no gfid. I feel its related to the Bug 1102747.

Comment 6 Vijaykumar Koppad 2014-06-03 12:50:18 UTC
This is reproducible even with plane distribute volume on slave. In this case, the gfid of the directories are corrupted, and all the files under that are not created. 

gfid of one directory under which all files failed to sync 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
on slave 

# getfattr -n glusterfs.gfid.string  /mnt/slave/thread2/level07/level17/level27/level37/
getfattr: Removing leading '/' from absolute path names
# file: mnt/slave/thread2/level07/level17/level27/level37/
glusterfs.gfid.string="00000000-0000-0000-5cf6-0b0100000000"


on master 

# getfattr -n glusterfs.gfid.string  /mnt/master/thread2/level07/level17/level27/level37/
getfattr: Removing leading '/' from absolute path names
# file: mnt/master/thread2/level07/level17/level27/level37/
glusterfs.gfid.string="d0575c6a-b7c4-4b28-bfc0-2ddf57d35cb4
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


and for the above gfid, the aux-mount logs  have 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-06-03 11:45:58.738230] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 28187: SETXATTR() /.gfid/02a62fe6-6921-44a3-842a-189ad633b339 => -1 (No such file or directory)
[2014-06-03 11:45:58.739845] W [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-slave-client-1: remote operation failed: File exists. Path: <gfid:fefa8327-f116-4768-9d8e-4395a727545d>/level55
[2014-06-03 11:45:58.740344] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 22901: SETXATTR() /.gfid/fefa8327-f116-4768-9d8e-4395a727545d => -1 (File exists)
[2014-06-03 11:45:58.750753] I [dht-layout.c:653:dht_layout_normalize] 0-slave-dht: found anomalies in (null). holes=1 overlaps=0
[2014-06-03 11:45:58.750995] W [dht-layout.c:178:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 2681045703
[2014-06-03 11:45:58.751033] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 28190: SETXATTR() /.gfid/d0575c6a-b7c4-4b28-bfc0-2ddf57d35cb4 => -1 (No such file or directory)
[2014-06-03 11:45:58.751635] W [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-slave-client-4: remote operation failed: File exists. Path: <gfid:02a62fe6-6921-44a3-842a-189ad633b339>/level47
[2014-06-03 11:45:58.751688] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 28191: SETXATTR() /.gfid/02a62fe6-6921-44a3-842a-189ad633b339 => -1 (File exists)
[2014-06-03 11:45:58.752223] W [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-slave-client-4: remote operation failed: File exists. Path: <gfid:d3e980b9-16b6-4a30-90f5-026efaefd1b9>/level57
[2014-06-03 11:45:58.752263] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 28192: SETXATTR() /.gfid/d3e980b9-16b6-4a30-90f5-026efaefd1b9 => -1 (File exists)
[2014-06-03 11:45:58.752802] W [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-slave-client-5: remote operation failed: File exists. Path: <gfid:fd054c2f-d892-4a40-a8b3-fb03a5b2d4e4>/level37
[2014-06-03 11:45:58.752840] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 28193: SETXATTR() /.gfid/fd054c2f-d892-4a40-a8b3-fb03a5b2d4e4 => -1 (File exists)
[2014-06-03 11:45:58.784464] W [dht-layout.c:178:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 577205183
[2014-06-03 11:45:58.784530] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 28196: SETXATTR() /.gfid/efce422b-bc92-4615-94b9-a18a78c413fd => -1 (No such file or directory)
[2014-06-03 11:45:58.784693] W [dht-layout.c:178:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 1812423478
[2014-06-03 11:45:58.784726] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 28197: SETXATTR() /.gfid/4a354f13-e523-4c6d-ad2d-d2f562a87083 => -1 (Invalid argument)
[2014-06-03 11:45:58.785022] W [dht-layout.c:178:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 1812423478
[2014-06-03 11:45:58.785058] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 28198: SETXATTR() /.gfid/d0575c6a-b7c4-4b28-bfc0-2ddf57d35cb4 => -1 (Invalid argument)
[2014-06-03 11:45:58.786243] W [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-slave-client-4: remote operation failed: File exists. Path: <gfid:ba0c0e46-a2cc-4a3c-8647-2f8cb3c7a78c>/level55
[2014-06-03 11:45:58.786285] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 22953: SETXATTR() /.gfid/ba0c0e46-a2cc-4a3c-8647-2f8cb3c7a78c => -1 (File exists)
[2014-06-03 11:45:58.798830] W [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-slave-client-5: remote operation failed: File exists. Path: <gfid:826f2fc5-ddf1-4eb2-88c8-8ad10f4ead14>/level55
[2014-06-03 11:45:58.798871] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 22968: SETXATTR() /.gfid/826f2fc5-ddf1-4eb2-88c8-8ad10f4ead14 => -1 (File exists)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 7 Venky Shankar 2014-06-04 07:51:50 UTC
(In reply to Vijaykumar Koppad from comment #6)

Could a couple of things be checked to eliminate variables?

1. Make sure the changelog records for the directory (under which all files are missing on the slave) are sane.

2. Anything in brick logs would help debugging. Please attach (or point) them too.

Comment 8 Kotresh HR 2014-06-04 08:47:40 UTC
In my setup, For the gfid say 'gfid1' on which "Operation not permitted" hits

The file is there in master with 'gfid1' and on slave the file is present with no data with different gfid (completely random/corrupted gfid) which is not even there in master's backend changelog. So when rsync tries to sync data with original gfid, it is failing with 'Operation not Permitted'

But no clue on how the entry of the file succeeded with completely different gfid.

Comment 9 Aravinda VK 2014-06-05 11:04:56 UTC
Upstream patch sent for review. http://review.gluster.org/#/c/7972/

Comment 10 Aravinda VK 2014-06-11 11:19:58 UTC
Patch sent for review
https://code.engineering.redhat.com/gerrit/#/c/26675/

Comment 11 Vijaykumar Koppad 2014-06-20 09:19:01 UTC
These errors are not coming in the build glusterfs-3.6.0.20-1

Comment 15 errata-xmlrpc 2014-09-22 19:39:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.