Bug 1109482

Summary: Dist-geo-rep : while creating hardlinks on master, snapshot with geo-rep, resulted in few hardlink give read error on slave side.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WONTFIX QA Contact: amainkar
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: avishwan, chrisw, csaba, david.macdonald, mzywusko, nlevinki, smohan, vshankar
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: consistency
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-16 15:57:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport of the master and slave nodes. none

Description Vijaykumar Koppad 2014-06-14 11:04:02 UTC
Description of problem: while creating hardlinks on master, snapshot with geo-rep, resulted in few hardlink give read error on slave side. 

arequal-checksum gave this error
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Calculating  slave checksum ...

md5sum: /tmp/tmpyfYjDH/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ: No data available

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

corresponding client logs say
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-06-14 10:39:52.011137] W [client-rpc-fops.c:1155:client3_3_fgetxattr_cbk] 0-slave-client-9: remote operation failed: No data available
[2014-06-14 10:39:52.011207] E [dht-helper.c:778:dht_migration_complete_check_task] 0-slave-dht: (null): failed to get the 'linkto' xattr No data available
[2014-06-14 10:39:52.011283] W [fuse-bridge.c:2157:fuse_readv_cbk] 0-glusterfs-fuse: 390: READ => -1 (No data available)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-3.6.0.16-1.el6rhs


How reproducible: didn't try to reproduce. 
 

Steps to Reproduce:
1. create geo-rep relationship between master and slave.
2. create data on master using the command "crefi -T 10 -n 5 --multi -b 10 -d 10 --random --min=1K --max=10K   /mnt/master/"
3. truncate all the data using the command "crefi -T 10 -n 5 --multi -b 10 -d 10 --random --min=1K --max=10K  --fop=truncate /mnt/master/"
4. while truncating the data, pause the geo-rep , take snap of slave and master and resume geo-rep.
5. After it completes syncing, create hardlinks using the command "crefi -T 10 -n 5 --multi -b 10 -d 10 --random --min=1K --max=10K  --fop=hardlink /mnt/master/"
6. while creating hardlinks, pause the geo-rep , take snap of slave and master and resume geo-rep.
7. Check the checksum of master and slave


Actual results: read on the few hardlinks failed with read error


Expected results: It shouldn't give read error on hardlinks after the syncing to slave. 


Additional info:

Comment 2 Vijaykumar Koppad 2014-06-14 11:15:29 UTC
Created attachment 908743 [details]
sosreport of the master and slave nodes.

Comment 3 Vijaykumar Koppad 2014-06-14 11:38:05 UTC
Stat of the file from mount-point

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
stat /tmp/tmpyfYjDH/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ 
  File: `/tmp/tmpyfYjDH/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ'
  Size: 0               Blocks: 0          IO Block: 131072 regular empty file
Device: 22h/34d Inode: 11524165653306002252  Links: 1
Access: (1000/---------T)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-06-14 15:45:12.164004818 +0530
Modify: 2014-06-14 15:45:12.164004818 +0530
Change: 2014-06-14 15:45:12.164004818 +0530

getfattr of the file in question from the slave backend bricks,

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[root@redmoon ~]# find /bricks/ | grep 539c1fb9%%S7NZ3IENGZ
/bricks/brick3/slave_b9/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ
[root@redmoon ~]# getfattr -d -m . -e hex /bricks/brick3/slave_b9/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick3/slave_b9/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ
trusted.gfid=0xb18d20e85e734e2f9fee0f9aa20fcb4c

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[root@redcloud ~]# find /bricks/ | grep 539c1fb9%%S7NZ3IENGZ/bricks/brick3/slave_b10/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ
[root@redcloud ~]# getfattr -d -m . -e hex bricks/brick3/slave_b10/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ
getfattr: bricks/brick3/slave_b10/thread0/level00/level10/level20/level30/level40/level50/level60/level70/level80/level90/hardlink_to_files/539c1fb9%%S7NZ3IENGZ: No such file or directory

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 4 Venky Shankar 2014-06-25 15:13:21 UTC
This looks like a side effect of capturing mknod() even if it's an internal fop. I see these in the logs:

------------------------------------------------------------------------------
vshankar@h3ckers-pride ~/sos/slave/redmoon-2014061416251402743305/var/log/glusterfs/geo-replication-slaves
 % grep -r '539c1fb9%%S7NZ3IENGZ' *
4d739b65-cd7b-49f3-902a-439653061bc8:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2014-06-14 10:15:30.633623] W [client-rpc-fops.c:240:client3_3_mknod_cbk] 0-slave-client-8: remote operation failed: File exists. Path: <gfid:ef718d6a-1b4e-4b3a-9000-9262500b5b23>/539c1fb9%%S7NZ3IENGZ
4d739b65-cd7b-49f3-902a-439653061bc8:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2014-06-14 10:15:30.634099] W [client-rpc-fops.c:240:client3_3_mknod_cbk] 0-slave-client-9: remote operation failed: File exists. Path: <gfid:ef718d6a-1b4e-4b3a-9000-9262500b5b23>/539c1fb9%%S7NZ3IENGZ
4d739b65-cd7b-49f3-902a-439653061bc8:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2014-06-14 10:17:02.033865] W [client-rpc-fops.c:240:client3_3_mknod_cbk] 0-slave-client-8: remote operation failed: File exists. Path: <gfid:ef718d6a-1b4e-4b3a-9000-9262500b5b23>/539c1fb9%%S7NZ3IENGZ
4d739b65-cd7b-49f3-902a-439653061bc8:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2014-06-14 10:17:02.034364] W [client-rpc-fops.c:240:client3_3_mknod_cbk] 0-slave-client-9: remote operation failed: File exists. Path: <gfid:ef718d6a-1b4e-4b3a-9000-9262500b5b23>/539c1fb9%%S7NZ3IENGZ
------------------------------------------------------------------------------

File "539c1fb9%%S7NZ3IENGZ" is a hardlink but the slave logs shows as mknod(). Though this is a file exist case, the first mknod() would have been successful.

Kotresh's patch to capture self-heal traffic ignores mknod() if it's an internal fop, thereby only the rename() call getting captured in changelog.

Comment 5 Vijaykumar Koppad 2014-06-25 15:16:34 UTC
I tried with the build glusterfs-3.6.0.22-1.el6rhs once, I was not able to hit it. I'll try some more runs with 22 and update.