Bug 1044872

Summary:	Dist-geo-rep : status keep going to faulty with rsync error "rsync> inflate returned -3"
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vijaykumar Koppad <vkoppad>
Component:	geo-replication	Assignee:	Bug Updates Notification Mailing List <rhs-bugs>
Status:	CLOSED WORKSFORME	QA Contact:	storage-qa-internal <storage-qa-internal>
Severity:	high	Docs Contact:
Priority:	medium
Version:	2.1	CC:	aavati, avishwan, csaba, david.macdonald, vshankar
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	consistency
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-08-06 14:13:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vijaykumar Koppad 2013-12-19 07:23:51 UTC

Description of problem: geo-rep status going to faulty with rsync error  "rsync> inflate returned -3"

This the backtrace, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-12-19 12:29:55.408256] I [monitor(monitor):81:set_state] Monitor: new state: Stable
[2013-12-19 12:29:56.605798] E [resource(/bricks/master_brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-vC_AbZ/gsycnd-ssh-%r@%h:%p --compress root.42.171:/proc/4762/cwd" returned with 12, saying:
[2013-12-19 12:29:56.607383] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/d3328146-ba07-4a5a-96a5-d3cab590a49e failed verification -- update retained (will try again).
[2013-12-19 12:29:56.607791] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/da1bd1f4-2f29-4073-9254-5d3ba6fee8af failed verification -- update retained (will try again).
[2013-12-19 12:29:56.608031] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/db74a16f-7bcb-4163-8ca9-62f65d392682 failed verification -- update retained (will try again).
[2013-12-19 12:29:56.608367] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> inflate returned -3 (0 bytes)
[2013-12-19 12:29:56.608613] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at token.c(547) [receiver=3.0.6]
[2013-12-19 12:29:56.608886] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (8086 bytes received so far) [sender]
[2013-12-19 12:29:56.609395] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-12-19 12:29:56.610018] I [syncdutils(/bricks/master_brick1):159:finalize] <top>: exiting.
[2013-12-19 12:29:56.616862] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Version-Release number of selected component (if applicable):glusterfs-3.4.0.51geo-1

How reproducible: Didn't try to reproduce 


Steps to Reproduce:
1. create and start a geo-rep realtionship between master and slave 
2. run the following script on master

for i in {1..10}
do  
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=chmod  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chown  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chgrp  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=symlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=hardlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=truncate /mnt/master/
    sleep 500

3. parallely on one of the master node, run this, 

while : ; do ps ax | grep "ssh " | awk '{print $1}' | xargs kill ; sleep 100 ; ps ax | grep "ssh " | awk '{print $1}' | xargs kill ; sleep 1000; done 


Actual results: geo-rep status keep going to faulty state with rsync errors 


Expected results: It shouldn't go to faulty. even if it does, it should recover.


Additional info:

Comment 2 Venky Shankar 2013-12-19 12:02:06 UTC

did some basic debugging and looks like it could be an issue during compression, but still not sure about the exact issue.

Comment 3 Vijaykumar Koppad 2014-01-24 11:57:05 UTC

This happened again in the build glusterfs-libs-3.4.0.57rhs-1. This was observed while syncing hardlinks in hybrid crawl.


Backtrace in geo-rep log-file 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-01-24 16:56:56.270279] I [monitor(monitor):81:set_state] Monitor: new state: Stable
[2014-01-24 16:56:56.609118] E [resource(/bricks/master_brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-ZLm0kb/061fc87d252b63093ab9bfb765588973.sock --compress root.43.174:/proc/2044/cwd" returned with 12, saying:
[2014-01-24 16:56:56.609867] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: open "/proc/2044/cwd/.gfid/04472653-c5fb-4226-b6e2-8067bad81234" failed: No data available (61)
[2014-01-24 16:56:56.610153] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)
[2014-01-24 16:56:56.610444] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/100924af-70ad-4025-bcf1-d3bafc14dcc9 failed verification -- update retained (will try again).
[2014-01-24 16:56:56.610729] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> inflate returned -3 (0 bytes)
[2014-01-24 16:56:56.610985] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (5419 bytes received so far) [sender]
[2014-01-24 16:56:56.611249] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2014-01-24 16:56:56.611910] I [syncdutils(/bricks/master_brick1):192:finalize] <top>: exiting.
[2014-01-24 16:56:56.623743] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

log snippet from the corresponding client log for the GFID 04472653-c5fb-4226-b6e2-8067bad81234 which got some error in master,( from above log snippet)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
6aae5daa288>/hardlink_to_files
[2014-01-24 11:52:55.101724] I [fuse-bridge.c:3516:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:0ed073ac-86fa-48b1-80c5-86aae5daa2
88>/hardlink_to_files with gfid (16316676-7dbf-4376-9c52-3382888aedd6): File exists
[2014-01-24 11:52:55.101756] W [fuse-bridge.c:1628:fuse_err_cbk] 0-glusterfs-fuse: 31812: MKDIR() <gfid:0ed073ac-86fa-48b1-80c5-86aae5daa288>/hardlink_to_files => -1 (F
ile exists)
[2014-01-24 11:53:03.658170] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:04472653-c5fb-4226-b6e2-8067bad81234>: failed to get the 'linkto'
 xattr No data available
[2014-01-24 11:53:03.658287] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 32339: STAT() <gfid:04472653-c5fb-4226-b6e2-8067bad81234> => -1 (No data available)
[2014-01-24 11:53:03.666448] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:04472653-c5fb-4226-b6e2-8067bad81234>: failed to get the 'linkto'
 xattr No data available
[2014-01-24 11:53:08.070778] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-2: remote operation failed: No data available
[2014-01-24 11:53:08.071455] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-2: remote operation failed: No data available
[2014-01-24 11:53:08.072415] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-3: remote operation failed: No data available
[2014-01-24 11:53:08.072482] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: (null): failed to get the 'linkto' xattr No data available
[2014-01-24 11:53:08.073145] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-3: remote operation failed: No data available
[2014-01-24 11:53:08.073282] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: (null): failed to get the 'linkto' xattr No data available
[2014-01-24 11:53:08.087516] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:365036ca-5992-4156-bd63-187bfbe10ad5>: failed to get the 'linkto'
 xattr No data available
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 5 Aravinda VK 2015-08-06 14:13:15 UTC

rsync inflate errors not seen during regression tests of RHGS 2.1.6, 3.0.4 and 3.1. Closing this bug. Please reopen if this bug found again.