Bug 1044872

Summary: Dist-geo-rep : status keep going to faulty with rsync error "rsync> inflate returned -3"
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WORKSFORME QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: aavati, avishwan, csaba, david.macdonald, vshankar
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: consistency
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-06 14:13:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vijaykumar Koppad 2013-12-19 07:23:51 UTC
Description of problem: geo-rep status going to faulty with rsync error  "rsync> inflate returned -3"

This the backtrace, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-12-19 12:29:55.408256] I [monitor(monitor):81:set_state] Monitor: new state: Stable
[2013-12-19 12:29:56.605798] E [resource(/bricks/master_brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-vC_AbZ/gsycnd-ssh-%r@%h:%p --compress root.42.171:/proc/4762/cwd" returned with 12, saying:
[2013-12-19 12:29:56.607383] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/d3328146-ba07-4a5a-96a5-d3cab590a49e failed verification -- update retained (will try again).
[2013-12-19 12:29:56.607791] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/da1bd1f4-2f29-4073-9254-5d3ba6fee8af failed verification -- update retained (will try again).
[2013-12-19 12:29:56.608031] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/db74a16f-7bcb-4163-8ca9-62f65d392682 failed verification -- update retained (will try again).
[2013-12-19 12:29:56.608367] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> inflate returned -3 (0 bytes)
[2013-12-19 12:29:56.608613] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at token.c(547) [receiver=3.0.6]
[2013-12-19 12:29:56.608886] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (8086 bytes received so far) [sender]
[2013-12-19 12:29:56.609395] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-12-19 12:29:56.610018] I [syncdutils(/bricks/master_brick1):159:finalize] <top>: exiting.
[2013-12-19 12:29:56.616862] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Version-Release number of selected component (if applicable):glusterfs-3.4.0.51geo-1

How reproducible: Didn't try to reproduce 


Steps to Reproduce:
1. create and start a geo-rep realtionship between master and slave 
2. run the following script on master

for i in {1..10}
do  
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=chmod  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chown  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chgrp  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=symlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=hardlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=truncate /mnt/master/
    sleep 500

3. parallely on one of the master node, run this, 

while : ; do ps ax | grep "ssh " | awk '{print $1}' | xargs kill ; sleep 100 ; ps ax | grep "ssh " | awk '{print $1}' | xargs kill ; sleep 1000; done 


Actual results: geo-rep status keep going to faulty state with rsync errors 


Expected results: It shouldn't go to faulty. even if it does, it should recover.


Additional info:

Comment 2 Venky Shankar 2013-12-19 12:02:06 UTC
did some basic debugging and looks like it could be an issue during compression, but still not sure about the exact issue.

Comment 3 Vijaykumar Koppad 2014-01-24 11:57:05 UTC
This happened again in the build glusterfs-libs-3.4.0.57rhs-1. This was observed while syncing hardlinks in hybrid crawl.


Backtrace in geo-rep log-file 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-01-24 16:56:56.270279] I [monitor(monitor):81:set_state] Monitor: new state: Stable
[2014-01-24 16:56:56.609118] E [resource(/bricks/master_brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-ZLm0kb/061fc87d252b63093ab9bfb765588973.sock --compress root.43.174:/proc/2044/cwd" returned with 12, saying:
[2014-01-24 16:56:56.609867] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: open "/proc/2044/cwd/.gfid/04472653-c5fb-4226-b6e2-8067bad81234" failed: No data available (61)
[2014-01-24 16:56:56.610153] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)
[2014-01-24 16:56:56.610444] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/100924af-70ad-4025-bcf1-d3bafc14dcc9 failed verification -- update retained (will try again).
[2014-01-24 16:56:56.610729] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> inflate returned -3 (0 bytes)
[2014-01-24 16:56:56.610985] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (5419 bytes received so far) [sender]
[2014-01-24 16:56:56.611249] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2014-01-24 16:56:56.611910] I [syncdutils(/bricks/master_brick1):192:finalize] <top>: exiting.
[2014-01-24 16:56:56.623743] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

log snippet from the corresponding client log for the GFID 04472653-c5fb-4226-b6e2-8067bad81234 which got some error in master,( from above log snippet)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
6aae5daa288>/hardlink_to_files
[2014-01-24 11:52:55.101724] I [fuse-bridge.c:3516:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:0ed073ac-86fa-48b1-80c5-86aae5daa2
88>/hardlink_to_files with gfid (16316676-7dbf-4376-9c52-3382888aedd6): File exists
[2014-01-24 11:52:55.101756] W [fuse-bridge.c:1628:fuse_err_cbk] 0-glusterfs-fuse: 31812: MKDIR() <gfid:0ed073ac-86fa-48b1-80c5-86aae5daa288>/hardlink_to_files => -1 (F
ile exists)
[2014-01-24 11:53:03.658170] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:04472653-c5fb-4226-b6e2-8067bad81234>: failed to get the 'linkto'
 xattr No data available
[2014-01-24 11:53:03.658287] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 32339: STAT() <gfid:04472653-c5fb-4226-b6e2-8067bad81234> => -1 (No data available)
[2014-01-24 11:53:03.666448] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:04472653-c5fb-4226-b6e2-8067bad81234>: failed to get the 'linkto'
 xattr No data available
[2014-01-24 11:53:08.070778] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-2: remote operation failed: No data available
[2014-01-24 11:53:08.071455] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-2: remote operation failed: No data available
[2014-01-24 11:53:08.072415] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-3: remote operation failed: No data available
[2014-01-24 11:53:08.072482] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: (null): failed to get the 'linkto' xattr No data available
[2014-01-24 11:53:08.073145] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-3: remote operation failed: No data available
[2014-01-24 11:53:08.073282] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: (null): failed to get the 'linkto' xattr No data available
[2014-01-24 11:53:08.087516] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:365036ca-5992-4156-bd63-187bfbe10ad5>: failed to get the 'linkto'
 xattr No data available
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 5 Aravinda VK 2015-08-06 14:13:15 UTC
rsync inflate errors not seen during regression tests of RHGS 2.1.6, 3.0.4 and 3.1. Closing this bug. Please reopen if this bug found again.