Bug 1044872 - Dist-geo-rep : status keep going to faulty with rsync error "rsync> inflate returned -3"
Summary: Dist-geo-rep : status keep going to faulty with rsync error "rsync> inflate ...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard: consistency
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-12-19 07:23 UTC by Vijaykumar Koppad
Modified: 2015-08-06 14:13 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-06 14:13:15 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Vijaykumar Koppad 2013-12-19 07:23:51 UTC
Description of problem: geo-rep status going to faulty with rsync error  "rsync> inflate returned -3"

This the backtrace, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-12-19 12:29:55.408256] I [monitor(monitor):81:set_state] Monitor: new state: Stable
[2013-12-19 12:29:56.605798] E [resource(/bricks/master_brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-vC_AbZ/gsycnd-ssh-%r@%h:%p --compress root.42.171:/proc/4762/cwd" returned with 12, saying:
[2013-12-19 12:29:56.607383] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/d3328146-ba07-4a5a-96a5-d3cab590a49e failed verification -- update retained (will try again).
[2013-12-19 12:29:56.607791] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/da1bd1f4-2f29-4073-9254-5d3ba6fee8af failed verification -- update retained (will try again).
[2013-12-19 12:29:56.608031] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/db74a16f-7bcb-4163-8ca9-62f65d392682 failed verification -- update retained (will try again).
[2013-12-19 12:29:56.608367] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> inflate returned -3 (0 bytes)
[2013-12-19 12:29:56.608613] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at token.c(547) [receiver=3.0.6]
[2013-12-19 12:29:56.608886] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (8086 bytes received so far) [sender]
[2013-12-19 12:29:56.609395] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2013-12-19 12:29:56.610018] I [syncdutils(/bricks/master_brick1):159:finalize] <top>: exiting.
[2013-12-19 12:29:56.616862] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Version-Release number of selected component (if applicable):glusterfs-3.4.0.51geo-1

How reproducible: Didn't try to reproduce 


Steps to Reproduce:
1. create and start a geo-rep realtionship between master and slave 
2. run the following script on master

for i in {1..10}
do  
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=chmod  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chown  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chgrp  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=symlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=hardlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=truncate /mnt/master/
    sleep 500

3. parallely on one of the master node, run this, 

while : ; do ps ax | grep "ssh " | awk '{print $1}' | xargs kill ; sleep 100 ; ps ax | grep "ssh " | awk '{print $1}' | xargs kill ; sleep 1000; done 


Actual results: geo-rep status keep going to faulty state with rsync errors 


Expected results: It shouldn't go to faulty. even if it does, it should recover.


Additional info:

Comment 2 Venky Shankar 2013-12-19 12:02:06 UTC
did some basic debugging and looks like it could be an issue during compression, but still not sure about the exact issue.

Comment 3 Vijaykumar Koppad 2014-01-24 11:57:05 UTC
This happened again in the build glusterfs-libs-3.4.0.57rhs-1. This was observed while syncing hardlinks in hybrid crawl.


Backtrace in geo-rep log-file 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-01-24 16:56:56.270279] I [monitor(monitor):81:set_state] Monitor: new state: Stable
[2014-01-24 16:56:56.609118] E [resource(/bricks/master_brick1):204:errlog] Popen: command "rsync -avR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs . -e ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-ZLm0kb/061fc87d252b63093ab9bfb765588973.sock --compress root.43.174:/proc/2044/cwd" returned with 12, saying:
[2014-01-24 16:56:56.609867] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: open "/proc/2044/cwd/.gfid/04472653-c5fb-4226-b6e2-8067bad81234" failed: No data available (61)
[2014-01-24 16:56:56.610153] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)
[2014-01-24 16:56:56.610444] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> WARNING: .gfid/100924af-70ad-4025-bcf1-d3bafc14dcc9 failed verification -- update retained (will try again).
[2014-01-24 16:56:56.610729] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> inflate returned -3 (0 bytes)
[2014-01-24 16:56:56.610985] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync: connection unexpectedly closed (5419 bytes received so far) [sender]
[2014-01-24 16:56:56.611249] E [resource(/bricks/master_brick1):207:logerr] Popen: rsync> rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
[2014-01-24 16:56:56.611910] I [syncdutils(/bricks/master_brick1):192:finalize] <top>: exiting.
[2014-01-24 16:56:56.623743] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

log snippet from the corresponding client log for the GFID 04472653-c5fb-4226-b6e2-8067bad81234 which got some error in master,( from above log snippet)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
6aae5daa288>/hardlink_to_files
[2014-01-24 11:52:55.101724] I [fuse-bridge.c:3516:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:0ed073ac-86fa-48b1-80c5-86aae5daa2
88>/hardlink_to_files with gfid (16316676-7dbf-4376-9c52-3382888aedd6): File exists
[2014-01-24 11:52:55.101756] W [fuse-bridge.c:1628:fuse_err_cbk] 0-glusterfs-fuse: 31812: MKDIR() <gfid:0ed073ac-86fa-48b1-80c5-86aae5daa288>/hardlink_to_files => -1 (F
ile exists)
[2014-01-24 11:53:03.658170] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:04472653-c5fb-4226-b6e2-8067bad81234>: failed to get the 'linkto'
 xattr No data available
[2014-01-24 11:53:03.658287] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 32339: STAT() <gfid:04472653-c5fb-4226-b6e2-8067bad81234> => -1 (No data available)
[2014-01-24 11:53:03.666448] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:04472653-c5fb-4226-b6e2-8067bad81234>: failed to get the 'linkto'
 xattr No data available
[2014-01-24 11:53:08.070778] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-2: remote operation failed: No data available
[2014-01-24 11:53:08.071455] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-2: remote operation failed: No data available
[2014-01-24 11:53:08.072415] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-3: remote operation failed: No data available
[2014-01-24 11:53:08.072482] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: (null): failed to get the 'linkto' xattr No data available
[2014-01-24 11:53:08.073145] W [client-rpc-fops.c:1170:client3_3_fgetxattr_cbk] 0-slave-client-3: remote operation failed: No data available
[2014-01-24 11:53:08.073282] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: (null): failed to get the 'linkto' xattr No data available
[2014-01-24 11:53:08.087516] E [dht-helper.c:777:dht_migration_complete_check_task] 0-slave-dht: <gfid:365036ca-5992-4156-bd63-187bfbe10ad5>: failed to get the 'linkto'
 xattr No data available
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 5 Aravinda VK 2015-08-06 14:13:15 UTC
rsync inflate errors not seen during regression tests of RHGS 2.1.6, 3.0.4 and 3.1. Closing this bug. Please reopen if this bug found again.


Note You need to log in before you can comment on or make changes to this bug.