Bug 1287107

Summary: [georep][sharding] Unable to resume geo-rep session after previous errors
Product: [Community] GlusterFS Reporter: Sahina Bose <sabose>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: avishwan, bugs, mselvaga, sabose
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-01 11:37:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1258386    
Attachments:
Description Flags
georep-master-log
none
georep-slave-log none

Description Sahina Bose 2015-12-01 14:07:53 UTC
Created attachment 1100927 [details]
georep-master-log

Description of problem:

Geo-replication session that was running on a sharded volume resulted in failures due to lack of space at slave volume.

Geo-rep session was stopped, slave volume disk space was extended (using lvextend on underlying brick mount point), and geo-replication session was resumed.

But looking at geo-rep status detail shows failures and it seems that files are not being synced.

Status detail and volume info in Additional info

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Setup geo-replication session between master and slave (slave volume has lesser capacity than master)
2. Start geo-rep 
3. Create data in master volume more than slave capacity
4. geo-rep status will report failures (seen in status detail as failure count)
5. stop geo-rep session
6. Increase capacity of slave volume (in my case, I extended brick lv by adding additional vdisk to VM hosting slave)
7. start geo-rep session again

Actual results:


Expected results:


Additional info:

# gluster vol geo-replication data1 10.70.40.112::hc-slavevol  status detail
 
MASTER NODE                              MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                        SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED            ENTRY    DATA    META    FAILURES    CHECKPOINT TIME        CHECKPOINT COMPLETED    CHECKPOINT COMPLETION TIME   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhsdev-docker1.lab.eng.blr.redhat.com    data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Passive    N/A              N/A                    N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
rhsdev9.lab.eng.blr.redhat.com           data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Active     History Crawl    2015-11-26 15:18:56    0        7226    0       107         2015-12-01 18:09:11    No                      N/A                          
rhsdev-docker2.lab.eng.blr.redhat.com    data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Passive    N/A              N/A                    N/A      N/A     N/A     N/A         N/A                    N/A

Master volume:
Volume Name: data1
Type: Replicate
Volume ID: 55bd10b0-f05a-446b-a481-6590cc400263
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhsdev9.lab.eng.blr.redhat.com:/rhgs/data1/b1
Brick2: rhsdev-docker2.lab.eng.blr.redhat.com:/rhgs/data1/b1
Brick3: rhsdev-docker1.lab.eng.blr.redhat.com:/rhgs/data1/b1
Options Reconfigured:
performance.readdir-ahead: on
performance.low-prio-threads: 32
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on

Slave volume:
Volume Name: hc-slavevol
Type: Distribute
Volume ID: 56a3d4d9-51bc-4daf-9257-bd13e10511ae
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.70.40.112:/brick/hc1
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on
features.shard: on
features.shard-block-size: 512MB

Comment 1 Sahina Bose 2015-12-01 14:09:08 UTC
Created attachment 1100928 [details]
georep-slave-log

Comment 2 Aravinda VK 2015-12-02 05:38:19 UTC
Looks like log has only partial details. Is it possible to attach the logs before disk expansion.(Interested in Failures related to disk space)

Comment 3 Sahina Bose 2016-03-01 11:37:48 UTC
I don't have the setup anymore. Closing this for now, as I have not run into this again.
Will re-open if I hit it.