1287107 – [georep][sharding] Unable to resume geo-rep session after previous errors

Bug 1287107 - [georep][sharding] Unable to resume geo-rep session after previous errors

Summary: [georep][sharding] Unable to resume geo-rep session after previous errors

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Aravinda VK
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-1
TreeView+	depends on / blocked

Reported:	2015-12-01 14:07 UTC by Sahina Bose
Modified:	2016-03-01 11:37 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-03-01 11:37:48 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
georep-master-log (7.12 KB, text/plain) 2015-12-01 14:07 UTC, Sahina Bose	no flags	Details
georep-slave-log (313.54 KB, text/plain) 2015-12-01 14:09 UTC, Sahina Bose	no flags	Details
View All

Description Sahina Bose 2015-12-01 14:07:53 UTC

Created attachment 1100927 [details]
georep-master-log

Description of problem:

Geo-replication session that was running on a sharded volume resulted in failures due to lack of space at slave volume.

Geo-rep session was stopped, slave volume disk space was extended (using lvextend on underlying brick mount point), and geo-replication session was resumed.

But looking at geo-rep status detail shows failures and it seems that files are not being synced.

Status detail and volume info in Additional info

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Setup geo-replication session between master and slave (slave volume has lesser capacity than master)
2. Start geo-rep 
3. Create data in master volume more than slave capacity
4. geo-rep status will report failures (seen in status detail as failure count)
5. stop geo-rep session
6. Increase capacity of slave volume (in my case, I extended brick lv by adding additional vdisk to VM hosting slave)
7. start geo-rep session again

Actual results:


Expected results:


Additional info:

# gluster vol geo-replication data1 10.70.40.112::hc-slavevol  status detail
 
MASTER NODE                              MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                        SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED            ENTRY    DATA    META    FAILURES    CHECKPOINT TIME        CHECKPOINT COMPLETED    CHECKPOINT COMPLETION TIME   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhsdev-docker1.lab.eng.blr.redhat.com    data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Passive    N/A              N/A                    N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
rhsdev9.lab.eng.blr.redhat.com           data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Active     History Crawl    2015-11-26 15:18:56    0        7226    0       107         2015-12-01 18:09:11    No                      N/A                          
rhsdev-docker2.lab.eng.blr.redhat.com    data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Passive    N/A              N/A                    N/A      N/A     N/A     N/A         N/A                    N/A

Master volume:
Volume Name: data1
Type: Replicate
Volume ID: 55bd10b0-f05a-446b-a481-6590cc400263
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhsdev9.lab.eng.blr.redhat.com:/rhgs/data1/b1
Brick2: rhsdev-docker2.lab.eng.blr.redhat.com:/rhgs/data1/b1
Brick3: rhsdev-docker1.lab.eng.blr.redhat.com:/rhgs/data1/b1
Options Reconfigured:
performance.readdir-ahead: on
performance.low-prio-threads: 32
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on

Slave volume:
Volume Name: hc-slavevol
Type: Distribute
Volume ID: 56a3d4d9-51bc-4daf-9257-bd13e10511ae
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.70.40.112:/brick/hc1
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on
features.shard: on
features.shard-block-size: 512MB

Comment 1 Sahina Bose 2015-12-01 14:09:08 UTC

Created attachment 1100928 [details]
georep-slave-log

Comment 2 Aravinda VK 2015-12-02 05:38:19 UTC

Looks like log has only partial details. Is it possible to attach the logs before disk expansion.(Interested in Failures related to disk space)

Comment 3 Sahina Bose 2016-03-01 11:37:48 UTC

I don't have the setup anymore. Closing this for now, as I have not run into this again.
Will re-open if I hit it.

Note You need to log in before you can comment on or make changes to this bug.