Bug 1287107

Summary:

[georep][sharding] Unable to resume geo-rep session after previous errors

Product:

[Community] GlusterFS

Reporter:

Sahina Bose <sabose>

Component:

geo-replication

Assignee:

Aravinda VK <avishwan>

Status:

CLOSED WORKSFORME

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

mainline

CC:

avishwan, bugs, mselvaga, sabose

Target Milestone:

---

Keywords:

Triaged

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-03-01 11:37:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1258386

Attachments:

Description	Flags
georep-master-log	none
georep-slave-log	none

Description Sahina Bose 2015-12-01 14:07:53 UTC

Created attachment 1100927 [details]
georep-master-log

Description of problem:

Geo-replication session that was running on a sharded volume resulted in failures due to lack of space at slave volume.

Geo-rep session was stopped, slave volume disk space was extended (using lvextend on underlying brick mount point), and geo-replication session was resumed.

But looking at geo-rep status detail shows failures and it seems that files are not being synced.

Status detail and volume info in Additional info

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Setup geo-replication session between master and slave (slave volume has lesser capacity than master)
2. Start geo-rep 
3. Create data in master volume more than slave capacity
4. geo-rep status will report failures (seen in status detail as failure count)
5. stop geo-rep session
6. Increase capacity of slave volume (in my case, I extended brick lv by adding additional vdisk to VM hosting slave)
7. start geo-rep session again

Actual results:


Expected results:


Additional info:

# gluster vol geo-replication data1 10.70.40.112::hc-slavevol  status detail
 
MASTER NODE                              MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                        SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED            ENTRY    DATA    META    FAILURES    CHECKPOINT TIME        CHECKPOINT COMPLETED    CHECKPOINT COMPLETION TIME   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhsdev-docker1.lab.eng.blr.redhat.com    data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Passive    N/A              N/A                    N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
rhsdev9.lab.eng.blr.redhat.com           data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Active     History Crawl    2015-11-26 15:18:56    0        7226    0       107         2015-12-01 18:09:11    No                      N/A                          
rhsdev-docker2.lab.eng.blr.redhat.com    data1         /rhgs/data1/b1    root          10.70.40.112::hc-slavevol    10.70.40.112    Passive    N/A              N/A                    N/A      N/A     N/A     N/A         N/A                    N/A

Master volume:
Volume Name: data1
Type: Replicate
Volume ID: 55bd10b0-f05a-446b-a481-6590cc400263
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhsdev9.lab.eng.blr.redhat.com:/rhgs/data1/b1
Brick2: rhsdev-docker2.lab.eng.blr.redhat.com:/rhgs/data1/b1
Brick3: rhsdev-docker1.lab.eng.blr.redhat.com:/rhgs/data1/b1
Options Reconfigured:
performance.readdir-ahead: on
performance.low-prio-threads: 32
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on

Slave volume:
Volume Name: hc-slavevol
Type: Distribute
Volume ID: 56a3d4d9-51bc-4daf-9257-bd13e10511ae
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.70.40.112:/brick/hc1
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on
features.shard: on
features.shard-block-size: 512MB

Comment 1 Sahina Bose 2015-12-01 14:09:08 UTC

Created attachment 1100928 [details]
georep-slave-log

Comment 2 Aravinda VK 2015-12-02 05:38:19 UTC

Looks like log has only partial details. Is it possible to attach the logs before disk expansion.(Interested in Failures related to disk space)

Comment 3 Sahina Bose 2016-03-01 11:37:48 UTC

I don't have the setup anymore. Closing this for now, as I have not run into this again.
Will re-open if I hit it.