Bug 1621883

Summary:

Geo-replication acknowledging complete writes after failure

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Dave <dfitzpat>

Component:

geo-replication

Assignee:

Kotresh HR <khiremat>

Status:

CLOSED WONTFIX

QA Contact:

Rahul Hinduja <rhinduja>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

rhhi-1.1

CC:

ascerra, avishwan, bkunal, csaba, dlane, jroberts, khiremat, rhs-bugs, sabose, storage-qa-internal

Target Milestone:

---

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-10-24 11:47:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1724792

Attachments:

Description	Flags
sosreport	none

Description Dave 2018-08-23 19:10:56 UTC

Description of problem:

The test consists of two gluster clusters, both with 3 nodes each (a RHHI POD and a GFS cluster acting as the DR Site). I write the test/temp data then create and start a geo-replication session to sync the data to the DR site. Once the data starts to synchronize I reboot the DR site, interrupting the geo-replication session then when the DR site comes back online I pick up monitoring of the geo-rep status command and that is when I see this output (the pastebin below). I have watched this for about 6 hours and nothing has changed.

I did just pick up on the 'FAILURES' column where I see 111. I wanted your opinion on this before I filed a bug? This is an automated test through Ansible, and works with other situations, such as geo-replication during network packet loss. The one test case that seems to fail is when the DR site is taken offline mid replication.

Addition questions: When is a write acknowledged by Gluster and considered written to the DR site? Our intuition is that it is considered written when the data can be recovered, a.k.a no data loss. When is this reflected in the 'gluster v geo status detail' command?

How reproducible:
always

Steps to Reproduce:
1. Write the test/temp data then create
2. Start a geo-replication session to sync the data to the DR site
3. Once the data starts to synchronize I reboot the DR site, interrupting the geo-replication session
4. Check status

Actual results:
Geo-rep takes a long time to reflect a write after rebooting/ rehandshaking

Expected results:
The transfer is resumed and completed in a timely manner and 'gluster v geo status detail' reflects the appropriate status.

Additional info/ output:
http://pastebin.test.redhat.com/636122

Comment 2 Dave 2018-08-23 19:21:32 UTC

Created attachment 1478317 [details]
sosreport

Comment 6 Dave 2018-10-22 13:47:01 UTC

@Bipin Unfortunately at this time CS&S doesn't have the resources to reproduce this issue. Is there a place in documentation for debugging purposes that defines when writes are considered acknowledged?

Comment 9 Sahina Bose 2019-10-24 11:47:07 UTC

Closing this as there's no clear reproducer, and not enough detail to move forward.
Please reopen if you can provide this.