Bug 1293332 - [geo-rep+tiering]: Hot tier bricks changelogs reports rsync failure
[geo-rep+tiering]: Hot tier bricks changelogs reports rsync failure
Status: ON_QA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
3.1
x86_64 Linux
high Severity high
: ---
: RHGS 3.4.0
Assigned To: Aravinda VK
Rochelle
rebase
: ZStream
Depends On: 1572043 1577627 1581047
Blocks: 1503134
  Show dependency treegraph
 
Reported: 2015-12-21 07:51 EST by Rahul Hinduja
Modified: 2018-05-22 02:49 EDT (History)
8 users (show)

See Also:
Fixed In Version: glusterfs-3.12.2-1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rahul Hinduja 2015-12-21 07:51:32 EST
Description of problem:
=======================

On a tiered volume setup, all the hot bricks changelogs reports incomplete sync, and retrying.

In actual the data is completely sync to slave but it took very long in compare to normal volume vs tier volume. These retrial might slow the sync process as the rsync retrial happens in a batch.

Size comparison:
===============

[root@mia test]# df -h 
Filesystem                 Size  Used Avail Use% Mounted on
10.70.37.99:/slave         597G   26G  572G   5% /mnt/slave
10.70.37.165:/master       746G   26G  721G   4% /mnt/master
[root@mia test]#

Log Snippet:
============

[2015-12-21 11:37:07.551009] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:37:25.350402] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:37:40.441261] I [master(/rhs/brick1/ct-b4):571:crawlwrap] _GMaster: 1 crawls, 30 turns
[2015-12-21 11:37:40.561965] I [master(/rhs/brick1/ct-b4):1131:crawl] _GMaster: slave's time: (1450694460, 0)
[2015-12-21 11:37:42.528250] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:51:24.521485] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192


Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.7.5-12.el7rhgs.x86_64

How reproducible:
=================

Always

Steps to Reproduce:
===================

1. Create Master cluster from 6 nodes
2. Create Slave cluster from 6 nodes
3. Create and Start master volume (Tiered: cold-tier 3x2 and hot-tier 2x2)
4. Create and Start slave volume (4x2)
5. Enable quota on master volume 
6. Enable shared storage on master volume
7. Setup geo-rep session between master and slave volume 
8. Mount master volume on client 
9. Create data from master client, used the following:

> crefi --multi -n 50 -b 5 -d 5 --max=1024k --min=5k --random -T 5 -t text -I 5 --fop=create /mnt/master
> for i in {1..10}; do dd if=/dev/zero of=rs.$i bs=10M count=100 ; done
> for i in {1..999}; do dd if=/dev/zero of=file.$i bs=2M count=10 ; done

10. Monitor georep logs 

Actual results:
===============

> Incomplete sync errors
> Changelog retrial
> Rsync being very slow

Additional info:
================

Something similar load has been tried on regular volume earlier, causing the sync to happen in about 15-20 mins.  Whereas on tier volume it took few hours. This comparison is with historical data and will update after retrying again on normal volume
Comment 6 Kotresh HR 2017-09-21 15:09:49 EDT
The patch is already merged upstream and is in 3.12, hence moving it to POST.

https://review.gluster.org/#/c/16010/

Note You need to log in before you can comment on or make changes to this bug.