Bug 1293332

Summary: [geo-rep+tiering]: Hot tier bricks changelogs reports rsync failure
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED ERRATA QA Contact: Rochelle <rallan>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: amukherj, avishwan, chrisw, csaba, khiremat, nchilaka, nlevinki, sheggodu
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: rebase
Fixed In Version: glusterfs-3.12.2-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:27:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1572043, 1577627, 1581047, 1597563    
Bug Blocks: 1503134    

Description Rahul Hinduja 2015-12-21 12:51:32 UTC
Description of problem:
=======================

On a tiered volume setup, all the hot bricks changelogs reports incomplete sync, and retrying.

In actual the data is completely sync to slave but it took very long in compare to normal volume vs tier volume. These retrial might slow the sync process as the rsync retrial happens in a batch.

Size comparison:
===============

[root@mia test]# df -h 
Filesystem                 Size  Used Avail Use% Mounted on
10.70.37.99:/slave         597G   26G  572G   5% /mnt/slave
10.70.37.165:/master       746G   26G  721G   4% /mnt/master
[root@mia test]#

Log Snippet:
============

[2015-12-21 11:37:07.551009] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:37:25.350402] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:37:40.441261] I [master(/rhs/brick1/ct-b4):571:crawlwrap] _GMaster: 1 crawls, 30 turns
[2015-12-21 11:37:40.561965] I [master(/rhs/brick1/ct-b4):1131:crawl] _GMaster: slave's time: (1450694460, 0)
[2015-12-21 11:37:42.528250] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:51:24.521485] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192


Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.7.5-12.el7rhgs.x86_64

How reproducible:
=================

Always

Steps to Reproduce:
===================

1. Create Master cluster from 6 nodes
2. Create Slave cluster from 6 nodes
3. Create and Start master volume (Tiered: cold-tier 3x2 and hot-tier 2x2)
4. Create and Start slave volume (4x2)
5. Enable quota on master volume 
6. Enable shared storage on master volume
7. Setup geo-rep session between master and slave volume 
8. Mount master volume on client 
9. Create data from master client, used the following:

> crefi --multi -n 50 -b 5 -d 5 --max=1024k --min=5k --random -T 5 -t text -I 5 --fop=create /mnt/master
> for i in {1..10}; do dd if=/dev/zero of=rs.$i bs=10M count=100 ; done
> for i in {1..999}; do dd if=/dev/zero of=file.$i bs=2M count=10 ; done

10. Monitor georep logs 

Actual results:
===============

> Incomplete sync errors
> Changelog retrial
> Rsync being very slow

Additional info:
================

Something similar load has been tried on regular volume earlier, causing the sync to happen in about 15-20 mins.  Whereas on tier volume it took few hours. This comparison is with historical data and will update after retrying again on normal volume

Comment 6 Kotresh HR 2017-09-21 19:09:49 UTC
The patch is already merged upstream and is in 3.12, hence moving it to POST.

https://review.gluster.org/#/c/16010/

Comment 12 errata-xmlrpc 2018-09-04 06:27:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607