Description of problem:
=======================
On a tiered volume setup, all the hot bricks changelogs reports incomplete sync, and retrying.
In actual the data is completely sync to slave but it took very long in compare to normal volume vs tier volume. These retrial might slow the sync process as the rsync retrial happens in a batch.
Size comparison:
===============
[root@mia test]# df -h
Filesystem Size Used Avail Use% Mounted on
10.70.37.99:/slave 597G 26G 572G 5% /mnt/slave
10.70.37.165:/master 746G 26G 721G 4% /mnt/master
[root@mia test]#
Log Snippet:
============
[2015-12-21 11:37:07.551009] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:37:25.350402] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:37:40.441261] I [master(/rhs/brick1/ct-b4):571:crawlwrap] _GMaster: 1 crawls, 30 turns
[2015-12-21 11:37:40.561965] I [master(/rhs/brick1/ct-b4):1131:crawl] _GMaster: slave's time: (1450694460, 0)
[2015-12-21 11:37:42.528250] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
[2015-12-21 11:51:24.521485] W [master(/rhs/brick3/hot-b2):1077:process] _GMaster: incomplete sync, retrying changelogs: CHANGELOG.1450696895 CHANGELOG.1450696929 CHANGELOG.1450696944 CHANGELOG.1450696961 CHANGELOG.1450696976 CHANGELOG.1450696995 CHANGELOG.1450697011 CHANGELOG.1450697028 CHANGELOG.1450697047 CHANGELOG.1450697063 CHANGELOG.1450697079 CHANGELOG.1450697094 CHANGELOG.1450697143 CHANGELOG.1450697160 CHANGELOG.1450697177 CHANGELOG.1450697192
Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.7.5-12.el7rhgs.x86_64
How reproducible:
=================
Always
Steps to Reproduce:
===================
1. Create Master cluster from 6 nodes
2. Create Slave cluster from 6 nodes
3. Create and Start master volume (Tiered: cold-tier 3x2 and hot-tier 2x2)
4. Create and Start slave volume (4x2)
5. Enable quota on master volume
6. Enable shared storage on master volume
7. Setup geo-rep session between master and slave volume
8. Mount master volume on client
9. Create data from master client, used the following:
> crefi --multi -n 50 -b 5 -d 5 --max=1024k --min=5k --random -T 5 -t text -I 5 --fop=create /mnt/master
> for i in {1..10}; do dd if=/dev/zero of=rs.$i bs=10M count=100 ; done
> for i in {1..999}; do dd if=/dev/zero of=file.$i bs=2M count=10 ; done
10. Monitor georep logs
Actual results:
===============
> Incomplete sync errors
> Changelog retrial
> Rsync being very slow
Additional info:
================
Something similar load has been tried on regular volume earlier, causing the sync to happen in about 15-20 mins. Whereas on tier volume it took few hours. This comparison is with historical data and will update after retrying again on normal volume
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2018:2607