> QE has it hit this issue again. We analysed the issue and following is root > cause. > > Problem: > 1. Just after upgrade, geo-rep is not syncing data. > 2. After upgrade, geo-rep missed syncing few files to slave. > > Cause: > > RHEL7 uses chronyd to synchronize time. But when the node is rebooted, it > witnessed wrong time (future time) and adjusted back to the correct time > with in the interval of around 30 secs. > > Mar 9 22:48:21 dhcp37-47 dracut: *** Resolving executable dependencies > *** > Mar 9 17:11:21 dhcp37-47 chronyd[987]: Selected source 96.126.105.86 > Mar 9 17:11:21 dhcp37-47 chronyd[987]: System clock wrong by > -20220.700769 seconds, adjustment started > Mar 9 17:11:21 dhcp37-47 systemd: Time has been changed > Mar 9 17:11:21 dhcp37-47 chronyd[987]: System clock was stepped by > -20220.700769 seconds > > > Changelog xlator uses time syscall to get current epoch time and adds 15 > secs to form abstime and waits on 'pthred_cond_timedwait'. So every 15sec, > this is expired and new changelog is generated with 'CHANGELOG.<TSTAMP>', > where TSTAMP = current time got using 'gettimeofday'. > > Problem 1 occurs because: > When the brick came up, changelog xlator got the future time when it > called 'time' syscall. So it's waiting on abstime (future time + 15 sec) in > 'pthread_cond_timedwait'. So even though, the chronyd corrects the system > time, pthread_cond_timedwait wakes up in 'future time + 15 sec' (in this > case, around 5 hrs ahead). So changelog is not rolled over, and geo-rep is > not syncing anything. It will sync after 5hrs when it gets the changelog. > > Problem 2 occurs because: > When the brick came up, changelog xlator got the future time for around > 30 sec. So two changelogs got rolled over with future time. And now by the > time 'time' syscall is called, chronyd corrected the time, and next > changelog got rolled over with correct time. In this case, changelog didn't > wait till future time to rollover. But the changelogs generated is in > following order. > changelog.<t1> > changelog.<t5> > changelog.<t10> > changelog.<t100> > changelog.<t105> > changelog.<t15> > changelog.<t20> > The htime index files will not have changelogs in ascending order and hence > history crawl would missing giving few changelogs. In above case, if <t1> to > <t20> is asked it would give t1, t5, t10 and rest of it is missed. > The above scenario is it one of the node and hence geo-rep has missed > syncing few files. The below htime file and geo-rep log confirms it. > > /rhs/brick2/b5/.glusterfs/changelogs/CHANGELOG.1489059602 > /rhs/brick2/b5/.glusterfs/changelogs/CHANGELOG.1489059617<<<< Before Reboot > --- Correc time > /rhs/brick2/b5/.glusterfs/changelogs/CHANGELOG.1489079883 <<<< After reboot > — Future time > /rhs/brick2/b5/.glusterfs/changelogs/changelog.1489079898 <<<<< Future time > /rhs/brick2/b5/.glusterfs/changelogs/CHANGELOG.1489063441 <<<< Correc time > /rhs/brick2/b5/.glusterfs/changelogs/changelog.1489063456 > /rhs/brick2/b5/.glusterfs/changelogs/changelog.1489063471 > /rhs/brick2/b5/.glusterfs/changelogs/changelog.1489063486 > > > [2017-03-09 12:46:42.291949] I [master(/rhs/brick1/b2):461:mgmt_lock] > _GMaster: Didn't get lock : /rhs/brick1/b2 : Becoming PASSIVE > [2017-03-09 12:46:42.307940] I [master(/rhs/brick2/b5):468:mgmt_lock] > _GMaster: Got lock : /rhs/brick2/b5 : Becoming ACTIVE > [2017-03-09 12:46:42.341409] I [master(/rhs/brick2/b5):1167:crawl] _GMaster: > starting history crawl... turns: 1, stime: (1489052939, 0), etime: 1489063602 > [2017-03-09 12:46:43.348171] I [master(/rhs/brick2/b5):1196:crawl] _GMaster: > slave's time: (1489052939, 0) > [2017-03-09 12:47:42.480091] I [master(/rhs/brick1/b2):534:crawlwrap] > _GMaster: 0 crawls, 0 turns > [2017-03-09 12:48:42.684154] I [master(/rhs/brick1/b2):534:crawlwrap] > _GMaster: 0 crawls, 0 turns > [2017-03-09 12:49:42.859257] I [master(/rhs/brick1/b2):534:crawlwrap] > _GMaster: 0 crawls, 0 turns > [2017-03-09 12:50:43.73237] I [master(/rhs/brick1/b2):534:crawlwrap] > _GMaster: 0 crawls, 0 turns > [2017-03-09 12:51:43.398479] I [master(/rhs/brick1/b2):534:crawlwrap] > _GMaster: 0 crawls, 0 turns > [2017-03-09 12:52:43.572623] I [master(/rhs/brick1/b2):534:crawlwrap] > _GMaster: 0 crawls, 0 turns > [2017-03-09 12:53:43.825984] I [master(/rhs/brick1/b2):534:crawlwrap] > _GMaster: 0 crawls, 0 turns > [2017-03-09 12:53:46.447371] I [master(/rhs/brick2/b5):1210:crawl] _GMaster: > finished history crawl syncing, endtime: 1489079883, stime: (1489059616, 0) > [2017-03-09 12:53:46.455058] I [master(/rhs/brick2/b5):512:crawlwrap] > _GMaster: primary master with volume id bdb045f2-56e8-4de8-9378-ebbec77043b7 > ... > > > SOLUTION: > chronyd should be started before glusterd starts the brick. We should > make sure somehow it fixed the time. Geo-rep is just the victim here. It > could be causing issues to any of the components using > pthread_cond_timedwait.
REVIEW: https://review.gluster.org/16889 (glusterd.service: Ensure chronyd is started before GlusterD) posted (#1) for review on master by Atin Mukherjee (amukherj)
REVIEW: https://review.gluster.org/16889 (glusterd.service: Ensure chronyd is started before GlusterD) posted (#2) for review on master by Atin Mukherjee (amukherj)
It turned out to a setup issue where the hypervisors used in the testing were not time synced which is a must in gluster storage nodes.