Description of problem: ======================= While carrying automation sanity check on build (glusterfs-geo-replication-3.8.4-36.el7rhgs.x86_64), one of the worker crashed with following traceback: [2017-07-29 16:26:24.323477] I [master(/bricks/brick1/master_brick9):1132:crawl] _GMaster: slave's time: (1501345566, 0) [2017-07-29 16:26:29.850406] E [syncdutils(/bricks/brick1/master_brick9):296:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1143, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1118, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1001, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 829, in process_change logging.info(lf('Ignoring rmdir. Directory present in ' NameError: global name 'lf' is not defined [2017-07-29 16:26:29.854490] I [syncdutils(/bricks/brick1/master_brick9):237:finalize] <top>: exiting. Looking into the trace, it looks during rmdir. Worker crashed and became passive. Syncs are successful via other active worker. Looks like a race. Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.8.4-36.el7rhgs.x86_64 How reproducible: ================= Have seen once, will retry and update the second occurrence. Steps Carried: ============== 1. Run the automation sanity suite which does {create,chmod,chown,chgrp,hardlink,symlink,truncate,rename,remove} on Master and Slave (Both EC volumes) Actual results: =============== Worker crashed Expected results: ================= Workers should not crash Additional info:
As part of improving debugging ability and logging improvements, structured logging support [1] is introduced and is merged in upstream. That is not taken in downstream 3.3. But somehow a patch using the structured logging support has sneaked in downstream 3.3 and would always crash hitting that code path. Hence it's a candidate for blocker and should be fixed. And it's easy fix. [1] https://review.gluster.org/#/c/17551/
It's downstream only patch. Patch link: https://code.engineering.redhat.com/gerrit/113848
Based on comment 3, and on having discussed it with Rahul, marking blocker flag to '?'.
How to reproduce this issue: 1. touch dir1 => This is to find which subvolume the file hashes too 2. rm dir1 3. mkdir dir1 4. Let it sync to slave 5. Stop the geo-replication 6. Attach gdb to mount pid and breakpoint at dht_rmdir_lock_cbk 7. continue 8. rmdir dir1 9. Kill the complete Hashed subvolume (captured from step 1) 10. continue 11. Start volume with force (bring back bricks) 12. ls /mnt/dir1 13. Wait for dht heal 14. Start the geo-replication Was able to reproduce this issue on build 3.8.4-33 Verified with build: glusterfs-geo-replication-3.8.4-37.el7rhgs.x86_64 => No crash is seen with the above mentioned steps. Moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774