Description of problem: Geo-replication worker hangs and doesn't switch to 'Changelog Crawl'. No data from master gets synced to slave. But the geo-rep works fine with xsync as change_detector. Version-Release number of selected component (if applicable): mainline How reproducible: Always Steps to Reproduce: 1. Setup geo-replication session between two gluster volumes Actual results: Geo-rep hangs and no data is synced Expected results: Geo-rep should not hang and should sync data Additional info: Analysis: It was found out that the culprit was patch "https://review.gluster.org/#/c/17779/" which got into mainline. This patch is causing a hang in 'libgfchangelog' library which the geo-replication uses to get the changelogs from brick back end. The back trace of the hang is as below. Thread 1 (Thread 0x7ffff7fe5700 (LWP 11895)): #0 pthread_spin_lock () at ../sysdeps/x86_64/nptl/pthread_spin_lock.S:32 #1 0x00007ffff7911af4 in mem_get (mem_pool=0x7ffff7bc5588 <pools+200>) at mem-pool.c:758 #2 0x00007ffff7911791 in mem_get0 (mem_pool=0x7ffff7bc5588 <pools+200>) at mem-pool.c:657 #3 0x00007ffff78dcd80 in log_buf_new () at logging.c:284 #4 0x00007ffff78e0c0a in _gf_msg_internal (domain=0x602b80 "gfchangelog", file=0x7ffff7bd50cd "gf-changelog.c", function=0x7ffff7bd52a0 <__FUNCTION__.17176> "gf_changelog_register_generic", line=552, level=GF_LOG_INFO, errnum=0, msgid=132028, appmsgstr=0x7fffffffd018, callstr=0x0, graph_id=0) at logging.c:1961 #5 0x00007ffff78e110c in _gf_msg (domain=0x602b80 "gfchangelog", file=0x7ffff7bd50cd "gf-changelog.c", function=0x7ffff7bd52a0 <__FUNCTION__.17176> "gf_changelog_register_generic", line=552, level=GF_LOG_INFO, errnum=0, trace=0, msgid=132028, fmt=0x7ffff7bd51c8 "Registering brick: %s [notify filter: %d]") at logging.c:2077 #6 0x00007ffff7bcd8dd in gf_changelog_register_generic (bricks=0x7fffffffd1c0, count=0, ordered=1, logfile=0x400c0d "/tmp/change.log", lvl=9, xl=0x0) at gf-changelog.c:549 #7 0x00007ffff7bcda84 in gf_changelog_register (brick_path=0x400c2a "/bricks/brick0/b0", scratch_dir=0x400c1d "/tmp/scratch", log_file=0x400c0d "/tmp/change.log", log_level=9, max_reconnects=5) at gf-changelog.c:623 #8 0x00000000004009ff in main (argc=1, argv=0x7fffffffe328) at get-changes.c:49 The call flow of first mem_get is as below mem_get-->mem_get_pool_list--> pthread_getspecific(pool_key) pthread_getspecific should have returned NULL as the pool_key is not set because mem_pools_init_early/mem_pools_init_late is not called in this code path. But it returned some value and hence spin lock initialization didn't happen causing this hang. According to man page of pthread_getspecific "The effect of calling pthread_getspecific() or pthread_setspecific() with a key value not obtained from pthread_key_create() or after key has been deleted with pthread_key_delete() is undefined." So we should not be having this if condition below ? mem_pools_init_early (void) { pthread_mutex_lock (&init_mutex); /* Use a pthread_key destructor to clean up when a thread exits. * * We won't increase init_count here, that is only done when the * pool_sweeper thread is started too. */ if (pthread_getspecific (pool_key) == NULL) { /* key has not been created yet */ if (pthread_key_create (&pool_key, pool_destructor) != 0) { gf_log ("mem-pool", GF_LOG_CRITICAL, "failed to initialize mem-pool key"); } } pthread_mutex_unlock (&init_mutex); } And now is it mandatory to do mem_pool_init_early in all the code paths like libgfchangelog?
Easy reproducer: 1. Create gluster volume and enable changelog 'gluster vol set <volname> changelog on 'gluster vol set <volname> changelog.rollover-time 5' 2. Export path to libgfchangelog.pc export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig/" 3. Compile the get-changes.c located in "xlators/features/changelog/lib/examples/c/get-changes.c" as below. gcc -g -o getchanges `pkg-config --cflags libgfchangelog` get-changes.c `pkg-config --libs libgfchangelog` 4. Run getchanges. 5. In other terminal, mount the gluster volume and touch a file. After 5 secs, you should see the path of changelog file in getchanges. But it's hanging and pring the changelog path
REVIEW: https://review.gluster.org/17899 (mem-pool: track and verify initialization state) posted (#1) for review on master by Niels de Vos (ndevos)
REVIEW: https://review.gluster.org/17900 (changelog: add mem-pool initialization) posted (#1) for review on master by Niels de Vos (ndevos)
COMMIT: https://review.gluster.org/17899 committed in master by Jeff Darcy (jeff.us) ------ commit b5fa5ae05f73e03023db37e43fb203267b719160 Author: Niels de Vos <ndevos> Date: Wed Jul 26 16:16:11 2017 +0200 mem-pool: track and verify initialization state It is possible that pthread_getspecific() returns a non-NULL value in case the pthread_key_t is not initialized. The behaviour for pthread_getspecific() is not defined in this case. This can happen when applications use mem-pools from libglusterfs.so, but did not call mem_pools_init_early(). By tracking the status of the mem-pools initialization, it is now possible to prevent calling pthread_getspecific() in case the pthread_key_t is not initialized. In future, we might want to exend this more to faciliate debugging. Reported-by: Kotresh HR <khiremat> Tested-by: Jiffin Tony Thottan <jthottan> Change-Id: I6255419fe05792dc78b1eaff55bc008fc5ff3933 Fixes: 1e8e62640 ("mem-pool: initialize pthread_key_t pool_key in mem_pool_init_early()") BUG: 1475255 Signed-off-by: Niels de Vos <ndevos> Reviewed-on: https://review.gluster.org/17899 Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: jiffin tony Thottan <jthottan> Reviewed-by: Jeff Darcy <jeff.us>
COMMIT: https://review.gluster.org/17900 committed in master by Jeff Darcy (jeff.us) ------ commit 1bc3cd9bd59e3826fd14fc239322f039d7a814da Author: Niels de Vos <ndevos> Date: Wed Jul 26 16:17:03 2017 +0200 changelog: add mem-pool initialization With recent changes to the mem-pool initialization, mem_pools_init_early() and mem_pools_init_late() need to be called before mem_get() is usable. This change has been tested manually with the included test from xlators/features/changelog/lib/examples/c/get-changes.c. Change-Id: I139563c4ff78a566cef2ff7e3da2ee10306def92 BUG: 1475255 Reported-by: Kotresh HR <khiremat> Signed-off-by: Niels de Vos <ndevos> Reviewed-on: https://review.gluster.org/17900 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Kotresh HR <khiremat> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jeff.us>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report. glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html [2] https://www.gluster.org/pipermail/gluster-users/