Bug 1475255 - [Geo-rep]: Geo-rep hangs in changelog mode
[Geo-rep]: Geo-rep hangs in changelog mode
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: geo-replication (Show other bugs)
mainline
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Niels de Vos
: Triaged
Depends On:
Blocks: 1475258
  Show dependency treegraph
 
Reported: 2017-07-26 05:56 EDT by Kotresh HR
Modified: 2017-12-08 12:35 EST (History)
2 users (show)

See Also:
Fixed In Version: glusterfs-3.13.0
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1475258 (view as bug list)
Environment:
Last Closed: 2017-12-08 12:35:17 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Kotresh HR 2017-07-26 05:56:05 EDT
Description of problem:
Geo-replication worker hangs and doesn't switch to 'Changelog Crawl'.
No data from master gets synced to slave. But the geo-rep works fine
with xsync as change_detector.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Setup geo-replication session between two gluster volumes


Actual results:
Geo-rep hangs and no data is synced

Expected results:
Geo-rep should not hang and should sync data

Additional info:
Analysis:
   It was found out that the culprit was patch "https://review.gluster.org/#/c/17779/" which got into mainline. This patch is causing a hang in 'libgfchangelog' library which the geo-replication uses to get the changelogs from brick back end.

The back trace of the hang is as below.

Thread 1 (Thread 0x7ffff7fe5700 (LWP 11895)):
#0  pthread_spin_lock () at ../sysdeps/x86_64/nptl/pthread_spin_lock.S:32
#1  0x00007ffff7911af4 in mem_get (mem_pool=0x7ffff7bc5588 <pools+200>) at mem-pool.c:758
#2  0x00007ffff7911791 in mem_get0 (mem_pool=0x7ffff7bc5588 <pools+200>) at mem-pool.c:657
#3  0x00007ffff78dcd80 in log_buf_new () at logging.c:284
#4  0x00007ffff78e0c0a in _gf_msg_internal (domain=0x602b80 "gfchangelog", file=0x7ffff7bd50cd "gf-changelog.c", 
    function=0x7ffff7bd52a0 <__FUNCTION__.17176> "gf_changelog_register_generic", line=552, level=GF_LOG_INFO, errnum=0, msgid=132028, appmsgstr=0x7fffffffd018, callstr=0x0, graph_id=0)
    at logging.c:1961
#5  0x00007ffff78e110c in _gf_msg (domain=0x602b80 "gfchangelog", file=0x7ffff7bd50cd "gf-changelog.c", function=0x7ffff7bd52a0 <__FUNCTION__.17176> "gf_changelog_register_generic", 
    line=552, level=GF_LOG_INFO, errnum=0, trace=0, msgid=132028, fmt=0x7ffff7bd51c8 "Registering brick: %s [notify filter: %d]") at logging.c:2077
#6  0x00007ffff7bcd8dd in gf_changelog_register_generic (bricks=0x7fffffffd1c0, count=0, ordered=1, logfile=0x400c0d "/tmp/change.log", lvl=9, xl=0x0) at gf-changelog.c:549
#7  0x00007ffff7bcda84 in gf_changelog_register (brick_path=0x400c2a "/bricks/brick0/b0", scratch_dir=0x400c1d "/tmp/scratch", log_file=0x400c0d "/tmp/change.log", log_level=9, 
    max_reconnects=5) at gf-changelog.c:623
#8  0x00000000004009ff in main (argc=1, argv=0x7fffffffe328) at get-changes.c:49

The call flow of first mem_get is as below
mem_get-->mem_get_pool_list--> pthread_getspecific(pool_key)

pthread_getspecific should have returned NULL as the pool_key is not set because
mem_pools_init_early/mem_pools_init_late is not called in this code path. But it returned some value and hence spin lock initialization didn't happen causing this hang.

According to man page of pthread_getspecific

"The effect of calling pthread_getspecific()  or  pthread_setspecific()  with  a  key value  not  obtained  from  pthread_key_create()  or after key has been deleted with pthread_key_delete() is undefined."

So we should not be having this if condition below ? 
       
mem_pools_init_early (void)
{
        pthread_mutex_lock (&init_mutex);
        /* Use a pthread_key destructor to clean up when a thread exits.
         *
         * We won't increase init_count here, that is only done when the
         * pool_sweeper thread is started too.
         */
        if (pthread_getspecific (pool_key) == NULL) {
                /* key has not been created yet */
                if (pthread_key_create (&pool_key, pool_destructor) != 0) {
                        gf_log ("mem-pool", GF_LOG_CRITICAL,
                                "failed to initialize mem-pool key");
                }
        }
        pthread_mutex_unlock (&init_mutex);
}

And now is it mandatory to do mem_pool_init_early in all the code paths like libgfchangelog?
Comment 1 Kotresh HR 2017-07-26 06:22:33 EDT
Easy reproducer:

1. Create gluster volume and enable changelog
    'gluster vol set <volname> changelog on
    'gluster vol set <volname> changelog.rollover-time 5'
2. Export path to libgfchangelog.pc
    export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig/"  

3. Compile the get-changes.c located in "xlators/features/changelog/lib/examples/c/get-changes.c" as below.
gcc -g -o getchanges `pkg-config --cflags libgfchangelog` get-changes.c `pkg-config --libs libgfchangelog`

4. Run getchanges. 

5. In other terminal, mount the gluster volume and touch a file. After 5 secs,
   you should see the path of changelog file in getchanges. But it's hanging and pring the changelog path
Comment 2 Worker Ant 2017-07-27 11:10:13 EDT
REVIEW: https://review.gluster.org/17899 (mem-pool: track and verify initialization state) posted (#1) for review on master by Niels de Vos (ndevos@redhat.com)
Comment 3 Worker Ant 2017-07-27 11:10:17 EDT
REVIEW: https://review.gluster.org/17900 (changelog: add mem-pool initialization) posted (#1) for review on master by Niels de Vos (ndevos@redhat.com)
Comment 4 Worker Ant 2017-07-28 07:27:14 EDT
COMMIT: https://review.gluster.org/17899 committed in master by Jeff Darcy (jeff@pl.atyp.us) 
------
commit b5fa5ae05f73e03023db37e43fb203267b719160
Author: Niels de Vos <ndevos@redhat.com>
Date:   Wed Jul 26 16:16:11 2017 +0200

    mem-pool: track and verify initialization state
    
    It is possible that pthread_getspecific() returns a non-NULL value in
    case the pthread_key_t is not initialized. The behaviour for
    pthread_getspecific() is not defined in this case. This can happen when
    applications use mem-pools from libglusterfs.so, but did not call
    mem_pools_init_early().
    
    By tracking the status of the mem-pools initialization, it is now
    possible to prevent calling pthread_getspecific() in case the
    pthread_key_t is not initialized. In future, we might want to exend this
    more to faciliate debugging.
    
    Reported-by: Kotresh HR <khiremat@redhat.com>
    Tested-by: Jiffin Tony Thottan <jthottan@redhat.com>
    Change-Id: I6255419fe05792dc78b1eaff55bc008fc5ff3933
    Fixes: 1e8e62640 ("mem-pool: initialize pthread_key_t pool_key in mem_pool_init_early()")
    BUG: 1475255
    Signed-off-by: Niels de Vos <ndevos@redhat.com>
    Reviewed-on: https://review.gluster.org/17899
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: jiffin tony Thottan <jthottan@redhat.com>
    Reviewed-by: Jeff Darcy <jeff@pl.atyp.us>
Comment 5 Worker Ant 2017-07-29 16:55:22 EDT
COMMIT: https://review.gluster.org/17900 committed in master by Jeff Darcy (jeff@pl.atyp.us) 
------
commit 1bc3cd9bd59e3826fd14fc239322f039d7a814da
Author: Niels de Vos <ndevos@redhat.com>
Date:   Wed Jul 26 16:17:03 2017 +0200

    changelog: add mem-pool initialization
    
    With recent changes to the mem-pool initialization,
    mem_pools_init_early() and mem_pools_init_late() need to be called
    before mem_get() is usable.
    
    This change has been tested manually with the included test from
    xlators/features/changelog/lib/examples/c/get-changes.c.
    
    Change-Id: I139563c4ff78a566cef2ff7e3da2ee10306def92
    BUG: 1475255
    Reported-by: Kotresh HR <khiremat@redhat.com>
    Signed-off-by: Niels de Vos <ndevos@redhat.com>
    Reviewed-on: https://review.gluster.org/17900
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Kotresh HR <khiremat@redhat.com>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Jeff Darcy <jeff@pl.atyp.us>
Comment 6 Shyamsundar 2017-12-08 12:35:17 EST
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.