Bug 1353529 - Multiple bricks could crash after invoking status command
Summary: Multiple bricks could crash after invoking status command
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 3.7.12
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-07 11:41 UTC by Oleksandr Natalenko
Modified: 2017-03-08 10:52 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-08 10:52:33 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
"thread apply all backtrace" output for 100% CPU usage (3.15 KB, application/x-xz)
2016-07-07 11:41 UTC, Oleksandr Natalenko
no flags Details
core file 1 (886.45 KB, application/x-xz)
2016-07-07 11:42 UTC, Oleksandr Natalenko
no flags Details
core file 2 (985.12 KB, application/x-xz)
2016-07-07 11:42 UTC, Oleksandr Natalenko
no flags Details

Description Oleksandr Natalenko 2016-07-07 11:41:46 UTC
Created attachment 1177253 [details]
"thread apply all backtrace" output for 100% CPU usage

Description of problem:

Given distributed-replicated volume (we didn't test other layouts) multiple brick process could crash under load while performing "volume status clients" command and probing bricks port.

Version-Release number of selected component (if applicable):

CentOS 7.2, GlusterFS 3.7.12 with following patches:

===
Jiffin Tony Thottan (1):
      gfapi : check the value "iovec" in glfs_io_async_cbk only for read

Kaleb S KEITHLEY (1):
      build: RHEL7 unpackaged files .../hooks/S57glusterfind-delete-post.{pyc,pyo}

Kotresh HR (1):
      changelog/rpc: Fix rpc_clnt_t mem leaks

Pranith Kumar K (1):
      features/index: Exclude gfid-type for '.', '..'

Raghavendra G (2):
      libglusterfs/client_t: Dump the 0th client too
      storage/posix: fix inode leaks

Raghavendra Talur (1):
      gfapi: update count when glfs_buf_copy is used

Ravishankar N (1):
      afr:Don't wind reads for files in metadata split-brain

Soumya Koduri (1):
      gfapi/handleops: Avoid using glfd during create
===

How reproducible:

Reliably (see below).

Steps to Reproduce:

All the actions below we performed on one node. Another node in replica was not used (except for maintaining the replica itself), and bricks there did not crash.

1. create distributed-replicated (or, we suspect, any other) volume and start it;
2. mount volume on some client via FUSE;
3. find out what TCP port are used by the volume on one of the hosts where crash would be initiated;
4. start nmap'ing those ports in a loop: "while true; do nmap -Pn -p49163-49167 127.0.0.1; done";
5. start invoking status command in a loop: "while true; do sudo gluster volume status test; sudo gluster volume status test clients; done";
6. start generating some workload on the volume (we used to write lots of zero files and stat them in parallel);
7. ...wait...
8. observe one or multiple brick crash on the node where status command is performed.

Actual results:

Two variants:

1. brick could crash and generate core file;
2. brick could hang consuming 100% of CPU time.

Expected results:

Do not crash, of course :).

Additional info:

If brick crashes generating core file, gdb gives us the following stacktrace:

===
#0  0x00007fefa9f1cda1 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007fefab7d8465 in str_to_data (value=value@entry=0x66726574737562eb <Address 0x66726574737562eb out of bounds>) at dict.c:904
#2  0x00007fefab7d9e16 in dict_set_str (this=this@entry=0x7fefababa048, key=key@entry=0x7fef8c225280 "client2896.hostname", 
    str=str@entry=0x66726574737562eb <Address 0x66726574737562eb out of bounds>) at dict.c:2224
#3  0x00007fef96d2e244 in server_priv_to_dict (this=<optimized out>, dict=0x7fefababa048) at server.c:262
#4  0x00007fefabcc311a in glusterfs_handle_brick_status (req=0x7fefad9942dc) at glusterfsd-mgmt.c:890
#5  0x00007fefab82d4a2 in synctask_wrap (old_task=<optimized out>) at syncop.c:380
#6  0x00007fefa9edd110 in ?? () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()
===

Additionally, we attach two compressed cores for the stacktrace above.

If brick hangs consuming 100% of CPU time, we attached to brick process using gdb and got stacktraces of all threads (see attached "all_threads_stacktrace.log.xz" file).

Comment 1 Oleksandr Natalenko 2016-07-07 11:42:18 UTC
Created attachment 1177254 [details]
core file 1

Comment 2 Oleksandr Natalenko 2016-07-07 11:42:39 UTC
Created attachment 1177255 [details]
core file 2

Comment 3 Oleksandr Natalenko 2016-07-07 11:58:12 UTC
While still investigating the issue, got the same cores even without file workload.

Comment 4 Oleksandr Natalenko 2016-07-08 09:08:19 UTC
Reverting http://review.gluster.org/#/c/13658 and http://review.gluster.org/#/c/14739 as suggested by Soumya Koduri did not help — I've got exactly the same crash with the same stacktrace.

Comment 5 Kaushal 2017-03-08 10:52:33 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.


Note You need to log in before you can comment on or make changes to this bug.