1559794 – weighted-rebalance

Bug 1559794 - weighted-rebalance

Summary: weighted-rebalance

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	rpc
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-23 09:48 UTC by Salamani
Modified:	2018-06-20 18:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-06-20 18:25:03 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs of volume rebalance (13.06 KB, text/plain) 2018-03-23 09:48 UTC, Salamani	no flags	Details
glusterd Logs for Weighted-rebalance.t failure on 3.12.3 (16.47 KB, text/plain) 2018-03-27 07:37 UTC, Cnaik	no flags	Details
Pathcy 1 Logs for Weighted-rebalance.t failure on 3.12.3 (12.26 KB, text/plain) 2018-03-27 07:38 UTC, Cnaik	no flags	Details
Patchy2 Logs for Weighted-rebalance.t failure on 3.12.3 (11.44 KB, text/plain) 2018-03-27 07:39 UTC, Cnaik	no flags	Details
Logs for brick d-backends-patchy1 (10.30 KB, text/plain) 2018-04-02 08:18 UTC, Salamani	no flags	Details
Logs for brick d-backends-patchy2 (9.83 KB, text/plain) 2018-04-02 08:23 UTC, Salamani	no flags	Details
Core Dump for Glusterd - Weighted-rebalance (409.27 KB, application/x-7z-compressed) 2018-04-03 03:31 UTC, Cnaik	no flags	Details
State Dump - Bricks (3.67 KB, application/x-7z-compressed) 2018-04-09 09:53 UTC, Cnaik	no flags	Details
Back Trace -Weighted-rebal_4.0.1 (201.70 KB, text/plain) 2018-04-23 11:09 UTC, Cnaik	no flags	Details
Complete Backtrace For core dump of weighted-rebalance (263.18 KB, text/plain) 2018-04-24 04:32 UTC, Cnaik	no flags	Details
Attaching patchy1/pathcy2 bricks log with TRACE enabled in the test. (1.69 MB, application/x-7z-compressed) 2018-04-25 06:30 UTC, Cnaik	no flags	Details
glusterd, patchy-rebalance.log with TRACE enabled (1.57 MB, application/x-7z-compressed) 2018-04-25 06:31 UTC, Cnaik	no flags	Details
View All

Description Salamani 2018-03-23 09:48:25 UTC

Created attachment 1412032 [details]
logs of volume rebalance

Description of problem:
tests/features/weighted-rebalance.t
the volume re-balance is failing.


How reproducible:
 Ran the test cases 

Steps to Reproduce:
1. check the volume patchy
2. run the rebalance
3. observed on brick inactive and also the rebalance fails.

Detailed Commands and their output is as follow: 



$ gluster --mode=script --wignore volume status

    Status of volume: patchy
    Gluster process                             TCP Port  RDMA Port  Online  Pid
    ------------------------------------------------------------------------------
    Brick myhost:/d/backends/patchy1          49152     0          Y       64813
    Brick myhost:/d/backends/patchy2          49153     0          Y       64834
    
    Task Status of Volume patchy
    ------------------------------------------------------------------------------
    There are no active volume tasks



$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off
$ gluster --mode=script --wignore volume rebalance patchy start force

    volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
    ID: 5c573761-314d-4294-99ba-c6a518675e26

$ gluster --mode=script --wignore volume rebalance patchy status

                                        Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                                   ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                                   localhost                0        0Bytes             0             3             0               failed        0:00:00
    volume rebalance: patchy: success


$ gluster --mode=script --wignore volume status

    Status of volume: patchy
    Gluster process                             TCP Port  RDMA Port  Online  Pid
    ------------------------------------------------------------------------------
    Brick myhost:/d/backends/patchy1          49152     0          Y       64813
    Brick myhost:/d/backends/patchy2          N/A       N/A        N       N/A
    
    Task Status of Volume patchy

------------------------------------------------------------------------------
Task                 : Rebalance
ID                   : 5c573761-314d-4294-99ba-c6a518675e26
Status               : failed



Find the attachment for rebalance logs

Comment 1 Cnaik 2018-03-27 07:37:15 UTC

Created attachment 1413595 [details]
glusterd Logs for Weighted-rebalance.t failure on 3.12.3

Comment 2 Cnaik 2018-03-27 07:38:42 UTC

Created attachment 1413597 [details]
Pathcy 1 Logs for Weighted-rebalance.t failure on 3.12.3

Comment 3 Cnaik 2018-03-27 07:39:08 UTC

Created attachment 1413598 [details]
Patchy2 Logs for Weighted-rebalance.t failure on 3.12.3

Comment 4 Salamani 2018-03-30 08:28:36 UTC

The test case passed if we change the NFILES=1000 up to  NFILES=750. glusterfs should be able to manage more than this. Tried running on system having 8 cores, still the issue is observed.

Comment 5 Nithya Balachandran 2018-04-02 03:18:17 UTC

Rebalance will terminate itself if any subvol is down (in this case a brick)
Why is the brick down?  Do you have the brick logs for this?

Comment 6 Susant Kumar Palai 2018-04-02 07:15:01 UTC

From rebalance log:
  [2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
    [2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
    [2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
    [2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
    [2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
    [2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
    [2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
    [2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
    [2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
    [2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
    [2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
    [2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
    [2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
    [2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
    [2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
    [2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
    [2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
    [2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
    [2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
    [2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
    [2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
    [2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available   <---------------- The brick went down at this point. 
    [2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting


As Nithya pointed out, rebalacne won't continue if the brick is down and will terminate.  If there is a crash on the brick side, can you attach the core or more preferably backtrace from gdb?

Comment 7 Salamani 2018-04-02 08:18:07 UTC

Created attachment 1416113 [details]
Logs for brick d-backends-patchy1

Comment 8 Salamani 2018-04-02 08:23:00 UTC

Created attachment 1416114 [details]
Logs for brick d-backends-patchy2

Comment 9 Cnaik 2018-04-03 03:31:57 UTC

Created attachment 1416544 [details]
Core Dump for Glusterd - Weighted-rebalance

Core Dump for Glusterd - Weighted-rebalance failure

Comment 10 Salamani 2018-04-03 09:17:20 UTC

Thanks @Cnaik for adding coredump, which we got while executing the the test case.

Following steps are followed for generating coredump
$ ulimit -c unlimited
$ sysctl -w kernel.core_pattern=/home/root/core_%e_%p_%s_%c_%d_%P
$ cat /proc/sys/kernel/core_pattern
/home/root/core_%e_%p_%s_%c_%d_%P

core_glusteriotwr0_29968_11_18446744073709551615_1_29968 got generated while executing `rebalance_completed` step.

$gdb -ex 'set sysroot ./' -ex 'core-file core_glusteriotwr0_29968_11_18446744073709551615_1_29968' /usr/local/sbin/glusterfsd
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "s390x-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/glusterfsd...done.
[New LWP 29977]
[New LWP 29978]
[New LWP 29979]
[New LWP 29980]
[New LWP 29981]
[New LWP 29982]
[New LWP 29983]
[New LWP 29985]
[New LWP 29972]
[New LWP 29971]
[New LWP 29970]
[New LWP 29969]
[New LWP 29987]
[New LWP 29986]
[New LWP 29973]
[New LWP 29984]
[New LWP 29976]
[New LWP 29975]
[New LWP 29974]
[New LWP 30566]
[New LWP 29968]
warning: Could not load shared library symbols for 41 libraries, e.g. /usr/local/lib/libglusterfs.so.0.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?
Core was generated by `/usr/local/sbin/glusterfsd -s ecos0032 --volfile-id patchy.ecos0032.d-backends-'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000003ff80cffcf8 in ?? ()
[Current thread is 1 (LWP 29977)]
(gdb) bt
#0  0x000003ff80cffcf8 in ?? ()
#1  0x000003ff80da352c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

@here any information on this

Comment 11 Susant Kumar Palai 2018-04-05 06:48:40 UTC

Salamani,
   Can you give some more details about the reproducer? In one of the update it is mentioned that the NFILES was reduced to 750 from 1000 in the test case. Is that the reproducer. If not can you elaborate more?


   Also the gdb o/p does not show the backtrace as it was not able to load the symbols. Is it that you are analyzing the core on a different setup? If yes, can you open the core file on the same machine on which the core was reproduced (or identical) and pass the backtrace. (you might need to install glusterfs-debuginfo package).

Comment 12 Salamani 2018-04-05 08:29:25 UTC

The issue is reproduced simply by running the weighted-rebalance.t test which has NFILES=1000. Volume Rebalance fails. 

But if we set NFILES<=750 the test passes

Installing glusterfs-debuginfo and rebuilding the glusterfs.

Comment 13 Susant Kumar Palai 2018-04-05 09:51:07 UTC

(In reply to Salamani from comment #12)
> The issue is reproduced simply by running the weighted-rebalance.t test
> which has NFILES=1000. Volume Rebalance fails. 
> 
> But if we set NFILES<=750 the test passes
> 
> Installing glusterfs-debuginfo and rebuilding the glusterfs.

The test case by default has value of 1000. And I could run it successfully in fedora. Hence, the backtrace would be helpful.

Comment 14 Salamani 2018-04-05 11:52:07 UTC

Tried by installing glusterfs-dbg package, built glusterfs with --enable-debug flag during configure. Still not able to get the backtrace.

Comment 15 Cnaik 2018-04-06 07:39:55 UTC

(In reply to Susant Kumar Palai from comment #13)
> (In reply to Salamani from comment #12)
> > The issue is reproduced simply by running the weighted-rebalance.t test
> > which has NFILES=1000. Volume Rebalance fails. 
> > 
> > But if we set NFILES<=750 the test passes
> > 
> > Installing glusterfs-debuginfo and rebuilding the glusterfs.
> 
> The test case by default has value of 1000. And I could run it successfully
> in fedora. Hence, the backtrace would be helpful.

Is there anything we are missing in getting the core dump, that we are not able to get the correct backtrace?

Comment 16 Cnaik 2018-04-06 07:50:25 UTC

(In reply to Susant Kumar Palai from comment #13)
> (In reply to Salamani from comment #12)
> > The issue is reproduced simply by running the weighted-rebalance.t test
> > which has NFILES=1000. Volume Rebalance fails. 
> > 
> > But if we set NFILES<=750 the test passes
> > 
> > Installing glusterfs-debuginfo and rebuilding the glusterfs.
> 
> The test case by default has value of 1000. And I could run it successfully
> in fedora. Hence, the backtrace would be helpful.

Steps used to get the core dump:
•ulimit -c unlimited 
•sysctl -w kernel.core_pattern=/path_to_store_dump/core_%e_%p
•Executed the weighted-rebalance.t 
•gdb -ex 'set sysroot ./' -ex 'core-file /path_to_store_dump/core_glusteriotwr1_11231' /usr/local/sbin/glusterfsd

Please let us know if there is an alternative way to do this.

Comment 17 Cnaik 2018-04-09 08:44:59 UTC

Adding statedumps for the bricks and glusterd.

Comment 18 Cnaik 2018-04-09 09:53:26 UTC

Created attachment 1419171 [details]
State Dump - Bricks

Attached State Dump - for Bricks. The state dumps are taken at 2 times: one before the rebalance starts and one after rebalance. 

Post rebalance no statedump is created for patchy2 brick.

The test has run twice hence there are 2 sets of statedumps.

Comment 19 Nithya Balachandran 2018-04-11 12:05:47 UTC

(In reply to Cnaik from comment #18)
> Created attachment 1419171 [details]
> State Dump - Bricks
> 
> Attached State Dump - for Bricks. The state dumps are taken at 2 times: one
> before the rebalance starts and one after rebalance. 
> 
> Post rebalance no statedump is created for patchy2 brick.
> 
> The test has run twice hence there are 2 sets of statedumps.

The statedumps will not help here. We require the coredump with the symbols to figure out where this crashed.



Can you try 't a a bt' at the gdb prompt after opening the coredump and let us know what you see?

Comment 20 Salamani 2018-04-11 13:00:38 UTC

<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/glusterfsd...done.
[New LWP 46659]
[New LWP 46559]
[New LWP 46560]
[New LWP 46570]
[New LWP 46571]
[New LWP 46611]
[New LWP 46558]
[New LWP 46613]
[New LWP 46612]
[New LWP 46552]
[New LWP 46564]
[New LWP 46563]
[New LWP 46562]
[New LWP 46561]
[New LWP 46567]
[New LWP 46566]
[New LWP 46565]
[New LWP 46614]
[New LWP 46557]
[New LWP 46556]
[New LWP 46555]
[New LWP 46554]
[New LWP 46553]
warning: Could not load shared library symbols for 42 libraries, e.g. /usr/local/lib/libglusterfs.so.0.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?
Core was generated by `/usr/local/sbin/glusterfsd -s ecos0032 --volfile-id patchy.ecos0032.d-backends-'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000003ffb137fcf8 in ?? ()
[Current thread is 1 (LWP 46659)]
(gdb) t a a bt

Thread 23 (LWP 46553):
#0  0x000003ffb1511b84 in ?? ()
#1  0x000003ffb1511b78 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 22 (LWP 46554):
#0  0x000003ffb15122f2 in ?? ()
#1  0x000003ffb151238a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 21 (LWP 46555):
#0  0x000003ffb13b93d0 in ?? ()
#1  0x000003ffb13b93c4 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 20 (LWP 46556):
#0  0x000003ffb150db98 in ?? ()
#1  0x000003ffb150dc30 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 19 (LWP 46557):
#0  0x000003ffb150db98 in ?? ()
#1  0x000003ffb150dc30 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 18 (LWP 46614):
#0  0x000003ffb13b93d0 in ?? ()
#1  0x000003ffb13b93c4 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 17 (LWP 46565):
#0  0x000003ffb13e5d1c in ?? ()
#1  0x000003ffb13e5d10 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 16 (LWP 46566):
#0  0x000003ffb13e5d1c in ?? ()
---Type <return> to continue, or q <return> to quit---

Comment 21 Nithya Balachandran 2018-04-11 13:47:50 UTC

Are you running this with a source install? In that case the gdb-debuginfo may not match. Have you uninstalled the rpms for gluster in case you installed those earlier?

If this is a source install, can you retry building with the following CFLAGS and see if the symbols show up?

make CFLAGS="-ggdb3 -O0" install

Comment 22 Cnaik 2018-04-13 07:47:36 UTC

We have build glusterfs from github source on a linux system.
We are not clear on is "private: false" mentioned above. Could you please tell where do we set this?

We tried make CFLAGS="-ggdb3 -O0" install and got the dump, following is the output after doing t a a bt:


(gdb) t a a bt

Thread 21 (Thread 0x3ff8a77f910 (LWP 15676)):
#0  0x000003ff8b911b84 in ?? () at ../sysdeps/unix/syscall-template.S:84 from /lib/s390x-linux-gnu/libpthread-2.23.so
#1  0x000003ff8bbbda64 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 20 (Thread 0x3ff88f7f910 (LWP 15679)):
#0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x12a457748, mutex=0x12a457720, abstime=0x3ff88f7ef08) at pthread_cond_timedwait.c:198
#1  0x000003ff8bbec3aa in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 19 (Thread 0x3ff6defe910 (LWP 15692)):
#0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x3ff8008b3e0, mutex=0x3ff8008b410, abstime=0x3ff6defdfa0)
    at pthread_cond_timedwait.c:198
#1  0x000003ff875a9fde in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 18 (Thread 0x3ff5ffff910 (LWP 15720)):
#0  0x000003ff8b7b93d0 in ?? () at ../sysdeps/unix/syscall-template.S:84 from /lib/s390x-linux-gnu/libc-2.23.so
#1  0x000003ff8b7b92d8 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x000003ff875ada78 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 17 (Thread 0x3ff6cefe910 (LWP 15719)):
#0  0x000003ff8b7ee3a4 in ?? () at ../sysdeps/unix/syscall-template.S:84 from /lib/s390x-linux-gnu/libc-2.23.so
#1  0x000003ff8bc0d3a6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 16 (Thread 0x3ff6efff910 (LWP 15690)):
#0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84 from /lib/s390x-linux-gnu/libc-2.23.so
#1  0x000003ff8711c20a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 15 (Thread 0x3ff6ffff910 (LWP 15688)):
#0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84 from /lib/s390x-linux-gnu/libc-2.23.so
#1  0x000003ff8711c20a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 14 (Thread 0x3ff6f7ff910 (LWP 15689)):
#0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84 from /lib/s390x-linux-gnu/libc-2.23.so
#1  0x000003ff8711c20a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 13 (Thread 0x3ff8567f910 (LWP 15685)):
#0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80077108, mutex=0x3ff800770e0) at pthread_cond_wait.c:186
---Type <return> to continue, or q <return> to quit---
#1  0x000003ff87085e10 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 12 (Thread 0x3ff84e7f910 (LWP 15686)):
#0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80077180, mutex=0x3ff80077158) at pthread_cond_wait.c:186
#1  0x000003ff870846de in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 11 (Thread 0x3ff8ba7f910 (LWP 15962)):
#0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x3ff8005ed88, mutex=0x3ff8005ed60, abstime=0x3ff8ba7efa8)
    at pthread_cond_timedwait.c:198
#1  0x000003ff86d85682 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 10 (Thread 0x3ff8497b910 (LWP 15687)):
#0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8007bea8, mutex=0x3ff8007be80) at pthread_cond_wait.c:186
#1  0x000003ff8711bfb2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 9 (Thread 0x3ff87dff910 (LWP 15681)):
#0  0x000003ff8b7ee3a4 in ?? () at ../sysdeps/unix/syscall-template.S:84 from /lib/s390x-linux-gnu/libc-2.23.so
#1  0x000003ff8bc0d3a6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 8 (Thread 0x3ff6d6fe910 (LWP 15693)):
#0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8008b588, mutex=0x3ff8008b560) at pthread_cond_wait.c:186
#1  0x000003ff875adfbe in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 7 (Thread 0x3ff8687f910 (LWP 15682)):
#0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8003fdb8, mutex=0x3ff8003fd90) at pthread_cond_wait.c:186
#1  0x000003ff8bb0b248 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 6 (Thread 0x3ff8877f910 (LWP 15680)):
#0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x12a457748, mutex=0x12a457720, abstime=0x3ff8877ef08) at pthread_cond_timedwait.c:198
#1  0x000003ff8bbec3aa in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 5 (Thread 0x3ff89f7f910 (LWP 15677)):
#0  do_sigwait (set=<optimized out>, set@entry=0x3ff89f7ef38, sig=sig@entry=0x3ff89f7ef34) at ../sysdeps/unix/sysv/linux/sigwait.c:64
#1  0x000003ff8b91238a in __sigwait (set=0x3ff89f7ef38, sig=0x3ff89f7ef34) at ../sysdeps/unix/sysv/linux/sigwait.c:96
#2  0x000000011240b4f6 in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2157
#3  0x000003ff8b907934 in start_thread (arg=0x3ff89f7f910) at pthread_create.c:335
#4  0x000003ff8b7edce2 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:74
---Type <return> to continue, or q <return> to quit---

Thread 4 (Thread 0x3ff8bd77750 (LWP 15675)):
#0  0x000003ff8b908bf2 in pthread_join (threadid=4396031146256, thread_return=0x0) at pthread_join.c:90
#1  0x000003ff8bc0db7c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 3 (Thread 0x3ff8577f910 (LWP 15683)):
#0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80051760, mutex=0x3ff80051738) at pthread_cond_wait.c:186
#1  0x000003ff86b86e24 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0x3ff8977f910 (LWP 15678)):
#0  0x000003ff8b7b93d0 in ?? () at ../sysdeps/unix/syscall-template.S:84 from /lib/s390x-linux-gnu/libc-2.23.so
#1  0x000003ff8b7b92d8 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x000003ff8bbd965a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0x3ff8bb7f910 (LWP 15684)):
#0  0x000003ff8b78147c in _int_malloc (av=av@entry=0x3ff7c000020, bytes=bytes@entry=24) at malloc.c:3349
#1  0x000003ff8b783d00 in __GI___libc_malloc (bytes=24) at malloc.c:2913
#2  0x000003ff8b823536 in x_inline (xdrs=0x3ff8bb7d428, len=24) at xdr_sizeof.c:89
#3  0x000003ff8ba8ef60 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Comment 23 Nithya Balachandran 2018-04-13 12:09:30 UTC

(In reply to Cnaik from comment #22)
> We have build glusterfs from github source on a linux system.

Can you tell me why are you working with a source install instead of the releases packages? Also why are you using the older version instead of the latest sources?

> We are not clear on is "private: false" mentioned above. Could you please
> tell where do we set this?

That was for the bugzilla comment and can be ignored.

> 
> We tried make CFLAGS="-ggdb3 -O0" install and got the dump, following is the
> output after doing t a a bt:

I still do not see any gluster symbols in the stack.

> 
> 
> (gdb) t a a bt
> 
> Thread 21 (Thread 0x3ff8a77f910 (LWP 15676)):
> #0  0x000003ff8b911b84 in ?? () at ../sysdeps/unix/syscall-template.S:84
> from /lib/s390x-linux-gnu/libpthread-2.23.so
> #1  0x000003ff8bbbda64 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 20 (Thread 0x3ff88f7f910 (LWP 15679)):
> #0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x12a457748,
> mutex=0x12a457720, abstime=0x3ff88f7ef08) at pthread_cond_timedwait.c:198
> #1  0x000003ff8bbec3aa in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 19 (Thread 0x3ff6defe910 (LWP 15692)):
> #0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x3ff8008b3e0,
> mutex=0x3ff8008b410, abstime=0x3ff6defdfa0)
>     at pthread_cond_timedwait.c:198
> #1  0x000003ff875a9fde in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 18 (Thread 0x3ff5ffff910 (LWP 15720)):
> #0  0x000003ff8b7b93d0 in ?? () at ../sysdeps/unix/syscall-template.S:84
> from /lib/s390x-linux-gnu/libc-2.23.so
> #1  0x000003ff8b7b92d8 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
> #2  0x000003ff875ada78 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 17 (Thread 0x3ff6cefe910 (LWP 15719)):
> #0  0x000003ff8b7ee3a4 in ?? () at ../sysdeps/unix/syscall-template.S:84
> from /lib/s390x-linux-gnu/libc-2.23.so
> #1  0x000003ff8bc0d3a6 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 16 (Thread 0x3ff6efff910 (LWP 15690)):
> #0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84
> from /lib/s390x-linux-gnu/libc-2.23.so
> #1  0x000003ff8711c20a in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 15 (Thread 0x3ff6ffff910 (LWP 15688)):
> #0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84
> from /lib/s390x-linux-gnu/libc-2.23.so
> #1  0x000003ff8711c20a in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 14 (Thread 0x3ff6f7ff910 (LWP 15689)):
> #0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84
> from /lib/s390x-linux-gnu/libc-2.23.so
> #1  0x000003ff8711c20a in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 13 (Thread 0x3ff8567f910 (LWP 15685)):
> #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80077108,
> mutex=0x3ff800770e0) at pthread_cond_wait.c:186
> ---Type <return> to continue, or q <return> to quit---
> #1  0x000003ff87085e10 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 12 (Thread 0x3ff84e7f910 (LWP 15686)):
> #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80077180,
> mutex=0x3ff80077158) at pthread_cond_wait.c:186
> #1  0x000003ff870846de in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 11 (Thread 0x3ff8ba7f910 (LWP 15962)):
> #0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x3ff8005ed88,
> mutex=0x3ff8005ed60, abstime=0x3ff8ba7efa8)
>     at pthread_cond_timedwait.c:198
> #1  0x000003ff86d85682 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 10 (Thread 0x3ff8497b910 (LWP 15687)):
> #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8007bea8,
> mutex=0x3ff8007be80) at pthread_cond_wait.c:186
> #1  0x000003ff8711bfb2 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 9 (Thread 0x3ff87dff910 (LWP 15681)):
> #0  0x000003ff8b7ee3a4 in ?? () at ../sysdeps/unix/syscall-template.S:84
> from /lib/s390x-linux-gnu/libc-2.23.so
> #1  0x000003ff8bc0d3a6 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 8 (Thread 0x3ff6d6fe910 (LWP 15693)):
> #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8008b588,
> mutex=0x3ff8008b560) at pthread_cond_wait.c:186
> #1  0x000003ff875adfbe in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 7 (Thread 0x3ff8687f910 (LWP 15682)):
> #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8003fdb8,
> mutex=0x3ff8003fd90) at pthread_cond_wait.c:186
> #1  0x000003ff8bb0b248 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 6 (Thread 0x3ff8877f910 (LWP 15680)):
> #0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x12a457748,
> mutex=0x12a457720, abstime=0x3ff8877ef08) at pthread_cond_timedwait.c:198
> #1  0x000003ff8bbec3aa in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 5 (Thread 0x3ff89f7f910 (LWP 15677)):
> #0  do_sigwait (set=<optimized out>, set@entry=0x3ff89f7ef38,
> sig=sig@entry=0x3ff89f7ef34) at ../sysdeps/unix/sysv/linux/sigwait.c:64
> #1  0x000003ff8b91238a in __sigwait (set=0x3ff89f7ef38, sig=0x3ff89f7ef34)
> at ../sysdeps/unix/sysv/linux/sigwait.c:96
> #2  0x000000011240b4f6 in glusterfs_sigwaiter (arg=<optimized out>) at
> glusterfsd.c:2157
> #3  0x000003ff8b907934 in start_thread (arg=0x3ff89f7f910) at
> pthread_create.c:335
> #4  0x000003ff8b7edce2 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:74
> ---Type <return> to continue, or q <return> to quit---
> 
> Thread 4 (Thread 0x3ff8bd77750 (LWP 15675)):
> #0  0x000003ff8b908bf2 in pthread_join (threadid=4396031146256,
> thread_return=0x0) at pthread_join.c:90
> #1  0x000003ff8bc0db7c in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 3 (Thread 0x3ff8577f910 (LWP 15683)):
> #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80051760,
> mutex=0x3ff80051738) at pthread_cond_wait.c:186
> #1  0x000003ff86b86e24 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 2 (Thread 0x3ff8977f910 (LWP 15678)):
> #0  0x000003ff8b7b93d0 in ?? () at ../sysdeps/unix/syscall-template.S:84
> from /lib/s390x-linux-gnu/libc-2.23.so
> #1  0x000003ff8b7b92d8 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
> #2  0x000003ff8bbd965a in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Thread 1 (Thread 0x3ff8bb7f910 (LWP 15684)):
> #0  0x000003ff8b78147c in _int_malloc (av=av@entry=0x3ff7c000020,
> bytes=bytes@entry=24) at malloc.c:3349
> #1  0x000003ff8b783d00 in __GI___libc_malloc (bytes=24) at malloc.c:2913
> #2  0x000003ff8b823536 in x_inline (xdrs=0x3ff8bb7d428, len=24) at
> xdr_sizeof.c:89
> #3  0x000003ff8ba8ef60 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)


Can you attach the /var/log/messages and /var/log/dmesg files to the BZ?

Comment 24 Salamani 2018-04-17 13:16:13 UTC

> (In reply to Cnaik from comment #22)
> > We have build glusterfs from github source on a linux system.
> 
> Can you tell me why are you working with a source install instead of the
> releases packages? Also why are you using the older version instead of the
> latest sources?
We are building the Glusterfs v4.0.1(latest) from github repo (https://github.com/gluster/glusterfs/tree/v4.0.1) to check its working on linux s390x system. 


> 
> > We are not clear on is "private: false" mentioned above. Could you please
> > tell where do we set this?
> 
> That was for the bugzilla comment and can be ignored.
> 
> > 
> > We tried make CFLAGS="-ggdb3 -O0" install and got the dump, following is the
> > output after doing t a a bt:
> 
> I still do not see any gluster symbols in the stack.

PFB steps to build and get the debugging symbols for glusterfs
•ulimit -c unlimited 
•sysctl -w kernel.core_pattern=/path_to_store_dump/core_%e_%p
•Installed glusterfs-dbg
•Built and installed the glusterfs
  git clone https://github.com/gluster/glusterfs
  cd glusterfs
  git checkout v4.0.1
  ./autogen.sh
  ./configure --enable-gnfs --enable-debug
  make
  make CFLAGS="-ggdb3 -O0" install
•Executed the weighted-rebalance.t 
•gdb -ex 'set sysroot ./' -ex 'core-file /path_to_store_dump/core_glusteriotwr1_11231' /usr/local/sbin/glusterfsd

Do let us know if anything else needed?


> 
> > 
> > 
> > (gdb) t a a bt
> > 
> > Thread 21 (Thread 0x3ff8a77f910 (LWP 15676)):
> > #0  0x000003ff8b911b84 in ?? () at ../sysdeps/unix/syscall-template.S:84
> > from /lib/s390x-linux-gnu/libpthread-2.23.so
> > #1  0x000003ff8bbbda64 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 20 (Thread 0x3ff88f7f910 (LWP 15679)):
> > #0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x12a457748,
> > mutex=0x12a457720, abstime=0x3ff88f7ef08) at pthread_cond_timedwait.c:198
> > #1  0x000003ff8bbec3aa in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 19 (Thread 0x3ff6defe910 (LWP 15692)):
> > #0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x3ff8008b3e0,
> > mutex=0x3ff8008b410, abstime=0x3ff6defdfa0)
> >     at pthread_cond_timedwait.c:198
> > #1  0x000003ff875a9fde in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 18 (Thread 0x3ff5ffff910 (LWP 15720)):
> > #0  0x000003ff8b7b93d0 in ?? () at ../sysdeps/unix/syscall-template.S:84
> > from /lib/s390x-linux-gnu/libc-2.23.so
> > #1  0x000003ff8b7b92d8 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
> > #2  0x000003ff875ada78 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 17 (Thread 0x3ff6cefe910 (LWP 15719)):
> > #0  0x000003ff8b7ee3a4 in ?? () at ../sysdeps/unix/syscall-template.S:84
> > from /lib/s390x-linux-gnu/libc-2.23.so
> > #1  0x000003ff8bc0d3a6 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 16 (Thread 0x3ff6efff910 (LWP 15690)):
> > #0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84
> > from /lib/s390x-linux-gnu/libc-2.23.so
> > #1  0x000003ff8711c20a in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 15 (Thread 0x3ff6ffff910 (LWP 15688)):
> > #0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84
> > from /lib/s390x-linux-gnu/libc-2.23.so
> > #1  0x000003ff8711c20a in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 14 (Thread 0x3ff6f7ff910 (LWP 15689)):
> > #0  0x000003ff8b7e5d1c in ?? () at ../sysdeps/unix/syscall-template.S:84
> > from /lib/s390x-linux-gnu/libc-2.23.so
> > #1  0x000003ff8711c20a in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 13 (Thread 0x3ff8567f910 (LWP 15685)):
> > #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80077108,
> > mutex=0x3ff800770e0) at pthread_cond_wait.c:186
> > ---Type <return> to continue, or q <return> to quit---
> > #1  0x000003ff87085e10 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 12 (Thread 0x3ff84e7f910 (LWP 15686)):
> > #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80077180,
> > mutex=0x3ff80077158) at pthread_cond_wait.c:186
> > #1  0x000003ff870846de in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 11 (Thread 0x3ff8ba7f910 (LWP 15962)):
> > #0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x3ff8005ed88,
> > mutex=0x3ff8005ed60, abstime=0x3ff8ba7efa8)
> >     at pthread_cond_timedwait.c:198
> > #1  0x000003ff86d85682 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 10 (Thread 0x3ff8497b910 (LWP 15687)):
> > #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8007bea8,
> > mutex=0x3ff8007be80) at pthread_cond_wait.c:186
> > #1  0x000003ff8711bfb2 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 9 (Thread 0x3ff87dff910 (LWP 15681)):
> > #0  0x000003ff8b7ee3a4 in ?? () at ../sysdeps/unix/syscall-template.S:84
> > from /lib/s390x-linux-gnu/libc-2.23.so
> > #1  0x000003ff8bc0d3a6 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 8 (Thread 0x3ff6d6fe910 (LWP 15693)):
> > #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8008b588,
> > mutex=0x3ff8008b560) at pthread_cond_wait.c:186
> > #1  0x000003ff875adfbe in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 7 (Thread 0x3ff8687f910 (LWP 15682)):
> > #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff8003fdb8,
> > mutex=0x3ff8003fd90) at pthread_cond_wait.c:186
> > #1  0x000003ff8bb0b248 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 6 (Thread 0x3ff8877f910 (LWP 15680)):
> > #0  0x000003ff8b90db98 in __pthread_cond_timedwait (cond=0x12a457748,
> > mutex=0x12a457720, abstime=0x3ff8877ef08) at pthread_cond_timedwait.c:198
> > #1  0x000003ff8bbec3aa in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 5 (Thread 0x3ff89f7f910 (LWP 15677)):
> > #0  do_sigwait (set=<optimized out>, set@entry=0x3ff89f7ef38,
> > sig=sig@entry=0x3ff89f7ef34) at ../sysdeps/unix/sysv/linux/sigwait.c:64
> > #1  0x000003ff8b91238a in __sigwait (set=0x3ff89f7ef38, sig=0x3ff89f7ef34)
> > at ../sysdeps/unix/sysv/linux/sigwait.c:96
> > #2  0x000000011240b4f6 in glusterfs_sigwaiter (arg=<optimized out>) at
> > glusterfsd.c:2157
> > #3  0x000003ff8b907934 in start_thread (arg=0x3ff89f7f910) at
> > pthread_create.c:335
> > #4  0x000003ff8b7edce2 in thread_start () at
> > ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:74
> > ---Type <return> to continue, or q <return> to quit---
> > 
> > Thread 4 (Thread 0x3ff8bd77750 (LWP 15675)):
> > #0  0x000003ff8b908bf2 in pthread_join (threadid=4396031146256,
> > thread_return=0x0) at pthread_join.c:90
> > #1  0x000003ff8bc0db7c in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 3 (Thread 0x3ff8577f910 (LWP 15683)):
> > #0  0x000003ff8b90d7d4 in __pthread_cond_wait (cond=0x3ff80051760,
> > mutex=0x3ff80051738) at pthread_cond_wait.c:186
> > #1  0x000003ff86b86e24 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 2 (Thread 0x3ff8977f910 (LWP 15678)):
> > #0  0x000003ff8b7b93d0 in ?? () at ../sysdeps/unix/syscall-template.S:84
> > from /lib/s390x-linux-gnu/libc-2.23.so
> > #1  0x000003ff8b7b92d8 in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
> > #2  0x000003ff8bbd965a in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> > 
> > Thread 1 (Thread 0x3ff8bb7f910 (LWP 15684)):
> > #0  0x000003ff8b78147c in _int_malloc (av=av@entry=0x3ff7c000020,
> > bytes=bytes@entry=24) at malloc.c:3349
> > #1  0x000003ff8b783d00 in __GI___libc_malloc (bytes=24) at malloc.c:2913
> > #2  0x000003ff8b823536 in x_inline (xdrs=0x3ff8bb7d428, len=24) at
> > xdr_sizeof.c:89
> > #3  0x000003ff8ba8ef60 in ?? ()
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> 
> Can you attach the /var/log/messages and /var/log/dmesg files to the BZ?

$cat /var/log/dmesg
(Nothing has been logged yet.)

$cat /var/log/messages
No such file

(In reply to Nithya Balachandran from comment #23)

Comment 25 Nithya Balachandran 2018-04-20 09:22:07 UTC

(In reply to Salamani from comment #24)
> > (In reply to Cnaik from comment #22)
> > > We have build glusterfs from github source on a linux system.
> > 
> > Can you tell me why are you working with a source install instead of the
> > releases packages? Also why are you using the older version instead of the
> > latest sources?
> We are building the Glusterfs v4.0.1(latest) from github repo
> (https://github.com/gluster/glusterfs/tree/v4.0.1) to check its working on
> linux s390x system. 
> 
> 
> > 
> > > We are not clear on is "private: false" mentioned above. Could you please
> > > tell where do we set this?
> > 
> > That was for the bugzilla comment and can be ignored.
> > 
> > > 
> > > We tried make CFLAGS="-ggdb3 -O0" install and got the dump, following is the
> > > output after doing t a a bt:
> > 
> > I still do not see any gluster symbols in the stack.
> 
> PFB steps to build and get the debugging symbols for glusterfs
> •ulimit -c unlimited 
> •sysctl -w kernel.core_pattern=/path_to_store_dump/core_%e_%p
> •Installed glusterfs-dbg

Please do not do this. 

> •Built and installed the glusterfs
>   git clone https://github.com/gluster/glusterfs
>   cd glusterfs
>   git checkout v4.0.1
>   ./autogen.sh
>   ./configure --enable-gnfs --enable-debug

Please try this without --enable-debug

>   make
>   make CFLAGS="-ggdb3 -O0" install


> •Executed the weighted-rebalance.t 
> •gdb -ex 'set sysroot ./' -ex 'core-file
> /path_to_store_dump/core_glusteriotwr1_11231' /usr/local/sbin/glusterfsd
> 
> Do let us know if anything else needed?
> 
> 

And see if you get a backtrace

Comment 26 Cnaik 2018-04-20 12:26:48 UTC

We built glusterfs without --enable-debug and executed make, make install as mentioned above:make CFLAGS="-ggdb3 -O0" install

Still not able to get the backtrace.

Comment 27 Cnaik 2018-04-20 12:27:07 UTC

We built glusterfs without --enable-debug and executed make, make install as mentioned above:make CFLAGS="-ggdb3 -O0" install

Still not able to get the backtrace.

Comment 28 Nithya Balachandran 2018-04-20 13:46:55 UTC

Did you uninstall gluster-debug?

Comment 29 Cnaik 2018-04-23 04:25:12 UTC

(In reply to Nithya Balachandran from comment #28)
> Did you uninstall gluster-debug?

Yes uninstalled  glusterfs-dbg (3.7.6-1ubuntu1 500) adn tried above steps.

Comment 30 Nithya Balachandran 2018-04-23 05:31:52 UTC

(In reply to Cnaik from comment #29)
> (In reply to Nithya Balachandran from comment #28)
> > Did you uninstall gluster-debug?
> 
> Yes uninstalled  glusterfs-dbg (3.7.6-1ubuntu1 500) adn tried above steps.

There is not much we can do without a stack trace.Please try the following:

1)In gdb, use 'info sharedlib' to determine the addresses and try to map the ones on the stack using 'info symbol <address>'

2. Enable TRACE logs in the test case and see if there is anything useful in the logs. In weighted-rebalance.t, after the line
EXPECT 'Started' volinfo_field $V0 'Status'

add
TEST $CLI volume set $V0 client-log-level TRACE
TEST $CLI volume set $V0 brick-log-level TRACE


and rerun the test.

Comment 31 Cnaik 2018-04-23 11:07:15 UTC

We managed to get the stack trace with symbols by setting the search path for libraries: set solib-search-path. Attaching the backtrace.
Please let us know if this will help.

Comment 32 Cnaik 2018-04-23 11:07:23 UTC

We managed to get the stack trace with symbols by setting the search path for libraries: set solib-search-path. Attaching the backtrace.
Please let us know if this will help.

Comment 33 Cnaik 2018-04-23 11:09:06 UTC

Created attachment 1425645 [details]
Back Trace -Weighted-rebal_4.0.1

Adding the backtrace with correct symbols. Log is too huge.Adding it part by part.

Comment 34 Cnaik 2018-04-24 04:32:57 UTC

Created attachment 1425782 [details]
Complete Backtrace For core dump of weighted-rebalance

Attaching the complete stack of core dump obtained after running weighted-rebalance.t Please check and let us know your comments on this.

Comment 35 Cnaik 2018-04-25 06:30:21 UTC

Created attachment 1426431 [details]
Attaching patchy1/pathcy2 bricks log with TRACE enabled in the test.

Enabled TRACE logs in the test, Attaching patchy1/pathcy2 bricks log.

Comment 36 Cnaik 2018-04-25 06:31:31 UTC

Created attachment 1426432 [details]
glusterd, patchy-rebalance.log with TRACE enabled

Comment 37 Nithya Balachandran 2018-04-27 05:54:39 UTC

Could be a stack overflow. What is your default stack size (ulimit -s) ?

Comment 38 Nithya Balachandran 2018-04-27 06:45:17 UTC

Try increasing the default stack size and see if you still hit the problem.

Comment 39 Nithya Balachandran 2018-05-02 11:23:38 UTC

Hi,

Did you try increasing the stack size?

Comment 40 agautam 2018-05-03 05:46:26 UTC

(In reply to Nithya Balachandran from comment #39)
> Hi,
> 
> Did you try increasing the stack size?
Hi,
Our default stack size (ulimit -s) is 8192, we also ran the test after  setting it to unlimited, but the test still failed.

Comment 41 Raghavendra G 2018-05-03 12:09:40 UTC

Encoding of readdir response is a recursive function. The depth of the stack is directly proportional to the number of dentries in a readdir response. So, it can result in stack overflow if readdir happen to have too many dentries. A safer alternative would be to make this function iterative.

Comment 42 Cnaik 2018-05-04 09:52:20 UTC

(In reply to Raghavendra G from comment #41)
> Encoding of readdir response is a recursive function. The depth of the stack
> is directly proportional to the number of dentries in a readdir response.
> So, it can result in stack overflow if readdir happen to have too many
> dentries. A safer alternative would be to make this function iterative.

Could you please point out to the exact function that which you are referring to?

Comment 43 Cnaik 2018-05-15 03:28:31 UTC

Any update on this?

Comment 44 Cnaik 2018-05-17 10:22:35 UTC

(In reply to Raghavendra G from comment #41)
> Encoding of readdir response is a recursive function. The depth of the stack
> is directly proportional to the number of dentries in a readdir response.
> So, it can result in stack overflow if readdir happen to have too many
> dentries. A safer alternative would be to make this function iterative.

Could you please let us know the function to be made iterative?

We tried making gfx_defrag_fix_layout() recursive function to iterative - still the test behavior is same.
Brick goes down before syncop_readdirp(called in while loop) returns in the second call to gfx_defrag_fix_layout

Comment 45 Raghavendra G 2018-05-17 12:14:22 UTC

(In reply to Cnaik from comment #44)
> (In reply to Raghavendra G from comment #41)
> > Encoding of readdir response is a recursive function. The depth of the stack
> > is directly proportional to the number of dentries in a readdir response.
> > So, it can result in stack overflow if readdir happen to have too many
> > dentries. A safer alternative would be to make this function iterative.
> 
> Could you please let us know the function to be made iterative?
> 
> We tried making gfx_defrag_fix_layout() recursive function to iterative -
> still the test behavior is same.
> Brick goes down before syncop_readdirp(called in while loop) returns in the
> second call to gfx_defrag_fix_layout

Sorry. The code is autogenerated. Autogeneration logic refers to the template of readdirp-response from ./rpc/xdr/src/glusterfs[3][4]xdr.x

struct gfs3_readdirp_rsp {
       int op_ret;
       int op_errno;
       gfs3_dirplist *reply;
        opaque   xdata<>; /* Extra data */
};

Note that definition of gfs3_dirplist has a pointer to another object of gfs3_dirplist - nextentry:


struct gfs3_dirplist {
       u_quad_t d_ino;
       u_quad_t d_off;
       unsigned int d_len;
       unsigned int d_type;
       string name<>;
       gf_iatt stat;
       opaque dict<>;
       gfs3_dirplist *nextentry;
};

Autogenerated code from ./rpc/xdr/src/glustersf[3][4]xdr.c:
bool_t
xdr_gfs3_readdirp_rsp (XDR *xdrs, gfs3_readdirp_rsp *objp)
{
        register int32_t *buf;

         if (!xdr_int (xdrs, &objp->op_ret))
                 return FALSE;
         if (!xdr_int (xdrs, &objp->op_errno))
                 return FALSE;
         if (!xdr_pointer (xdrs, (char **)&objp->reply, sizeof (gfs3_dirplist), (xdrproc_t) xdr_gfs3_dirplist))
                 return FALSE;
         if (!xdr_bytes (xdrs, (char **)&objp->xdata.xdata_val, (u_int *) &objp->xdata.xdata_len, ~0))
                 return FALSE;
        return TRUE;
}


Definition of xdr_gfs3_dirplist invokes itself to encode/decode the next dentry in the list:
bool_t
xdr_gfs3_dirplist (XDR *xdrs, gfs3_dirplist *objp)
{
        register int32_t *buf;

         if (!xdr_u_quad_t (xdrs, &objp->d_ino))
                 return FALSE;
         if (!xdr_u_quad_t (xdrs, &objp->d_off))
                 return FALSE;
         if (!xdr_u_int (xdrs, &objp->d_len))
                 return FALSE;
         if (!xdr_u_int (xdrs, &objp->d_type))
                 return FALSE;
         if (!xdr_string (xdrs, &objp->name, ~0))
                 return FALSE;
         if (!xdr_gf_iatt (xdrs, &objp->stat))
                 return FALSE;
         if (!xdr_bytes (xdrs, (char **)&objp->dict.dict_val, (u_int *) &objp->dict.dict_len, ~0))
                 return FALSE;
         if (!xdr_pointer (xdrs, (char **)&objp->nextentry, sizeof (gfs3_dirplist), (xdrproc_t) xdr_gfs3_dirplist))
                 return FALSE;
        return TRUE;
}

if (!xdr_pointer (xdrs, (char **)&objp->nextentry, sizeof (gfs3_dirplist), (xdrproc_t) xdr_gfs3_dirplist)) is the recursive invocation.

Since, this code is autogenerated, I am still thinking how to make this iterative.

Comment 46 Nithya Balachandran 2018-05-18 03:33:36 UTC

> Since, this code is autogenerated, I am still thinking how to make this
> iterative.

Would removing the readdir rsp definition from the .x file and writing a separate .c file for the iterative method  work?

The other thing to keep in mind it s390x may not be officially supported as a server platform. We could run into endianess issues (DHT-IATT-IN_DICT)

Comment 47 Cnaik 2018-05-18 05:01:43 UTC

(In reply to Nithya Balachandran from comment #46)
> > Since, this code is autogenerated, I am still thinking how to make this
> > iterative.
> 
> Would removing the readdir rsp definition from the .x file and writing a
> separate .c file for the iterative method  work?
> 
> The other thing to keep in mind it s390x may not be officially supported as
> a server platform. We could run into endianess issues (DHT-IATT-IN_DICT)

With respect to the comment "We could run into endianess issues (DHT-IATT-IN_DICT)" - The test passes on RHEL distributions. Hence looks like the issue might not be related to endianess.

Comment 48 Nithya Balachandran 2018-05-18 05:24:15 UTC

(In reply to Cnaik from comment #47)
> (In reply to Nithya Balachandran from comment #46)
> > > Since, this code is autogenerated, I am still thinking how to make this
> > > iterative.
> > 
> > Would removing the readdir rsp definition from the .x file and writing a
> > separate .c file for the iterative method  work?
> > 
> > The other thing to keep in mind it s390x may not be officially supported as
> > a server platform. We could run into endianess issues (DHT-IATT-IN_DICT)
> 
> With respect to the comment "We could run into endianess issues
> (DHT-IATT-IN_DICT)" - The test passes on RHEL distributions. Hence looks
> like the issue might not be related to endianess.


The endianess I was referring to was the endianess of s390x vs the client machines. What I meant was even if this issue you are seeing with weighted-rebalance is fixed, you could still run into other issues later.It is unrelated to the issue you are facing.

Comment 49 Nithya Balachandran 2018-05-18 13:01:48 UTC

On the gluster IRC channel, "ajax" confirmed that doubling IOT_THREAD_STACK_SIZE fixed the problem. This confirms that it is a stack overflow.

It is not clear as to why the test passed on RHEL.

Comment 50 Cnaik 2018-05-29 10:03:19 UTC

(In reply to Nithya Balachandran from comment #49)
> On the gluster IRC channel, "ajax" confirmed that doubling
> IOT_THREAD_STACK_SIZE fixed the problem. This confirms that it is a stack
> overflow.
> 
> It is not clear as to why the test passed on RHEL.

Thank you for providing your valuable inputs in debugging this issue.

Comment 51 Shyamsundar 2018-06-20 18:25:03 UTC

This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

Note You need to log in before you can comment on or make changes to this bug.