Bug 1083668 - BVT: profile tests hang because of Gluster issues
Summary: BVT: profile tests hang because of Gluster issues
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: Avra Sengupta
QA Contact: Lalatendu Mohanty
URL:
Whiteboard:
Depends On:
Blocks: 1095097
TreeView+ depends on / blocked
 
Reported: 2014-04-02 16:50 UTC by Lalatendu Mohanty
Modified: 2015-05-15 17:50 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.6.0.1-1.el6rhs
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1095097 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:33:30 UTC
Embargoed:


Attachments (Terms of Use)
Brick log (53.08 KB, text/x-log)
2014-04-10 09:19 UTC, Lalatendu Mohanty
no flags Details
Glusterd Logs (66.64 KB, text/x-log)
2014-04-10 09:20 UTC, Lalatendu Mohanty
no flags Details
/var/log/messages (448.71 KB, text/plain)
2014-04-10 09:21 UTC, Lalatendu Mohanty
no flags Details
gluster and valgrind logs (59.85 KB, application/x-gzip)
2014-04-24 07:47 UTC, Lalatendu Mohanty
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description Lalatendu Mohanty 2014-04-02 16:50:33 UTC
Description of problem:

Automation run for profile tests caused a Segmentation fault. 


Version-Release number of selected component (if applicable):
glusterfs-3.5qa2-0.294.git00802b3.el6rhs.x86_64

How reproducible:
This issue came in BVT run. Will try to reproduce and report the frequency of the occurrence in the bug.

Steps to Reproduce:

Not sure if these are the steps to reproduce. But this is the last test run before the segmentation fault.

1. Create gluster volume (replica = 2)
2. Start profiling on the volume i.e. gluster volume profile $volname start
3. Mount it on client and create files on it e.g: as below

for i in `seq 1 100`; do
   dd if=/dev/zero of=$MOUNT_POINT/$i bs=128K count=100 2>&1 1>/dev/null
    dd if=$MOUNT_POINT/$i of=/dev/zero bs=128K count=100 2>&1 1>/dev/null
done

4. Add two bricks and do rebalance.

5. Remove both the bricks e.g. gluster volume remove-brick $volname $remove_brick_list --mode=script

Actual results:

Segmentation fault occurred and core generated for glusterfsd.  

Expected results:


Additional info:

The backtrace from the generated core.

New Thread 17305]
[New Thread 17299]
[New Thread 17300]
[New Thread 17302]
[New Thread 17301]
[New Thread 17304]
[New Thread 17298]
Core was generated by `/usr/sbin/glusterfsd -s rhsauto022.lab.eng.blr.redhat.com --volfile-id hosdu.rh'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f3af975d867 in ?? ()

Thread 7 (Thread 17298):
#0  0x00007f3affacc264 in ?? ()
No symbol table info available.
#1  0x00007f3af9963b20 in ?? ()
No symbol table info available.
#2  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 6 (Thread 17304):
#0  0x00007f3affacc264 in ?? ()
No symbol table info available.
#1  0x00007f3af9963b20 in ?? ()
No symbol table info available.
#2  0x00007f3af38fa9b8 in ?? ()
No symbol table info available.
#3  0x00007f3affac7523 in ?? ()
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 5 (Thread 17301):
#0  0x00007f3affac998e in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 4 (Thread 17302):
#0  0x00007f3affaccf3d in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 3 (Thread 17300):
#0  0x00007f3affac998e in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 2 (Thread 17299):
#0  0x00007f3affacd4b5 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 1 (Thread 17305):
#0  0x00007f3af975d867 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Comment 2 Lalatendu Mohanty 2014-04-10 09:18:08 UTC
This again reproducible in BVT with latest build glusterfs-3.5qa2-0.323.git6567d14.el6rhs and it is reproducible in all runs of BVT.

Because of the issue/bug, glusterfsd and glusted stop working hence the test automation also hangs. So it is a blocker for BVT also.

Comment 3 Lalatendu Mohanty 2014-04-10 09:19:43 UTC
Created attachment 884818 [details]
Brick log

Comment 4 Lalatendu Mohanty 2014-04-10 09:20:29 UTC
Created attachment 884819 [details]
Glusterd Logs

Comment 5 Lalatendu Mohanty 2014-04-10 09:21:21 UTC
Created attachment 884820 [details]
/var/log/messages

Comment 6 Lalatendu Mohanty 2014-04-10 09:23:21 UTC
These are the error seen in the brick log. To see details logs please check the attached brick log

[2014-04-09 10:46:17.617163] E [common-utils.c:222:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2014-04-09 10:46:17.617208] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-glusterfs: DNS resolution failed on host rhsauto057.lab.eng.blr.redhat.com
[2014-04-09 10:52:56.915794] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent
[2014-04-09 10:52:56.915859] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again
[2014-04-09 10:52:56.916712] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed
[2014-04-09 10:52:56.916738] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed
[2014-04-09 10:53:06.697366] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent
[2014-04-09 10:53:06.697403] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again
[2014-04-09 10:53:06.697453] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed
[2014-04-09 10:53:06.697472] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed
[2014-04-09 11:00:09.470458] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent
[2014-04-09 11:00:09.470499] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again
[2014-04-09 11:00:09.470517] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed
[2014-04-09 11:00:09.470531] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed

Comment 7 Raghavendra G 2014-04-15 07:25:33 UTC
The backtrace provided doesn't contain any useful information. The log files too are of no help. 

1. Is it possible to run these tests in valgrind? Starting glusterd with --xlator-option=*.run-with-valgrind=yes will start each brick process under valgrind. The valgrind logs can be found in <install-directory>/var/log/glusterfs/*valgind*

2. Are debuginfo rpms installed on these testbeds. It seems like symbol tables are not there and hence they are not installed. If not, can we install debuginfo and rerun the tests

regards,
Raghavendra.

Comment 8 Raghavendra G 2014-04-15 07:28:30 UTC
I am assuming the brick process crashed. If not, we would need valgrind logs for the process which crashed.

Comment 9 Lalatendu Mohanty 2014-04-24 07:41:42 UTC
With the latest build i.e. glusterfs-3.5qa2-0.340.gitc193996.el6rhs.x86_64, core file is not getting generated but gusterfsd is dying for two bricks (one in each node).

From brick log: bricks-hosdu_brick1.log 

[2014-04-24 05:41:46.592134] I [client_t.c:294:gf_client_put] 0-hosdu-server: Shutting down connection rhsauto067.lab.eng.blr.redhat.com-15309-2014/04/24-05:40:28:953055-hosdu-client-1-0-0
[2014-04-24 05:41:47.641653] W [MSGID: 100032] [glusterfsd.c:1130:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[2014-04-24 05:41:47.642465] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed

From valgrind logs from the same glusterfsd process (valgrind-bricks-hosdu_brick1.log):

=10010== 5,328 (4,752 direct, 576 indirect) bytes in 18 blocks are definitely lost in loss record 248 of 275
==10010==    at 0x4C2677B: calloc (vg_replace_malloc.c:593)
==10010==    by 0x4E79822: __gf_calloc (mem-pool.h:88)
==10010==    by 0x4E68D2A: __inode_create (inode.c:531)
==10010==    by 0x4E69112: inode_new (inode.c:562)
==10010==    by 0x1337CC30: server_lookup_resume (server-rpc-fops.c:3010)
==10010==    by 0x1336171D: server_resolve_done (server-resolve.c:541)
==10010==    by 0x13361F1C: server_resolve_all (server-resolve.c:576)
==10010==    by 0x13361E2C: server_resolve (server-resolve.c:525)
==10010==    by 0x13361F5D: server_resolve_all (server-resolve.c:572)
==10010==    by 0x13361F94: server_resolve_entry (server-resolve.c:325)
==10010==    by 0x13361E37: server_resolve (server-resolve.c:510)
==10010==    by 0x13361F3D: server_resolve_all (server-resolve.c:565)
==10010== 
==10010== LEAK SUMMARY:
==10010==    definitely lost: 6,034 bytes in 86 blocks
==10010==    indirectly lost: 875 bytes in 22 blocks
==10010==      possibly lost: 4,136 bytes in 18 blocks
==10010==    still reachable: 16,575,371 bytes in 624 blocks
==10010==         suppressed: 0 bytes in 0 blocks
==10010== Reachable blocks (those to which a pointer was found) are not shown.
==10010== To see them, rerun with: --leak-check=full --show-reachable=yes
==10010== 
==10010== For counts of detected and suppressed errors, rerun with: -v
==10010== ERROR SUMMARY: 37 errors from 37 contexts (suppressed: 69 from 9)

I will also upload all the logs to the bug

Comment 10 Lalatendu Mohanty 2014-04-24 07:47:45 UTC
Created attachment 889196 [details]
gluster and valgrind logs

Comment 11 Lalatendu Mohanty 2014-04-24 08:02:37 UTC
Volume information

[root@rhsauto008 ~]# gluster v info
 
Volume Name: hosdu
Type: Distributed-Replicate
Volume ID: 26d5d531-1f03-4f86-926c-2a17835d721e
Status: Started
Snap Volume: no
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick2
Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3
Brick3: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick4
Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5
Brick5: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick6
Brick6: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick7
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on

Comment 12 Lalatendu Mohanty 2014-04-24 08:07:20 UTC
During the automation run I observed below issues in the test machines, 

Once below warning is seen in the brick logs, few gluster commands were returning "Another transaction is in progress for hosdu. Please try again after sometime." and some commands like gluster v info was working fine. for details check below.


[2014-04-24 02:25:49.150889] W [MSGID: 100032] [glusterfsd.c:1130:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[2014-04-24 02:25:49.151853] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed

[root@rhsauto022 ~]# gluster v status
Another transaction is in progress for hosdu. Please try again after sometime.

root@rhsauto022 ~]# gluster volume profile hosdu info
Another transaction is in progress for hosdu. Please try again after sometime.
 
[root@rhsauto0022 ~]# gluster v info
 
Volume Name: hosdu
Type: Distributed-Replicate
Volume ID: 26d5d531-1f03-4f86-926c-2a17835d721e
Status: Started
Snap Volume: no
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick2
Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3
Brick3: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick4
Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5
Brick5: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick6
Brick6: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick7
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on

[root@rhsauto022 ~]# gluster v status
Another transaction is in progress for hosdu. Please try again after sometime.

Comment 13 Lalatendu Mohanty 2014-04-29 13:48:44 UTC
With build glusterfs-3.5qa2-0.369.git500a656, I can reproduce the issue as mentioned in my previous comments (comment #9). Because of this issue BVT is getting stuck and eventually Beaker times out.

Comment 14 Avra Sengupta 2014-05-06 11:26:55 UTC
As glusterd_do_replace_brick() is spawned through gf_timer_call_after(),
by the time it's called the event is freed, and the txn_id is lost.
Hence using a calloc-ed copy, which will be freed as a part of rb_ctx
dict.

Comment 15 Lalatendu Mohanty 2014-05-06 11:51:30 UTC
Patch: http://review.gluster.org/#/c/7686/

Comment 16 Nagaprasad Sathyanarayana 2014-05-19 10:56:33 UTC
Setting flags required to add BZs to RHS 3.0 Errata

Comment 17 Lalatendu Mohanty 2014-05-22 13:52:52 UTC
Tested with glusterfs-3.6.0.5-1.el6rhs. I am still seeing the issue. Working on getting some more info on this bug.

Comment 18 Lalatendu Mohanty 2014-05-23 12:19:28 UTC
This time (refer the previous comment) test failed because of a automation issue. As in rhs3.0 replace brick commands will not work without "--mode=script" as replace-brick is deprecated and cmd asks for confirmation as below.
    
    All replace-brick commands except commit force are deprecated.
    Do you want to continue? (y/n)

After fixing the automation, automation run  fine (with buuild glusterfs-3.6.0.5-1.el6rhs) . Hence marking the bug as verified

Comment 20 errata-xmlrpc 2014-09-22 19:33:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.