Bug 1083668

Summary:

BVT: profile tests hang because of Gluster issues

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Lalatendu Mohanty <lmohanty>

Component:

glusterd

Assignee:

Avra Sengupta <asengupt>

Status:

CLOSED ERRATA

QA Contact:

Lalatendu Mohanty <lmohanty>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

rhgs-3.0

CC:

lmohanty, nlevinki, nsathyan, rcyriac, ssamanta, vagarwal, vbellur

Target Milestone:

---

Target Release:

RHGS 3.0.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.6.0.1-1.el6rhs

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1095097 (view as bug list)

Environment:

Last Closed:

2014-09-22 19:33:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1095097

Attachments:

Description	Flags
Brick log	none
Glusterd Logs	none
/var/log/messages	none
gluster and valgrind logs	none

Description Lalatendu Mohanty 2014-04-02 16:50:33 UTC

Description of problem:

Automation run for profile tests caused a Segmentation fault. 


Version-Release number of selected component (if applicable):
glusterfs-3.5qa2-0.294.git00802b3.el6rhs.x86_64

How reproducible:
This issue came in BVT run. Will try to reproduce and report the frequency of the occurrence in the bug.

Steps to Reproduce:

Not sure if these are the steps to reproduce. But this is the last test run before the segmentation fault.

1. Create gluster volume (replica = 2)
2. Start profiling on the volume i.e. gluster volume profile $volname start
3. Mount it on client and create files on it e.g: as below

for i in `seq 1 100`; do
   dd if=/dev/zero of=$MOUNT_POINT/$i bs=128K count=100 2>&1 1>/dev/null
    dd if=$MOUNT_POINT/$i of=/dev/zero bs=128K count=100 2>&1 1>/dev/null
done

4. Add two bricks and do rebalance.

5. Remove both the bricks e.g. gluster volume remove-brick $volname $remove_brick_list --mode=script

Actual results:

Segmentation fault occurred and core generated for glusterfsd.  

Expected results:


Additional info:

The backtrace from the generated core.

New Thread 17305]
[New Thread 17299]
[New Thread 17300]
[New Thread 17302]
[New Thread 17301]
[New Thread 17304]
[New Thread 17298]
Core was generated by `/usr/sbin/glusterfsd -s rhsauto022.lab.eng.blr.redhat.com --volfile-id hosdu.rh'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f3af975d867 in ?? ()

Thread 7 (Thread 17298):
#0  0x00007f3affacc264 in ?? ()
No symbol table info available.
#1  0x00007f3af9963b20 in ?? ()
No symbol table info available.
#2  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 6 (Thread 17304):
#0  0x00007f3affacc264 in ?? ()
No symbol table info available.
#1  0x00007f3af9963b20 in ?? ()
No symbol table info available.
#2  0x00007f3af38fa9b8 in ?? ()
No symbol table info available.
#3  0x00007f3affac7523 in ?? ()
No symbol table info available.
#4  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 5 (Thread 17301):
#0  0x00007f3affac998e in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 4 (Thread 17302):
#0  0x00007f3affaccf3d in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 3 (Thread 17300):
#0  0x00007f3affac998e in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 2 (Thread 17299):
#0  0x00007f3affacd4b5 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 1 (Thread 17305):
#0  0x00007f3af975d867 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Comment 2 Lalatendu Mohanty 2014-04-10 09:18:08 UTC

This again reproducible in BVT with latest build glusterfs-3.5qa2-0.323.git6567d14.el6rhs and it is reproducible in all runs of BVT.

Because of the issue/bug, glusterfsd and glusted stop working hence the test automation also hangs. So it is a blocker for BVT also.

Comment 3 Lalatendu Mohanty 2014-04-10 09:19:43 UTC

Created attachment 884818 [details]
Brick log

Comment 4 Lalatendu Mohanty 2014-04-10 09:20:29 UTC

Created attachment 884819 [details]
Glusterd Logs

Comment 5 Lalatendu Mohanty 2014-04-10 09:21:21 UTC

Created attachment 884820 [details]
/var/log/messages

Comment 6 Lalatendu Mohanty 2014-04-10 09:23:21 UTC

These are the error seen in the brick log. To see details logs please check the attached brick log

[2014-04-09 10:46:17.617163] E [common-utils.c:222:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2014-04-09 10:46:17.617208] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-glusterfs: DNS resolution failed on host rhsauto057.lab.eng.blr.redhat.com
[2014-04-09 10:52:56.915794] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent
[2014-04-09 10:52:56.915859] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again
[2014-04-09 10:52:56.916712] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed
[2014-04-09 10:52:56.916738] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed
[2014-04-09 10:53:06.697366] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent
[2014-04-09 10:53:06.697403] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again
[2014-04-09 10:53:06.697453] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed
[2014-04-09 10:53:06.697472] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed
[2014-04-09 11:00:09.470458] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent
[2014-04-09 11:00:09.470499] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again
[2014-04-09 11:00:09.470517] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed
[2014-04-09 11:00:09.470531] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed

Comment 7 Raghavendra G 2014-04-15 07:25:33 UTC

The backtrace provided doesn't contain any useful information. The log files too are of no help. 

1. Is it possible to run these tests in valgrind? Starting glusterd with --xlator-option=*.run-with-valgrind=yes will start each brick process under valgrind. The valgrind logs can be found in <install-directory>/var/log/glusterfs/*valgind*

2. Are debuginfo rpms installed on these testbeds. It seems like symbol tables are not there and hence they are not installed. If not, can we install debuginfo and rerun the tests

regards,
Raghavendra.

Comment 8 Raghavendra G 2014-04-15 07:28:30 UTC

I am assuming the brick process crashed. If not, we would need valgrind logs for the process which crashed.

Comment 9 Lalatendu Mohanty 2014-04-24 07:41:42 UTC

With the latest build i.e. glusterfs-3.5qa2-0.340.gitc193996.el6rhs.x86_64, core file is not getting generated but gusterfsd is dying for two bricks (one in each node).

From brick log: bricks-hosdu_brick1.log 

[2014-04-24 05:41:46.592134] I [client_t.c:294:gf_client_put] 0-hosdu-server: Shutting down connection rhsauto067.lab.eng.blr.redhat.com-15309-2014/04/24-05:40:28:953055-hosdu-client-1-0-0
[2014-04-24 05:41:47.641653] W [MSGID: 100032] [glusterfsd.c:1130:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[2014-04-24 05:41:47.642465] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed

From valgrind logs from the same glusterfsd process (valgrind-bricks-hosdu_brick1.log):

=10010== 5,328 (4,752 direct, 576 indirect) bytes in 18 blocks are definitely lost in loss record 248 of 275
==10010==    at 0x4C2677B: calloc (vg_replace_malloc.c:593)
==10010==    by 0x4E79822: __gf_calloc (mem-pool.h:88)
==10010==    by 0x4E68D2A: __inode_create (inode.c:531)
==10010==    by 0x4E69112: inode_new (inode.c:562)
==10010==    by 0x1337CC30: server_lookup_resume (server-rpc-fops.c:3010)
==10010==    by 0x1336171D: server_resolve_done (server-resolve.c:541)
==10010==    by 0x13361F1C: server_resolve_all (server-resolve.c:576)
==10010==    by 0x13361E2C: server_resolve (server-resolve.c:525)
==10010==    by 0x13361F5D: server_resolve_all (server-resolve.c:572)
==10010==    by 0x13361F94: server_resolve_entry (server-resolve.c:325)
==10010==    by 0x13361E37: server_resolve (server-resolve.c:510)
==10010==    by 0x13361F3D: server_resolve_all (server-resolve.c:565)
==10010== 
==10010== LEAK SUMMARY:
==10010==    definitely lost: 6,034 bytes in 86 blocks
==10010==    indirectly lost: 875 bytes in 22 blocks
==10010==      possibly lost: 4,136 bytes in 18 blocks
==10010==    still reachable: 16,575,371 bytes in 624 blocks
==10010==         suppressed: 0 bytes in 0 blocks
==10010== Reachable blocks (those to which a pointer was found) are not shown.
==10010== To see them, rerun with: --leak-check=full --show-reachable=yes
==10010== 
==10010== For counts of detected and suppressed errors, rerun with: -v
==10010== ERROR SUMMARY: 37 errors from 37 contexts (suppressed: 69 from 9)

I will also upload all the logs to the bug

Comment 10 Lalatendu Mohanty 2014-04-24 07:47:45 UTC

Created attachment 889196 [details]
gluster and valgrind logs

Comment 11 Lalatendu Mohanty 2014-04-24 08:02:37 UTC

Volume information

[root@rhsauto008 ~]# gluster v info
 
Volume Name: hosdu
Type: Distributed-Replicate
Volume ID: 26d5d531-1f03-4f86-926c-2a17835d721e
Status: Started
Snap Volume: no
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick2
Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3
Brick3: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick4
Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5
Brick5: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick6
Brick6: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick7
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on

Comment 12 Lalatendu Mohanty 2014-04-24 08:07:20 UTC

During the automation run I observed below issues in the test machines, 

Once below warning is seen in the brick logs, few gluster commands were returning "Another transaction is in progress for hosdu. Please try again after sometime." and some commands like gluster v info was working fine. for details check below.


[2014-04-24 02:25:49.150889] W [MSGID: 100032] [glusterfsd.c:1130:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[2014-04-24 02:25:49.151853] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed

[root@rhsauto022 ~]# gluster v status
Another transaction is in progress for hosdu. Please try again after sometime.

root@rhsauto022 ~]# gluster volume profile hosdu info
Another transaction is in progress for hosdu. Please try again after sometime.
 
[root@rhsauto0022 ~]# gluster v info
 
Volume Name: hosdu
Type: Distributed-Replicate
Volume ID: 26d5d531-1f03-4f86-926c-2a17835d721e
Status: Started
Snap Volume: no
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick2
Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3
Brick3: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick4
Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5
Brick5: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick6
Brick6: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick7
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on

[root@rhsauto022 ~]# gluster v status
Another transaction is in progress for hosdu. Please try again after sometime.

Comment 13 Lalatendu Mohanty 2014-04-29 13:48:44 UTC

With build glusterfs-3.5qa2-0.369.git500a656, I can reproduce the issue as mentioned in my previous comments (comment #9). Because of this issue BVT is getting stuck and eventually Beaker times out.

Comment 14 Avra Sengupta 2014-05-06 11:26:55 UTC

As glusterd_do_replace_brick() is spawned through gf_timer_call_after(),
by the time it's called the event is freed, and the txn_id is lost.
Hence using a calloc-ed copy, which will be freed as a part of rb_ctx
dict.

Comment 15 Lalatendu Mohanty 2014-05-06 11:51:30 UTC

Patch: http://review.gluster.org/#/c/7686/

Comment 16 Nagaprasad Sathyanarayana 2014-05-19 10:56:33 UTC

Setting flags required to add BZs to RHS 3.0 Errata

Comment 17 Lalatendu Mohanty 2014-05-22 13:52:52 UTC

Tested with glusterfs-3.6.0.5-1.el6rhs. I am still seeing the issue. Working on getting some more info on this bug.

Comment 18 Lalatendu Mohanty 2014-05-23 12:19:28 UTC

This time (refer the previous comment) test failed because of a automation issue. As in rhs3.0 replace brick commands will not work without "--mode=script" as replace-brick is deprecated and cmd asks for confirmation as below.
    
    All replace-brick commands except commit force are deprecated.
    Do you want to continue? (y/n)

After fixing the automation, automation run  fine (with buuild glusterfs-3.6.0.5-1.el6rhs) . Hence marking the bug as verified

Comment 20 errata-xmlrpc 2014-09-22 19:33:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html