Bug 1083668
Summary: | BVT: profile tests hang because of Gluster issues | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Lalatendu Mohanty <lmohanty> | ||||||||||
Component: | glusterd | Assignee: | Avra Sengupta <asengupt> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Lalatendu Mohanty <lmohanty> | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | rhgs-3.0 | CC: | lmohanty, nlevinki, nsathyan, rcyriac, ssamanta, vagarwal, vbellur | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | RHGS 3.0.0 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | glusterfs-3.6.0.1-1.el6rhs | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1095097 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2014-09-22 19:33:30 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1095097 | ||||||||||||
Attachments: |
|
Description
Lalatendu Mohanty
2014-04-02 16:50:33 UTC
This again reproducible in BVT with latest build glusterfs-3.5qa2-0.323.git6567d14.el6rhs and it is reproducible in all runs of BVT. Because of the issue/bug, glusterfsd and glusted stop working hence the test automation also hangs. So it is a blocker for BVT also. Created attachment 884818 [details]
Brick log
Created attachment 884819 [details]
Glusterd Logs
Created attachment 884820 [details]
/var/log/messages
These are the error seen in the brick log. To see details logs please check the attached brick log [2014-04-09 10:46:17.617163] E [common-utils.c:222:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known) [2014-04-09 10:46:17.617208] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-glusterfs: DNS resolution failed on host rhsauto057.lab.eng.blr.redhat.com [2014-04-09 10:52:56.915794] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent [2014-04-09 10:52:56.915859] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again [2014-04-09 10:52:56.916712] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed [2014-04-09 10:52:56.916738] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed [2014-04-09 10:53:06.697366] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent [2014-04-09 10:53:06.697403] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again [2014-04-09 10:53:06.697453] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed [2014-04-09 10:53:06.697472] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed [2014-04-09 11:00:09.470458] E [posix.c:5317:init] 0-hosdu-posix: Extended attribute trusted.glusterfs.volume-id is absent [2014-04-09 11:00:09.470499] E [xlator.c:406:xlator_init] 0-hosdu-posix: Initialization of volume 'hosdu-posix' failed, review your volfile again [2014-04-09 11:00:09.470517] E [graph.c:307:glusterfs_graph_init] 0-hosdu-posix: initializing translator failed [2014-04-09 11:00:09.470531] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed The backtrace provided doesn't contain any useful information. The log files too are of no help. 1. Is it possible to run these tests in valgrind? Starting glusterd with --xlator-option=*.run-with-valgrind=yes will start each brick process under valgrind. The valgrind logs can be found in <install-directory>/var/log/glusterfs/*valgind* 2. Are debuginfo rpms installed on these testbeds. It seems like symbol tables are not there and hence they are not installed. If not, can we install debuginfo and rerun the tests regards, Raghavendra. I am assuming the brick process crashed. If not, we would need valgrind logs for the process which crashed. With the latest build i.e. glusterfs-3.5qa2-0.340.gitc193996.el6rhs.x86_64, core file is not getting generated but gusterfsd is dying for two bricks (one in each node). From brick log: bricks-hosdu_brick1.log [2014-04-24 05:41:46.592134] I [client_t.c:294:gf_client_put] 0-hosdu-server: Shutting down connection rhsauto067.lab.eng.blr.redhat.com-15309-2014/04/24-05:40:28:953055-hosdu-client-1-0-0 [2014-04-24 05:41:47.641653] W [MSGID: 100032] [glusterfsd.c:1130:cleanup_and_exit] (--> 0-: received signum (15), shutting down [2014-04-24 05:41:47.642465] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed From valgrind logs from the same glusterfsd process (valgrind-bricks-hosdu_brick1.log): =10010== 5,328 (4,752 direct, 576 indirect) bytes in 18 blocks are definitely lost in loss record 248 of 275 ==10010== at 0x4C2677B: calloc (vg_replace_malloc.c:593) ==10010== by 0x4E79822: __gf_calloc (mem-pool.h:88) ==10010== by 0x4E68D2A: __inode_create (inode.c:531) ==10010== by 0x4E69112: inode_new (inode.c:562) ==10010== by 0x1337CC30: server_lookup_resume (server-rpc-fops.c:3010) ==10010== by 0x1336171D: server_resolve_done (server-resolve.c:541) ==10010== by 0x13361F1C: server_resolve_all (server-resolve.c:576) ==10010== by 0x13361E2C: server_resolve (server-resolve.c:525) ==10010== by 0x13361F5D: server_resolve_all (server-resolve.c:572) ==10010== by 0x13361F94: server_resolve_entry (server-resolve.c:325) ==10010== by 0x13361E37: server_resolve (server-resolve.c:510) ==10010== by 0x13361F3D: server_resolve_all (server-resolve.c:565) ==10010== ==10010== LEAK SUMMARY: ==10010== definitely lost: 6,034 bytes in 86 blocks ==10010== indirectly lost: 875 bytes in 22 blocks ==10010== possibly lost: 4,136 bytes in 18 blocks ==10010== still reachable: 16,575,371 bytes in 624 blocks ==10010== suppressed: 0 bytes in 0 blocks ==10010== Reachable blocks (those to which a pointer was found) are not shown. ==10010== To see them, rerun with: --leak-check=full --show-reachable=yes ==10010== ==10010== For counts of detected and suppressed errors, rerun with: -v ==10010== ERROR SUMMARY: 37 errors from 37 contexts (suppressed: 69 from 9) I will also upload all the logs to the bug Created attachment 889196 [details]
gluster and valgrind logs
Volume information [root@rhsauto008 ~]# gluster v info Volume Name: hosdu Type: Distributed-Replicate Volume ID: 26d5d531-1f03-4f86-926c-2a17835d721e Status: Started Snap Volume: no Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick2 Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3 Brick3: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick4 Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5 Brick5: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick6 Brick6: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick7 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on During the automation run I observed below issues in the test machines, Once below warning is seen in the brick logs, few gluster commands were returning "Another transaction is in progress for hosdu. Please try again after sometime." and some commands like gluster v info was working fine. for details check below. [2014-04-24 02:25:49.150889] W [MSGID: 100032] [glusterfsd.c:1130:cleanup_and_exit] (--> 0-: received signum (15), shutting down [2014-04-24 02:25:49.151853] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed [root@rhsauto022 ~]# gluster v status Another transaction is in progress for hosdu. Please try again after sometime. root@rhsauto022 ~]# gluster volume profile hosdu info Another transaction is in progress for hosdu. Please try again after sometime. [root@rhsauto0022 ~]# gluster v info Volume Name: hosdu Type: Distributed-Replicate Volume ID: 26d5d531-1f03-4f86-926c-2a17835d721e Status: Started Snap Volume: no Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick2 Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick3 Brick3: rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick4 Brick4: rhsauto008.lab.eng.blr.redhat.com:/bricks/hosdu_brick5 Brick5: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick6 Brick6: rhsauto067.lab.eng.blr.redhat.com:/bricks/hosdu_brick7 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on [root@rhsauto022 ~]# gluster v status Another transaction is in progress for hosdu. Please try again after sometime. With build glusterfs-3.5qa2-0.369.git500a656, I can reproduce the issue as mentioned in my previous comments (comment #9). Because of this issue BVT is getting stuck and eventually Beaker times out. As glusterd_do_replace_brick() is spawned through gf_timer_call_after(), by the time it's called the event is freed, and the txn_id is lost. Hence using a calloc-ed copy, which will be freed as a part of rb_ctx dict. Setting flags required to add BZs to RHS 3.0 Errata Tested with glusterfs-3.6.0.5-1.el6rhs. I am still seeing the issue. Working on getting some more info on this bug. This time (refer the previous comment) test failed because of a automation issue. As in rhs3.0 replace brick commands will not work without "--mode=script" as replace-brick is deprecated and cmd asks for confirmation as below. All replace-brick commands except commit force are deprecated. Do you want to continue? (y/n) After fixing the automation, automation run fine (with buuild glusterfs-3.6.0.5-1.el6rhs) . Hence marking the bug as verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html |