1608564 – line coverage tests failing consistently over a week

Bug 1608564 - line coverage tests failing consistently over a week

Summary: line coverage tests failing consistently over a week

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	tests
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1608566 1608568 1613310
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-25 20:07 UTC by Shyamsundar
Modified:	2018-10-23 15:15 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-5.0
Clone Of:
Clones:	1608566 1608568 1613310 (view as bug list)
Environment:
Last Closed:	2018-10-23 15:15:30 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shyamsundar 2018-07-25 20:07:25 UTC

The nightly line coverage tests are failing consistently for over a few weeks. The failures are as follows,

2 test(s) failed 
./tests/basic/sdfs-sanity.t
./tests/bugs/core/bug-1432542-mpx-restart-crash.t

1 test(s) generated core 
./tests/basic/sdfs-sanity.t

a) ./tests/bugs/core/bug-1432542-mpx-restart-crash.t

This test is timing out, my thought is to increment the time for this test, as the line coverage tests seem to take more time (assuming lcov instrumentation slows things down).

For example the time taken for the following tests in centos7 regression builds look as follows,
./tests/bugs/index/bug-1559004-EMLINK-handling.t  -  896 second
./tests/bugs/core/bug-1432542-mpx-restart-crash.t  -  309 second
./tests/basic/afr/lk-quorum.t  -  225 second

On lcov tests these take,
./tests/bugs/index/bug-1559004-EMLINK-handling.t  -  1063 second
./tests/bugs/core/bug-1432542-mpx-restart-crash.t  -  400 second (timeout)
./tests/basic/afr/lk-quorum.t  -  267 second

As can be seen each test seems to add 25 seconds for every 100 seconds of a normal run.

Need to reproduce this locally and check if we can increase the timeout for the mpx test to resolve (a)

b) ./tests/basic/sdfs-sanity.t

This test results in a core for glusterd, and as a result the test fails. The core is common across runs and looks as follows,

See: https://build.gluster.org/job/line-coverage/391/consoleFull
(gdb) t 1
#0  0x00007f241d1f0c11 in __strnlen_sse2 () from ./lib64/libc.so.6

(gdb) f 2
#2  0x00007f241ec82d66 in xlator_volume_option_get_list (vol_list=0x7f2404203570, key=0x7f241366fee0 "features") at options.c:933

(gdb) p opt[0]
$7 = {key = {0x7f240de27c6d "pass-through", 0x0, 0x0, 0x0}, type = GF_OPTION_TYPE_BOOL, min = 0, max = 0, value = {0x0 <repeats 64 times>}, default_value = 0x7f240de27c7a "false", 
  description = 0x7f240de27c80 "Enable/Disable dentry serialize functionality", validate = GF_OPT_VALIDATE_BOTH, op_version = {40100, 0, 0, 0}, deprecated = {0, 0, 0, 0}, flags = 35, tags = {
    0x7f240de27cae "sdfs", 0x0 <repeats 63 times>}, setkey = 0x0, level = OPT_STATUS_ADVANCED}

(gdb) p opt[1]
$8 = {key = {0x7f240e02a600 "", 0xc0a2c6690000007b <error: Cannot access memory at address 0xc0a2c6690000007b>, 0x7cb34af9 <error: Cannot access memory at address 0x7cb34af9>, 
    0x2 <error: Cannot access memory at address 0x2>}, type = 235060032, min = 0, max = 0, value = {0x0, 0x7f240e02a600 "", 0x392413ac0000007a <error: Cannot access memory at address 0x392413ac0000007a>, ...

(gdb) p index
$11 = 1

(gdb) p cmp_key 
$9 = 0xc0a2c6690000007b <error: Cannot access memory at address 0xc0a2c6690000007b>

The above needs further debugging to get to the root cause of the failure for (b).

Comment 1 Worker Ant 2018-08-06 17:58:50 UTC

REVIEW: https://review.gluster.org/20648 (tests: Add timeout option to run-tests.sh) posted (#1) for review on master by Shyamsundar Ranganathan

Comment 2 Worker Ant 2018-08-07 05:37:30 UTC

COMMIT: https://review.gluster.org/20648 committed in master by "Shyamsundar Ranganathan" <srangana> with a commit message- tests: Add timeout option to run-tests.sh

Added a '-t' timeout option to run-tests.sh, to be able to
set this to higher than the default 200 in case of lcov
based tests, as those take more time due to instrumentations
added by lcov.

Change-Id: Ibaf70e881bfa94f35e822124bcf9849b309e7cc1
Updates: bz#1608564
Signed-off-by: ShyamsundarR <srangana>

Comment 3 Shyamsundar 2018-10-23 15:15:30 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.