Bug 1369401 - NetBSD hangs at /tests/features/lock_revocation.t [NEEDINFO]
Summary: NetBSD hangs at /tests/features/lock_revocation.t
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: GlusterFS
Classification: Community
Component: locks
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-23 10:16 UTC by Nigel Babu
Modified: 2018-08-29 03:53 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-29 03:53:15 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
rtalur: needinfo? (pkarampu)


Attachments (Terms of Use)
ps-axl (4.80 KB, text/plain)
2016-08-23 10:38 UTC, Nigel Babu
no flags Details

Description Nigel Babu 2016-08-23 10:16:22 UTC
We've had this happen reasonably constantly. Right now, there are 4 jobs hung at that test. Could we please debug what's going on here and fix it?

https://build.gluster.org/job/netbsd7-regression/165/consoleFull
https://build.gluster.org/job/netbsd7-regression/164/consoleFull
https://build.gluster.org/job/netbsd7-regression/150/consoleFull
https://build.gluster.org/job/netbsd7-regression/142/consoleFull

If you need information from the machines when they're hung, please let me know and I can fetch it.

Comment 1 Nigel Babu 2016-08-23 10:38:53 UTC
Created attachment 1193298 [details]
ps-axl

Output of ps-axl

Comment 2 Nigel Babu 2016-08-23 10:56:00 UTC
From the log:

[2016-08-23 09:53:20.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 34 glusterd ++++++++++
[2016-08-23 09:53:25.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 35 pidof glusterd ++++++++++
[2016-08-23 09:53:25.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 36 gluster --mode=script --wignore volume create patchy replica 2 nbslave7g.cloud.gluster.org:/d/backends/brick0 nbslave7g.cloud.gluster.org:/d/backends/brick1 ++++++++++
[2016-08-23 09:53:25.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 37 gluster --mode=script --wignore volume set patchy self-heal-daemon off ++++++++++
[2016-08-23 09:53:26.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 38 gluster --mode=script --wignore volume set patchy features.locks-monkey-unlocking on ++++++++++
[2016-08-23 09:53:26.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 39 gluster --mode=script --wignore volume set patchy features.locks-revocation-secs 2 ++++++++++
[2016-08-23 09:53:27.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 40 gluster --mode=script --wignore volume start patchy ++++++++++
[2016-08-23 09:53:29.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 41 glusterfs --attribute-timeout=0 --entry-timeout=0 --volfile-id=patchy -s nbslave7g.cloud.gluster.org /mnt/glusterfs/0 ++++++++++
[2016-08-23 09:53:29.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 42 glusterfs --attribute-timeout=0 --entry-timeout=0 --volfile-id=patchy -s nbslave7g.cloud.gluster.org /mnt/glusterfs/1 ++++++++++
[2016-08-23 09:53:29.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 46 SUCCESS monkey_unlock ++++++++++
[2016-08-23 09:53:43.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 50 append_to_file /mnt/glusterfs/1/testfile ++++++++++

Comment 3 Raghavendra Talur 2016-08-23 14:38:48 UTC
I took a look at one of the NetBSD machines.

There was a umount process that was hung and saw a dd also hung at the same time.
The test however is not doing anything obviously wrong. As of now, I suspect it to be a bug in Gluster code rather than in the infra or the test. 

I will update again tomorrow after looking into the stack of brick and mount process.

Comment 4 Niels de Vos 2016-08-30 12:15:11 UTC
Status update? Please triage this when you have more details.

Comment 5 Worker Ant 2016-08-31 10:40:02 UTC
REVIEW: http://review.gluster.org/15374 (tests: disable lock_revocation.t on NetBSD) posted (#1) for review on master by Raghavendra Talur (rtalur@redhat.com)

Comment 6 Raghavendra Talur 2016-08-31 11:18:32 UTC
I think I know why the hang is happening. Will try in next patch.

Comment 7 Worker Ant 2016-08-31 11:34:25 UTC
REVIEW: http://review.gluster.org/15375 (test fix for lock_revocation hang) posted (#1) for review on master by Raghavendra Talur (rtalur@redhat.com)

Comment 8 Worker Ant 2016-08-31 11:51:14 UTC
REVIEW: http://review.gluster.org/15375 (test fix for lock_revocation hang) posted (#2) for review on master by Raghavendra Talur (rtalur@redhat.com)

Comment 9 Worker Ant 2016-08-31 12:16:03 UTC
COMMIT: http://review.gluster.org/15374 committed in master by Jeff Darcy (jdarcy@redhat.com) 
------
commit add85dda0127262164123c6373d55ff2cf9bb14b
Author: Raghavendra Talur <rtalur@redhat.com>
Date:   Wed Aug 31 16:07:09 2016 +0530

    tests: disable lock_revocation.t on NetBSD
    
    This has been consistently causing hangs in NetBSD machines. I have not
    been able to debug the issue and we have merge deadline for 3.9. It
    would be better to disable this for now.
    
    Change-Id: I8c63940aa26f78dd9994bb63293a5757835ec52b
    BUG: 1369401
    Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
    Reviewed-on: http://review.gluster.org/15374
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Jeff Darcy <jdarcy@redhat.com>

Comment 10 Raghavendra Talur 2016-09-01 07:42:41 UTC
update:

I can consistently reproduce the problem in Fedora 24 also. dd command does not complete sometimes. This is now confirmed that it is not a test framework bug.


Try running the test on Fedora 24 for multiple times. You should be able to hit it within 5 runs.

I think it is closely related to the monkey unlocking feature. Could Pranith or Kruthika debug this please.

Comment 11 Worker Ant 2016-09-09 08:50:42 UTC
REVIEW: http://review.gluster.org/15375 (test fix for lock_revocation hang) posted (#3) for review on master by Raghavendra Talur (rtalur@redhat.com)

Comment 12 Worker Ant 2016-09-09 10:36:44 UTC
REVIEW: http://review.gluster.org/15375 (test fix for lock_revocation hang) posted (#4) for review on master by Raghavendra Talur (rtalur@redhat.com)

Comment 13 Raghavendra Talur 2016-09-09 11:52:51 UTC
This is a bug in the code. My analysis for now:

Run the test till the setup and exit before deadlock_fop.
Now perform dd on the M0 till dd hangs.
If the expectation is right, lock revocation should happen and dd should proceed as well as other writes on the file from other mounts should succeed.

Here, dd hangs indefinitely. Performing an echo to the same file from another mount succeeds and it causes the dd to proceed too.

Needs more investigation.

Comment 14 Amar Tumballi 2018-08-29 03:53:15 UTC
Lot of time since no activity on this bug. We have either fixed it already or it is mostly not critical anymore!

Please re-open the bug if the issue is burning for you, or you want to take the bug to closure with fixes.


Note You need to log in before you can comment on or make changes to this bug.