We've had this happen reasonably constantly. Right now, there are 4 jobs hung at that test. Could we please debug what's going on here and fix it? https://build.gluster.org/job/netbsd7-regression/165/consoleFull https://build.gluster.org/job/netbsd7-regression/164/consoleFull https://build.gluster.org/job/netbsd7-regression/150/consoleFull https://build.gluster.org/job/netbsd7-regression/142/consoleFull If you need information from the machines when they're hung, please let me know and I can fetch it.
Created attachment 1193298 [details] ps-axl Output of ps-axl
From the log: [2016-08-23 09:53:20.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 34 glusterd ++++++++++ [2016-08-23 09:53:25.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 35 pidof glusterd ++++++++++ [2016-08-23 09:53:25.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 36 gluster --mode=script --wignore volume create patchy replica 2 nbslave7g.cloud.gluster.org:/d/backends/brick0 nbslave7g.cloud.gluster.org:/d/backends/brick1 ++++++++++ [2016-08-23 09:53:25.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 37 gluster --mode=script --wignore volume set patchy self-heal-daemon off ++++++++++ [2016-08-23 09:53:26.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 38 gluster --mode=script --wignore volume set patchy features.locks-monkey-unlocking on ++++++++++ [2016-08-23 09:53:26.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 39 gluster --mode=script --wignore volume set patchy features.locks-revocation-secs 2 ++++++++++ [2016-08-23 09:53:27.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 40 gluster --mode=script --wignore volume start patchy ++++++++++ [2016-08-23 09:53:29.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 41 glusterfs --attribute-timeout=0 --entry-timeout=0 --volfile-id=patchy -s nbslave7g.cloud.gluster.org /mnt/glusterfs/0 ++++++++++ [2016-08-23 09:53:29.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 42 glusterfs --attribute-timeout=0 --entry-timeout=0 --volfile-id=patchy -s nbslave7g.cloud.gluster.org /mnt/glusterfs/1 ++++++++++ [2016-08-23 09:53:29.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 46 SUCCESS monkey_unlock ++++++++++ [2016-08-23 09:53:43.6N]:++++++++++ G_LOG:./tests/features/lock_revocation.t: TEST: 50 append_to_file /mnt/glusterfs/1/testfile ++++++++++
I took a look at one of the NetBSD machines. There was a umount process that was hung and saw a dd also hung at the same time. The test however is not doing anything obviously wrong. As of now, I suspect it to be a bug in Gluster code rather than in the infra or the test. I will update again tomorrow after looking into the stack of brick and mount process.
Status update? Please triage this when you have more details.
REVIEW: http://review.gluster.org/15374 (tests: disable lock_revocation.t on NetBSD) posted (#1) for review on master by Raghavendra Talur (rtalur)
I think I know why the hang is happening. Will try in next patch.
REVIEW: http://review.gluster.org/15375 (test fix for lock_revocation hang) posted (#1) for review on master by Raghavendra Talur (rtalur)
REVIEW: http://review.gluster.org/15375 (test fix for lock_revocation hang) posted (#2) for review on master by Raghavendra Talur (rtalur)
COMMIT: http://review.gluster.org/15374 committed in master by Jeff Darcy (jdarcy) ------ commit add85dda0127262164123c6373d55ff2cf9bb14b Author: Raghavendra Talur <rtalur> Date: Wed Aug 31 16:07:09 2016 +0530 tests: disable lock_revocation.t on NetBSD This has been consistently causing hangs in NetBSD machines. I have not been able to debug the issue and we have merge deadline for 3.9. It would be better to disable this for now. Change-Id: I8c63940aa26f78dd9994bb63293a5757835ec52b BUG: 1369401 Signed-off-by: Raghavendra Talur <rtalur> Reviewed-on: http://review.gluster.org/15374 Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> NetBSD-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jdarcy>
update: I can consistently reproduce the problem in Fedora 24 also. dd command does not complete sometimes. This is now confirmed that it is not a test framework bug. Try running the test on Fedora 24 for multiple times. You should be able to hit it within 5 runs. I think it is closely related to the monkey unlocking feature. Could Pranith or Kruthika debug this please.
REVIEW: http://review.gluster.org/15375 (test fix for lock_revocation hang) posted (#3) for review on master by Raghavendra Talur (rtalur)
REVIEW: http://review.gluster.org/15375 (test fix for lock_revocation hang) posted (#4) for review on master by Raghavendra Talur (rtalur)
This is a bug in the code. My analysis for now: Run the test till the setup and exit before deadlock_fop. Now perform dd on the M0 till dd hangs. If the expectation is right, lock revocation should happen and dd should proceed as well as other writes on the file from other mounts should succeed. Here, dd hangs indefinitely. Performing an echo to the same file from another mount succeeds and it causes the dd to proceed too. Needs more investigation.
Lot of time since no activity on this bug. We have either fixed it already or it is mostly not critical anymore! Please re-open the bug if the issue is burning for you, or you want to take the bug to closure with fixes.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days