Description of problem: While running the connectathon test suite the following test failed. /kernel/filesystems/nfs/connectathon/rhel3-nfs/nfsvers=3_udp/lock Version-Release number of selected component (if applicable): 2.6.20-14.el5rt How reproducible: Intermittant 10% of the time it fails. Steps to Reproduce: 1. Run the RHTS test /kernel/filesystems/nfs/connectathon while running the RT kernel Actual results: Start tests on path /mnt/rhel3-nfs/dhcp78-129.test [y/n]? sh ./runtests -l -t /mnt/rhel3-nfs/dhcp78-129.test Starting LOCKING tests: test directory /mnt/rhel3-nfs/dhcp78-129.test (arg: -t) Testing native post-LFS locking Creating parent/child synchronization pipes. Test #1 - Test regions of an unlocked file. Parent: 1.1 - F_TEST [ 0, 1] PASSED. Parent: 1.2 - F_TEST [ 0, ENDING] PASSED. Parent: 1.3 - F_TEST [ 0,7fffffffffffffff] PASSED. Parent: 1.4 - F_TEST [ 1, 1] PASSED. Parent: 1.5 - F_TEST [ 1, ENDING] PASSED. Parent: 1.6 - F_TEST [ 1,7fffffffffffffff] PASSED. Parent: 1.7 - F_TEST [7fffffffffffffff, 1] PASSED. Parent: 1.8 - F_TEST [7fffffffffffffff, ENDING] PASSED. Parent: 1.9 - F_TEST [7fffffffffffffff,7fffffffffffffff] PASSED. Test #2 - Try to lock the whole file. Parent: 2.0 - F_TLOCK [ 0, ENDING] PASSED. Child: 2.1 - F_TEST [ 0, 1] PASSED. Child: 2.2 - F_TEST [ 0, ENDING] PASSED. Child: 2.3 - F_TEST [ 0,7fffffffffffffff] PASSED. Child: 2.4 - F_TEST [ 1, 1] PASSED. Child: 2.5 - F_TEST [ 1, ENDING] PASSED. Child: 2.6 - F_TEST [ 1,7fffffffffffffff] PASSED. Child: 2.7 - F_TEST [7fffffffffffffff, 1] PASSED. Child: 2.8 - F_TEST [7fffffffffffffff, ENDING] PASSED. Child: 2.9 - F_TEST [7fffffffffffffff,7fffffffffffffff] PASSED. Parent: 2.10 - F_ULOCK [ 0, ENDING] PASSED. Test #3 - Try to lock just the 1st byte. Parent: 3.0 - F_TLOCK [ 0, 1] PASSED. Child: 3.1 - F_TEST [ 0, 1] PASSED. Child: 3.2 - F_TEST [ 0, ENDING] PASSED. Child: 3.3 - F_TEST [ 1, 1] PASSED. Child: 3.4 - F_TEST [ 1, ENDING] PASSED. Parent: 3.5 - F_ULOCK [ 0, 1] PASSED. Test #4 - Try to lock the 2nd byte, test around it. Parent: 4.0 - F_TLOCK [ 1, 1] PASSED. Child: 4.1 - F_TEST [ 0, 1] PASSED. Child: 4.2 - F_TEST [ 0, 2] PASSED. Child: 4.3 - F_TEST [ 0, ENDING] PASSED. Child: 4.4 - F_TEST [ 1, 1] PASSED. Child: 4.5 - F_TEST [ 1, 2] PASSED. Child: 4.6 - F_TEST [ 1, ENDING] PASSED. Child: 4.7 - F_TEST [ 2, 1] PASSED. Child: 4.8 - F_TEST [ 2, 2] PASSED. Child: 4.9 - F_TEST [ 2, ENDING] PASSED. Parent: 4.10 - F_ULOCK [ 1, 1] PASSED. Test #5 - Try to lock 1st and 2nd bytes, test around them. Parent: 5.0 - F_TLOCK [ 0, 1] PASSED. Parent: 5.1 - F_TLOCK [ 2, 1] PASSED. Child: 5.2 - F_TEST [ 0, 1] PASSED. Child: 5.3 - F_TEST [ 0, 2] PASSED. Child: 5.4 - F_TEST [ 0, ENDING] PASSED. Child: 5.5 - F_TEST [ 1, 1] PASSED. Child: 5.6 - F_TEST [ 1, 2] PASSED. Child: 5.7 - F_TEST [ 1, ENDING] PASSED. Child: 5.8 - F_TEST [ 2, 1] PASSED. Child: 5.9 - F_TEST [ 2, 2] PASSED. Child: 5.10 - F_TEST [ 2, ENDING] PASSED. Child: 5.11 - F_TEST [ 3, 1] PASSED. Child: 5.12 - F_TEST [ 3, 2] PASSED. Child: 5.13 - F_TEST [ 3, ENDING] PASSED. Parent: 5.14 - F_ULOCK [ 0, 1] PASSED. Parent: 5.15 - F_ULOCK [ 2, 1] PASSED. Test #6 - Try to lock the MAXEOF byte. Parent: 6.0 - F_TLOCK [7fffffffffffffff, 1] PASSED. Child: 6.1 - F_TEST [7ffffffffffffffe, 1] PASSED. Child: 6.2 - F_TEST [7ffffffffffffffe, 2] PASSED. Child: 6.3 - F_TEST [7ffffffffffffffe, ENDING] PASSED. Child: 6.4 - F_TEST [7fffffffffffffff, 1] PASSED. Child: 6.5 - F_TEST [7fffffffffffffff, 2] PASSED. Child: 6.6 - F_TEST [7fffffffffffffff, ENDING] PASSED. Child: 6.7 - F_TEST [8000000000000000, ENDING] PASSED. Child: 6.8 - F_TEST [8000000000000000, 1] PASSED. Child: 6.9 - F_TEST [8000000000000000,7fffffffffffffff] PASSED. Child: 6.10 - F_TEST [8000000000000000,8000000000000000] PASSED. Parent: 6.11 - F_ULOCK [7fffffffffffffff, 1] PASSED. Test #7 - Test parent/child mutual exclusion. Parent: 7.0 - F_TLOCK [ ffc, 9] PASSED. Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ]. Parent: Now free child to run, should block on lock. Parent: Check data in file to insure child blocked. Parent: Read 'aaaa eh' from testfile [ 4092, 7 ]. Parent: 7.1 - COMPARE [ ffc, 7] PASSED. Parent: Now unlock region so child will unblock. Parent: 7.2 - F_ULOCK [ ffc, 9] PASSED. Child: 7.3 - F_LOCK [ ffc, 9] PASSED. Child: Write child's version of the data and release lock. Parent: Now try to regain lock, parent should block. Child: Wrote 'bebebebeb' to testfile [ 4092, 9 ]. Child: 7.4 - F_ULOCK [ ffc, 9] PASSED. Parent: 7.5 - F_LOCK [ ffc, 9] PASSED. Parent: Check data in file to insure child unblocked. Parent: Read 'bebebebeb' from testfile [ 4092, 9 ]. Parent: 7.6 - COMPARE [ ffc, 9] PASSED. Parent: 7.7 - F_ULOCK [ ffc, 9] PASSED. Test #8 - Rate test performing lock/unlock cycles. Parent: Performed 1000 lock/unlock cycles in 560 msecs. [214285 lpm]. Test #10 - Make sure a locked region is split properly. Parent: 10.0 - F_TLOCK [ 0, 3] PASSED. Parent: 10.1 - F_ULOCK [ 1, 1] PASSED. Child: 10.2 - F_TEST [ 0, 1] PASSED. Child: 10.3 - F_TEST [ 2, 1] PASSED. Child: 10.4 - F_TEST [ 3, ENDING] PASSED. Child: 10.5 - F_TEST [ 1, 1] PASSED. Parent: 10.6 - F_ULOCK [ 0, 1] PASSED. Parent: 10.7 - F_ULOCK [ 2, 1] PASSED. Child: 10.8 - F_TEST [ 0, 3] PASSED. Parent: 10.9 - F_ULOCK [ 0, 1] PASSED. Parent: 10.10 - F_TLOCK [ 1, 3] PASSED. Parent: 10.11 - F_ULOCK [ 2, 1] PASSED. Child: 10.12 - F_TEST [ 1, 1] PASSED. Child: 10.13 - F_TEST [ 3, 1] PASSED. Child: 10.14 - F_TEST [ 4, ENDING] PASSED. Child: 10.15 - F_TEST [ 2, 1] PASSED. Child: 10.16 - F_TEST [ 0, 1] PASSED. Test #11 - Make sure close() releases the process's locks. Parent: 11.0 - F_TLOCK [ 0, ENDING] PASSED. Parent: Closed testfile. Child: 11.1 - F_TLOCK [ 0, ENDING] PASSED. Child: 11.2 - F_ULOCK [ 0, ENDING] PASSED. Parent: 11.3 - F_TLOCK [ 1d, 5b7] PASSED. Parent: 11.4 - F_TLOCK [ 2000, 57] PASSED. Parent: Closed testfile. Child: 11.5 - F_TLOCK [ 0, ENDING] PASSED. Child: 11.6 - F_ULOCK [ 0, ENDING] PASSED. Parent: Wrote '123456789abcdef' to testfile [ 0, 16 ]. Parent: 11.7 - F_TLOCK [ 0, ENDING] PASSED. Parent: Wrote '123456789abcdef' to testfile [ 13, 16 ]. Parent: Closed testfile. Child: 11.8 - F_TLOCK [ 0, ENDING] PASSED. Child: 11.9 - F_ULOCK [ 0, ENDING] PASSED. Parent: Wrote '123456789abcdef' to testfile [ 0, 16 ]. Parent: 11.10 - F_TLOCK [ 0, ENDING] PASSED. Parent: Truncated testfile. Parent: Closed testfile. Child: 11.11 - F_TLOCK [ 0, ENDING] PASSED. Child: 11.12 - F_ULOCK [ 0, ENDING] PASSED. Test #12 - Signalled process should release locks. Child: 12.0 - F_TLOCK [ 0, ENDING] PASSED. Parent: Killed child process. Parent: 12.1 - F_TLOCK [ 0, ENDING] FAILED! Parent: **** Expected success, returned EAGAIN... Parent: **** Probably implementation error. ** PARENT pass 1 results: 39/39 pass, 0/0 warn, 1/1 fail (pass/total). ** CHILD pass 1 results: 64/64 pass, 0/0 warn, 0/0 fail (pass/total). lock tests failed Expected results: This test should pass 100% of the time. Additional info: Here is a link to the packet capture while the test was running. http://rhts.lab.boston.redhat.com/testlogs/18994/73670/344426/rhel3-nfs_nfsvers=3_udp_lock.cap.bz2
Have you every seen this on TCP mounts?
Nope, if i did I would have open another BZ
Well putting the server under load, I am able to get the lock tests to hang, but in test 8 (the rate test) instead of test 7 (the blocking test). One other notable, is using a RHEL5 xen guest I do see the same hang, but it appears the xen kernel does recover..
A couple of things... 1) It appears this is a UDP only thing since I am not able to reproduce the hang using TCP mounts... 2) the ethereal trace is no longer avaiable...
Steve, RHTS was upgraded yesterday, The link was broken. You can find the information here. http://armstrong.rhts.boston.redhat.com/testlogs/18994/73670/344426/rhel3-nfs_nfsvers=3_udp_lock.cap.bz2 Thanks Jeff
Ok... after further review... it appears this problem only appears on RT kernels... Using a non-rt kernel, I was not able to get the cthon test to hang...
Steve - have you differentiated whether this issue is RT specific vs a more recent upstream issue? Clark does a generic kernel build which is just the upstream stuff without the RT patches. Does the flaw occur there?
I generally use Rawhide kernels for my "generic" kernels. when I ran using those I was not able to reproduce the problem. But its been a while since I tried... let me set up another test to see how things have changed... if they have...
This issue has been seen on vanilla RHEL-5. It was also observed upstream and should be fixed there. There is also a testsuite timing dependency which can lead to this behavior. The testsuite in RHTS has been patched.
I haven't seen this on recent RHTS runs. closing for now
I see this, or at least, I did before I constructed the testsuite changes and the kernel changes to address it.
do we still see this on recent (2.6.24.7-101.el5rt) RT kernels?
Clark, Yes we still see this with the current version 2.6.24.7-101.el5rt Here are the links to the results: http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=6270210 http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=6278249 Thanks, Jeff
Isn't this bug related to BZ486645? Jon Masters added a patch to kernel -104 that may help fixing this issue (though I haven't been seeing these failures recently - at least not the ones that used to freeze the machine).
I do not believe4 so. This is a different type of failure. It is a race condition. The latest RHEL5.4 kernel fixed the issue. https://bugzilla.redhat.com/show_bug.cgi?id=448929
Do we still see this on recent (2.6.24.7-132.el5rt) RT kernels?
Looking at the latest results. I do not see the failure. But it is a timing bug.
Haven't seen this issue on our 2.6.33 or 3.0.9 kernels