Description of problem: I have experienced system hang-up after running the latest LTP (http://ltp.sourceforge.net/) tool as a part of durability testings on our GFS environement. I was using 2 DELL PE1950 servers which installed CentOS4.3 (IA32) and DELL EMC AX150 was setup as a GFS shared storage connected to each servers. We did "./runltp -d /gfs3" to run the LTP tool on one server and another server remained just idle. Here are the extracted /var/log/messages taken when the system was stopped: Oct 4 13:15:47 centos1 kernel: lock_dlm: Assertion failed on line 500 of file /home/buildcentos/rpmbuild/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/plock.c Oct 4 13:15:47 centos1 kernel: lock_dlm: assertion: "!error" Oct 4 13:15:47 centos1 kernel: lock_dlm: time = 71704458 Oct 4 13:15:47 centos1 kernel: error=-11 Oct 4 13:15:47 centos1 kernel: Oct 4 13:15:47 centos1 kernel: ------------[ cut here ]------------ Oct 4 13:15:47 centos1 kernel: kernel BUG at /home/buildcentos/rpmbuild/BUILD/fs-kernel-2.6.9-49/smp/src/dlm/plock.c:500! Oct 4 13:15:47 centos1 kernel: invalid operand: 0000 [#1] Oct 4 13:15:47 centos1 kernel: SMP Oct 4 13:15:47 centos1 kernel: Modules linked in: parport_pc lp parport autofs i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) sunrpc dm_mirror dm_multipath dm_mod button battery ac md5 ipv6 joydev uhci_hcd ehci_hcd hw random shpchp bnx2 ext3 jbd qla6312 qla2xxx scsi_transport_fc megaraid_sas sd_md scsi_mod Oct 4 13:15:47 centos1 kernel: CPU: 0 Version-Release number of selected component (if applicable): CentOS 4.3 (i386): kernel 2.6.9-34.ELsmp dlm-1.0.0-5.i686.rpm, dlm-kernel-smp-2.6.9-41.7.i686.rpm How reproducible: It happens every time. Steps to Reproduce: See the description. Actual results: See the description. Expected results: No system hang happens with LTP. Additional info: It might be able to take a diskdump on this problem if you need.
Can you tell us which test case caused the assertion? There should be more output after the kernel BUG message that states the name of the process and a backtrace.
Created attachment 140560 [details] Output of 'log' command on crash utility
Created attachment 140561 [details] LTP runlog taken after the panic happened
Sorry for delayed response. We were able to reproduce this problem in the same environment and with the same tool. The attached are the output of 'log' command on crash utility and the LTP's run-log taken after the panic happened.
The logs clearly show that the test case running was fcntl11. http://ltp.cvs.sourceforge.net/ltp/ltp/testcases/kernel/syscalls/fcntl/fcntl11.c?view=log
This shouldn't block a release since it's not been an issue outside of this specific test.
I doubt we'll want to fiddle much with plocks on rhel4 at this late stage unless it's a really crucial issue people are facing. It should work in the new rhel5 code, though.