Description of problem: Consistency error Version-Release number of selected component (if applicable): kernel 2.6.18-92.1.22.el5 5.2 stack How reproducible: Not reproducible Steps to Reproduce: 1. 2. 3. Actual results: file system consistency error Expected results: Additional info: GFS: fsid=blade:home.0: fatal: filesystem consistency error GFS: fsid=blade:home.0: inode = 55504460/55504460 GFS: fsid=blade:home.0: function = dir_e_del GFS: fsid=blade:home.0: file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/dir.c, line = 1545 GFS: fsid=blade:home.0: time = 1241525364 GFS: fsid=blade:home.0: about to withdraw from the cluster GFS: fsid=blade:home.0: telling LM to withdraw dlm: home: group leave failed -512 0 GFS: fsid=blade:home.0: withdrawn Call Trace: [<ffffffff8888b094>] :gfs:gfs_lm_withdraw+0xc4/0xd3 [<ffffffff8001355e>] find_lock_page+0x26/0xa1 [<ffffffff8887770d>] :gfs:getbuf+0x170/0x17f [<ffffffff88877afd>] :gfs:gfs_dreread+0x72/0xc7 [<ffffffff88877b7a>] :gfs:gfs_dread+0x28/0x43 [<ffffffff888a09ab>] :gfs:gfs_consist_inode_i+0x3d/0x42 [<ffffffff8887acfe>] :gfs:gfs_dir_del+0x123/0x277 [<ffffffff8888681f>] :gfs:gfs_unlinki+0x13/0x54 [<ffffffff88894ef3>] :gfs:gfs_unlink+0xda/0x145 [<ffffffff80049c1c>] vfs_unlink+0xc2/0x108 [<ffffffff8003c367>] do_unlinkat+0xaa/0x141 [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
General notes: This is parallel computing using gfs and a 13-node cluster. Mr. Oliva helped me set up my roth cluster to run one scenario that fails intermittently. It involved setting up an environment and installing espresso. See: http://www.quantum-espresso.org/ Command I'm using to try to recreate the problem: mpirun -np 6 -machinefile /home/bob/espresso-4.0.5/machinefile.roth /home/bob/espresso-4.0.5/bin/pw.x < /home/bob/espresso-4.0.5/g5864_prelim.in > /home/bob/espresso-4.0.5/g5864_prelim.out Misc setup notes: yum -y install openmpi-devel [root@roth-02 /mnt/gfs]# cat /home/bob/espresso-4.0.5/machinefile.roth roth-01 roth-02 roth-03 roth-01 roth-02 roth-03 mpi-selector-menu 2 u (get a new bash shell) cd /home/bob/espresso-4.0.5 ./configure make all And make sure it uses mpif90 to compile rather than gfortran. My roth cluster might not be enough horse-power to get a failure. I've only got 3 nodes with two x86_64 processors versus 13 dual quad-cores with like 16GB memory.
Created attachment 342514 [details] Pseudo file to go in /mnt/gfs/pseudo/
Created attachment 342515 [details] Input file for recreating the problem
My work on this problem was hampered by the fact that my memory sticks in one node went bad. I ordered new memory and swapped sticks with another machine, then I added a fourth node into the cluster, which seems to be acting flaky, probably due to hardware issues. Yesterday I tried to recreate the failure by running the failing scenario on six processors--three nodes out of four--and it didn't fail. The fact that it didn't fail may indicate it was fixed by my recent code changes for bug #491369, so I need to go back to an older level of GFS and try again. If that doesn't recreate the failure, I may need to add more nodes, but now I have three more nodes I can add (I'd just need to scratch-build them and reconfigure the cluster). The good news is that after running for many hours, the primary node was still living within its memory constraints (not swapping to disk).
I ran the user scenario on a 6-node cluster roth-0{1,2,3,6,7,8} and 12 cpus yesterday. The scenario ran for many hours but unfortunately, the problem did not recreate. I guess maybe I need to re-run this every night for a few nights and maybe I can run it throughout the weekend. Either that or I need a bigger cluster.
I was never able to recreate this problem. However, a large number of changes went into 4.8 for bug #455696 and some of those changes may have fixed this problem. I recommend updating their software to 4.8 and see if the problem still exists. I'll set the NEEDINFO flag until I hear back.
I didn't have the problem again.
For now I'm closing this as WORKSFORME. If this problem occurs again, please re-open the bug record and if possible give instructions on how to recreate it.