Bug 499237 - GFS: Inconsistency error generated by openmpi and quantum espresso
Summary: GFS: Inconsistency error generated by openmpi and quantum espresso
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: GFS-kernel
Version: 4
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
Assignee: Robert Peterson
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-05-05 18:03 UTC by Gennaro Oliva
Modified: 2010-03-10 19:10 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-10 19:10:10 UTC
Embargoed:


Attachments (Terms of Use)
Pseudo file to go in /mnt/gfs/pseudo/ (6.88 MB, application/octet-stream)
2009-05-05 18:33 UTC, Robert Peterson
no flags Details
Input file for recreating the problem (5.81 KB, text/plain)
2009-05-05 18:37 UTC, Robert Peterson
no flags Details

Description Gennaro Oliva 2009-05-05 18:03:49 UTC
Description of problem:
Consistency error 

Version-Release number of selected component (if applicable):
kernel 2.6.18-92.1.22.el5
5.2 stack

How reproducible:
Not reproducible

Steps to Reproduce:
1.
2.
3.
  
Actual results:
file system consistency error

Expected results:


Additional info:
GFS: fsid=blade:home.0: fatal: filesystem consistency error
GFS: fsid=blade:home.0:   inode = 55504460/55504460
GFS: fsid=blade:home.0:   function = dir_e_del
GFS: fsid=blade:home.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/dir.c, line = 1545
GFS: fsid=blade:home.0:   time = 1241525364
GFS: fsid=blade:home.0: about to withdraw from the cluster
GFS: fsid=blade:home.0: telling LM to withdraw
dlm: home: group leave failed -512 0
GFS: fsid=blade:home.0: withdrawn

Call Trace:
 [<ffffffff8888b094>] :gfs:gfs_lm_withdraw+0xc4/0xd3
 [<ffffffff8001355e>] find_lock_page+0x26/0xa1
 [<ffffffff8887770d>] :gfs:getbuf+0x170/0x17f
 [<ffffffff88877afd>] :gfs:gfs_dreread+0x72/0xc7
 [<ffffffff88877b7a>] :gfs:gfs_dread+0x28/0x43
 [<ffffffff888a09ab>] :gfs:gfs_consist_inode_i+0x3d/0x42
 [<ffffffff8887acfe>] :gfs:gfs_dir_del+0x123/0x277
 [<ffffffff8888681f>] :gfs:gfs_unlinki+0x13/0x54
 [<ffffffff88894ef3>] :gfs:gfs_unlink+0xda/0x145
 [<ffffffff80049c1c>] vfs_unlink+0xc2/0x108
 [<ffffffff8003c367>] do_unlinkat+0xaa/0x141
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 1 Robert Peterson 2009-05-05 18:30:56 UTC
General notes:

This is parallel computing using gfs and a 13-node cluster.

Mr. Oliva helped me set up my roth cluster to run one scenario that
fails intermittently.  It involved setting up an environment and installing
espresso.  See: http://www.quantum-espresso.org/

Command I'm using to try to recreate the problem:

mpirun -np 6 -machinefile /home/bob/espresso-4.0.5/machinefile.roth /home/bob/espresso-4.0.5/bin/pw.x < /home/bob/espresso-4.0.5/g5864_prelim.in > /home/bob/espresso-4.0.5/g5864_prelim.out

Misc setup notes:
yum -y install openmpi-devel

[root@roth-02 /mnt/gfs]# cat /home/bob/espresso-4.0.5/machinefile.roth
roth-01
roth-02
roth-03
roth-01
roth-02
roth-03
mpi-selector-menu
2
u
(get a new bash shell)
cd /home/bob/espresso-4.0.5
./configure
make all
And make sure it uses mpif90 to compile rather than gfortran.

My roth cluster might not be enough horse-power to get a failure.
I've only got 3 nodes with two x86_64 processors versus 13 dual
quad-cores with like 16GB memory.

Comment 2 Robert Peterson 2009-05-05 18:33:08 UTC
Created attachment 342514 [details]
Pseudo file to go in /mnt/gfs/pseudo/

Comment 3 Robert Peterson 2009-05-05 18:37:12 UTC
Created attachment 342515 [details]
Input file for recreating the problem

Comment 4 Robert Peterson 2009-05-12 13:17:34 UTC
My work on this problem was hampered by the fact that my memory sticks
in one node went bad.  I ordered new memory and swapped sticks with
another machine, then I added a fourth node into the cluster, which
seems to be acting flaky, probably due to hardware issues.  Yesterday
I tried to recreate the failure by running the failing scenario on
six processors--three nodes out of four--and it didn't fail.
The fact that it didn't fail may indicate it was fixed by my recent
code changes for bug #491369, so I need to go back to an older level
of GFS and try again.  If that doesn't recreate the failure, I may
need to add more nodes, but now I have three more nodes I can add
(I'd just need to scratch-build them and reconfigure the cluster).
The good news is that after running for many hours, the primary node
was still living within its memory constraints (not swapping to disk).

Comment 5 Robert Peterson 2009-05-14 13:45:01 UTC
I ran the user scenario on a 6-node cluster roth-0{1,2,3,6,7,8}
and 12 cpus yesterday.  The scenario ran for many hours but
unfortunately, the problem did not recreate.  I guess maybe I need
to re-run this every night for a few nights and maybe I can run it
throughout the weekend.  Either that or I need a bigger cluster.

Comment 7 Robert Peterson 2010-02-18 20:53:46 UTC
I was never able to recreate this problem.  However, a large
number of changes went into 4.8 for bug #455696 and some of
those changes may have fixed this problem.  I recommend updating
their software to 4.8 and see if the problem still exists.
I'll set the NEEDINFO flag until I hear back.

Comment 8 Gennaro Oliva 2010-02-20 20:07:25 UTC
I didn't have the problem again.

Comment 9 Robert Peterson 2010-03-10 19:10:10 UTC
For now I'm closing this as WORKSFORME.
If this problem occurs again, please re-open the bug record
and if possible give instructions on how to recreate it.


Note You need to log in before you can comment on or make changes to this bug.