Bug 499237 - GFS: Inconsistency error generated by openmpi and quantum espresso
GFS: Inconsistency error generated by openmpi and quantum espresso
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: GFS-kernel (Show other bugs)
4
x86_64 Linux
low Severity medium
: ---
: ---
Assigned To: Robert Peterson
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-05 14:03 EDT by Gennaro Oliva
Modified: 2010-03-10 14:10 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-03-10 14:10:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Pseudo file to go in /mnt/gfs/pseudo/ (6.88 MB, application/octet-stream)
2009-05-05 14:33 EDT, Robert Peterson
no flags Details
Input file for recreating the problem (5.81 KB, text/plain)
2009-05-05 14:37 EDT, Robert Peterson
no flags Details

  None (edit)
Description Gennaro Oliva 2009-05-05 14:03:49 EDT
Description of problem:
Consistency error 

Version-Release number of selected component (if applicable):
kernel 2.6.18-92.1.22.el5
5.2 stack

How reproducible:
Not reproducible

Steps to Reproduce:
1.
2.
3.
  
Actual results:
file system consistency error

Expected results:


Additional info:
GFS: fsid=blade:home.0: fatal: filesystem consistency error
GFS: fsid=blade:home.0:   inode = 55504460/55504460
GFS: fsid=blade:home.0:   function = dir_e_del
GFS: fsid=blade:home.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/dir.c, line = 1545
GFS: fsid=blade:home.0:   time = 1241525364
GFS: fsid=blade:home.0: about to withdraw from the cluster
GFS: fsid=blade:home.0: telling LM to withdraw
dlm: home: group leave failed -512 0
GFS: fsid=blade:home.0: withdrawn

Call Trace:
 [<ffffffff8888b094>] :gfs:gfs_lm_withdraw+0xc4/0xd3
 [<ffffffff8001355e>] find_lock_page+0x26/0xa1
 [<ffffffff8887770d>] :gfs:getbuf+0x170/0x17f
 [<ffffffff88877afd>] :gfs:gfs_dreread+0x72/0xc7
 [<ffffffff88877b7a>] :gfs:gfs_dread+0x28/0x43
 [<ffffffff888a09ab>] :gfs:gfs_consist_inode_i+0x3d/0x42
 [<ffffffff8887acfe>] :gfs:gfs_dir_del+0x123/0x277
 [<ffffffff8888681f>] :gfs:gfs_unlinki+0x13/0x54
 [<ffffffff88894ef3>] :gfs:gfs_unlink+0xda/0x145
 [<ffffffff80049c1c>] vfs_unlink+0xc2/0x108
 [<ffffffff8003c367>] do_unlinkat+0xaa/0x141
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Comment 1 Robert Peterson 2009-05-05 14:30:56 EDT
General notes:

This is parallel computing using gfs and a 13-node cluster.

Mr. Oliva helped me set up my roth cluster to run one scenario that
fails intermittently.  It involved setting up an environment and installing
espresso.  See: http://www.quantum-espresso.org/

Command I'm using to try to recreate the problem:

mpirun -np 6 -machinefile /home/bob/espresso-4.0.5/machinefile.roth /home/bob/espresso-4.0.5/bin/pw.x < /home/bob/espresso-4.0.5/g5864_prelim.in > /home/bob/espresso-4.0.5/g5864_prelim.out

Misc setup notes:
yum -y install openmpi-devel

[root@roth-02 /mnt/gfs]# cat /home/bob/espresso-4.0.5/machinefile.roth
roth-01
roth-02
roth-03
roth-01
roth-02
roth-03
mpi-selector-menu
2
u
(get a new bash shell)
cd /home/bob/espresso-4.0.5
./configure
make all
And make sure it uses mpif90 to compile rather than gfortran.

My roth cluster might not be enough horse-power to get a failure.
I've only got 3 nodes with two x86_64 processors versus 13 dual
quad-cores with like 16GB memory.
Comment 2 Robert Peterson 2009-05-05 14:33:08 EDT
Created attachment 342514 [details]
Pseudo file to go in /mnt/gfs/pseudo/
Comment 3 Robert Peterson 2009-05-05 14:37:12 EDT
Created attachment 342515 [details]
Input file for recreating the problem
Comment 4 Robert Peterson 2009-05-12 09:17:34 EDT
My work on this problem was hampered by the fact that my memory sticks
in one node went bad.  I ordered new memory and swapped sticks with
another machine, then I added a fourth node into the cluster, which
seems to be acting flaky, probably due to hardware issues.  Yesterday
I tried to recreate the failure by running the failing scenario on
six processors--three nodes out of four--and it didn't fail.
The fact that it didn't fail may indicate it was fixed by my recent
code changes for bug #491369, so I need to go back to an older level
of GFS and try again.  If that doesn't recreate the failure, I may
need to add more nodes, but now I have three more nodes I can add
(I'd just need to scratch-build them and reconfigure the cluster).
The good news is that after running for many hours, the primary node
was still living within its memory constraints (not swapping to disk).
Comment 5 Robert Peterson 2009-05-14 09:45:01 EDT
I ran the user scenario on a 6-node cluster roth-0{1,2,3,6,7,8}
and 12 cpus yesterday.  The scenario ran for many hours but
unfortunately, the problem did not recreate.  I guess maybe I need
to re-run this every night for a few nights and maybe I can run it
throughout the weekend.  Either that or I need a bigger cluster.
Comment 7 Robert Peterson 2010-02-18 15:53:46 EST
I was never able to recreate this problem.  However, a large
number of changes went into 4.8 for bug #455696 and some of
those changes may have fixed this problem.  I recommend updating
their software to 4.8 and see if the problem still exists.
I'll set the NEEDINFO flag until I hear back.
Comment 8 Gennaro Oliva 2010-02-20 15:07:25 EST
I didn't have the problem again.
Comment 9 Robert Peterson 2010-03-10 14:10:10 EST
For now I'm closing this as WORKSFORME.
If this problem occurs again, please re-open the bug record
and if possible give instructions on how to recreate it.

Note You need to log in before you can comment on or make changes to this bug.