Bug 499237
Summary: | GFS: Inconsistency error generated by openmpi and quantum espresso | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Gennaro Oliva <gennaro.oliva> | ||||||
Component: | GFS-kernel | Assignee: | Robert Peterson <rpeterso> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 4 | CC: | edamato, swhiteho | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2010-03-10 19:10:10 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Gennaro Oliva
2009-05-05 18:03:49 UTC
General notes: This is parallel computing using gfs and a 13-node cluster. Mr. Oliva helped me set up my roth cluster to run one scenario that fails intermittently. It involved setting up an environment and installing espresso. See: http://www.quantum-espresso.org/ Command I'm using to try to recreate the problem: mpirun -np 6 -machinefile /home/bob/espresso-4.0.5/machinefile.roth /home/bob/espresso-4.0.5/bin/pw.x < /home/bob/espresso-4.0.5/g5864_prelim.in > /home/bob/espresso-4.0.5/g5864_prelim.out Misc setup notes: yum -y install openmpi-devel [root@roth-02 /mnt/gfs]# cat /home/bob/espresso-4.0.5/machinefile.roth roth-01 roth-02 roth-03 roth-01 roth-02 roth-03 mpi-selector-menu 2 u (get a new bash shell) cd /home/bob/espresso-4.0.5 ./configure make all And make sure it uses mpif90 to compile rather than gfortran. My roth cluster might not be enough horse-power to get a failure. I've only got 3 nodes with two x86_64 processors versus 13 dual quad-cores with like 16GB memory. Created attachment 342514 [details]
Pseudo file to go in /mnt/gfs/pseudo/
Created attachment 342515 [details]
Input file for recreating the problem
My work on this problem was hampered by the fact that my memory sticks in one node went bad. I ordered new memory and swapped sticks with another machine, then I added a fourth node into the cluster, which seems to be acting flaky, probably due to hardware issues. Yesterday I tried to recreate the failure by running the failing scenario on six processors--three nodes out of four--and it didn't fail. The fact that it didn't fail may indicate it was fixed by my recent code changes for bug #491369, so I need to go back to an older level of GFS and try again. If that doesn't recreate the failure, I may need to add more nodes, but now I have three more nodes I can add (I'd just need to scratch-build them and reconfigure the cluster). The good news is that after running for many hours, the primary node was still living within its memory constraints (not swapping to disk). I ran the user scenario on a 6-node cluster roth-0{1,2,3,6,7,8} and 12 cpus yesterday. The scenario ran for many hours but unfortunately, the problem did not recreate. I guess maybe I need to re-run this every night for a few nights and maybe I can run it throughout the weekend. Either that or I need a bigger cluster. I was never able to recreate this problem. However, a large number of changes went into 4.8 for bug #455696 and some of those changes may have fixed this problem. I recommend updating their software to 4.8 and see if the problem still exists. I'll set the NEEDINFO flag until I hear back. I didn't have the problem again. For now I'm closing this as WORKSFORME. If this problem occurs again, please re-open the bug record and if possible give instructions on how to recreate it. |