Description of problem: Any calculations on NWChem are causing a crash with mpich-x86_64 or openmpi-x86_64 Version-Release number of selected component (if applicable): nwchem-common-6.6.27746-24.fc24.noarch nwchem-openmpi-6.6.27746-24.fc24.x86_64 nwchem-mpich-6.6.27746-24.fc24.x86_64 nwchem-6.6.27746-24.fc24.x86_64 How reproducible: Create any simple input file and run it. Steps to Reproduce: 1. module load mpi/openmpi-x86_64 2. nwchem_openmpi water.nw 3. Crash Actual results: With openmpi NWChem is generating a segmentation fault, with mpiCH the error is more descritive but in both cases the process dies in the DFT stage. Expected results: The magic of computational chemistry should appear as a result. Additional info: It looks line NWChem is unable to read any basis sets and that is causing a crash. I can see this error: Attempting to read a basis set from a non-existing file: /builddir/build/BUILD/nwchem-6.6/src/basis/libraries/cc-pvdz It looks like the RPMMacros haven1t converted the path to a / directory structure. ---------------------------------------------------------------------------- NWChem DFT Module ----------------- Título Attempting to read a basis set from a non-existing file: /builddir/build/BUILD/nwchem-6.6/src/basis/libraries/cc-pvdz ------------------------------------------------------------------------ bas_tag_lib: failed opening basis file 0 ------------------------------------------------------------------------ ------------------------------------------------------------------------ current input line : 25: task dft optimize ------------------------------------------------------------------------ ------------------------------------------------------------------------ There is an error in the specified basis set ------------------------------------------------------------------------ For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation For further details see manual section: 0:0:bas_tag_lib: failed opening basis file:: -1 (rank:0 hostname:localhost.localdomain pid:15672):ARMCI DASSERT fail. src/common/armci.c:ARMCI_Error():208 cond:0 Last System Error Message from Task 0:: No such file or directory application called MPI_Abort(comm=0x84000001, -1) - process 0 [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=-1 : system msg for write_line failure : Bad file descriptor ----------------------------------------------------------------------------
I've tried a rebuild using mock and the process is stuck at this point: + MPIRUN_PATH=/usr/lib64/openmpi/bin/mpiexec + export NWCHEM_EXECUTABLE=/builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem_openmpi + NWCHEM_EXECUTABLE=/builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem_openmpi + ./doafewqmtests.mpi 2 + tee ../doafewqmtests.mpi.2_openmpi.log ======================================================= QM: Running a very small subset of the available tests ======================================================= Running tests/h2o_opt/h2o_opt cleaning scratch copying input and verified output files running nwchem (/builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem_openmpi) NWChem execution failed Failed Running tests/dft_he2+/dft_he2+ cleaning scratch copying input and verified output files running nwchem (/builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem_openmpi)
maybe related to bug #1347788
Compilation of ccsd code fails on Rawhide ... gfortran -Wl,--export-dynamic -L/builddir/build/BUILD/nwchem-6.6/lib/LINUX64 -L/builddir/build/BUILD/nwchem-6.6/src/tools/install/lib -o /builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem nwchem.o stubs.o - lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lsolvation -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -lnwpython -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lmm -lcons -lperfm -ldntmc -lccca -lnwcutil -L /usr/lib64/mpich/lib -lga -larmci -lpeigs -lperfm -lcons -lbq -lnwcutil /usr/lib64/python2.7/config/libpython2.7.so -l64to32 -L/usr/lib64 -lopenblas -L/usr/lib64/mpich/lib -lmpich -lmpifort -lnwcutil -lpyt hon2.7 -lpthread -ldl -lutil -lm /usr/bin/ld: /builddir/build/BUILD/nwchem-6.6/lib/LINUX64/libtce.a(ccsd_t_singles_l.o): invalid string offset 7089 >= 385 for section `.strtab' ... /builddir/build/BUILD/nwchem-6.6/lib/LINUX64/libnwctask.a(input_parse.o): In function `input_parse_': /builddir/build/BUILD/nwchem-6.6/src/input/input_parse.F:219: undefined reference to `tce_input_' ... Let me try to nwchem build after applying all patches http://www.nwchem-sw.org/index.php/Download I remember in the past there were usually problems with nwchem when changing gfortran major versions - Fedora 24 moved to gfortran 6.
I'm getting the tests hanging similar to yours on Fedora 24 and Rawhide (though at different jobs): http://koji.fedoraproject.org/koji/taskinfo?taskID=14907720 http://koji.fedoraproject.org/koji/taskinfo?taskID=14907781 Building Nwchem-6.6.revision27746-src.2015-10-20.tar.gz locally on Fedora 24 (no rpmbuild) produces a working executable. One difference I can think of between this build and an rpmbuild --rebuild from nwchem SRPM is a different version of global arrays. In the former case nwchem uses it's bundled ga (an unknown version of ga-5-4), in the latter case nwchem uses the ga-5-3b version packaged by Fedora. I'm trying if rebuilding nwchem against Fedora ga RPMS updated to the latest official ga-5-4 helps. The official ga-5-4 release is newer than nwchem itself so this may cause other problems.
Can I help in some way?
(In reply to Henrique "LonelySpooky" Junior from comment #5) > Can I help in some way? yes. Concerning your initial steps: > Steps to Reproduce: > 1. module load mpi/openmpi-x86_64 > 2. nwchem_openmpi water.nw > 3. Crash > 0. after a fresh nwchem install you need to source a /etc/profile.d/nwchem* script in order to set env variables: NWCHEM_BASIS_LIBRARY NWCHEM_NWPW_LIBRARY Other than this I confirm that nwchem-openmpi tests hang under koji (tried Fedora 24 and Rawhide x86_64), and fail "on my machine" (do not hang). I'm submitting an update spec which kills the %check stage after 30 minutes so the build can terminate. Can you work with nwchem community forum in order to figure out what's wrong with Fedora builds? Note that on epel6 nwchem-openmpi tests mostly succeed.
In order to discuss the tests failing on Fedora on the nwchem forum you would need to build nwchem with global arrays bundled dependencies included. When I do that the tests seem to fail with the same errors as when using Fedora nwchem-openmpi RPM. I used https://atlas.hashicorp.com/elastic/boxes/fedora-24-x86_64 as a Fedora base (Vagrantfile attached). $ sudo su - -c "dnf -y install wget gcc-gfortran libsysfs-devel ncurses-devel openblas-devel openmpi-devel python2-devel readline-devel tcsh zlib-devel" $ cd /tmp $ wget 'http://www.nwchem-sw.org/download.php?f=Nwchem-6.6.revision27746-src.2015-10-20.tar.gz' -O Nwchem-6.6.revision27746-src.2015-10-20.tar.gz $ tar xvf Nwchem-6.6.revision27746-src.2015-10-20.tar.gz $ cd nwchem-6.6/src and build using the following: # see http://www.nwchem-sw.org/index.php/Compiling_NWChem export NWCHEM_TOP=/tmp/nwchem-6.6 export NWCHEM_TARGET=LINUX64 export CC=gcc export FC=gfortran export USE_ARUR=TRUE export USE_NOFSCHECK=TRUE export NWCHEM_FSCHECK=N export LARGE_FILES=TRUE export MRCC_THEORY=Y export EACCSD=Y export IPCCSD=Y export CCSDTQ=Y export CCSDTLR=Y export NWCHEM_LONG_PATHS=Y export PYTHONHOME=/usr export PYTHONVERSION=2.7 export PYTHONLIBTYPE=so export USE_PYTHON64=y export HAS_BLAS=yes export BLASOPT='-L/usr/lib64 -lopenblas' export BLAS_SIZE='4' export MAKE=/usr/bin/make export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib export USE_MPI=y export USE_MPIF=y export USE_MPIF4=y export MPIEXEC=/usr/lib64/openmpi/bin/mpiexec export MPI_LIB=/usr/lib64/openmpi/lib export MPI_INCLUDE=/usr/include/openmpi-x86_64 export LIBMPI='-lmpi -lmpi_usempif08 -lmpi_mpifh' $MAKE nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee ../make_nwchem_config_openmpi.log $MAKE 64_to_32 2>&1 | tee ../make_64_to_32_openmpi.log export MAKEOPTS="USE_64TO32=y" $MAKE ${MAKEOPTS} 2>&1 | tee ../make.log
Created attachment 1180418 [details] Vagrantfile which compiles Nwchem on Fedora 24
Right. I'll get the discussion started today (ASAP).
nwchem-6.6.27746-26.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-f10b67dbd7
Compilation "by hand" seems to be finishing OK on my Fedora 24 box and the binary is correctly generated (haven't tried the vagrant box because I don't know the tool and I'm a little short in time to get into it). Trying a simple calculation is giving me one error, but that is probably because I messed up on some point: Sum of atomic energies: -1716.76316504 Renormalizing density from 86.00 to 84 ------------------------------------------------------------------------ spcart_bra2etran: nbf_xj.ne.nbf_sj (xj-sj) = 5 ------------------------------------------------------------------------ ------------------------------------------------------------------------ current input line : 43: task DFT optimize ------------------------------------------------------------------------ ------------------------------------------------------------------------ This error has not yet been assigned to a category ------------------------------------------------------------------------ For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation For further details see manual section: No section for this category 0:spcart_bra2etran: nbf_xj.ne.nbf_sj (xj-sj) =:Received an Error in Communication -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0 with errorcode 5. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. --------------------------------------------------------------------------
Created attachment 1180761 [details] Make.log
Created attachment 1180762 [details] make_64_to_32
Created attachment 1180764 [details] make_nwchem_openmpi
(In reply to Henrique "LonelySpooky" Junior from comment #11) > Compilation "by hand" seems to be finishing OK on my Fedora 24 box and the > binary is correctly generated (haven't tried the vagrant box because I don't > know the tool and I'm a little short in time to get into it). > Trying a simple calculation is giving me one error, but that is probably > because I messed up on some point: > > Sum of atomic energies: -1716.76316504 > > Renormalizing density from 86.00 to 84 > ------------------------------------------------------------------------ > spcart_bra2etran: nbf_xj.ne.nbf_sj (xj-sj) = 5 > ------------------------------------------------------------------------ > ------------------------------------------------------------------------ > current input line : > 43: task DFT optimize > ------------------------------------------------------------------------ > ------------------------------------------------------------------------ > This error has not yet been assigned to a category > ------------------------------------------------------------------------ > For more information see the NWChem manual at > http://www.nwchem-sw.org/index.php/NWChem_Documentation > > > For further details see manual section: No section for this category > > 0:spcart_bra2etran: nbf_xj.ne.nbf_sj (xj-sj) =:Received an Error in > Communication > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM > 0 > with errorcode 5. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- Yes, I get this error when running https://bodhi.fedoraproject.org/updates/FEDORA-2016-f10b67dbd7 That's why I suggested if you can try to sort it out with on nwchem forum. We need nwchem developers involved in this.
(In reply to marcindulak from comment #15) > (In reply to Henrique "LonelySpooky" Junior from comment #11) > > Compilation "by hand" seems to be finishing OK on my Fedora 24 box and the > > binary is correctly generated (haven't tried the vagrant box because I don't > > know the tool and I'm a little short in time to get into it). > > Trying a simple calculation is giving me one error, but that is probably > > because I messed up on some point: > > > > Sum of atomic energies: -1716.76316504 > > > > Renormalizing density from 86.00 to 84 > > ------------------------------------------------------------------------ > > spcart_bra2etran: nbf_xj.ne.nbf_sj (xj-sj) = 5 > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------ > > current input line : > > 43: task DFT optimize > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------ > > This error has not yet been assigned to a category > > ------------------------------------------------------------------------ > > For more information see the NWChem manual at > > http://www.nwchem-sw.org/index.php/NWChem_Documentation > > > > > > For further details see manual section: No section for this category > > > > 0:spcart_bra2etran: nbf_xj.ne.nbf_sj (xj-sj) =:Received an Error in > > Communication > > -------------------------------------------------------------------------- > > MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM > > 0 > > with errorcode 5. > > > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > > You may or may not see output from other processes, depending on > > exactly when Open MPI kills them. > > -------------------------------------------------------------------------- > > Yes, I get this error when running > https://bodhi.fedoraproject.org/updates/FEDORA-2016-f10b67dbd7 > That's why I suggested if you can try to sort it out with on nwchem forum. > We need nwchem developers involved in this. Thread is open at: http://www.nwchem-sw.org/index.php/Special:AWCforum/st/id2109/#post_7399
nwchem-6.6.27746-26.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-f10b67dbd7
Created attachment 1181380 [details] This patch turns off the -ftree-dominator-opts optimization This patch turns off the -ftree-dominator-opts optimization by using the -fno-tree-dominator-opts. This seems to fix the failures observed with gfortran 6.1
nwchem-6.6.27746-27.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-96b3062018
nwchem-6.6.27746-27.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-96b3062018
Marcin, The current QA failure for h2o-response can be fixed with the following patch http://www.nwchem-sw.org/download.php?f=Zgesvd.patch.gz I suggest you to apply all the patches listed at http://www.nwchem-sw.org/index.php/Download#Patches_for_the_27746_revision_of_NWChem_6.6
I'll wait for the next official release of nwchem. It looks to me that applying Txs_gcc6.patch.gz results in SIGSEGV http://koji.fedoraproject.org/koji/taskinfo?taskID=14947501 https://retrace.fedoraproject.org/faf/reports/1216766/
The SIGSEGV occurs early on in the run. It is quite unlikely that the Txs_gcc6.patch.gz is the root cause for it, since it modifies a part of the code that used at a later stage. Does the SIGSEGV occur just for openmpi and does it occur for openmpich, too?
By the way, the build.log file shows that ga-5-3 is used. Please move to ga-5-4 as soon as possible
(In reply to Edoardo Apra from comment #23) > The SIGSEGV occurs early on in the run. It is quite unlikely that the > Txs_gcc6.patch.gz is the root cause for it, since it modifies a part of the > code that used at a later stage. > Does the SIGSEGV occur just for openmpi and does it occur for openmpich, too? I can't reproduce SIGSEGV locally, only happened under koji.
(In reply to Edoardo Apra from comment #24) > By the way, the build.log file shows that ga-5-3 is used. Please move to > ga-5-4 as soon as possible I've already requested ga-5-4 build from the maintainer: bug #1357022
Hi, to me nwchem (6.6.27746-27) is crashing with openMPI, but working as expected with mpich. The error with openMPI is as follows: Program received signal SIGSEGV: Segmentation fault - invalid memory reference. Backtrace for this error: #0 0x7f70c56cbe3a #1 0x7f70c56cb02d #2 0x7f70c499876f #3 0x7f70c49ec1d6 #4 0x7f70c28b8a58 #5 0x7f70c64cfa51 #6 0x7f70c64f288c #7 0x7f70c933265a #8 0x7f70c93326ce #9 0x7f70c933052c #10 0x52e486 #11 0x52ef73 #12 0x7f70c4984730 #13 0x52cff8 #14 0xffffffffffffffff
Henrique Please try to use mpirun to start your openMPI
The SIGSEGV above (that can be avoid using mpirun or giving no arguments to nwchem) is fixed in ga-5-4
nwchem-6.6.27746-27.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.