Bug 1356735 - NWChem can not read a basis set and crashes
Summary: NWChem can not read a basis set and crashes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: nwchem
Version: 24
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: marcindulak
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-14 21:44 UTC by Henrique C. S. Junior
Modified: 2016-08-08 20:29 UTC (History)
2 users (show)

Fixed In Version: nwchem-6.6.27746-27.fc24
Clone Of:
Environment:
Last Closed: 2016-08-08 20:29:48 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Vagrantfile which compiles Nwchem on Fedora 24 (2.35 KB, text/plain)
2016-07-16 13:01 UTC, marcindulak
no flags Details
Make.log (176.79 KB, application/x-bzip)
2016-07-17 13:35 UTC, Henrique C. S. Junior
no flags Details
make_64_to_32 (71.08 KB, application/x-bzip)
2016-07-17 13:37 UTC, Henrique C. S. Junior
no flags Details
make_nwchem_openmpi (9.82 KB, text/plain)
2016-07-17 13:40 UTC, Henrique C. S. Junior
no flags Details
This patch turns off the -ftree-dominator-opts optimization (412 bytes, patch)
2016-07-19 02:29 UTC, Edoardo Apra
no flags Details | Diff

Description Henrique C. S. Junior 2016-07-14 21:44:29 UTC
Description of problem:
Any calculations on NWChem are causing a crash with mpich-x86_64 or openmpi-x86_64 


Version-Release number of selected component (if applicable):
nwchem-common-6.6.27746-24.fc24.noarch
nwchem-openmpi-6.6.27746-24.fc24.x86_64
nwchem-mpich-6.6.27746-24.fc24.x86_64
nwchem-6.6.27746-24.fc24.x86_64


How reproducible:
Create any simple input file and run it.


Steps to Reproduce:
1. module load mpi/openmpi-x86_64
2. nwchem_openmpi water.nw
3. Crash

Actual results:
With openmpi NWChem is generating a segmentation fault, with mpiCH the error is more descritive but in both cases the process dies in the DFT stage.

Expected results:
The magic of computational chemistry should appear as a result.


Additional info:
It looks line NWChem is unable to read any basis sets and that is causing a crash. I can see this error:  Attempting to read a basis set from a  non-existing file:
 /builddir/build/BUILD/nwchem-6.6/src/basis/libraries/cc-pvdz   


It looks like the RPMMacros haven1t converted the path to a / directory structure.

----------------------------------------------------------------------------

                                 NWChem DFT Module
                                 -----------------


                                      Título


 Attempting to read a basis set from a  non-existing file:
 /builddir/build/BUILD/nwchem-6.6/src/basis/libraries/cc-pvdz                                                                                                                                                                                                   
 ------------------------------------------------------------------------
 bas_tag_lib: failed opening basis file                   0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
    25: task dft optimize
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 There is an error in the specified basis set
 ------------------------------------------------------------------------
 For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation


 For further details see manual section:                                                                                                                                                                                                                                                                
0:0:bas_tag_lib: failed opening basis file:: -1
(rank:0 hostname:localhost.localdomain pid:15672):ARMCI DASSERT fail. src/common/armci.c:ARMCI_Error():208 cond:0
Last System Error Message from Task 0:: No such file or directory
application called MPI_Abort(comm=0x84000001, -1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=-1
:
system msg for write_line failure : Bad file descriptor

----------------------------------------------------------------------------

Comment 1 Henrique C. S. Junior 2016-07-15 10:45:11 UTC
I've tried a rebuild using mock and the process is stuck at this point:

+ MPIRUN_PATH=/usr/lib64/openmpi/bin/mpiexec
+ export NWCHEM_EXECUTABLE=/builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem_openmpi
+ NWCHEM_EXECUTABLE=/builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem_openmpi
+ ./doafewqmtests.mpi 2
+ tee ../doafewqmtests.mpi.2_openmpi.log

 =======================================================
 QM: Running a very small subset of the available tests 
 =======================================================


 
 Running tests/h2o_opt/h2o_opt 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem_openmpi)
 
     NWChem execution failed
 
Failed

 
 Running tests/dft_he2+/dft_he2+ 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem_openmpi)

Comment 2 marcindulak 2016-07-15 14:23:57 UTC
maybe related to bug #1347788

Comment 3 marcindulak 2016-07-15 14:42:54 UTC
Compilation of ccsd code fails on Rawhide

...
gfortran  -Wl,--export-dynamic  -L/builddir/build/BUILD/nwchem-6.6/lib/LINUX64 -L/builddir/build/BUILD/nwchem-6.6/src/tools/install/lib  -o /builddir/build/BUILD/nwchem-6.6/bin/LINUX64/nwchem nwchem.o stubs.o -
lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lsolvation -lnwints -lprepar 
-lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -lnwpython -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lmm -lcons -lperfm -ldntmc -lccca -lnwcutil -L
/usr/lib64/mpich/lib -lga -larmci -lpeigs -lperfm -lcons -lbq -lnwcutil /usr/lib64/python2.7/config/libpython2.7.so -l64to32 -L/usr/lib64 -lopenblas     -L/usr/lib64/mpich/lib -lmpich -lmpifort  -lnwcutil -lpyt
hon2.7 -lpthread -ldl -lutil -lm   
/usr/bin/ld: /builddir/build/BUILD/nwchem-6.6/lib/LINUX64/libtce.a(ccsd_t_singles_l.o): invalid string offset 7089 >= 385 for section `.strtab'
...

/builddir/build/BUILD/nwchem-6.6/lib/LINUX64/libnwctask.a(input_parse.o): In function `input_parse_':
/builddir/build/BUILD/nwchem-6.6/src/input/input_parse.F:219: undefined reference to `tce_input_'
...

Let me try to nwchem build after applying all patches http://www.nwchem-sw.org/index.php/Download

I remember in the past there were usually problems with nwchem when changing gfortran major versions - Fedora 24 moved to gfortran 6.

Comment 4 marcindulak 2016-07-15 20:00:20 UTC
I'm getting the tests hanging similar to yours on Fedora 24 and Rawhide (though at different jobs):
http://koji.fedoraproject.org/koji/taskinfo?taskID=14907720
http://koji.fedoraproject.org/koji/taskinfo?taskID=14907781

Building Nwchem-6.6.revision27746-src.2015-10-20.tar.gz locally on Fedora 24 (no rpmbuild) produces a working executable. One difference I can think of between this build and an rpmbuild --rebuild from nwchem SRPM is a different version of global arrays.
In the former case nwchem uses it's bundled ga (an unknown version of ga-5-4), in the latter case nwchem uses the ga-5-3b version packaged by Fedora.

I'm trying if rebuilding nwchem against Fedora ga RPMS updated to the latest official ga-5-4 helps. The official ga-5-4 release is newer than nwchem itself so this may cause other problems.

Comment 5 Henrique C. S. Junior 2016-07-15 22:26:54 UTC
Can I help in some way?

Comment 6 marcindulak 2016-07-16 11:55:18 UTC
(In reply to Henrique "LonelySpooky" Junior from comment #5)
> Can I help in some way?

yes.

Concerning your initial steps:

> Steps to Reproduce:
> 1. module load mpi/openmpi-x86_64
> 2. nwchem_openmpi water.nw
> 3. Crash
> 

0. after a fresh nwchem install you need to source a /etc/profile.d/nwchem* script in order to set env variables:
NWCHEM_BASIS_LIBRARY
NWCHEM_NWPW_LIBRARY

Other than this I confirm that nwchem-openmpi tests hang under koji (tried Fedora 24 and Rawhide x86_64), and fail "on my machine" (do not hang). I'm submitting an update spec which kills the %check stage after 30 minutes so the build can terminate. Can you work with nwchem community forum in order to figure out what's wrong with Fedora builds? Note that on epel6 nwchem-openmpi tests mostly succeed.

Comment 7 marcindulak 2016-07-16 12:59:45 UTC
In order to discuss the tests failing on Fedora on the nwchem forum you would need to build nwchem with global arrays bundled dependencies included.
When I do that the tests seem to fail with the same errors as when using Fedora nwchem-openmpi RPM. I used https://atlas.hashicorp.com/elastic/boxes/fedora-24-x86_64 as a Fedora base (Vagrantfile attached).

$ sudo su - -c "dnf -y install wget gcc-gfortran libsysfs-devel ncurses-devel openblas-devel openmpi-devel python2-devel readline-devel tcsh zlib-devel"
$ cd /tmp
$ wget 'http://www.nwchem-sw.org/download.php?f=Nwchem-6.6.revision27746-src.2015-10-20.tar.gz' -O Nwchem-6.6.revision27746-src.2015-10-20.tar.gz
$ tar xvf Nwchem-6.6.revision27746-src.2015-10-20.tar.gz
$ cd nwchem-6.6/src

and build using the following:

# see http://www.nwchem-sw.org/index.php/Compiling_NWChem
export NWCHEM_TOP=/tmp/nwchem-6.6
export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export USE_ARUR=TRUE
export USE_NOFSCHECK=TRUE
export NWCHEM_FSCHECK=N
export LARGE_FILES=TRUE
export MRCC_THEORY=Y
export EACCSD=Y
export IPCCSD=Y
export CCSDTQ=Y
export CCSDTLR=Y
export NWCHEM_LONG_PATHS=Y
export PYTHONHOME=/usr
export PYTHONVERSION=2.7
export PYTHONLIBTYPE=so
export USE_PYTHON64=y
export HAS_BLAS=yes
export BLASOPT='-L/usr/lib64 -lopenblas'
export BLAS_SIZE='4'
export MAKE=/usr/bin/make
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPIEXEC=/usr/lib64/openmpi/bin/mpiexec
export MPI_LIB=/usr/lib64/openmpi/lib
export MPI_INCLUDE=/usr/include/openmpi-x86_64
export LIBMPI='-lmpi -lmpi_usempif08 -lmpi_mpifh'
$MAKE nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee ../make_nwchem_config_openmpi.log
$MAKE 64_to_32 2>&1 | tee ../make_64_to_32_openmpi.log
export MAKEOPTS="USE_64TO32=y"
$MAKE ${MAKEOPTS} 2>&1 | tee ../make.log

Comment 8 marcindulak 2016-07-16 13:01:16 UTC
Created attachment 1180418 [details]
Vagrantfile which compiles Nwchem on Fedora 24

Comment 9 Henrique C. S. Junior 2016-07-16 14:04:04 UTC
Right. I'll get the discussion started today (ASAP).

Comment 10 Fedora Update System 2016-07-16 17:00:37 UTC
nwchem-6.6.27746-26.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-f10b67dbd7

Comment 11 Henrique C. S. Junior 2016-07-17 13:33:11 UTC
Compilation "by hand" seems to be finishing OK on my Fedora 24 box and the binary is correctly generated (haven't tried the vagrant box because I don't know the tool and I'm a little short in time to get into it).
Trying a simple calculation is giving me one error, but that is probably because I messed up on some point:

Sum of atomic energies:       -1716.76316504

 Renormalizing density from      86.00 to     84
 ------------------------------------------------------------------------
 spcart_bra2etran: nbf_xj.ne.nbf_sj  (xj-sj) =                   5
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
    43: task DFT  optimize
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
 This error has not yet been assigned to a category
 ------------------------------------------------------------------------
 For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation


 For further details see manual section: No section for this category                                                                                                                                                                                                                                   
0:spcart_bra2etran: nbf_xj.ne.nbf_sj  (xj-sj) =:Received an Error in Communication
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0 
with errorcode 5.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Comment 12 Henrique C. S. Junior 2016-07-17 13:35:54 UTC
Created attachment 1180761 [details]
Make.log

Comment 13 Henrique C. S. Junior 2016-07-17 13:37:38 UTC
Created attachment 1180762 [details]
make_64_to_32

Comment 14 Henrique C. S. Junior 2016-07-17 13:40:03 UTC
Created attachment 1180764 [details]
make_nwchem_openmpi

Comment 15 marcindulak 2016-07-17 15:13:44 UTC
(In reply to Henrique "LonelySpooky" Junior from comment #11)
> Compilation "by hand" seems to be finishing OK on my Fedora 24 box and the
> binary is correctly generated (haven't tried the vagrant box because I don't
> know the tool and I'm a little short in time to get into it).
> Trying a simple calculation is giving me one error, but that is probably
> because I messed up on some point:
> 
> Sum of atomic energies:       -1716.76316504
> 
>  Renormalizing density from      86.00 to     84
>  ------------------------------------------------------------------------
>  spcart_bra2etran: nbf_xj.ne.nbf_sj  (xj-sj) =                   5
>  ------------------------------------------------------------------------
>  ------------------------------------------------------------------------
>   current input line : 
>     43: task DFT  optimize
>  ------------------------------------------------------------------------
>  ------------------------------------------------------------------------
>  This error has not yet been assigned to a category
>  ------------------------------------------------------------------------
>  For more information see the NWChem manual at
> http://www.nwchem-sw.org/index.php/NWChem_Documentation
> 
> 
>  For further details see manual section: No section for this category       
> 
> 0:spcart_bra2etran: nbf_xj.ne.nbf_sj  (xj-sj) =:Received an Error in
> Communication
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM
> 0 
> with errorcode 5.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------

Yes, I get this error when running https://bodhi.fedoraproject.org/updates/FEDORA-2016-f10b67dbd7
That's why I suggested if you can try to sort it out with on nwchem forum.
We need nwchem developers involved in this.

Comment 16 Henrique C. S. Junior 2016-07-17 19:39:13 UTC
(In reply to marcindulak from comment #15)
> (In reply to Henrique "LonelySpooky" Junior from comment #11)
> > Compilation "by hand" seems to be finishing OK on my Fedora 24 box and the
> > binary is correctly generated (haven't tried the vagrant box because I don't
> > know the tool and I'm a little short in time to get into it).
> > Trying a simple calculation is giving me one error, but that is probably
> > because I messed up on some point:
> > 
> > Sum of atomic energies:       -1716.76316504
> > 
> >  Renormalizing density from      86.00 to     84
> >  ------------------------------------------------------------------------
> >  spcart_bra2etran: nbf_xj.ne.nbf_sj  (xj-sj) =                   5
> >  ------------------------------------------------------------------------
> >  ------------------------------------------------------------------------
> >   current input line : 
> >     43: task DFT  optimize
> >  ------------------------------------------------------------------------
> >  ------------------------------------------------------------------------
> >  This error has not yet been assigned to a category
> >  ------------------------------------------------------------------------
> >  For more information see the NWChem manual at
> > http://www.nwchem-sw.org/index.php/NWChem_Documentation
> > 
> > 
> >  For further details see manual section: No section for this category       
> > 
> > 0:spcart_bra2etran: nbf_xj.ne.nbf_sj  (xj-sj) =:Received an Error in
> > Communication
> > --------------------------------------------------------------------------
> > MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM
> > 0 
> > with errorcode 5.
> > 
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > You may or may not see output from other processes, depending on
> > exactly when Open MPI kills them.
> > --------------------------------------------------------------------------
> 
> Yes, I get this error when running
> https://bodhi.fedoraproject.org/updates/FEDORA-2016-f10b67dbd7
> That's why I suggested if you can try to sort it out with on nwchem forum.
> We need nwchem developers involved in this.

Thread is open at: http://www.nwchem-sw.org/index.php/Special:AWCforum/st/id2109/#post_7399

Comment 17 Fedora Update System 2016-07-18 22:26:10 UTC
nwchem-6.6.27746-26.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-f10b67dbd7

Comment 18 Edoardo Apra 2016-07-19 02:29:32 UTC
Created attachment 1181380 [details]
This patch turns off the -ftree-dominator-opts optimization

This patch turns off the -ftree-dominator-opts optimization by using the -fno-tree-dominator-opts. This seems to fix the failures observed with gfortran 6.1

Comment 19 Fedora Update System 2016-07-20 07:21:55 UTC
nwchem-6.6.27746-27.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-96b3062018

Comment 20 Fedora Update System 2016-07-21 04:21:39 UTC
nwchem-6.6.27746-27.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-96b3062018

Comment 21 Edoardo Apra 2016-07-22 02:06:33 UTC
Marcin,

The current QA failure for h2o-response can be fixed with the following patch

http://www.nwchem-sw.org/download.php?f=Zgesvd.patch.gz 

I suggest you to apply all the patches listed at

http://www.nwchem-sw.org/index.php/Download#Patches_for_the_27746_revision_of_NWChem_6.6

Comment 22 marcindulak 2016-07-22 08:37:09 UTC
I'll wait for the next official release of nwchem.

It looks to me that applying Txs_gcc6.patch.gz results in SIGSEGV
http://koji.fedoraproject.org/koji/taskinfo?taskID=14947501
https://retrace.fedoraproject.org/faf/reports/1216766/

Comment 23 Edoardo Apra 2016-07-22 15:33:53 UTC
The SIGSEGV occurs early on in the run. It is quite unlikely that the Txs_gcc6.patch.gz is the root cause for it, since it modifies a part of the code that used at a later stage.
Does the SIGSEGV occur just for openmpi and does it occur for openmpich, too?

Comment 24 Edoardo Apra 2016-07-22 15:35:43 UTC
By the way, the build.log file shows that ga-5-3 is used. Please move to ga-5-4 as soon as possible

Comment 25 marcindulak 2016-07-22 22:41:11 UTC
(In reply to Edoardo Apra from comment #23)
> The SIGSEGV occurs early on in the run. It is quite unlikely that the
> Txs_gcc6.patch.gz is the root cause for it, since it modifies a part of the
> code that used at a later stage.
> Does the SIGSEGV occur just for openmpi and does it occur for openmpich, too?

I can't reproduce SIGSEGV locally, only happened under koji.

Comment 26 marcindulak 2016-07-22 22:42:18 UTC
(In reply to Edoardo Apra from comment #24)
> By the way, the build.log file shows that ga-5-3 is used. Please move to
> ga-5-4 as soon as possible

I've already requested ga-5-4 build from the maintainer: bug #1357022

Comment 27 Henrique C. S. Junior 2016-08-01 12:04:50 UTC
Hi, to me nwchem (6.6.27746-27) is crashing with openMPI, but working as expected with mpich.
The error with openMPI is as follows:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f70c56cbe3a
#1  0x7f70c56cb02d
#2  0x7f70c499876f
#3  0x7f70c49ec1d6
#4  0x7f70c28b8a58
#5  0x7f70c64cfa51
#6  0x7f70c64f288c
#7  0x7f70c933265a
#8  0x7f70c93326ce
#9  0x7f70c933052c
#10  0x52e486
#11  0x52ef73
#12  0x7f70c4984730
#13  0x52cff8
#14  0xffffffffffffffff

Comment 28 Edoardo Apra 2016-08-01 18:57:14 UTC
Henrique
Please try to use mpirun to start your openMPI

Comment 29 Edoardo Apra 2016-08-01 19:07:25 UTC
The SIGSEGV above (that can be avoid using mpirun or giving no arguments to nwchem) is fixed in ga-5-4

Comment 30 Fedora Update System 2016-08-08 20:29:44 UTC
nwchem-6.6.27746-27.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.