Created attachment 1817529 [details] Problematic netcdf file Description of problem: netcdf (or hdf5?) has an issue with the attached file. Version-Release number of selected component (if applicable): netcdf-4.8.0-1.fc36.x86_64 How reproducible: Not sure. I noticed this while building BOUT++, which sometimes works, but on local tasting it failed very often. With the attached file I got 1000 failures in a row. Steps to Reproduce: 1. ncdump $file Actual results: ncdump: /tmp/fileIRh3lF: NetCDF: HDF error Expected results: No error, netcdf file is read, as on F34 and below Additional info: Some netcdf file work without issue. This was generated under rawhide as part of BOUT++'s unit test with `./serial_tests --gtest_filter=OptionsNetCDFTest.FieldPerpWriteCellCentre`
Upon further testing, things are more complicated: `h5dump /tmp/fileIRh3lF` fails `cp /tmp/fileIRh3lF . && h5dump fileIRh3lF` works /tmp is an xfs filesystem, . is a gpfs filesystem. Please let me know if there is anything I can help with testing ...
As it h5dump shows the same behaviour, I assume it is actually hdf5 related. Version: hdf5-1.10.7-1.fc36.x86_64
I can't reproduce this so I think something must be strange on your end. Does strace reveal any interesting differences between the two runs?
The error is: `flock(3, LOCK_SH|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)` export HDF5_USE_FILE_LOCKING=FALSE avoids the issue. I tried this also for a build [1] and the errors in the unit tests are gone, so it doesn't seem to only happen on this specific system. Did you try on xfs? hdf5 could really improve hugely on the error reporting - that would help a lot ... [1] https://koji.fedoraproject.org/koji/taskinfo?taskID=74521222
It looks like something already holds a lock on that file. I would suggest trying to figure out what. Perhaps there are some race conditions in the BOUT++ tests? If they are run in parallel it may make sense to run them serially.
I just notice, I had the unit test still running in gdb, so that explains why the file was locked. Unfortunately, this doesn't explain why the unit test is failing. I have a backtrace, but I am not sure that helps: ``` Thread 1 "serial_tests" hit Catchpoint 1 (exception thrown), 0x00001555549c6572 in __cxxabiv1::__cxa_throw (obj=0xd0d140, tinfo=0x1555552408a0 <typeinfo for netCDF::exceptions::NcHdfErr>, dest=0x155555211610 <netCDF::exceptions::NcHdfErr::~NcHdfErr()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:78 78 PROBE2 (throw, obj, tinfo); (gdb) bt #0 0x00001555549c6572 in __cxxabiv1::__cxa_throw (obj=0xd0d140, tinfo=0x1555552408a0 <typeinfo for netCDF::exceptions::NcHdfErr>, dest=0x155555211610 <netCDF::exceptions::NcHdfErr::~NcHdfErr()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:78 #1 0x000015555520e55e in netCDF::ncCheck (retCode=<optimized out>, file=0x1555552310f8 "../../cxx4/ncFile.cpp", line=88) at ../../cxx4/ncCheck.cpp:58 #2 0x00001555552144ea in netCDF::NcFile::open (this=this@entry=0x7fffffffc700, filePath="/tmp/file4VW8WQ", fMode=fMode@entry=netCDF::NcFile::read) at /usr/include/c++/11/bits/basic_string.h:194 #3 0x00001555552147f4 in netCDF::NcFile::NcFile (this=<optimized out>, filePath=..., fMode=<optimized out>, this=<optimized out>, filePath=..., fMode=<optimized out>) at ../../cxx4/ncFile.cpp:48 #4 0x00000000008769c7 in bout::experimental::OptionsNetCDF::read (this=this@entry=0x7fffffffc830) at options_netcdf.cxx:125 #5 0x000000000077261b in OptionsNetCDFTest_FieldPerpWriteCellCentre_Test::TestBody (this=0xcae3d0) at sys/test_options_netcdf.cxx:232 #6 0x00000000007fad8f in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0xa3b312 "the test body", method=<optimized out>, object=0xcae3d0) ``` Command is: gdb -args ./serial_tests --gtest_filter=OptionsNetCDFTest.FieldPerpWriteCellCentre The lock is held by ./serial_tests with BOUT++ v4.4.0, running `./configure` and `make -j 32 build-check-unit-tests` disabling locking avoids the issue: HDF5_USE_FILE_LOCKING=FALSE ./serial_tests --gtest_filter=OptionsNetCDFTest.FieldPerpWriteCellCentre
Created attachment 1818054 [details] This resolves the issue for me The issue seems to be that the hdf5 opens the file, and then locks it. Later, it closes the fd, but in the mean time additional threads have spawned, and thus the lock is not cleared, as the file is still opened. Explicitly unlocking avoids the issue, and successive open+flock are working.
So, you are going to need to submit an issue upstream here - https://portal.hdfgroup.org/display/support/The+HDF+Help+Desk Something like this I don't think is appropriate to apply downstream.
I have opened: https://github.com/HDFGroup/hdf5/pull/967 However, it might make sense to also apply this downstream, as with f34, the code path in BOUT++ is the same, the file gets opened + locked, then pthread_create is called a few times, and then the file is closed. Requesting a new lock in F35+ is failing, while it is working on F34, so this is a regression in F35+. For this specific case I have found a workaround in bout++, but I think the patch is still useful, hopefully hdf5 will merge ...
This bug appears to have been reported against 'rawhide' during the Fedora 36 development cycle. Changing version to 36.
This message is a reminder that Fedora Linux 36 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 36 on 2023-05-16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '36'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 36 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 36 entered end-of-life (EOL) status on 2023-05-16. Fedora Linux 36 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.