Bug 1997717 - ncdump doesn't read file
Summary: ncdump doesn't read file
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: hdf5
Version: 36
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Orion Poplawski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-25 16:54 UTC by david08741
Modified: 2023-05-25 19:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-25 19:33:29 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Problematic netcdf file (6.16 KB, application/x-hdf)
2021-08-25 16:54 UTC, david08741
no flags Details
This resolves the issue for me (492 bytes, patch)
2021-08-26 17:54 UTC, david08741
no flags Details | Diff

Description david08741 2021-08-25 16:54:43 UTC
Created attachment 1817529 [details]
Problematic netcdf file

Description of problem:
netcdf (or hdf5?) has an issue with the attached file.

Version-Release number of selected component (if applicable):
netcdf-4.8.0-1.fc36.x86_64

How reproducible:
Not sure. I noticed this while building BOUT++, which sometimes works, but on local tasting it failed very often. With the attached file I got 1000 failures in a row.

Steps to Reproduce:
1. ncdump $file

Actual results:
ncdump: /tmp/fileIRh3lF: NetCDF: HDF error

Expected results:
No error, netcdf file is read, as on F34 and below

Additional info:
Some netcdf file work without issue. This was generated under rawhide as part of BOUT++'s unit test with `./serial_tests --gtest_filter=OptionsNetCDFTest.FieldPerpWriteCellCentre`

Comment 1 david08741 2021-08-25 17:05:07 UTC
Upon further testing, things are more complicated:

`h5dump /tmp/fileIRh3lF` fails
`cp /tmp/fileIRh3lF . && h5dump fileIRh3lF` works

/tmp is an xfs filesystem, . is a gpfs filesystem.

Please let me know if there is anything I can help with testing ...

Comment 2 david08741 2021-08-25 17:06:27 UTC
As it h5dump shows the same behaviour, I assume it is actually hdf5 related.

Version:
hdf5-1.10.7-1.fc36.x86_64

Comment 3 Orion Poplawski 2021-08-26 03:00:36 UTC
I can't reproduce this so I think something must be strange on your end.  Does strace reveal any interesting differences between the two runs?

Comment 4 david08741 2021-08-26 07:10:56 UTC
The error is:
`flock(3, LOCK_SH|LOCK_NB)               = -1 EAGAIN (Resource temporarily unavailable)`

export HDF5_USE_FILE_LOCKING=FALSE avoids the issue.
I tried this also for a build [1] and the errors in the unit tests are gone, so it doesn't seem to only happen on this specific system.

Did you try on xfs?

hdf5 could really improve hugely on the error reporting - that would help a lot ...

[1] https://koji.fedoraproject.org/koji/taskinfo?taskID=74521222

Comment 5 Orion Poplawski 2021-08-26 14:01:05 UTC
It looks like something already holds a lock on that file.  I would suggest trying to figure out what.  Perhaps there are some race conditions in the BOUT++ tests?  If they are run in parallel it may make sense to run them serially.

Comment 6 david08741 2021-08-26 14:07:05 UTC
I just notice, I had the unit test still running in gdb, so that explains why the file was locked.
Unfortunately, this doesn't explain why the unit test is failing.

I have a backtrace, but I am not sure that helps:

```
Thread 1 "serial_tests" hit Catchpoint 1 (exception thrown), 0x00001555549c6572 in __cxxabiv1::__cxa_throw (obj=0xd0d140, tinfo=0x1555552408a0 <typeinfo for netCDF::exceptions::NcHdfErr>, 
    dest=0x155555211610 <netCDF::exceptions::NcHdfErr::~NcHdfErr()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:78
78        PROBE2 (throw, obj, tinfo);
(gdb) bt
#0  0x00001555549c6572 in __cxxabiv1::__cxa_throw (obj=0xd0d140, tinfo=0x1555552408a0 <typeinfo for netCDF::exceptions::NcHdfErr>, dest=0x155555211610 <netCDF::exceptions::NcHdfErr::~NcHdfErr()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:78
#1  0x000015555520e55e in netCDF::ncCheck (retCode=<optimized out>, file=0x1555552310f8 "../../cxx4/ncFile.cpp", line=88) at ../../cxx4/ncCheck.cpp:58
#2  0x00001555552144ea in netCDF::NcFile::open (this=this@entry=0x7fffffffc700, filePath="/tmp/file4VW8WQ", fMode=fMode@entry=netCDF::NcFile::read) at /usr/include/c++/11/bits/basic_string.h:194
#3  0x00001555552147f4 in netCDF::NcFile::NcFile (this=<optimized out>, filePath=..., fMode=<optimized out>, this=<optimized out>, filePath=..., fMode=<optimized out>) at ../../cxx4/ncFile.cpp:48
#4  0x00000000008769c7 in bout::experimental::OptionsNetCDF::read (this=this@entry=0x7fffffffc830) at options_netcdf.cxx:125
#5  0x000000000077261b in OptionsNetCDFTest_FieldPerpWriteCellCentre_Test::TestBody (this=0xcae3d0) at sys/test_options_netcdf.cxx:232
#6  0x00000000007fad8f in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0xa3b312 "the test body", method=<optimized out>, object=0xcae3d0)
```

Command is:
gdb -args ./serial_tests --gtest_filter=OptionsNetCDFTest.FieldPerpWriteCellCentre

The lock is held by ./serial_tests

with BOUT++ v4.4.0, running `./configure` and `make -j 32 build-check-unit-tests`

disabling locking avoids the issue:
HDF5_USE_FILE_LOCKING=FALSE  ./serial_tests --gtest_filter=OptionsNetCDFTest.FieldPerpWriteCellCentre

Comment 7 david08741 2021-08-26 17:54:10 UTC
Created attachment 1818054 [details]
This resolves the issue for me

The issue seems to be that the hdf5 opens the file, and then locks it.

Later, it closes the fd, but in the mean time additional threads have spawned, and thus the lock is not cleared, as the file is still opened.

Explicitly unlocking avoids the issue, and successive open+flock are working.

Comment 8 Orion Poplawski 2021-08-27 03:04:21 UTC
So, you are going to need to submit an issue upstream here - https://portal.hdfgroup.org/display/support/The+HDF+Help+Desk  Something like this I don't think is appropriate to apply downstream.

Comment 9 david08741 2021-08-27 12:50:46 UTC
I have opened:
https://github.com/HDFGroup/hdf5/pull/967

However, it might make sense to also apply this downstream, as with f34, the code path in BOUT++ is the same, the file gets opened + locked, then pthread_create is called a few times, and then the file is closed. Requesting a new lock in F35+ is failing, while it is working on F34, so this is a regression in F35+.

For this specific case I have found a workaround in bout++, but I think the patch is still useful, hopefully hdf5 will merge ...

Comment 10 Ben Cotton 2022-02-08 21:43:48 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 36 development cycle.
Changing version to 36.

Comment 11 Ben Cotton 2023-04-25 18:26:12 UTC
This message is a reminder that Fedora Linux 36 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 36 on 2023-05-16.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '36'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 36 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 12 Ludek Smid 2023-05-25 19:33:29 UTC
Fedora Linux 36 entered end-of-life (EOL) status on 2023-05-16.

Fedora Linux 36 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.