2262344 – Failed to read package header from file "/var/cache/libdnf5/..."

Bug 2262344 - Failed to read package header from file "/var/cache/libdnf5/..."

Summary: Failed to read package header from file "/var/cache/libdnf5/..."

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	dnf5
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Evan Goode
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-01 23:39 UTC by Kevin Fenzi
Modified:	2024-10-14 12:59 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2024-10-14 12:59:05 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kevin Fenzi 2024-02-01 23:39:57 UTC

Greetings. 

For a while now we have seen builds sometimes fail due to a 'Failed to read package header' error. 

For example: https://koji.fedoraproject.org/koji/taskinfo?taskID=112684858

fromm the root.log: 


DEBUG util.py:463:  --------------------------------------------------------------------------------
DEBUG util.py:463:  [105/105] Total                         100% |   1.8 MiB/s |  37.7 MiB |  00m21s
DEBUG util.py:463:  Running transaction
DEBUG util.py:461:  Failed to read package header from file "/var/cache/libdnf5/build-c4bb181def2452ef/packages/publicsuffix-list-dafsa-20240107-3.fc40.noarch.rpm"
DEBUG util.py:610:  Child return code was: 1
DEBUG util.py:185:  kill orphans in chroot /var/lib/mock/f40-build-side-82617-48632678-5781450-bootstrap/root

So, dnf5 has downloaded everything, and run the rpm transaction and it fails. 
The indicated packages show correct checksum and that they downloaded correctly. 

This problem seems to have started a while back, but been very sporadic before, but the dnf5 for builds and infrastructure issues with power9 builders seems to have made it much worse. ;( 

I am not sure if this is a rpm issue or something else, but it seems like dnf5 has correctly downloaded things and passed the transaction to rpm, so it seems like it's something in rpm. 

Sadly, I don't have a simple reproducer, but there's lots of builds with this issue. Happy to help gather more info or help trying to track this down. 
Seperately, I am trying to fix the slowness issues with the power9 builders, but this needs sorting aside from that. 

Thanks!

Reproducible: Always

Comment 1 Panu Matilainen 2024-02-02 09:07:09 UTC

Pointers to other builds with this failure can't hurt to gather info. Like, is it always power9 where it fails?
Also, even a rough ballpark idea of when it started could be helpful (last month, half a year, year ago is far better than nothing)

At the point where it fails, the file itself has been successfully opened so we know it exists, but it's failing in rpmReadPackageFile(). As you apparently have verified the package(s) are intact and legit in these situations, my primary suspect would be some unexpected error code in the header/package file reading loop inside rpm, which will retry any failed read on EINTR, EAGAIN and EWOULDBLOCK and give up otherwise. If you end up with some other errno codein there (some kernel/glibc/container glitch, who knows?), it'd behave exactly like this on what is a perfectly legit package. *Except* that it would emit an error message of some sort, there's no path through rpmReadPackageFile() that I know of where rpm would silently fail. The problem is dnf likes to eat up any errors from rpm (this isn't the first time that comes up). I don't know what dnf5 does with them, maybe /var/lib/dnf5.log?

This needs to start with dnf at least logging the failure code from rpmReadPackageFile() and strerror(errno) at that time if errno is set. And, let those rpm error messages through, please. They may not be pretty but they contain diagnostics that are simply not available outside of rpm guts, and when an error occurs you'll want all the info you can get.

Comment 2 Mamoru TASAKA 2024-02-04 00:10:49 UTC

I was the original reporter to fedora-infrastructure : https://pagure.io/fedora-infrastructure/issue/11753

Recently (within a month?) I see frequent build failure with koschei due to failing to read package header:
e.g.

https://koji.fedoraproject.org/koji/taskinfo?taskID=112475224
https://koji.fedoraproject.org/koji/taskinfo?taskID=112636082
https://koji.fedoraproject.org/koji/taskinfo?taskID=112820587
https://koji.fedoraproject.org/koji/taskinfo?taskID=112839582

For now the failure seems happening on ppc64le, but I don't know if this affects other arches.

Comment 3 Panu Matilainen 2024-02-05 12:15:42 UTC

If this is "within a month" thing, the dnf5 swap in mock is in that window. Rpm itself hasn't been updated since early December, except for getting rebuilt in January, but of course one can never count out older hidden bugs suddenly surfacing due to other changes.

Comment 4 Ben Beasley 2024-02-05 23:51:16 UTC

I caught this error on s390x during the mass rebuild:

https://koji.fedoraproject.org/koji/taskinfo?taskID=111995838

(Unfortunately, it looks like the logs were already cleaned up.)

Fabio Valentini caught it on aarch64, too:

https://koji.fedoraproject.org/koji/taskinfo?taskID=112438783

Comment 5 Panu Matilainen 2024-02-06 07:40:07 UTC

Ack, so not arch specific. Thanks, that's useful info. It's always these "secondary" architectures though, which probably is a clue in itself though. PPC where it happens most is having slowness issues. Looking at the logs where available, the failing package is always one of the last 1-8 (or so) to be downloaded, never an early one. 

Based on these bits and the error itself, my primary suspect would be some kind of race condition in the download completion handling. Something that tends to get masked on faster computers / light load.

Comment 6 Evan Goode 2024-02-08 23:44:50 UTC

> This needs to start with dnf at least logging the failure code from rpmReadPackageFile() and strerror(errno) at that time if errno is set. And, let those rpm error messages through, please. They may not be pretty but they contain diagnostics that are simply not available outside of rpm guts, and when an error occurs you'll want all the info you can get.

Related: https://github.com/rpm-software-management/dnf5/issues/522. I agree that we need to be logging those RPM errors.

Comment 7 Evan Goode 2024-02-19 17:38:15 UTC

We'll be able to see DNF 5 logs from Koji builds once https://pagure.io/koji/issue/4009 is addressed. But I'm guessing Panu is right:

> Based on these bits and the error itself, my primary suspect would be some kind of race condition in the download completion handling.

In which case the error message from RPM might not be that helpful. In the meantime, I'll try to investigate this locally outside of Koji.

Comment 8 Evan Goode 2024-09-26 16:09:07 UTC

Is anyone still seeing these errors? I was never able to reproduce this outside of Koji, and it has been a while without any activity here.

Comment 9 Ben Beasley 2024-09-26 17:12:36 UTC

I have not seen this in a very long time.

Comment 10 Kevin Fenzi 2024-10-11 17:54:45 UTC

I haven't seen this in a long time either. 

It seems to have been fixed by something somewhere along the way. 

+1 to just closing this one...

Comment 11 Evan Goode 2024-10-14 12:59:05 UTC

Alright, I'm closing as WORKSFORME then. koji.fedoraproject.org is now running Koji 1.35, so it will correctly log RPM errors from DNF 5 in the build logs, should we come across similar errors in the future.

Note You need to log in before you can comment on or make changes to this bug.