Greetings. For a while now we have seen builds sometimes fail due to a 'Failed to read package header' error. For example: https://koji.fedoraproject.org/koji/taskinfo?taskID=112684858 fromm the root.log: DEBUG util.py:463: -------------------------------------------------------------------------------- DEBUG util.py:463: [105/105] Total 100% | 1.8 MiB/s | 37.7 MiB | 00m21s DEBUG util.py:463: Running transaction DEBUG util.py:461: Failed to read package header from file "/var/cache/libdnf5/build-c4bb181def2452ef/packages/publicsuffix-list-dafsa-20240107-3.fc40.noarch.rpm" DEBUG util.py:610: Child return code was: 1 DEBUG util.py:185: kill orphans in chroot /var/lib/mock/f40-build-side-82617-48632678-5781450-bootstrap/root So, dnf5 has downloaded everything, and run the rpm transaction and it fails. The indicated packages show correct checksum and that they downloaded correctly. This problem seems to have started a while back, but been very sporadic before, but the dnf5 for builds and infrastructure issues with power9 builders seems to have made it much worse. ;( I am not sure if this is a rpm issue or something else, but it seems like dnf5 has correctly downloaded things and passed the transaction to rpm, so it seems like it's something in rpm. Sadly, I don't have a simple reproducer, but there's lots of builds with this issue. Happy to help gather more info or help trying to track this down. Seperately, I am trying to fix the slowness issues with the power9 builders, but this needs sorting aside from that. Thanks! Reproducible: Always
Pointers to other builds with this failure can't hurt to gather info. Like, is it always power9 where it fails? Also, even a rough ballpark idea of when it started could be helpful (last month, half a year, year ago is far better than nothing) At the point where it fails, the file itself has been successfully opened so we know it exists, but it's failing in rpmReadPackageFile(). As you apparently have verified the package(s) are intact and legit in these situations, my primary suspect would be some unexpected error code in the header/package file reading loop inside rpm, which will retry any failed read on EINTR, EAGAIN and EWOULDBLOCK and give up otherwise. If you end up with some other errno codein there (some kernel/glibc/container glitch, who knows?), it'd behave exactly like this on what is a perfectly legit package. *Except* that it would emit an error message of some sort, there's no path through rpmReadPackageFile() that I know of where rpm would silently fail. The problem is dnf likes to eat up any errors from rpm (this isn't the first time that comes up). I don't know what dnf5 does with them, maybe /var/lib/dnf5.log? This needs to start with dnf at least logging the failure code from rpmReadPackageFile() and strerror(errno) at that time if errno is set. And, let those rpm error messages through, please. They may not be pretty but they contain diagnostics that are simply not available outside of rpm guts, and when an error occurs you'll want all the info you can get.
I was the original reporter to fedora-infrastructure : https://pagure.io/fedora-infrastructure/issue/11753 Recently (within a month?) I see frequent build failure with koschei due to failing to read package header: e.g. https://koji.fedoraproject.org/koji/taskinfo?taskID=112475224 https://koji.fedoraproject.org/koji/taskinfo?taskID=112636082 https://koji.fedoraproject.org/koji/taskinfo?taskID=112820587 https://koji.fedoraproject.org/koji/taskinfo?taskID=112839582 For now the failure seems happening on ppc64le, but I don't know if this affects other arches.
If this is "within a month" thing, the dnf5 swap in mock is in that window. Rpm itself hasn't been updated since early December, except for getting rebuilt in January, but of course one can never count out older hidden bugs suddenly surfacing due to other changes.
I caught this error on s390x during the mass rebuild: https://koji.fedoraproject.org/koji/taskinfo?taskID=111995838 (Unfortunately, it looks like the logs were already cleaned up.) Fabio Valentini caught it on aarch64, too: https://koji.fedoraproject.org/koji/taskinfo?taskID=112438783
Ack, so not arch specific. Thanks, that's useful info. It's always these "secondary" architectures though, which probably is a clue in itself though. PPC where it happens most is having slowness issues. Looking at the logs where available, the failing package is always one of the last 1-8 (or so) to be downloaded, never an early one. Based on these bits and the error itself, my primary suspect would be some kind of race condition in the download completion handling. Something that tends to get masked on faster computers / light load.
> This needs to start with dnf at least logging the failure code from rpmReadPackageFile() and strerror(errno) at that time if errno is set. And, let those rpm error messages through, please. They may not be pretty but they contain diagnostics that are simply not available outside of rpm guts, and when an error occurs you'll want all the info you can get. Related: https://github.com/rpm-software-management/dnf5/issues/522. I agree that we need to be logging those RPM errors.
We'll be able to see DNF 5 logs from Koji builds once https://pagure.io/koji/issue/4009 is addressed. But I'm guessing Panu is right: > Based on these bits and the error itself, my primary suspect would be some kind of race condition in the download completion handling. In which case the error message from RPM might not be that helpful. In the meantime, I'll try to investigate this locally outside of Koji.
Is anyone still seeing these errors? I was never able to reproduce this outside of Koji, and it has been a while without any activity here.
I have not seen this in a very long time.
I haven't seen this in a long time either. It seems to have been fixed by something somewhere along the way. +1 to just closing this one...
Alright, I'm closing as WORKSFORME then. koji.fedoraproject.org is now running Koji 1.35, so it will correctly log RPM errors from DNF 5 in the build logs, should we come across similar errors in the future.