On some i686 rawhide builds, we are seeing builds fail with: Error unpacking rpm package shadow-utils-2:4.13-7.fc39.i686 error: unpacking of archive failed on file /usr/bin/newgidmap;64a19767: cpio: cap_set_file failed - Value too large for defined data type error: shadow-utils-2:4.13-7.fc39.i686: install failed This could be rpm, cpio, libcap (which is really behind upstream currently), glibc or something else, but filing here for initial triage. This thread on the devel list has examples: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/CYKVFTMSOY2MB76UTOC7WMTZ76OFEGID/ It was also brought up on the infrastructure list: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org/thread/HX7RPWZGULSWEHVQ4GBCZQNIZUCNLA7A/ It seemed like it might only be happening on buildvm builders (kvm instances) vs hardware, but I am not 100% sure thats true. It also doesn't seem to happen on every i686 build, just some of them? It also always seems to be shadow-utils and that file, but it might be that is the only file with capabilities in the default buildroot? Happy to try and gather more info... Reproducible: Sometimes Steps to Reproduce: 1. do a archfull rawhide build 2. see the i686 build fail Actual Results: Build fails when unpacking shadow-utils Expected Results: Build works.
Saw the discussion on devel, thanks for filing the bug. EOVERFLOW is not something cap_set_fd() is supposed to return on it's own, that's bubbling up from something else. Looking at what cap_set_fd() itself does, of the things it calls only lstat() is documented to return EOVERFLOW. That could happen on eg _FILE_OFFSET_BITS mismatch on large files. We had such an issue in rpm from the cmake transition, but that's supposedly fixed and -D_FILE_OFFSET_BITS=64 is clearly visible in the rpm build log. Also, newgidmap is not a large file but I guess there could be some other value like ino_t overflowing. The first reports seem to be from the end of June, 28th to be precise. libcap hasn't been touched since Jan so it should be off the hook. There was a minor rpm update on Jun 28th but I don't see how that could affect this at all (but okay, anything is possible). There also was a largish glibc update on the same date, and a huge kernel update on 27. CC'ing the maintainers: @jforbes, @codonell - does any of this ring any bells wrt those updates? As for next steps, rebuilding libcap with debugging enabled and trying to reproduce the error with that should help pinpoint the issue a bit. I'm off to a long vacation after tomorrow but the rest of the rpm-team will be around to look at this.
Oh and to be clear, despite the error message suggesting it, cpio not related, it's not even in the building. We need to fix those misleading error messages, rpm never uses the cpio program for anything at all.
I did some digging through Rust package build failures in koschei in an attempt to pinpoint this issue. The earliest occurrence of this failure I could find happened on 2023-07-01. The only commonality between all failures is that they happened only *after* these three updates: annobin 12.12-1.fc39 -> 12.17-1.fc39 debugedit 5.0-7.fc39 -> 5.0-9.fc39 glibc 2.37.9000-14.fc39 -> 2.37.9000-15.fc39 but never *before* these updates. They were all pushed within hours of each other late on June 30: https://bodhi.fedoraproject.org/updates/FEDORA-2023-9a0b560d10 https://bodhi.fedoraproject.org/updates/FEDORA-2023-0992e99dd9 https://bodhi.fedoraproject.org/updates/FEDORA-2023-fd52ab13db Looks like the time granularity provided by koschei isn't enough to differentiate between them, especially because the issue only happens intermittently. However, given the nature of this issue, I doubt that annobin or debugedit are to blame, so this glibc commit looks like the culprit: https://src.fedoraproject.org/rpms/glibc/c/eb302655a8320a79729fcc252cd72877fd58503f?branch=rawhide I've looked at the diff of the tarballs between the -14 and -15 versions of glibc, but couldn't find anything obviously suspicious. The downstream patches also didn't change between these versions. One difference between glibc -14 and -15 are that they were built against different kernel headers (6.4.0-0.rc5.git0.1.fc39 and 6.4.0-1.fc39, respectively), but the difference between these two is only a few changed lines which look unrelated (only linux/ethtool_netlink.h changed). Also, the kernel version on koji builders does not seem to have changed (kernel-6.3.8-200-fc38).
Another data point (which I only noticed now because koschei builds for older releases much more infrequently), pointing away from glibc: This is also happening on older releases, not only on rawhide. This from the logs of an f37 build launched by koschei today: DEBUG util.py:442: Error unpacking rpm package shadow-utils-2:4.12.3-6.fc37.i686 DEBUG util.py:444: error: unpacking of archive failed on file /usr/bin/newgidmap;64a8599a: cpio: cap_set_file failed - Value too large for defined data type DEBUG util.py:444: error: shadow-utils-2:4.12.3-6.fc37.i686: install failed And now that I've done a bit more digging, I can find this same issue on Fedora 37 as well, happening as early as July 2. Is it possible that the koji builders were rebooted for new kernels around June 30 / July 1st? Earlier builds I've looked at hat kernel 6.2 on the host, but failures had all been happening on kernel 6.3.
I updated/rebooted all the builders on June 25th. I then downgraded systemd and rebooted them again on June 27th. They are all running 6.3.8-200.fc38 currently. I can reboot them into kernel-6.3.11-200.fc38 if we think there might be some kernel bug in there? The dates don't seem to match up 100%, but I suppose it's possible...
I haven't checked every failed build, but those I did check all happened on the host buildvm-x86-11.iad2.fedoraproject.org,
Yes, I could have sworn there were builds on other builders, but yeah, I only see that one builder now. ;( That builder is out and I am looking at it, but I don't see anything off hand weird about it. If anyone gets a failure now, please let me know...
I just saw it happen on a different builder: https://koji.fedoraproject.org/koji/taskinfo?taskID=103187326 (on buildvm-x86-15.iad2.fedoraproject.org)
And another one on buildvm-x86-15.iad2.fedoraproject.org: https://koji.fedoraproject.org/koji/taskinfo?taskID=103189460
Just saw it in a scratch build on buildvm-x86-04.iad2.fedoraproject.org: https://koji.fedoraproject.org/koji/taskinfo?taskID=103202698
Happened to me as well: buildvm-x86-04.iad2.fedoraproject.org https://koji.fedoraproject.org/koji/taskinfo?taskID=103213708
Note that this is not breaking on the Fedora version the build is for but the on the one of the builders: Fedora 38 Mock just does a /usr/bin/dnf-3 --installroot there which is what is failing. So this is unrelated to the RPM alpha in rawhide. The RPM code is really just calling cap_from_text() and cap_set_fd() (not cap_set_file as told by the error message): https://github.com/rpm-software-management/rpm/blob/rpm-4.18.x/lib/fsm.c#L98 So we don't even pass any off_t data at this point (See https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Value-too-large-for-defined-data-type). Note that this is run while being chroot'ed. Chances are this is just the first file getting the capability set during the install. So the capability sub system could be broken in the chroot. I wonder if this is an issue with different setting within and outside of the chroot. Do we use other than 64 bit for -D_FILE_OFFSET_BITS anywhere? This being randomly happening makes it unlikely that we just don't set the chroot up the wrong way. If we use 32 bit FILE_OFFSET_BITS inside the 32 bit this could be an issue with this being read at some point, may be glibc tries to read the settings randomly from the chroot and not finding it set to 64 bit as the chroot is still empty? I am re-assigning this to glibc as they seem to be the more likely culprit and returning an unexpected errno is clearly on it. Could still be an kernel issue, though.
cap_set_file does this: “ int cap_set_file(const char *filename, cap_t cap_d) { struct vfs_ns_cap_data rawvfscap; int sizeofcaps; struct stat buf; if (lstat(filename, &buf) != 0) { _cap_debug("unable to stat file [%s]", filename); return -1; } if (S_ISLNK(buf.st_mode) || !S_ISREG(buf.st_mode)) { _cap_debug("file [%s] is not a regular file", filename); errno = EINVAL; return -1; } ” Presumably that lstat results in EOVERFLOW. According to <https://kojipkgs.fedoraproject.org//packages/libcap/2.48/6.fc38/data/logs/i686/build.log> only one source file is built with -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64, and this is not it. We cannot build the whole distribution with -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 because it is an ABI break. libcap isn't the only package with this problem. This needs to be fixed via builder configuration, by switching to a file system with 32-bit inode numbers. Since the chroot is recreated from scratch every time, merely mounting with “-o inode32“ is probably sufficient for XFS, see xfs(5). It affects builders only because package installation on end user systems uses the 64-bit rpm package.
Filed a releng ticket: https://pagure.io/releng/issue/11531
*** Bug 2222365 has been marked as a duplicate of this bug. ***