Since Fedora-Rawhide-20200710.n.1, if you boot a network install image - e.g. https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200713.n.0/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso - without any special options and click through the language selection screens, anaconda crashes soon after. The traceback ends in this error: dnf.exceptions.RepoError: Loading repository 'rawhide' has failed and packaging.log shows this: 18:09:27,798 DBG dnf: repo: downloading from remote: rawhide 18:09:27,799 DBG dnf: countme: no event for rawhide: budget to spend: 1 18:09:32,398 DBG packaging: repo rawhide: _sync_metadata success from https://mirrors.fedoraproject.org/metalink?repo=rawhide&arch=x86_64 18:09:32,399 DBG dnf: repo: downloading from remote: updates 18:09:32,399 DBG dnf: countme: no event for updates: budget to spend: 1 18:09:34,697 DBG packaging: repo updates: _sync_metadata success from https://mirrors.fedoraproject.org/metalink?repo=updates-released-frawhide&arch=x86_64 18:09:34,710 DBG dnf: repo: using cache for: rawhide 14:10:01,060 DBG dnf: loading repo 'rawhide' failure: write_ext(0) has failed: -1 Looking into it a bit further, that error happens in `dnf.base.Base._add_repo_to_sack()` when it does this: try: self._sack.load_repo(repo._repo, build_cache=True, **mdload_flags) except hawkey.Exception as e: logger.debug(_("loading repo '{}' failure: {}").format(repo.id, e)) as the exception type suggests, that's in libdnf (which provides hawkey, these days), and it's failing in `libdnf/dnf-sack.cpp` `write_ext()` - https://github.com/rpm-software-management/libdnf/blob/master/libdnf/dnf-sack.cpp#L621 - in this section: FILE *fp = fdopen(tmp_fd, "w+"); g_debug("%s: storing %s to: %s", __func__, repo->name, tmp_fn_templ); if (which_repodata != _HY_REPODATA_UPDATEINFO) ret |= repodata_write(data, fp); else ret |= write_ext_updateinfo(hrepo, data, fp); ret |= checksum_write(repoImpl->checksum, fp); ret |= fclose(fp); if (ret) { success = FALSE; g_set_error (error, DNF_ERROR, DNF_ERROR_FAILED, _("write_ext(%1$d) has failed: %2$d"), which_repodata, ret); goto done; } we're getting ret -1 there. We see the crash in anaconda, but I actually think the thing that changed is most likely to be glibc. In the compose where this first showed up, neither anaconda nor dnf nor libdnf changed, but glibc (and kernel and systemd among others, but they seem like less likely suspects) did change, to this build: https://koji.fedoraproject.org/koji/buildinfo?buildID=1538815 which has quite a lot of changes (the previous build in Rawhide was -17, so all the changes from -18 landed in the affected compose). I'm guessing the problem is in there somewhere. But CCing dnf/libdnf developers also for input. This nearly meets the threshold to be an automatic blocker - "Complete failure of any release-blocking TC/RC image to boot at all under any circumstance - "DOA" image (conditional failure is not an automatic blocker)" - but I think not quite, so just proposing it as one. It prevents a default install of any netinst image, though, so I think it's pretty much a slam dunk.
We sync glibc weekly as we freeze the ABI/API for use in the August 1st glibc 2.32 release. We do the sync specifically to catch any niggling issues that might crop up as we install the final ABI pieces for the release. My Rawhide VM (which I was using for firefox sanbox testing with glibc 2.32): glibc-all-langpacks-2.31.9000-19.fc33.x86_64 glibc-common-2.31.9000-19.fc33.x86_64 glibc-langpack-en-2.31.9000-19.fc33.x86_64 glibc-2.31.9000-19.fc33.x86_64 glibc-headers-x86-2.31.9000-19.fc33.noarch glibc-devel-2.31.9000-19.fc33.x86_64 glibc-static-2.31.9000-19.fc33.x86_64 glibc-minimal-langpack-2.31.9000-19.fc33.x86_64 libreport-plugin-systemd-journal-2.13.1-2.fc33.x86_64 python3-systemd-234-13.fc33.x86_64 python-systemd-doc-234-13.fc33.x86_64 rpm-plugin-systemd-inhibit-4.16.0-0.beta3.2.fc33.x86_64 systemd-bootchart-233-6.fc32.x86_64 systemd-libs-246~rc1-1.fc33.x86_64 systemd-rpm-macros-246~rc1-1.fc33.noarch systemd-246~rc1-1.fc33.x86_64 systemd-pam-246~rc1-1.fc33.x86_64 systemd-udev-246~rc1-1.fc33.x86_64 systemd-container-246~rc1-1.fc33.x86_64 kernel-5.8.0-0.rc4.1.fc33.x86_64 kernel-core-5.8.0-0.rc4.1.fc33.x86_64 kernel-modules-extra-5.8.0-0.rc4.1.fc33.x86_64 kernel-modules-5.8.0-0.rc4.1.fc33.x86_64 libreport-plugin-kerneloops-2.13.1-2.fc33.x86_64 kernel-headers-5.8.0-0.rc4.git0.1.fc33.x86_64 abrt-addon-kerneloops-2.14.2-3.fc33.x86_64 kernel-core-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64 kernel-modules-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64 kernel-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64 kernel-modules-extra-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64 Double checked I was booted into 5.8.0. Tested once with rc4.1: uname -a Linux localhost.localdomain 5.8.0-0.rc4.1.fc33.x86_64 #1 SMP Tue Jul 7 19:24:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Tested again with rc4.*: uname -a Linux localhost.localdomain 5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64 #1 SMP Thu Jul 9 22:36:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Checked Anaconda is installed and up to date: kdump-anaconda-addon-005-8.20200220git80aab11.fc33.noarch libreport-anaconda-2.13.1-2.fc33.x86_64 anaconda-user-help-26.1-11.fc32.noarch anaconda-tui-33.21-1.fc33.x86_64 anaconda-core-33.21-1.fc33.x86_64 anaconda-widgets-33.21-1.fc33.x86_64 anaconda-gui-33.21-1.fc33.x86_64 anaconda-install-env-deps-33.21-1.fc33.x86_64 anaconda-33.21-1.fc33.x86_64 I was able to set: - Keyboard - Language Support - Time & Date - Root password - User creation - Software Install source - Software selection - Fedora Workstation - Network & Host name So far nothing has crashed. The ISO has: glibc-2.31.9000-19.fc33.x86_64.rpm (matches my VM) kernel-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64.rpm (same) systemd-246~rc1-1.fc33.x86_64.rpm (matches my VM) I went and got the rc5 kernel: https://koji.fedoraproject.org/koji/buildinfo?buildID=1541262 uname -a Linux localhost.localdomain 5.8.0-0.rc5.1.fc33.x86_64 #1 SMP Mon Jul 13 15:27:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Testing with rc5 works fine also with no crashes. I can't reproduce this by running anaconda in the VM. Adam, Do I need to debug the ISO? Can we confirm it's not a transient failure? I'm out of time for today.
Just running anaconda in a VM may not hit the problematic codepath. openQA isn't hitting this crash, for instance, on most of the tests that pass a repo config on the command line with inst.repo , or tests that install from the Server DVD (where there's a repo on the ISO itself). It is definitely not a transient failure. It failed on every "default install with a netinst image" test in Fedora-Rawhide-20200710.n.1 and Rawhide-20200713.n.0 , which is 8 tests, plus the 'install_updates_nfs' test (which passes ` inst.stage2=nfs:nfsvers=4:10.0.2.110:/repo`) failed the same way in both composes, I re-ran a couple of the tests manually and they failed the same way, and I reproduced it in a local VM. Just rebooted that VM and reproduced it again. All you have to do is boot the ISO and click through the language selection screen and the 'this is an unstable release' warning, then wait a bit.
(In reply to Adam Williamson from comment #2) > Just running anaconda in a VM may not hit the problematic codepath. openQA > isn't hitting this crash, for instance, on most of the tests that pass a > repo config on the command line with inst.repo , or tests that install from > the Server DVD (where there's a repo on the ISO itself). > > It is definitely not a transient failure. It failed on every "default > install with a netinst image" test in Fedora-Rawhide-20200710.n.1 and > Rawhide-20200713.n.0 , which is 8 tests, plus the 'install_updates_nfs' test > (which passes ` inst.stage2=nfs:nfsvers=4:10.0.2.110:/repo`) failed the same > way in both composes, I re-ran a couple of the tests manually and they > failed the same way, and I reproduced it in a local VM. Just rebooted that > VM and reproduced it again. All you have to do is boot the ISO and click > through the language selection screen and the 'this is an unstable release' > warning, then wait a bit. I setup an x86_64 VM and booted the following ISO: https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200713.n.0/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso 1. Clicked English / English (US) 2. Clicked to agree to using an unstalbe release. 3. Filled in all the details. 4. Clicked "Begin Install" After that the kernel dies with deadlock in btrfs with an anaconda task holding 3 kernel locks. Could you be seeing a btrfs problem? I don't doubt you see a problem, but I can't help if I can't reproduce it. Someone who can reproduce the issue at-will would need to build alternative ISOs with swapped out rpms. The "fast" way I usually test this is with a Fedora Rawhide VM and run anaconda by hand (which doesn't use btrfs). If you could find out how to trigger this in such a configuration then it would help me track it down. My suggestion at this point is to ask the dnf team for help to instrument libdnf to give you some more information about what is failing. Any thoughts on a next step?
LVM2 based install of the same image proceeds without a problem and I get a fully functional system.
"Could you be seeing a btrfs problem?" I don't *think* so, because the crash happens before the installer creates any target partitions, and nothing on the installer itself is btrfs-formatted. My logs show no btrfs-related errors. It did start happening in the first compose where the btrfs-by-default change landed, but I can't see any way it could have caused the problem, at least no obvious way. I can still reproduce this every time in my local VM.
(In reply to Adam Williamson from comment #5) > "Could you be seeing a btrfs problem?" > > I don't *think* so, because the crash happens before the installer creates > any target partitions, and nothing on the installer itself is > btrfs-formatted. My logs show no btrfs-related errors. It did start > happening in the first compose where the btrfs-by-default change landed, but > I can't see any way it could have caused the problem, at least no obvious > way. > > I can still reproduce this every time in my local VM. Are you able to rebuild the ISO or Rawhide VM with a different glibc, systemd, or kernel? I don't know how to do that myself. A running ISO can be used to scp files from other sources, or nfsv4 mount a filesystem, but the existing anaconda process has already mapped the -19 glibc, so you can't change it in-situ, you'd have to restart the anaconda process without rebooting. Can you rebuild the ISO with a different glibc, systemd, kernel? How is this kind of issue normally debugged?
So I tried to drill down at least a little by logging the ret value at each stage of `write_ext`, and got: repodata_write ret is -1! which means we're hitting the `repodata_write` branch of the conditional (not the `write_ext_updateinfo` one) and failing there. So the problem appears to be in `repodata_write`, which is part of libsolv and is quite long and complex and does do low-level glibc-y stuff I don't understand: https://github.com/openSUSE/libsolv/blob/master/src/repo_write.c#L1118 if you can see any potential issues there, or suggest potential changes to test, please let me know. I can test changes quite fast; rebuilding the ISO isn't necessary, I can just build an updates image with the contents of a modified RPM and boot with that. I'm on both Freenode and internal IRC (as adamw) if you want to chat about this synchronously.
I can try an updates image with an older glibc also. I think that should work, as these days we do updates images in initramfs phase before the switchroot, so you can replace nearly anything. Replacing kernel I think won't work, though; replacing systemd *may* work, I'm not sure. I'll try glibc first.
I can build modified ISOs for test if necessary too, it's just a slightly longer and more annoying process.
(In reply to Adam Williamson from comment #5) > I can still reproduce this every time in my local VM. Does the problem go away if you increase the RAM allocated to the VM?
aha! Florian wins the prize. Yes it does. My test VM and the VMs openQA uses both have 2GB of RAM. If I bump to 4GB the crash goes away. I can change this in openQA (though I'd have to check the worker host has enough RAM for it). But seems like it might be good to figure out why we suddenly need more, and if that's fixable.
So I'm still not sure which is the culprit yet. An updates.img with an older glibc still hit the bug, but that's not a super conclusive result as it's possible the switch didn't work or the sheer process of applying a fairly large updates.img muddies the waters, or something. So I'm going to test with rebuilt ISOs. I have to go out for a couple hours right now, but when I get back I'll work on that, hope to have the results by this evening (Pacific time).
(In reply to Adam Williamson from comment #12) > So I'm still not sure which is the culprit yet. An updates.img with an older > glibc still hit the bug, but that's not a super conclusive result as it's > possible the switch didn't work or the sheer process of applying a fairly > large updates.img muddies the waters, or something. > > So I'm going to test with rebuilt ISOs. I have to go out for a couple hours > right now, but when I get back I'll work on that, hope to have the results > by this evening (Pacific time). I am confirming my VM has 4GiB of RAM. I am confirming that if I set my VM to 2GiB of RAM then anaconda fails. anaconda 33.21 exception report Traceback (most recent call first): File "/usr/lib/python3.9/site-packages/dnf/base.py", line 147, in _add_repo_to_sack raise dnf.exceptions.RepoError( File "/usr/lib/python3.9/site-packages/dnf/base.py", line 392, in fill_sack self._add_repo_to_sack(r) File "/usr/lib64/python3.9/site-packages/pyanaconda/payload/dnf/payload.py", line 1226, in gather_repo_metadata self._base.fill_sack(load_system_repo=False) File "/usr/lib64/python3.9/site-packages/pyanaconda/payload/manager.py", line 220, in _run_thread payload.gather_repo_metadata() File "/usr/lib64/python3.9/threading.py", line 888, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python3.9/site-packages/pyanaconda/threading.py", line 280, in run threading.Thread.run(self) dnf.exceptions.RepoError: Loading repository 'rawhide' has failed Now that I know the memory limit is the trigger I can probably try a few things in a Rawhide VM with rlimit.
Reproduces with `systemd-run --scope -p MemoryLimit=2048M anaconda` if you want to run in a tightly controlled cgroup with less RAM and in a Rawhide VM.
To be clear with the systemd-run cgroup in place anaconda is killed, so you don't get a nice modal dialog box that the rawhide repo failed to load. It doesn't reproduce if I run gdb in the cgroup and *then* start python3 running anaconda as the inferior process. I didn't expect gdb to impact the reproducibility.
Adam, I'm out of time for today, I need to get some upstream stuff done. When you come back hopefully you'll find my notes useful. Should be much easier to reproduce in a non-ISO context using a cgroup memory limit.
OK, so I tracked down the cause now (using ISO builds just because I had the production line set up already), and it's...systemd! I built ISOs based on Fedora-Rawhide-20200709.n.0 with each of the "suspects" from 20200710.n.1 added alone in turn, and with the newer glibc or kernel added the image still works fine, but with the newer systemd added, we hit the crash. So, re-assigning to systemd. Zbigniew, the story here is that recent Rawhide netinst ISOs - since systemd-246~rc1-1.fc33 - crash a few minutes after startup (I found you don't actually have to click through the language select screen, even, if you just boot the image and wait, the crash happens) if run with 2GB of RAM. If you run with 4GB of RAM they work fine. Something about the newer systemd is causing the anaconda environment to use more memory while it's doing all its early repository setup and stuff than it was using before, I guess. To reproduce, just grab any netinst since 20200710.n.1 - e.g. https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200713.n.0/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso - boot in a VM or system with 2GB of RAM, and wait for the crash.
Aren't the downloaded RPMs supposed to be saved on the newly created target? I wonder if there are situations where that doesn't always happen and that's what we're running into? It's falling back to using the memory based overlay for these RPMs?
OK I see part of the problem. We do not have swaponzram enabled on the netinstall images still, I think due to this PR not being pulled in yet: https://github.com/weldr/lorax/pull/1048 So what happened in part is, anaconda's zram implementation went away in favor of swaponzram, but due to unexpected assumption about weak dependencies zram-generator(-defaults) doesn't get pulled in. That's just a happy accident, so it'll get fixed.
There are no "downloaded RPMs". It hasn't downloaded any RPMs by the time it crashes. You can hit the crash by just booting to the language select screen and waiting, you don't even have to make it to the main hub.
I'm not sure swap-on-ZRAM has much to do with this, see #c17. I can reproduce it by building an ISO from the Fedora-Rawhide-20200709.n.0 but including the systemd build from 20200710.n.1. Nothing else changes in the experiment besides the systemd package pulled in.
Fedora-Rawhide-20200709.n.0 almost certainly still has anaconda's swaponzram implementation still in it. I didn't see that change land in composes until 20200712.
OK it's definitely not swaponzram. Most of the time it turns out, no network allows me to manually setup swaponzram in this environment. Enable networking, and it still crashes, and hasn't used any swap. So this is not an "out of memory" situation. I'll see if debug logs help reveal more info.
Discussed during the 2020-07-20 blocker review meeting: [0] The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following Beta criterion: "The installer must run when launched normally from the release-blocking images"* *When booting release-blocking netinst images with 2GB of RAM or less. We note that 2GB is a fairly common config for VMs, and the official docs state 1GB is the minimum requirement. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-07-20/f33-blocker-review.2020-07-20-16.18.txt
I see this only happening with repo Everything, Server works OK. No idea how this can be related to repo but obviously it is.
I have machines with 4096 MB, bare metal: Lenovo ThinkCentre e73 Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz 4096 MB memory, 500 GB disk space Server variant works, Everything variant does not.
Everything repo is a lot bigger than Server, which sorta lines up with the whole "running out of space" element :)
I tried to execute the reproducer from comment#17, and I don't see anything using memory except Anaconda. But even Anaconda doesn't use that much. 8 min: virt/res 596 193 kb 10 min: 501 197 15 min: 606 202 20 min: 615 211 39 min: 907 479 Somewhere at that point anaconda displayed an error message pasted below. But even at this point there's 2GB of zram swap (set up by zram.service, not the new zram-generator thingy) and none of it used. So there's *plenty* of "free" space. 15:32:03,070 CRT exception: Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/dnf/base.py", line 144, in _add_repo_to_sack self._sack.load_repo(repo._repo, build_cache=True, **mdload_flags) _hawkey.Exception: write_ext(0) się nie powiodło: -1 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib64/python3.9/site-packages/pyanaconda/threading.py", line 280, in run threading.Thread.run(self) File "/usr/lib64/python3.9/threading.py", line 888, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python3.9/site-packages/pyanaconda/payload/manager.py", line 220, in _run_thread payload.gather_repo_metadata() File "/usr/lib64/python3.9/site-packages/pyanaconda/payload/dnf/payload.py", line 1226, in gather_repo_metadata self._base.fill_sack(load_system_repo=False) File "/usr/lib/python3.9/site-packages/dnf/base.py", line 392, in fill_sack self._add_repo_to_sack(r) File "/usr/lib/python3.9/site-packages/dnf/base.py", line 147, in _add_repo_to_sack raise dnf.exceptions.RepoError( dnf.exceptions.RepoError: Loading repository 'rawhide' has failed "się nie powiodło" means "failed". Seems to be the same that AdamW got. The see two things relevant for memory use: 1. systemd-resolved is running, even though nss-resolve is not enabled in /etc/nsswitch.conf. This is because the the fedora-release presets PR was merged already. It's using 19MB of RSS. This is more than I'd expect. Probably some optimizations are possible. Nevertheless, I don't this amount can make a difference. 2. /tmp/syslog is created. This probably duplicates the journal log that is already there. But it's only 0.5 MB. The problem is that it is probably not rotated, so if we wait long enough, /tmp (which is on tmpfs, i.e. in-memory) will get filled. 3. Both dbus-daemon and dbus-broker are running :( It is likely that systmed-246 is using more memory (through more loaded libraries and such), I don't have any reason to think it's allocating more, but don't think this matters. The only thing I'm seeing is anaconda as the biggest user. Dunno, maybe systemd-246 is using just a bit more but things were already almost exactly at the limit, and this extra use brings us over the limit. But I don't think systemd can be faulted for that, and also it's just a hypothesis, and it doesn't explain why the swap space is not used. Can we get some *useful* report from anaconda/dnf/hawkey?
Created attachment 1702360 [details] top screen after anaconda tb, 2GB ram, Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso
zbigniew: i dug down to where the error is actually happening in https://bugzilla.redhat.com/show_bug.cgi?id=1856514#c7 . I don't know enough to add useful logging to that code, though. if you can suggest a patch I can test it.
BTW, anaconda actually has a feature for tracking memory usage and we run a test that uses that: https://openqa.fedoraproject.org/tests/628803 in recent composes that logs max memory usage as calculated by https://github.com/rhinstaller/anaconda/blob/master/scripts/instperf as around 1.1G, during a notable spike early in the install process (logging start was 10:52:41): 10:53:15 783172 0 anaconda:234028,python3:93300,Xorg:74392,python3:72948,python3:72628 10:53:16 792348 0 anaconda:245572,python3:93300,python3:72948,python3:72628,Xorg:71580 10:53:17 808232 0 anaconda:261708,python3:93300,python3:72948,python3:72628,Xorg:71580 10:53:18 830360 0 anaconda:283676,python3:93300,python3:72948,python3:72628,Xorg:71580 10:53:19 855776 0 anaconda:308700,python3:93300,python3:72948,Xorg:72640,python3:72628 10:53:20 834812 0 anaconda:288236,python3:93300,python3:72948,python3:72628,Xorg:71592 10:53:21 859024 0 anaconda:317652,python3:93300,python3:72948,python3:72628,Xorg:71592 10:53:22 883584 0 anaconda:345096,python3:93300,Xorg:79628,python3:72948,python3:72628 10:53:23 902876 0 anaconda:365384,python3:93300,Xorg:79628,python3:72948,python3:72628 10:53:24 923412 0 anaconda:382444,python3:93300,Xorg:78348,python3:72948,python3:72628 10:53:25 935468 0 anaconda:394568,python3:93300,Xorg:78348,python3:72948,python3:72628 10:53:26 970892 0 anaconda:430104,python3:93300,Xorg:78348,python3:72948,python3:72628 10:53:27 991132 0 anaconda:449868,python3:93300,Xorg:78348,python3:72948,python3:72628 10:53:29 1011276 0 anaconda:465208,python3:93300,Xorg:81432,python3:72948,python3:72628 10:53:29 1017492 0 anaconda:476684,python3:93300,Xorg:77144,python3:72948,python3:72628 10:53:30 1029080 0 anaconda:487112,python3:93300,Xorg:77144,python3:72948,python3:72628 10:53:31 1041704 0 anaconda:500524,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:32 1058980 0 anaconda:516692,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:33 1080480 0 anaconda:549296,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:34 1081236 0 anaconda:539256,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:35 1087032 0 anaconda:544660,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:36 1098372 0 anaconda:556108,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:37 1104168 0 anaconda:558288,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:39 1071096 0 anaconda:526740,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:40 1068012 0 anaconda:526740,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:41 1069180 0 anaconda:528064,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:42 1069092 0 anaconda:530176,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:43 865356 0 anaconda:326716,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:44 868000 0 anaconda:326944,python3:93300,Xorg:77148,python3:72948,python3:72628 10:53:45 877876 0 anaconda:336220,python3:93460,Xorg:77148,python3:72948,python3:72628 10:53:46 907080 0 anaconda:365352,python3:93460,Xorg:77148,python3:72948,python3:72628 10:53:47 887460 0 anaconda:343368,python3:93460,Xorg:77148,python3:72948,python3:72628 I grabbed the log from the same test in the 20200709.n.0 compose, and interestingly, the early spike looks very similar there: 15:39:12 789352 0 anaconda:237696,python3:92904,python3:72652,python3:72332,Xorg:71884 15:39:13 811120 0 anaconda:261820,python3:92904,python3:72652,python3:72332,Xorg:70660 15:39:14 821988 0 anaconda:273216,python3:92904,python3:72756,python3:72432,Xorg:70660 15:39:15 833992 0 anaconda:283324,python3:93044,Xorg:72784,python3:72756,python3:72432 15:39:16 856420 0 anaconda:305288,python3:93044,python3:72756,python3:72432,Xorg:70696 15:39:17 836076 0 anaconda:286160,python3:93044,python3:72756,python3:72432,Xorg:70696 15:39:18 862796 0 anaconda:312308,python3:93044,python3:72756,python3:72432,Xorg:70696 15:39:20 884520 0 anaconda:337164,python3:93044,python3:72756,python3:72432,Xorg:70696 15:39:21 911988 0 anaconda:367556,python3:93044,Xorg:72824,python3:72756,python3:72432 15:39:22 917656 0 anaconda:384732,python3:93044,Xorg:78716,python3:72756,python3:72432 15:39:23 938048 0 anaconda:402380,python3:93044,Xorg:78716,python3:72756,python3:72432 15:39:24 953124 1280 anaconda:418172,python3:92740,Xorg:77372,python3:72512,python3:72184 15:39:25 959764 2560 anaconda:421876,python3:92728,Xorg:77372,python3:72500,python3:72172 15:39:26 991068 4352 anaconda:455180,python3:92728,Xorg:77372,python3:72500,python3:72172 15:39:27 1003088 5120 anaconda:464636,python3:92268,Xorg:77372,python3:72344,python3:72020 15:39:28 1024216 6144 anaconda:482028,python3:92164,Xorg:80564,python3:72252,python3:71960 15:39:28 1027968 8448 anaconda:491576,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:29 1042652 8448 anaconda:505152,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:30 1057916 8448 anaconda:520580,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:31 1085420 8448 anaconda:548536,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:32 1077968 9216 anaconda:540636,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:33 1085388 9216 anaconda:547244,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:34 1091940 9216 anaconda:553580,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:35 1106216 9216 anaconda:568600,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:36 1095524 9216 anaconda:542456,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:37 1067800 9216 anaconda:530504,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:38 1068116 9216 anaconda:531296,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:39 1070556 9216 anaconda:533408,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:40 1068116 9216 anaconda:531560,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:41 867516 9216 anaconda:331204,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:42 867588 9216 anaconda:331312,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:43 877068 9216 anaconda:341492,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:45 899844 9216 anaconda:366416,python3:92164,Xorg:76244,python3:72252,python3:71960 15:39:46 884556 9216 anaconda:348664,python3:92164,Xorg:76244,python3:72252,python3:71960 Obviously none of those numbers is near 2G, which is interesting. A possibility is that libsolv tries to reserve a ton of space it never actually ends up fully *using* based on some calculation or other, I guess? I have grabbed the last image before the bug showed up (20200709.n.0) so I can play with this a bit more - you can grab it also from https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200709.n.0/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200709.n.0.iso . An obvious thing to do to confirm or disprove Zbigniew's theory would be to see if *reducing* memory below 2GB can reproduce the issue with that compose, and if so, how much we have to reduce by to hit it.
hum, another interesting thing: the second number is 'swap used' and it's noticeable that it's 0 throughout the log from the recent compose, but gets well above 0 in the earlier log. I might try booting a recent image with 2GB and inst.debug and see if the number goes above 0, if not, we may indeed be looking at 'failure to use swap' as the issue...
Maybe boot with inst.loglevel=debug The anaconda process doesn't look like it's running with --debug (Silverblue 32)
Update: mboddu moved the 20200709.n.0 compose to save it from the garbage collector, so that download link is now: https://kojipkgs.fedoraproject.org/compose/rawhide/foo/Fedora-Rawhide-20200709.n.0-adamw/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200709.n.0.iso and we'll have the compose available to tinker with this bug for a while.
I've noticed /tmp on tmpfs is smaller since systemd commit cc1c85fbc3e3c09c73421e370dfbfbe4ac3d9449 I seemingly have enough at the time of the failure but maybe something asked for more than 27M at one time? tmpfs 196M 169M 27M 87% /tmp Fedora-Everything-netinst-x86_64-Rawhide-20200724.n.0.iso has zram-generator(-defaults); only 2M of swap is used at the time of the crash.
If I disable the network link, and then switch to tty2, and `mount -o remount,size=400m,noatime /tmp`, then renable the network link, and proceed: (a) crash does not happen (b) /tmp used is 286M I think this is the problem.
(In reply to Adam Williamson from comment #30) > zbigniew: i dug down to where the error is actually happening in > https://bugzilla.redhat.com/show_bug.cgi?id=1856514#c7. Yeah, I read that. That's why I mean by [lack of] a useful message. -1 is meaningless. It's a 970 line function that does a bunch of things, and without adequate messages it's really hard to say what is going wrong, unless you really know the code intimately. If we even had an errno without any message it would be something. If the error is really lack of disk space, with -ENOSPC we wouldn't need to speculate so much. > hum, another interesting thing: the second number is 'swap used' and it's noticeable that it's 0 throughout the log from the recent compose I noticed that too, but the simplest interpretation is that the kernel doesn't see the need to swap out. When I put some files in /tmp, swap starts being used. > I've noticed /tmp on tmpfs is smaller since systemd commit cc1c85fbc3e3c09c73421e370dfbfbe4ac3d9449 What do you mean by "smaller"? cc1c85fbc3e3c09c was supposed to be about inode count. I see 69/400k inodes used on /tmp, so that's probably not it. But maybe that commit changes the disk size too. That'd be a bug, but it seems to be a plausible explanation.
So I ran the 20200713.n.0 image with 2GB again and, before it crashed, it did register use of swap. So I don't think that's the difference :/
I think Chris is onto something. If I spam 'du -h /tmp' I can see indeed dnf is using a lot of space there to cache repodata. And from my local mirror I can see that the size of Rawhide's 'everything' repo metadata is 230M: [adamw@www os]$ pwd /share/mirror/fedora/linux/development/rawhide/Everything/x86_64/os [adamw@www os]$ du -h repodata/ 230M repodata/ so that kinda lines up with Chris' numbers. Good spot.
For the record, it's anaconda that tells dnf to use /tmp/dnf.cache as its cache dir: https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/payload/dnf/utils.py#L34 presumably because it can't guarantee any disk-backed storage early in install.
Confirming Chris' finding that /tmp reports its size as 196M when booting a recent image with 2048MB of RAM. When booting a recent image with 3072MB of RAM, /tmp size is reported as 297M, which is just big enough for the repodata to fit I guess. Booting a pre-systemd-246 image with 2048MB of RAM, /tmp size is reported as 981M.
Wrong commit. This one 7d85383edbab73274dc81cc888d884bb01070bc2.
Yep, that's it. But the question is what the right limit would be. 10% is maybe too little, but we want *some* limit.
This specific case is fairly unique to the installer environment - it wouldn't come up on an installed system, where dnf will be using a disk-backed location for its cache dir - so perhaps just allowing it to be configured in a way that we can apply to the installer environment would be sufficient to solve this problem without too much argument? https://github.com/systemd/systemd/pull/15424 is the pull request that added the limits, I've added a comment there, so Lennart should see it...
Pretty sure /tmp on tmpfs has been sized 50% since 2012. https://fedoraproject.org/wiki/Features/tmp-on-tmpfs It almost seems like the change is a solution in search of a problem.
https://github.com/systemd/systemd/pull/16576
I am changing the component to systemd based on the comment 46.
Thanks for all the debugging. Should be fixed now.