Bug 1856514 - In default boot with 2GB of RAM or less, network install images crash with "loading repo 'rawhide' failure: write_ext(0) has failed: -1" since systemd-246~rc1-1.fc33
Summary: In default boot with 2GB of RAM or less, network install images crash with "l...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa AcceptedBlocker
Depends On:
Blocks: BetaBlocker, F33BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2020-07-13 19:47 UTC by Adam Williamson
Modified: 2020-07-27 15:24 UTC (History)
33 users (show)

Fixed In Version: systemd-246~rc2-2.fc33
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-27 15:24:26 UTC
Type: Bug


Attachments (Terms of Use)
top screen after anaconda tb, 2GB ram, Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso (36.64 KB, image/png)
2020-07-24 16:16 UTC, Zbigniew Jędrzejewski-Szmek
no flags Details

Description Adam Williamson 2020-07-13 19:47:53 UTC
Since Fedora-Rawhide-20200710.n.1, if you boot a network install image - e.g. https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200713.n.0/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso - without any special options and click through the language selection screens, anaconda crashes soon after.

The traceback ends in this error:

dnf.exceptions.RepoError: Loading repository 'rawhide' has failed

and packaging.log shows this:

18:09:27,798 DBG dnf: repo: downloading from remote: rawhide
18:09:27,799 DBG dnf: countme: no event for rawhide: budget to spend: 1
18:09:32,398 DBG packaging: repo rawhide: _sync_metadata success from https://mirrors.fedoraproject.org/metalink?repo=rawhide&arch=x86_64
18:09:32,399 DBG dnf: repo: downloading from remote: updates
18:09:32,399 DBG dnf: countme: no event for updates: budget to spend: 1
18:09:34,697 DBG packaging: repo updates: _sync_metadata success from https://mirrors.fedoraproject.org/metalink?repo=updates-released-frawhide&arch=x86_64
18:09:34,710 DBG dnf: repo: using cache for: rawhide
14:10:01,060 DBG dnf: loading repo 'rawhide' failure: write_ext(0) has failed: -1

Looking into it a bit further, that error happens in `dnf.base.Base._add_repo_to_sack()` when it does this:

        try:
            self._sack.load_repo(repo._repo, build_cache=True, **mdload_flags)
        except hawkey.Exception as e:
            logger.debug(_("loading repo '{}' failure: {}").format(repo.id, e))

as the exception type suggests, that's in libdnf (which provides hawkey, these days), and it's failing in `libdnf/dnf-sack.cpp` `write_ext()` - https://github.com/rpm-software-management/libdnf/blob/master/libdnf/dnf-sack.cpp#L621 - in this section:

        FILE *fp = fdopen(tmp_fd, "w+");

        g_debug("%s: storing %s to: %s", __func__, repo->name, tmp_fn_templ);
        if (which_repodata != _HY_REPODATA_UPDATEINFO)
            ret |= repodata_write(data, fp);
        else
            ret |= write_ext_updateinfo(hrepo, data, fp);
        ret |= checksum_write(repoImpl->checksum, fp);
        ret |= fclose(fp);
        if (ret) {
            success = FALSE;
            g_set_error (error,
                        DNF_ERROR,
                        DNF_ERROR_FAILED,
                        _("write_ext(%1$d) has failed: %2$d"),
                        which_repodata, ret);
            goto done;
        }

we're getting ret -1 there.

We see the crash in anaconda, but I actually think the thing that changed is most likely to be glibc. In the compose where this first showed up, neither anaconda nor dnf nor libdnf changed, but glibc (and kernel and systemd among others, but they seem like less likely suspects) did change, to this build:

https://koji.fedoraproject.org/koji/buildinfo?buildID=1538815

which has quite a lot of changes (the previous build in Rawhide was -17, so all the changes from -18 landed in the affected compose). I'm guessing the problem is in there somewhere. But CCing dnf/libdnf developers also for input.

This nearly meets the threshold to be an automatic blocker - "Complete failure of any release-blocking TC/RC image to boot at all under any circumstance - "DOA" image (conditional failure is not an automatic blocker)" - but I think not quite, so just proposing it as one. It prevents a default install of any netinst image, though, so I think it's pretty much a slam dunk.

Comment 1 Carlos O'Donell 2020-07-13 22:00:06 UTC
We sync glibc weekly as we freeze the ABI/API for use in the August 1st glibc 2.32 release.

We do the sync specifically to catch any niggling issues that might crop up as we install the final ABI pieces for the release.

My Rawhide VM (which I was using for firefox sanbox testing with glibc 2.32):

glibc-all-langpacks-2.31.9000-19.fc33.x86_64
glibc-common-2.31.9000-19.fc33.x86_64
glibc-langpack-en-2.31.9000-19.fc33.x86_64
glibc-2.31.9000-19.fc33.x86_64
glibc-headers-x86-2.31.9000-19.fc33.noarch
glibc-devel-2.31.9000-19.fc33.x86_64
glibc-static-2.31.9000-19.fc33.x86_64
glibc-minimal-langpack-2.31.9000-19.fc33.x86_64

libreport-plugin-systemd-journal-2.13.1-2.fc33.x86_64
python3-systemd-234-13.fc33.x86_64
python-systemd-doc-234-13.fc33.x86_64
rpm-plugin-systemd-inhibit-4.16.0-0.beta3.2.fc33.x86_64
systemd-bootchart-233-6.fc32.x86_64
systemd-libs-246~rc1-1.fc33.x86_64
systemd-rpm-macros-246~rc1-1.fc33.noarch
systemd-246~rc1-1.fc33.x86_64
systemd-pam-246~rc1-1.fc33.x86_64
systemd-udev-246~rc1-1.fc33.x86_64
systemd-container-246~rc1-1.fc33.x86_64

kernel-5.8.0-0.rc4.1.fc33.x86_64
kernel-core-5.8.0-0.rc4.1.fc33.x86_64
kernel-modules-extra-5.8.0-0.rc4.1.fc33.x86_64
kernel-modules-5.8.0-0.rc4.1.fc33.x86_64
libreport-plugin-kerneloops-2.13.1-2.fc33.x86_64
kernel-headers-5.8.0-0.rc4.git0.1.fc33.x86_64
abrt-addon-kerneloops-2.14.2-3.fc33.x86_64
kernel-core-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64
kernel-modules-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64
kernel-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64
kernel-modules-extra-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64

Double checked I was booted into 5.8.0.

Tested once with rc4.1:
uname -a
Linux localhost.localdomain 5.8.0-0.rc4.1.fc33.x86_64 #1 SMP Tue Jul 7 19:24:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Tested again with rc4.*:
uname -a
Linux localhost.localdomain 5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64 #1 SMP Thu Jul 9 22:36:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Checked Anaconda is installed and up to date:

kdump-anaconda-addon-005-8.20200220git80aab11.fc33.noarch
libreport-anaconda-2.13.1-2.fc33.x86_64
anaconda-user-help-26.1-11.fc32.noarch
anaconda-tui-33.21-1.fc33.x86_64
anaconda-core-33.21-1.fc33.x86_64
anaconda-widgets-33.21-1.fc33.x86_64
anaconda-gui-33.21-1.fc33.x86_64
anaconda-install-env-deps-33.21-1.fc33.x86_64
anaconda-33.21-1.fc33.x86_64

I was able to set:
- Keyboard
- Language Support
- Time & Date
- Root password
- User creation
- Software Install source
- Software selection - Fedora Workstation
- Network & Host name

So far nothing has crashed.

The ISO has:
glibc-2.31.9000-19.fc33.x86_64.rpm (matches my VM)
kernel-5.8.0-0.rc4.20200709git0bddd227f3dc.1.fc33.x86_64.rpm (same)
systemd-246~rc1-1.fc33.x86_64.rpm (matches my VM)

I went and got the rc5 kernel:
https://koji.fedoraproject.org/koji/buildinfo?buildID=1541262

uname -a
Linux localhost.localdomain 5.8.0-0.rc5.1.fc33.x86_64 #1 SMP Mon Jul 13 15:27:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Testing with rc5 works fine also with no crashes.

I can't reproduce this by running anaconda in the VM.

Adam, Do I need to debug the ISO? Can we confirm it's not a transient failure? I'm out of time for today.

Comment 2 Adam Williamson 2020-07-13 22:09:12 UTC
Just running anaconda in a VM may not hit the problematic codepath. openQA isn't hitting this crash, for instance, on most of the tests that pass a repo config on the command line with inst.repo , or tests that install from the Server DVD (where there's a repo on the ISO itself).

It is definitely not a transient failure. It failed on every "default install with a netinst image" test in Fedora-Rawhide-20200710.n.1 and Rawhide-20200713.n.0 , which is 8 tests, plus the 'install_updates_nfs' test (which passes ` inst.stage2=nfs:nfsvers=4:10.0.2.110:/repo`) failed the same way in both composes, I re-ran a couple of the tests manually and they failed the same way, and I reproduced it in a local VM. Just rebooted that VM and reproduced it again. All you have to do is boot the ISO and click through the language selection screen and the 'this is an unstable release' warning, then wait a bit.

Comment 3 Carlos O'Donell 2020-07-14 03:10:23 UTC
(In reply to Adam Williamson from comment #2)
> Just running anaconda in a VM may not hit the problematic codepath. openQA
> isn't hitting this crash, for instance, on most of the tests that pass a
> repo config on the command line with inst.repo , or tests that install from
> the Server DVD (where there's a repo on the ISO itself).
> 
> It is definitely not a transient failure. It failed on every "default
> install with a netinst image" test in Fedora-Rawhide-20200710.n.1 and
> Rawhide-20200713.n.0 , which is 8 tests, plus the 'install_updates_nfs' test
> (which passes ` inst.stage2=nfs:nfsvers=4:10.0.2.110:/repo`) failed the same
> way in both composes, I re-ran a couple of the tests manually and they
> failed the same way, and I reproduced it in a local VM. Just rebooted that
> VM and reproduced it again. All you have to do is boot the ISO and click
> through the language selection screen and the 'this is an unstable release'
> warning, then wait a bit.

I setup an x86_64 VM and booted the following ISO:
https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200713.n.0/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso

1. Clicked English / English (US)
2. Clicked to agree to using an unstalbe release.
3. Filled in all the details.
4. Clicked "Begin Install"

After that the kernel dies with deadlock in btrfs with an anaconda task holding 3 kernel locks.

Could you be seeing a btrfs problem?

I don't doubt you see a problem, but I can't help if I can't reproduce it.

Someone who can reproduce the issue at-will would need to build alternative ISOs with swapped out rpms.

The "fast" way I usually test this is with a Fedora Rawhide VM and run anaconda by hand (which doesn't use btrfs).

If you could find out how to trigger this in such a configuration then it would help me track it down.

My suggestion at this point is to ask the dnf team for help to instrument libdnf to give you some more information about what is failing.

Any thoughts on a next step?

Comment 4 Carlos O'Donell 2020-07-14 03:58:44 UTC
LVM2 based install of the same image proceeds without a problem and I get a fully functional system.

Comment 5 Adam Williamson 2020-07-14 17:11:27 UTC
"Could you be seeing a btrfs problem?"

I don't *think* so, because the crash happens before the installer creates any target partitions, and nothing on the installer itself is btrfs-formatted. My logs show no btrfs-related errors. It did start happening in the first compose where the btrfs-by-default change landed, but I can't see any way it could have caused the problem, at least no obvious way.

I can still reproduce this every time in my local VM.

Comment 6 Carlos O'Donell 2020-07-14 18:30:37 UTC
(In reply to Adam Williamson from comment #5)
> "Could you be seeing a btrfs problem?"
> 
> I don't *think* so, because the crash happens before the installer creates
> any target partitions, and nothing on the installer itself is
> btrfs-formatted. My logs show no btrfs-related errors. It did start
> happening in the first compose where the btrfs-by-default change landed, but
> I can't see any way it could have caused the problem, at least no obvious
> way.
> 
> I can still reproduce this every time in my local VM.

Are you able to rebuild the ISO or Rawhide VM with a different glibc, systemd, or kernel?

I don't know how to do that myself. A running ISO can be used to scp files from other sources, or nfsv4 mount a filesystem, but the existing anaconda process has already mapped the -19 glibc, so you can't change it in-situ, you'd have to restart the anaconda process without rebooting.

Can you rebuild the ISO with a different glibc, systemd, kernel?

How is this kind of issue normally debugged?

Comment 7 Adam Williamson 2020-07-14 18:42:24 UTC
So I tried to drill down at least a little by logging the ret value at each stage of `write_ext`, and got:

repodata_write ret is -1!

which means we're hitting the `repodata_write` branch of the conditional (not the `write_ext_updateinfo` one) and failing there. So the problem appears to be in `repodata_write`, which is part of libsolv and is quite long and complex and does do low-level glibc-y stuff I don't understand:

https://github.com/openSUSE/libsolv/blob/master/src/repo_write.c#L1118

if you can see any potential issues there, or suggest potential changes to test, please let me know.

I can test changes quite fast; rebuilding the ISO isn't necessary, I can just build an updates image with the contents of a modified RPM and boot with that. I'm on both Freenode and internal IRC (as adamw) if you want to chat about this synchronously.

Comment 8 Adam Williamson 2020-07-14 18:44:13 UTC
I can try an updates image with an older glibc also. I think that should work, as these days we do updates images in initramfs phase before the switchroot, so you can replace nearly anything. Replacing kernel I think won't work, though; replacing systemd *may* work, I'm not sure. I'll try glibc first.

Comment 9 Adam Williamson 2020-07-14 18:44:35 UTC
I can build modified ISOs for test if necessary too, it's just a slightly longer and more annoying process.

Comment 10 Florian Weimer 2020-07-14 18:51:01 UTC
(In reply to Adam Williamson from comment #5)
> I can still reproduce this every time in my local VM.

Does the problem go away if you increase the RAM allocated to the VM?

Comment 11 Adam Williamson 2020-07-14 18:58:00 UTC
aha! Florian wins the prize. Yes it does. My test VM and the VMs openQA uses both have 2GB of RAM. If I bump to 4GB the crash goes away.

I can change this in openQA (though I'd have to check the worker host has enough RAM for it). But seems like it might be good to figure out why we suddenly need more, and if that's fixable.

Comment 12 Adam Williamson 2020-07-14 19:25:24 UTC
So I'm still not sure which is the culprit yet. An updates.img with an older glibc still hit the bug, but that's not a super conclusive result as it's possible the switch didn't work or the sheer process of applying a fairly large updates.img muddies the waters, or something.

So I'm going to test with rebuilt ISOs. I have to go out for a couple hours right now, but when I get back I'll work on that, hope to have the results by this evening (Pacific time).

Comment 13 Carlos O'Donell 2020-07-14 19:52:11 UTC
(In reply to Adam Williamson from comment #12)
> So I'm still not sure which is the culprit yet. An updates.img with an older
> glibc still hit the bug, but that's not a super conclusive result as it's
> possible the switch didn't work or the sheer process of applying a fairly
> large updates.img muddies the waters, or something.
> 
> So I'm going to test with rebuilt ISOs. I have to go out for a couple hours
> right now, but when I get back I'll work on that, hope to have the results
> by this evening (Pacific time).

I am confirming my VM has 4GiB of RAM.

I am confirming that if I set my VM to 2GiB of RAM then anaconda fails.

anaconda 33.21 exception report
Traceback (most recent call first):
  File "/usr/lib/python3.9/site-packages/dnf/base.py", line 147, in _add_repo_to_sack
    raise dnf.exceptions.RepoError(
  File "/usr/lib/python3.9/site-packages/dnf/base.py", line 392, in fill_sack
    self._add_repo_to_sack(r)
  File "/usr/lib64/python3.9/site-packages/pyanaconda/payload/dnf/payload.py", line 1226, in gather_repo_metadata
    self._base.fill_sack(load_system_repo=False)
  File "/usr/lib64/python3.9/site-packages/pyanaconda/payload/manager.py", line 220, in _run_thread
    payload.gather_repo_metadata()
  File "/usr/lib64/python3.9/threading.py", line 888, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/site-packages/pyanaconda/threading.py", line 280, in run
    threading.Thread.run(self)
dnf.exceptions.RepoError: Loading repository 'rawhide' has failed

Now that I know the memory limit is the trigger I can probably try a few things in a Rawhide VM with rlimit.

Comment 14 Carlos O'Donell 2020-07-14 19:58:03 UTC
Reproduces with `systemd-run --scope -p MemoryLimit=2048M anaconda` if you want to run in a tightly controlled cgroup with less RAM and in a Rawhide VM.

Comment 15 Carlos O'Donell 2020-07-14 20:11:38 UTC
To be clear with the systemd-run cgroup in place anaconda is killed, so you don't get a nice modal dialog box that the rawhide repo failed to load.

It doesn't reproduce if I run gdb in the cgroup and *then* start python3 running anaconda as the inferior process. I didn't expect gdb to impact the reproducibility.

Comment 16 Carlos O'Donell 2020-07-14 20:12:56 UTC
Adam, I'm out of time for today, I need to get some upstream stuff done. When you come back hopefully you'll find my notes useful. Should be much easier to reproduce in a non-ISO context using a cgroup memory limit.

Comment 17 Adam Williamson 2020-07-15 01:06:11 UTC
OK, so I tracked down the cause now (using ISO builds just because I had the production line set up already), and it's...systemd! I built ISOs based on Fedora-Rawhide-20200709.n.0 with each of the "suspects" from 20200710.n.1 added alone in turn, and with the newer glibc or kernel added the image still works fine, but with the newer systemd added, we hit the crash.

So, re-assigning to systemd. Zbigniew, the story here is that recent Rawhide netinst ISOs - since systemd-246~rc1-1.fc33 - crash a few minutes after startup (I found you don't actually have to click through the language select screen, even, if you just boot the image and wait, the crash happens) if run with 2GB of RAM. If you run with 4GB of RAM they work fine. Something about the newer systemd is causing the anaconda environment to use more memory while it's doing all its early repository setup and stuff than it was using before, I guess. To reproduce, just grab any netinst since 20200710.n.1 - e.g. https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200713.n.0/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso - boot in a VM or system with 2GB of RAM, and wait for the crash.

Comment 18 Chris Murphy 2020-07-20 17:51:58 UTC
Aren't the downloaded RPMs supposed to be saved on the newly created target? I wonder if there are situations where that doesn't always happen and that's what we're running into? It's falling back to using the memory based overlay for these RPMs?

Comment 19 Chris Murphy 2020-07-20 18:03:50 UTC
OK I see part of the problem. We do not have swaponzram enabled on the netinstall images still, I think due to this PR not being pulled in yet:

https://github.com/weldr/lorax/pull/1048

So what happened in part is, anaconda's zram implementation went away in favor of swaponzram, but due to unexpected assumption about weak dependencies zram-generator(-defaults) doesn't get pulled in. That's just a happy accident, so it'll get fixed.

Comment 20 Adam Williamson 2020-07-20 18:04:04 UTC
There are no "downloaded RPMs". It hasn't downloaded any RPMs by the time it crashes. You can hit the crash by just booting to the language select screen and waiting, you don't even have to make it to the main hub.

Comment 21 Adam Williamson 2020-07-20 18:05:10 UTC
I'm not sure swap-on-ZRAM has much to do with this, see #c17. I can reproduce it by building an ISO from the Fedora-Rawhide-20200709.n.0 but including the systemd build from 20200710.n.1. Nothing else changes in the experiment besides the systemd package pulled in.

Comment 22 Chris Murphy 2020-07-20 18:16:37 UTC
Fedora-Rawhide-20200709.n.0 almost certainly still has anaconda's swaponzram implementation still in it. I didn't see that change land in composes until 20200712.

Comment 23 Chris Murphy 2020-07-20 18:52:42 UTC
OK it's definitely not swaponzram. Most of the time it turns out, no network allows me to manually setup swaponzram in this environment. Enable networking, and it still crashes, and hasn't used any swap. So this is not an "out of memory" situation. I'll see if debug logs help reveal more info.

Comment 24 Geoffrey Marr 2020-07-20 19:00:24 UTC
Discussed during the 2020-07-20 blocker review meeting: [0]

The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following Beta criterion:

"The installer must run when launched normally from the release-blocking images"*

*When booting release-blocking netinst images with 2GB of RAM or less. We note that 2GB is a fairly common config for VMs, and the official docs state 1GB is the minimum requirement.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-07-20/f33-blocker-review.2020-07-20-16.18.txt

Comment 25 Vladimir Benes 2020-07-21 08:57:12 UTC
I see this only happening with repo Everything, Server works OK. No idea how this can be related to repo but obviously it is.

Comment 26 Vladimir Benes 2020-07-21 08:59:30 UTC
I have machines with 4096 MB, bare metal:

Lenovo ThinkCentre e73
Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
4096 MB memory, 500 GB disk space

Server variant works, Everything variant does not.

Comment 27 Adam Williamson 2020-07-21 17:48:46 UTC
Everything repo is a lot bigger than Server, which sorta lines up with the whole "running out of space" element :)

Comment 28 Zbigniew Jędrzejewski-Szmek 2020-07-24 16:14:22 UTC
I tried to execute the reproducer from comment#17, and I don't see anything using memory except Anaconda.
But even Anaconda doesn't use that much.

8 min: virt/res 596 193 kb
10 min:         501 197
15 min:         606 202
20 min:         615 211
39 min:         907 479

Somewhere at that point anaconda displayed an error message pasted below. But even at this point
there's 2GB of zram swap (set up by zram.service, not the new zram-generator thingy) and none of
it used. So there's *plenty* of "free" space.

15:32:03,070 CRT exception: Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/dnf/base.py", line 144, in _add_repo_to_sack
    self._sack.load_repo(repo._repo, build_cache=True, **mdload_flags)
_hawkey.Exception: write_ext(0) się nie powiodło: -1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib64/python3.9/site-packages/pyanaconda/threading.py", line 280, in run
    threading.Thread.run(self)
  File "/usr/lib64/python3.9/threading.py", line 888, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/site-packages/pyanaconda/payload/manager.py", line 220, in _run_thread
    payload.gather_repo_metadata()
  File "/usr/lib64/python3.9/site-packages/pyanaconda/payload/dnf/payload.py", line 1226, in gather_repo_metadata
    self._base.fill_sack(load_system_repo=False)
  File "/usr/lib/python3.9/site-packages/dnf/base.py", line 392, in fill_sack
    self._add_repo_to_sack(r)
  File "/usr/lib/python3.9/site-packages/dnf/base.py", line 147, in _add_repo_to_sack
    raise dnf.exceptions.RepoError(
dnf.exceptions.RepoError: Loading repository 'rawhide' has failed

"się nie powiodło" means "failed". Seems to be the same that AdamW got.

The see two things relevant for memory use:
1. systemd-resolved is running, even though nss-resolve is not enabled in /etc/nsswitch.conf.
   This is because the the fedora-release presets PR was merged already. It's using 19MB of RSS.
   This is more than I'd expect. Probably some optimizations are possible. Nevertheless, I don't
   this amount can make a difference.
2. /tmp/syslog is created. This probably duplicates the journal log that is already there. But
   it's only 0.5 MB. The problem is that it is probably not rotated, so if we wait long enough,
   /tmp (which is on tmpfs, i.e. in-memory) will get filled.
3. Both dbus-daemon and dbus-broker are running :(

It is likely that systmed-246 is using more memory (through more loaded libraries and such), I don't
have any reason to think it's allocating more, but don't think this matters. The only thing I'm
seeing is anaconda as the biggest user. Dunno, maybe systemd-246 is using just a bit more but things
were already almost exactly at the limit, and this extra use brings us over the limit. But I don't
think systemd can be faulted for that, and also it's just a hypothesis, and it doesn't explain why
the swap space is not used. Can we get some *useful* report from anaconda/dnf/hawkey?

Comment 29 Zbigniew Jędrzejewski-Szmek 2020-07-24 16:16:51 UTC
Created attachment 1702360 [details]
top screen after anaconda tb, 2GB ram, Fedora-Everything-netinst-x86_64-Rawhide-20200713.n.0.iso

Comment 30 Adam Williamson 2020-07-24 17:35:06 UTC
zbigniew: i dug down to where the error is actually happening in https://bugzilla.redhat.com/show_bug.cgi?id=1856514#c7 . I don't know enough to add useful logging to that code, though. if you can suggest a patch I can test it.

Comment 31 Adam Williamson 2020-07-24 17:45:27 UTC
BTW, anaconda actually has a feature for tracking memory usage and we run a test that uses that:

https://openqa.fedoraproject.org/tests/628803

in recent composes that logs max memory usage as calculated by https://github.com/rhinstaller/anaconda/blob/master/scripts/instperf as around 1.1G, during a notable spike early in the install process (logging start was 10:52:41):

10:53:15 783172 0 anaconda:234028,python3:93300,Xorg:74392,python3:72948,python3:72628
10:53:16 792348 0 anaconda:245572,python3:93300,python3:72948,python3:72628,Xorg:71580
10:53:17 808232 0 anaconda:261708,python3:93300,python3:72948,python3:72628,Xorg:71580
10:53:18 830360 0 anaconda:283676,python3:93300,python3:72948,python3:72628,Xorg:71580
10:53:19 855776 0 anaconda:308700,python3:93300,python3:72948,Xorg:72640,python3:72628
10:53:20 834812 0 anaconda:288236,python3:93300,python3:72948,python3:72628,Xorg:71592
10:53:21 859024 0 anaconda:317652,python3:93300,python3:72948,python3:72628,Xorg:71592
10:53:22 883584 0 anaconda:345096,python3:93300,Xorg:79628,python3:72948,python3:72628
10:53:23 902876 0 anaconda:365384,python3:93300,Xorg:79628,python3:72948,python3:72628
10:53:24 923412 0 anaconda:382444,python3:93300,Xorg:78348,python3:72948,python3:72628
10:53:25 935468 0 anaconda:394568,python3:93300,Xorg:78348,python3:72948,python3:72628
10:53:26 970892 0 anaconda:430104,python3:93300,Xorg:78348,python3:72948,python3:72628
10:53:27 991132 0 anaconda:449868,python3:93300,Xorg:78348,python3:72948,python3:72628
10:53:29 1011276 0 anaconda:465208,python3:93300,Xorg:81432,python3:72948,python3:72628
10:53:29 1017492 0 anaconda:476684,python3:93300,Xorg:77144,python3:72948,python3:72628
10:53:30 1029080 0 anaconda:487112,python3:93300,Xorg:77144,python3:72948,python3:72628
10:53:31 1041704 0 anaconda:500524,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:32 1058980 0 anaconda:516692,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:33 1080480 0 anaconda:549296,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:34 1081236 0 anaconda:539256,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:35 1087032 0 anaconda:544660,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:36 1098372 0 anaconda:556108,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:37 1104168 0 anaconda:558288,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:39 1071096 0 anaconda:526740,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:40 1068012 0 anaconda:526740,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:41 1069180 0 anaconda:528064,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:42 1069092 0 anaconda:530176,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:43 865356 0 anaconda:326716,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:44 868000 0 anaconda:326944,python3:93300,Xorg:77148,python3:72948,python3:72628
10:53:45 877876 0 anaconda:336220,python3:93460,Xorg:77148,python3:72948,python3:72628
10:53:46 907080 0 anaconda:365352,python3:93460,Xorg:77148,python3:72948,python3:72628
10:53:47 887460 0 anaconda:343368,python3:93460,Xorg:77148,python3:72948,python3:72628

I grabbed the log from the same test in the 20200709.n.0 compose, and interestingly, the early spike looks very similar there:

15:39:12 789352 0 anaconda:237696,python3:92904,python3:72652,python3:72332,Xorg:71884
15:39:13 811120 0 anaconda:261820,python3:92904,python3:72652,python3:72332,Xorg:70660
15:39:14 821988 0 anaconda:273216,python3:92904,python3:72756,python3:72432,Xorg:70660
15:39:15 833992 0 anaconda:283324,python3:93044,Xorg:72784,python3:72756,python3:72432
15:39:16 856420 0 anaconda:305288,python3:93044,python3:72756,python3:72432,Xorg:70696
15:39:17 836076 0 anaconda:286160,python3:93044,python3:72756,python3:72432,Xorg:70696
15:39:18 862796 0 anaconda:312308,python3:93044,python3:72756,python3:72432,Xorg:70696
15:39:20 884520 0 anaconda:337164,python3:93044,python3:72756,python3:72432,Xorg:70696
15:39:21 911988 0 anaconda:367556,python3:93044,Xorg:72824,python3:72756,python3:72432
15:39:22 917656 0 anaconda:384732,python3:93044,Xorg:78716,python3:72756,python3:72432
15:39:23 938048 0 anaconda:402380,python3:93044,Xorg:78716,python3:72756,python3:72432
15:39:24 953124 1280 anaconda:418172,python3:92740,Xorg:77372,python3:72512,python3:72184
15:39:25 959764 2560 anaconda:421876,python3:92728,Xorg:77372,python3:72500,python3:72172
15:39:26 991068 4352 anaconda:455180,python3:92728,Xorg:77372,python3:72500,python3:72172
15:39:27 1003088 5120 anaconda:464636,python3:92268,Xorg:77372,python3:72344,python3:72020
15:39:28 1024216 6144 anaconda:482028,python3:92164,Xorg:80564,python3:72252,python3:71960
15:39:28 1027968 8448 anaconda:491576,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:29 1042652 8448 anaconda:505152,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:30 1057916 8448 anaconda:520580,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:31 1085420 8448 anaconda:548536,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:32 1077968 9216 anaconda:540636,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:33 1085388 9216 anaconda:547244,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:34 1091940 9216 anaconda:553580,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:35 1106216 9216 anaconda:568600,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:36 1095524 9216 anaconda:542456,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:37 1067800 9216 anaconda:530504,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:38 1068116 9216 anaconda:531296,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:39 1070556 9216 anaconda:533408,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:40 1068116 9216 anaconda:531560,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:41 867516 9216 anaconda:331204,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:42 867588 9216 anaconda:331312,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:43 877068 9216 anaconda:341492,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:45 899844 9216 anaconda:366416,python3:92164,Xorg:76244,python3:72252,python3:71960
15:39:46 884556 9216 anaconda:348664,python3:92164,Xorg:76244,python3:72252,python3:71960

Obviously none of those numbers is near 2G, which is interesting. A possibility is that libsolv tries to reserve a ton of space it never actually ends up fully *using* based on some calculation or other, I guess?

I have grabbed the last image before the bug showed up (20200709.n.0) so I can play with this a bit more - you can grab it also from https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20200709.n.0/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200709.n.0.iso . An obvious thing to do to confirm or disprove Zbigniew's theory would be to see if *reducing* memory below 2GB can reproduce the issue with that compose, and if so, how much we have to reduce by to hit it.

Comment 32 Adam Williamson 2020-07-24 17:47:56 UTC
hum, another interesting thing: the second number is 'swap used' and it's noticeable that it's 0 throughout the log from the recent compose, but gets well above 0 in the earlier log. I might try booting a recent image with 2GB and inst.debug and see if the number goes above 0, if not, we may indeed be looking at 'failure to use swap' as the issue...

Comment 33 Chris Murphy 2020-07-24 17:50:07 UTC
Maybe boot with inst.loglevel=debug
The anaconda process doesn't look like it's running with --debug (Silverblue 32)

Comment 34 Adam Williamson 2020-07-24 18:02:58 UTC
Update: mboddu moved the 20200709.n.0 compose to save it from the garbage collector, so that download link is now:

https://kojipkgs.fedoraproject.org/compose/rawhide/foo/Fedora-Rawhide-20200709.n.0-adamw/compose/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-Rawhide-20200709.n.0.iso

and we'll have the compose available to tinker with this bug for a while.

Comment 35 Chris Murphy 2020-07-24 18:15:36 UTC
I've noticed /tmp on tmpfs is smaller since systemd commit cc1c85fbc3e3c09c73421e370dfbfbe4ac3d9449

I seemingly have enough at the time of the failure but maybe something asked for more than 27M at one time?
tmpfs                196M  169M   27M  87% /tmp

Fedora-Everything-netinst-x86_64-Rawhide-20200724.n.0.iso has zram-generator(-defaults); only 2M of swap is used at the time of the crash.

Comment 36 Chris Murphy 2020-07-24 18:23:48 UTC
If I disable the network link, and then switch to tty2, and `mount -o remount,size=400m,noatime /tmp`, then renable the network link, and proceed:

(a) crash does not happen
(b) /tmp used is 286M

I think this is the problem.

Comment 37 Zbigniew Jędrzejewski-Szmek 2020-07-24 18:38:50 UTC
(In reply to Adam Williamson from comment #30)
> zbigniew: i dug down to where the error is actually happening in
> https://bugzilla.redhat.com/show_bug.cgi?id=1856514#c7.

Yeah, I read that. That's why I mean by [lack of] a useful message. -1 is meaningless.
It's a 970 line function that does a bunch of things, and without adequate messages it's
really hard to say what is going wrong, unless you really know the code intimately.
If we even had an errno without any message it would be something. If the error is
really lack of disk space, with -ENOSPC we wouldn't need to speculate so much.

> hum, another interesting thing: the second number is 'swap used' and it's noticeable that it's 0 throughout the log from the recent compose

I noticed that too, but the simplest interpretation is that the kernel doesn't see
the need to swap out. When I put some files in /tmp, swap starts being used.

> I've noticed /tmp on tmpfs is smaller since systemd commit cc1c85fbc3e3c09c73421e370dfbfbe4ac3d9449

What do you mean by "smaller"? cc1c85fbc3e3c09c was supposed to be about inode count.
I see 69/400k inodes used on /tmp, so that's probably not it. But maybe that commit
changes the disk size too. That'd be a bug, but it seems to be a plausible explanation.

Comment 38 Adam Williamson 2020-07-24 18:41:00 UTC
So I ran the 20200713.n.0 image with 2GB again and, before it crashed, it did register use of swap. So I don't think that's the difference :/

Comment 39 Adam Williamson 2020-07-24 18:50:04 UTC
I think Chris is onto something. If I spam 'du -h /tmp' I can see indeed dnf is using a lot of space there to cache repodata. And from my local mirror I can see that the size of Rawhide's 'everything' repo metadata is 230M:

[adamw@www os]$ pwd
/share/mirror/fedora/linux/development/rawhide/Everything/x86_64/os
[adamw@www os]$ du -h repodata/
230M	repodata/

so that kinda lines up with Chris' numbers. Good spot.

Comment 40 Adam Williamson 2020-07-24 18:54:43 UTC
For the record, it's anaconda that tells dnf to use /tmp/dnf.cache as its cache dir:

https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/payload/dnf/utils.py#L34

presumably because it can't guarantee any disk-backed storage early in install.

Comment 41 Adam Williamson 2020-07-24 19:00:39 UTC
Confirming Chris' finding that /tmp reports its size as 196M when booting a recent image with 2048MB of RAM. When booting a recent image with 3072MB of RAM, /tmp size is reported as 297M, which is just big enough for the repodata to fit I guess.

Booting a pre-systemd-246 image with 2048MB of RAM, /tmp size is reported as 981M.

Comment 42 Chris Murphy 2020-07-24 19:03:57 UTC
Wrong commit. This one 7d85383edbab73274dc81cc888d884bb01070bc2.

Comment 43 Zbigniew Jędrzejewski-Szmek 2020-07-24 19:27:15 UTC
Yep, that's it. But the question is what the right limit would be. 10% is maybe too little, but
we want *some* limit.

Comment 44 Adam Williamson 2020-07-24 19:40:02 UTC
This specific case is fairly unique to the installer environment - it wouldn't come up on an installed system, where dnf will be using a disk-backed location for its cache dir - so perhaps just allowing it to be configured in a way that we can apply to the installer environment would be sufficient to solve this problem without too much argument?

https://github.com/systemd/systemd/pull/15424 is the pull request that added the limits, I've added a comment there, so Lennart should see it...

Comment 45 Chris Murphy 2020-07-24 20:12:32 UTC
Pretty sure /tmp on tmpfs has been sized 50% since 2012.
https://fedoraproject.org/wiki/Features/tmp-on-tmpfs

It almost seems like the change is a solution in search of a problem.

Comment 46 Zbigniew Jędrzejewski-Szmek 2020-07-24 20:33:26 UTC
https://github.com/systemd/systemd/pull/16576

Comment 47 Vendula Poncova 2020-07-27 09:47:03 UTC
I am changing the component to systemd based on the comment 46.

Comment 48 Zbigniew Jędrzejewski-Szmek 2020-07-27 15:24:26 UTC
Thanks for all the debugging. Should be fixed now.


Note You need to log in before you can comment on or make changes to this bug.