Bug 1731978

Summary:	Default to no disk swap for Workstation installations
Product:	[Fedora] Fedora	Reporter:	Bastien Nocera <bnocera>
Component:	anaconda	Assignee:	Anaconda Maintenance Team <anaconda-maint-list>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	31	CC:	anaconda-maint-list, bugzilla, ego.cordatus, hdegoede, jan.public, jkonecny, jonathan, kellin, mcatanzaro, mkolman, pbrobinson, renault, rharwood, vanmeeuwen+fedora, vponcova, wwoods
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-21 18:32:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bastien Nocera 2019-07-22 13:12:26 UTC

By default, anaconda creates disk-backed swap partitions dependent on the RAM size available, as per suggest_swap_size()

Unfortunately, especially on interactive systems such as the Workstation variants, hitting the disk-based swap under low-memory conditions renders the machine completely unusable. The disk-based swap is not fast enough to free up physical memory to keep the machine's interactivity. zram based swap performs much better, even if it just means pushing back the inevitable, eg. the machine ceasing up for hours.

Furthermore, disk-based swap would create wear and tear on devices with limited lifespans, such as EMMCs, NVMe, SSDs, SD cards, etc.

This bug was filed to enable zram-backed swap by default on Workstation installations:
https://bugzilla.redhat.com/show_bug.cgi?id=1731598

I've tested the reactivity of a system with 16GB of RAM under low-memory conditions (the coredump created by https://bugzilla.redhat.com/show_bug.cgi?id=1731371 seems to eat gigs of memory) and the system stayed interactive for 15 minutes longer than one with disk-backed swap.

(Note: I think this might just need a single line change in the fedora-kickstarts to say "noswap" but I don't know whether that will change the default for all the different manners in which a live disk installer would work)

Comment 1 Bastien Nocera 2019-07-22 13:19:11 UTC

(Note that switching to the bfq I/O scheduler, as discussed on the fedora kernel list, didn't have any impact on the interactivity of the machine when it reached really low-memory situations).

Comment 2 Jiri Konecny 2019-07-23 11:59:40 UTC

Hello,

Thanks for the suggestion but I'm not sure that this is a good idea.

When we remove the disk swap then you will get another sort of problems.

* Instead of having the system harder to interact, it may result in the fail of the high memory consuming operation or even a system crash.
* What about hibernation to the disk? You are effectively broke the feature by this change.

Using zram won't solve the problems above it will only post-pone it.

If you really want to get this in, you can create a system wide change proposal on Fedora but we will not create this change without a community approval. This is a really big change in a way how the system will behave.

Comment 3 Bastien Nocera 2019-07-23 12:47:17 UTC

(In reply to Jiri Konecny from comment #2)
> Hello,
> 
> Thanks for the suggestion but I'm not sure that this is a good idea.
> 
> When we remove the disk swap then you will get another sort of problems.
>
> * Instead of having the system harder to interact, it may result in the fail
> of the high memory consuming operation or even a system crash.

Impossible to interact, not harder to interact. For hours.

> * What about hibernation to the disk? You are effectively broke the feature
> by this change.

Hibernation is not supported by the Fedora kernel developers. Users that want
to use hibernation can still do that by creating a swap themselves.

One of the discussions from 2018:
https://lists.fedoraproject.org/archives/list/desktop@lists.fedoraproject.org/thread/TLTA6HAYJWQYHV3ZHFXUIXM4IJVWBEJJ/

> Using zram won't solve the problems above it will only post-pone it.

Which is exactly the point. Postponing it instead of making the system completely
unusable for hours on end.

> If you really want to get this in, you can create a system wide change
> proposal on Fedora but we will not create this change without a community
> approval. This is a really big change in a way how the system will behave.

I'm not sure why I would create a system-wide change proposal to *discuss* a change. Can't we do that here?

Can you answer this question in my original comment:
> (Note: I think this might just need a single line change in the fedora-kickstarts to say "noswap" but I don't know whether that will change the default for all the different manners in which a live disk installer would work)

Comment 4 Michael Catanzaro 2019-07-23 13:01:07 UTC

The Workstation Working Group is tracking this issue at https://pagure.io/fedora-workstation/issue/98.

I assume that on a workstation system, killing memory-hog processes is almost always preferable to a total system hang. My personal experience has been that as soon as Fedora starts swapping, it becomes completely unusable, and the best course of action is to pull the plug. So I'm firmly convinced that swap is entirely harmful on Workstation, and I fear that by retaining disk-based swap by default we continue to ensure a terrible user experience in out-of-memory situations. To convince me that we should keep swap, I'd like to see compelling arguments specifically regarding system interactivity in out-of-memory situations.

Comment 5 Jiri Konecny 2019-07-25 11:35:32 UTC

I see your point. From the Anaconda point of view we will implement that if there will be approved change in Fedora. From my personal user view I think we are creating here something which will make Fedora unstable system and we will loose users thanks to this change.

I would simplify this issue on two possibilities:

1) Unresponsive system for a while
2) Responsive system with a possibility of loosing your data

I'm not an expert in this field but AFAICT the OOM killer will kill a process and it will not always choose what you would expect. From my understanding it may also kill your Gnome shell session. One of the scenarios I can imagine is that this would happen during the DNF update and then you won't be able to boot to the system thanks to that.

Feel free to correct my assumption if I'm wrong.

I would rather see tweaking configuration of the kernel settings when it should swap and play with that a little bit. Basically to avoid swapping if not really necessary instead of loosing data.

For example:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-tunables


Back to your question:

> I think this might just need a single line change in the fedora-kickstarts to say "noswap" but I don't know whether that will change the default for all the different manners in which a live disk installer would work

No, change in the fedora-kickstart won't help you. Kickstart in the fedora-kickstart will help you to create installation environment and what will land in the installed system from the Live media installation. However, any of those won't change how the default partitioning will look like. To make this possible we have to implement you an option to change the default partitioning of the Anaconda for Fedora Workstation.

Comment 6 Bastien Nocera 2019-07-25 13:29:43 UTC

(In reply to Jiri Konecny from comment #5)
> I see your point. From the Anaconda point of view we will implement that if
> there will be approved change in Fedora. From my personal user view I think
> we are creating here something which will make Fedora unstable system and we
> will loose users thanks to this change.
> 
> I would simplify this issue on two possibilities:
> 
> 1) Unresponsive system for a while
> 2) Responsive system with a possibility of loosing your data
> 
> I'm not an expert in this field but AFAICT the OOM killer will kill a
> process and it will not always choose what you would expect. From my
> understanding it may also kill your Gnome shell session.

The incorrect assumption is that in 1), the system will recover. It will not.
The machine churning for 4 hours is unusable, and you need to lose _all_ your
data because you will need to switch the machine off forcefully.

The comments about how the OOM killer chooses its target is somewhat correct even
if there are ways to nominate yourself for early killing, which is what Chrome does
for its content tabs for example:
https://github.com/endlessm/chromium-browser/blob/45f610422da26b6bc60204b0eef2360ca52684cf/base/process/memory_linux.cc#L90

The problem being that if you have a large amount of swap, your machine will
freeze from the swapping and memory reclaim, but won't trigger the OOM killer
until the memory's exhausted, which could be many hours after the system has
been rendered unusable. So we can't really make the OOM killer better if we
never actually see it in action.

Similar setups, with zram-backed swap and no disk-based swap are already deployed
on desktop systems by Endless.

> One of the
> scenarios I can imagine is that this would happen during the DNF update and
> then you won't be able to boot to the system thanks to that.

No, that wouldn't happen on Workstation because we only support offline updates,
which run in a cut down environment, and thus doesn't compete against the user's
applications.

> Feel free to correct my assumption if I'm wrong.
> 
> I would rather see tweaking configuration of the kernel settings when it
> should swap and play with that a little bit. Basically to avoid swapping if
> not really necessary instead of loosing data.
> 
> For example:
> 
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/
> html/performance_tuning_guide/s-memory-tunables

This guide is for RHEL6 released 10 years ago, and for server workloads, not
for Workstation workloads.

> Back to your question:
> 
> > I think this might just need a single line change in the fedora-kickstarts to say "noswap" but I don't know whether that will change the default for all the different manners in which a live disk installer would work
> 
> No, change in the fedora-kickstart won't help you. Kickstart in the
> fedora-kickstart will help you to create installation environment and what
> will land in the installed system from the Live media installation. However,
> any of those won't change how the default partitioning will look like. To
> make this possible we have to implement you an option to change the default
> partitioning of the Anaconda for Fedora Workstation.

OK.

Comment 7 Bastien Nocera 2019-07-25 13:32:40 UTC

I just want to make it clear that I expect anaconda to offer this as an
option that the various spins can opt into, like the Workstation SIG probably
would. It's up to those separate spins to figure out whether having a
disk-backed swap is a benefit to them and their workloads.

Comment 8 Hans de Goede 2019-07-26 16:30:54 UTC

+1 for this, I've actually have writing a patch to make anaconda not create swap by default on eMMC devices, since that is a very bad idea (very low performance, too many writes brick them).

Ideally we should still anaconda to not create swap on /dev/mmcblk# devices *ever*. But not having it do that for workstation installl would be a great start.

Somewhat unrelated: as for enabling zswap support by default for Workstation installs, I think that that is a great idea too.

Comment 9 Jiri Konecny 2019-07-29 08:03:01 UTC

(In reply to Bastien Nocera from comment #7)
> I just want to make it clear that I expect anaconda to offer this as an
> option that the various spins can opt into, like the Workstation SIG probably
> would. It's up to those separate spins to figure out whether having a
> disk-backed swap is a benefit to them and their workloads.

You don't have to worry about that.
We definitely won't implement this as default for everyone. First, that is not how Anaconda usually adapting things. Second, we need to think about RHEL too and have a fallback to an existing solution. Third, we always have in our minds spins with a specific needs and requirements.

Comment 10 Jiri Konecny 2019-07-29 08:06:01 UTC

Move rest of this discussion to when the change will be proposed. Leave this bug only about implementation to Anaconda and not about if it is or not a good idea for distribution.

Comment 11 Chris Murphy 2019-07-30 19:29:00 UTC

I'm a +1 to this change in Anaconda. Apology in advance that I'm not exactly adhering to Jiri's request in comment 10, but this is not a rehash of the prior comments.

It's clear for a while that as CPU+RAM have scaled in performance, drives have not scaled in equivalent performance. The now comparatively huge swap partitions almost immediately craters system responsiveness when substantively used. Swap partitions weren't ever intended to be this big for swap purposes. However, conflating swap with hibernation images in the same location made it inevitable. The two other common operating systems do not conflate these things, they're totally separated.

And while I'm sensitive to the differences Anaconda folks need to apply to desktop and server, this problem can happen on a server if any appreciable swap starts being used persistently. It's that bad. And I suspect Anaconda folks aren't as sensitive to this problem on the RHEL side of things because RHEL support folks are good at making sure customers have plenty sufficient RAM for their workloads - the problem is obviated by resources.

I further suspect that kernel swap code has stagnated, and simply isn't compatible with today's resource restricted environments unless we put constraints on swap (faster backing, smaller pool), which is what this RFE is about. And the same for hibernation, it's a mess of firmware bugs stepping all over memory in different ways that the kernel cannot hope to keep up with when restoring the hibernation image, except for the most popular hardware. As an indicator, Microsoft has given up on it in Windows 10.

As for zswap, it's still marked experimental in kernel documentation. I've used it and I like it, but for the problem under discussion, since the backing is ultimately a conventional swap partition, it gives the kernel and thus applications, the wrong view of available resources, such that the worst cases can end up soaking that swap partition all the same, and imploding system performance.

I agree with Anaconda folks there's likely more than one thing going on. But I also think the proposal results in a net better experience in memory stressed situations. Maybe cgroup2 provides an option to memory constrain applications likely to instigate the problem? I'm not sure how difficult it is to implement that.

Merely saying things doesn't make them true, so I'll try to get some better evidence to make the case for this change.

Comment 12 Ben Cotton 2019-08-13 16:51:58 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 13 Ben Cotton 2019-08-13 18:52:02 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 14 Chris Murphy 2019-08-13 20:32:32 UTC

Two problems with no swap:
- old pages aren't freed, they must stay in memory, this is inefficient use of RAM
- when out of memory, oom-killer immediately starts killing off processes based on oom score

Currently there's no effort in Fedora to optimize the oom score of processes to ensure the most disposable processes are subject to being killed, while important ones are preserved: web browser, the shell, basic required system services. Further, it's an open question whether the effort is better focused elsewhere anyway, preventing the oom scenario in the first place.

Workstation needs to present a generic solution, and yet there are use cases that need specific solutions. Perhaps we need to look at ways of customizing the usage of swap post-install, based on use cases.

Comment 15 Chris Murphy 2019-08-13 20:38:32 UTC

Third problem with no swap:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/BJHZRNK2IW6CIR4N4KSPM6G77GUAZ7JT/

Also, that thread contains on-going discussion about lack of system responsiveness in memory constrained situations, with and without swap, on various devices.

Comment 16 Jiri Konecny 2019-08-16 10:21:33 UTC

Thanks Chris for keeping us updated!

Comment 17 Bastien Nocera 2019-08-19 10:55:47 UTC

(In reply to Chris Murphy from comment #15)
> Third problem with no swap:
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/
> message/BJHZRNK2IW6CIR4N4KSPM6G77GUAZ7JT/
> 
> Also, that thread contains on-going discussion about lack of system
> responsiveness in memory constrained situations, with and without swap, on
> various devices.

2GB of slow disk-backed swap is still a huge amount that's likely to get swamped.

Note that we should really be focusing on "with disk-backed and zram-backed swap" and
"with zram-backed swap" comparisons.

(zram is enabled by default on F31, correct?
https://bugzilla.redhat.com/show_bug.cgi?id=1731598)

Comment 18 Chris Murphy 2019-08-19 16:58:17 UTC

There are three "swap on ZRAM" implementations: Anaconda's, Fedora zram package, and upstream systemd zram-generator.

I would like to get Anaconda team's feedback on the two other zram implementations, if they would consider them viable replacements for their needs. If not, why not, and get those modifications fed upstream so we can have a single implementation.

Anaconda's swap on ZRAM is started for low RAM (2GiB and lower) devices when anaconda is launched. This applies to Fedora netinstall and LiveOS boot only, for Fedora 31. The as-installed system doesn't have any swap on ZRAM enabled due to various conflicts, sufficient to warrant it being a system-wide change so everyone's aware of it and can discuss it.

Near as I can tell, if a system is using swap on ZRAM, it will conflict with hibernation. While Fedora kernel team doesn't support blocking release due to hibernation related bugs, there is slightly more than tacit support for it: quite a lot of users convinced anaconda team to include "resume=<swapdevice>" to boot parameter during the installation, and just killing off that fairly recent prior work without discussion and planning I think will make a decent amount of users angry.

We need a plan, some consensus, coordination, and a feature proposal to make it the default.

As for comparisons: I've done the 'build webkitgtk' example test with various combinations of swap on SSD of various sizes, swap on ZRAM of various sizes, and two swaps SSD and ZRAM of various sizes, and also zswap. The GUI falls over in every case. The CLI is marginally usable in some cases but I still consider those systems lost. So the swap stuff I think is an important optimization, but so far it causes conflicts with other valid use cases, and to force the decision on users at installation time is problematic, hence the idea floated in comment 14.

Comment 19 Martin Kolman 2019-08-27 10:25:52 UTC

(In reply to Chris Murphy from comment #18)
> There are three "swap on ZRAM" implementations: Anaconda's, Fedora zram
> package, and upstream systemd zram-generator.
> 
> I would like to get Anaconda team's feedback on the two other zram
> implementations, if they would consider them viable replacements for their
> needs. If not, why not, and get those modifications fed upstream so we can
> have a single implementation.
The reason why we currently have a custom mechanism for setting up zram in Anaconda
is simple - when we added support for it none of the alternatives where yet available
and packaged for Fedora. And it has simply been in place since.

I think we should definitely check the alternatives and ideally switch to one of
them and dump our custom zram code if possible.

> 
> Anaconda's swap on ZRAM is started for low RAM (2GiB and lower) devices when
> anaconda is launched.
I wonder - what about just always enabling zram during installation ? We could just
dump the RAM size checks and simply always activate zram (preferably using one of
the common solutions mentioned above) regardless of RAM size. If a system has a lot of
RAM it will very likely also have enough CPU resources, so the extra zram devices
should not cause measurable slowdowns.

Do you think there would be any obvious downsides ? 

>This applies to Fedora netinstall and LiveOS boot
> only, for Fedora 31. The as-installed system doesn't have any swap on ZRAM
> enabled due to various conflicts, sufficient to warrant it being a
> system-wide change so everyone's aware of it and can discuss it.
> 
> Near as I can tell, if a system is using swap on ZRAM, it will conflict with
> hibernation. While Fedora kernel team doesn't support blocking release due
> to hibernation related bugs, there is slightly more than tacit support for
> it: quite a lot of users convinced anaconda team to include
> "resume=<swapdevice>" to boot parameter during the installation, and just
> killing off that fairly recent prior work without discussion and planning I
> think will make a decent amount of users angry.
> 
> We need a plan, some consensus, coordination, and a feature proposal to make
> it the default.
> 
> As for comparisons: I've done the 'build webkitgtk' example test with
> various combinations of swap on SSD of various sizes, swap on ZRAM of
> various sizes, and two swaps SSD and ZRAM of various sizes, and also zswap.
> The GUI falls over in every case. The CLI is marginally usable in some cases
> but I still consider those systems lost. So the swap stuff I think is an
> important optimization, but so far it causes conflicts with other valid use
> cases, and to force the decision on users at installation time is
> problematic, hence the idea floated in comment 14.

Comment 20 Chris Murphy 2019-08-27 17:35:35 UTC

(In reply to Martin Kolman from comment #19)
Yeah I'm aware of the relative histories, no one's done anything wrong. I like the idea of this being a systemd generator, making it foundational, reliable, upstream. The gotcha being, it's in rust and right now the present code doesn't work. So I just need to recruit some rust folks interested in the topic, see about summarizing what it can and can't do compared to the other options, and then decide if it's worth fixing and maintaining.

I have no problem with defaulting to enabled, and setup the first available /dev/zram device to 1:1 RAM or even 1/2 RAM. The kernel docs say there's no point making the zram device bigger than 2x RAM, because about a 2:1 compression ratio is expected. I suspect we only start running into a need for memory cap tuning if the zram device goes beyond 1.5x RAM.

The high memory system is probably designed that way in order to avoid swapping, so swap usage would be only incidental or inadvertent. Since memory is dynamically allocated to the zram device only as needed, the main effect is postponing what would otherwise have been an oom-killer event (if the alternative is no swap at all). Does anyone design a machine expecting oom-killer to trigger at a specific point, and would be annoyed if it were delayed? *shrug* I expect near realtime compression with lz4 on such systems. e.g. 64GiB RAM would mean a 64GiB zram device, with ~32GiB memory used should the entire swap get filled. Perhaps it turns out the "middle of the road" machines should get a 1:1 ratio, and the very high end and low end systems should set zram device to a fraction of RAM - like a bell curve. At least for starters. It's a conservative approach that should produce zero complaints and almost zero accolades as no one notices anything!

Comment 21 Chris Murphy 2020-01-08 03:46:12 UTC

Workstation working group continues to discuss this, is ready to start making decisions. 

In particular, on workstation class machines, swap to RAM ratio of 1:1 is excessive:
a) heavy swap use leads to really poor performance and UX, even frozen GUI for long periods of time; 
b) 16G let alone 64G to 128G swap is a pointlessly huge amount of space to waste by default;
c) writing out the image and reading it back in to resume, takes time even on fast SSD, so its use for hibernation is of decreasing efficacy compared to just booting.

We're discussing two issues and are soliciting feedback in particular from Anaconda folks; and get some momentum and planning to get these changes done, hopefully for Fedora 33.

swap is too big: don't make swap partition, instead use systemd-zram generator to setup a smallish ~2-4GiB swap on ZRAM device
https://pagure.io/fedora-workstation/issue/120
https://github.com/systemd/zram-generator

hibernation: significant gotchas and (dev) resource barriers to seriously supporting it (beyond best effort)
https://pagure.io/fedora-workstation/issue/121

It's possible a systemd unit can creates+activates a swapfile prior to entering hibernation.target, and disable it after resume. That way it's never used for swapping; doesn't take up space unless the user opts in; and they don't have to do custom partition to get it.
https://pagure.io/fedora-workstation/issue/120#comment-618549

As it relates to other editions: server folks have told me it's about as common to have no swap as to have swap. And cloud folks tell me they tend to not use swap at all. And IoT folks are using their own swap-on-zram implementation, enabled by default, and are agreeable to converging on systemd-zram generator. It may be all the editions and spins will want to go this way

Comment 22 Jiri Konecny 2020-01-09 09:56:19 UTC

Thanks for the heads-up Chris. I've posted some ideas/questions on those tickets.

Comment 23 Chris Murphy 2020-07-21 18:32:18 UTC

This has now happened with this change.
https://fedoraproject.org/wiki/Changes/SwapOnZRAM
https://bugzilla.redhat.com/show_bug.cgi?id=1850218