Bug 919485 - pstore: please disable extremely dangerous EFI variable fiddling
Summary: pstore: please disable extremely dangerous EFI variable fiddling
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 927321
TreeView+ depends on / blocked
 
Reported: 2013-03-08 15:52 UTC by Kay Sievers
Modified: 2013-09-09 18:05 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 927321 (view as bug list)
Environment:
Last Closed: 2013-04-29 12:54:24 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Kay Sievers 2013-03-08 15:52:31 UTC
The current version of pstore seems to write to EFI variable until the storage
is exhausted. Every single kernel oops, like it just happens with a wrong
root= paramter risks to render machines unbootable.

I just trashed the second! laptop just by causing pstore to fiddle with
EFI variables.

Every further attempt to update EFI variable results in -ENOSPACE and
other unpredictable errors of efivarfs and its interaction with the firmware.

One of the laptops (Compaq CQ58) I was only able to recover from that
by re-flashing the firmware, all resetting of the BIOS of trying to
manually delete variables did not fix anything. Even the EFI shell froze
when trying to access the variables.

The Thinkpad firmware is a bit more forgiving here, and a BIOS reset to
defaults clears the problem.

At this moment, the benefit of pstore is almost nothing, but we risk to
potentially destroy hardware with it. It is kind of ironic, that the crash
dump facility causes the machines to become unbootable. :)

Please just disable this facility it until this is properly sorted out in
the pstore code.

Thanks!

Comment 1 Josh Boyer 2013-03-08 18:14:58 UTC
This commit went into today's rawhide:

commit 68d929862e29a8b52a7f2f2f86a0600423b093cd
Author: Matthew Garrett <matthew.garrett>
Date:   Sat Mar 2 19:40:17 2013 -0500

    efi: be more paranoid about available space when creating variables
    
    UEFI variables are typically stored in flash. For various reasons, avaiable
    space is typically not reclaimed immediately upon the deletion of a
    variable - instead, the system will garbage collect during initialisation
    after a reboot.
    
    Some systems appear to handle this garbage collection extremely poorly,
    failing if more than 50% of the system flash is in use. This can result in
    the machine refusing to boot. The safest thing to do for the moment is to
    forbid writes if they'd end up using more than half of the storage space.
    We can make this more finegrained later if we come up with a method for
    identifying the broken machines.

it was also committed as part of a patch set a few days ago.  What specific kernel version were you using when your two laptops hit issues?

Comment 2 Kay Sievers 2013-03-08 18:18:13 UTC
It was:
  3.9.0-0.rc1.git0.4.fc19.x86_64

Comment 3 Josh Boyer 2013-03-08 18:19:41 UTC
(In reply to comment #2)
> It was:
>   3.9.0-0.rc1.git0.4.fc19.x86_64

Hm, ok.  That should have had the above commit carried in efi-fixes.patch.

Matthew, any thoughts here?

Comment 4 Kay Sievers 2013-03-08 21:20:47 UTC
Hmm, while thinking about it:

Maybe it's sensible to require explicit enabling of the pstore dump, by making
a sysfs writable variable, or a kernel command line. The cheap flash in the
EFI firmware of the low end boxes is really nothing we want to stress too much
I guess.

Also, there is not much point in writing stuff to the firmware flash if
nothing consumes it. And at the moment, almost all dumps get unnoticed,
I guess.

Making the maximum size userspace tweakable might be nice too.

Unrelated to that bug, we really need to be able to pass a string to
the dumper which gets included in all of the dumped pages, there is
currently no way to find out where and when these dumps occurred, we
just find them with the next reboot, without any sensible context. They
could even be from a different disk booted on the same hardware.

Userspace could enable the dump and write the boot and machine id into the facility, so we can make some sense out of it with the next reboot.

Comment 5 Matthew Garrett 2013-03-08 21:30:30 UTC
If we're crashing often enough that we're going through significant flash wear cycles, we have other problems. It also seems unlikely that userspace is going to start consuming these things unless they're made available to userspace.

Regarding your original problem, it's unclear what actually happened. Were both machines running kernels with that patch? If not, what was the actual failure mode of the machine that was running that patch?

Comment 6 Lukas "krteknet" Novy 2013-03-09 11:28:45 UTC
(In reply to comment #0)
> The Thinkpad firmware is a bit more forgiving here, and a BIOS reset to
> defaults clears the problem.
Not in my case... I've already managed to brick three T430s Thinkpads in two months:) I found no way to reset BIOS settings, disconnecting the CMOS battery doesn't help either. (kernel 3.8.0-1.el6.elrepo.x86_64)

Also, see this comment[1], there is a chance that even disabling EFI boot in Setup Utility needn't to be enough.

[1]https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557/comments/163

Comment 7 Kay Sievers 2013-03-09 16:23:53 UTC
The pstore facility is not useful to be enabled by default, it's just to scary
to take the risk to break machines that way for no real benefit. That stuff
needs to be enabled by the admin from userspace and surely not by the kernel's
defaults.

The current data without any context and bootup information, without
proper date, time, boot-id information is pretty pointless anyway.

Yes, both machines ran 3.9.0-0.rc1.git0.4.fc19.x86_64, the thinkpad returned
-ENOSPACE, and no EFI variable setting was possible from that on.

The Compaq machine returned wild garbage, wrong data in EFI variables,
and was only to be rescued by a firmware re-flash, which luckily
fixed it. Reset in the BIOS menu did not help anything.

The current deal of risk versus benefit for the pstore clearly suggests
to disable it. The hardware just can't deal with it, and we cannot brick
people's machines that way.

Comment 8 Matthew Garrett 2013-03-09 16:30:18 UTC
Yes. Once you fill the EFI variable space, you'll get -ENOSPC. You could do that without pstore. If things break in that scenario, things need fixing.

The Compaq case is more interesting, but again, if you can trigger this via pstore it can be triggered from userspace. Preventing Linux from using more than 50% of the storage space should prevent this from being an issue, and if it's not then we need to understand what's going wrong there. Please open a separate bug for that issue.

Comment 9 Kay Sievers 2013-03-09 16:41:37 UTC
What part of: "the patch you hope solves the problem was included, but it
still breaks the box" don't you understand?

The EFI variable facility is just to fragile on low-end hardware to be
filled by default a crash dump facility, no matter what half-thought-through percentage limit we apply here. We need to make that an opt-in and not a
default.

Comment 10 Matthew Garrett 2013-03-09 16:49:29 UTC
What part of "If pstore can do this, so can userspace" don't you understand? The Thinkpad case appears to be working exactly as intended. The Compaq needs fixing entirely independently of pstore, because if variable setting is breaking things even without the system running out of storage, it can probably be triggered with efibootmgr. Let's just concentrate on fixing the actual problem?

Comment 11 Kay Sievers 2013-03-09 17:04:20 UTC
There is a huge difference between a default facility that runs unconditionally
and is meant to fill up the flash storage, and random tools that in theory
could do the same if run in a loop by the user.

"The Thinkpad case appears to be working exactly as intended." The need
to reset the BIOS settings it intended? No, this is not how we do things!
It's just luck that it worked, maybe next time it will not.

The whole idea to use large parts of the too-cheap flash space in cheap
hardware is flawed at the core.

You might like the feature, but as we see here, it doesn't work, and it
probably will never work properly on these sorts of hardware. Fedora must
not risk breaking people's boxes with this totally questionable and in 99% useless feature.

The pstore needs to enabled by the user not by the distribution, we cannot
risk to break hardware with default settings, it's simple as that.

Comment 12 Matthew Garrett 2013-03-09 17:23:29 UTC
You ran out of variable space. Further attempts to create variables returns -ENOSPC. Could you please describe what you expect to happen here? Did deleting variables not work? The -ENOSPC you're getting is synthesised by the kernel rather than coming from the firmware, so it really shouldn't correspond to any other unexpected behaviour.

The Compaq case is concerning, but we need to fix the underlying bug instead of just disabling a feature that makes it easier to trigger. You're right that we need to prevent Fedora from damaging machines, but we need to do it properly rather than papering over it. You have a machine that exhibits problematic behaviour and you have the ability to recover it, so let's concentrate on working out what's actually going wrong? The time you've taken arguing on this bug would conceivably have been enough to figure it out.

Comment 13 Lingzhu Xiang 2013-03-11 14:48:27 UTC
The price of figuring out what's wrong here is often a bit too high. I bricked a Thinkpad T520 with some random efibootmgr stress test months ago (kernel 3.5). It could still boot though, but only after I tried to reinstall did I find the firmware was crippled and couldn't create any variables. Then it was really physically bricked and took a month to replace the motherboard. I wanted to report bug, but I couldn't even tell what exactly caused the problem without risking another motherboard reproducing. It might be the out of space firmware bug. I can't verify that. 

> Making the maximum size userspace tweakable might be nice too.

kmsg_bytes=2048? Installation has to write certain amount of data as boot option variables anyway. Writing that amount should be well tested and safe.

Comment 14 Oleg Shirochenkov 2013-03-25 13:17:18 UTC
(In reply to comment #0)
> One of the laptops (Compaq CQ58) I was only able to recover from that
> by re-flashing the firmware, all resetting of the BIOS of trying to
> manually delete variables did not fix anything. Even the EFI shell froze
> when trying to access the variables.
Hi, Kay Sievers.
Have you tested this fix (68d929862e29a8b52a7f2f2f86a0600423b093cd
)?
So are you able to edit the EFI variables and to load the kernel after applying the patch?
Me not, on the same HP Compaq Presario CQ58, using 3.8.3 kernel.
I've found similar reportings for other hardware:
Bug 462705 – sys-boot/efibootmgr with kernel >3.8.2 - `efibootmgr -o x,y,z' does not set boot order - https://bugs.gentoo.org/show_bug.cgi?id=462705
Bug 55471 – efivars.c: fail for write boot entry - https://bugzilla.kernel.org/show_bug.cgi?id=55471
Affected users that can't boot using EFI Stub with this fix / Arch Linux Forums - https://bbs.archlinux.org/viewtopic.php?pid=1249131#p1249131

Comment 15 Kay Sievers 2013-03-25 13:46:39 UTC
That all has happened with the above commit included, and still happens.

The firmware, even in the "corrupted" state, seems to be able to boot windows; and from windows I can re-flash the firmware, which sometimes after a couple
of reboots, or a second flash, a full reset seems to happen and all works
from there again. I did not find any other way so far to recover from that\
state.

Pstore is just not to be enabled by default on such low-end hardware,
no heuristic can work around that issue. Pstore provides almost zero value for
ordinary users, but risks to "destroy" their hardware. I really don't
understand what kind of game Fedora tries to play here and what kind of deal
it is that we think we are doing? It all makes absolutely zero sense to me.

But hey, I'm running a kernel without "pstore laptop bricking technology"
on that box now, I will not support that kind of fiddling with my hardware
anymore, and I'm not interested in fine-tuning how fast or how bad we can
break things -- seems they will in any case. I really have more interesting things to work on.

Unconditionally writing large data to the firmware seems not a valid
approach for low-end hardware, and again, no heuristics with assumptions
about free space will solve that issue.

Comment 16 Matthew Garrett 2013-03-25 14:32:18 UTC
Kay,

You can either actually help us work on this bug or we can close it. You have supplied basically none of the requested information. pstore does nothing magic, and if it can cause hardware to misbehave then so can other tools that we ship. Turning it off reduces our ability to debug crashes and doesn't protect our users. So, could you please actually describe the precise problem you're having with each machine?

Comment 17 Kay Sievers 2013-03-25 15:48:38 UTC
There is nothing missing here. It's all described in the text above.
What you hope fore, seems just not true on the real and cheap hardware, no
matter how often you repeat that all will be fine, it will not.

But hey, just wait for others to run into the same issues, and you will find
out. They will not open bugs, but just tell that something went wrong with
the machine after Fedora crashed.

The difference between "other tools" and pstore is the overly extensive use
of the variable store, nothing else on a Linux system, or any other operating
system would ever do that. Even "the half of the available storage" is so much
more than any other common use case. So pstore is surely the "magic"
to trigger the "bricking", and there is nothing that would prevent that
from happening on these cheap and broken firmwares.

You can repeat "but it's according to the spec" as many times as you like,
everybody knows how trustworthy that is when it comes to firmware, it will
not fix the cheap hardware and broken firmwares out there, and heuristics
will only make it more or less likely to happen but never solve the,
underlying problem, which is a very serious one: it breaks normal operation
of machines.

And no, I'm not filling up the EFI variable store now to see when the
boxes crash again, they will, and that is all described above. I will
no longer take part in that experiment, I don't see any value in it.

I disabled pstore and will not use it again on these boxes. And Fedora
should just disable it by default too, instead of continuing this
dangerous fiddling.

Comment 18 Matthew Garrett 2013-03-25 15:58:57 UTC
Kay,

You're being wilfully unhelpful. I'm clearly not arguing that the spec has anything to do with this. I'm arguing that it's the role of the kernel to prevent the system from ending up in an unusable state, and simply disabling pstore doesn't do that. If you want to help us fix the problem, great. But since you don't, I'm closing the bug.

Comment 19 Kay Sievers 2013-03-25 16:05:07 UTC
Stop closing Fedora bug please. This is not YOUR decision, and it will
never be.

The current kernel renders boxes unbootable, and this need to be fixed.

Comment 20 Matthew Garrett 2013-03-25 16:06:57 UTC
? I'm one of the kernel maintainers. It's absolutely my decision.

Comment 21 Kay Sievers 2013-03-25 16:32:52 UTC
Stop closing this bug, which is known to create problems for no other reason that your personal agenda.

This feature is very risky and almost 100% useless and needless for
ordinary users. There is no balance between gain and risk, that feature
needs to be disabled by default.

Sticking the head in the sand and praying all will work out is not how we
work, safety trumps features, especially such dangerous and not-really-useful ones.

And no, you are not in the position to make that decision, because "your"
feature is the reason for the issues, and you don't get to decide to ignore
the problems you are creating for our users.

In case you still don't understand that this stuff is dangerous and nothing
that ordinary Fedora users should ever run into, I will escalate the decision
about the granted privileges you currently own and which you clearly misuse
here.

Comment 22 Matthew Garrett 2013-03-25 16:35:06 UTC
Kay,

You have provided no information that lets us resolve the underlying bug and you've indicated that you don't intend to in future. If there are specific issues then open bugs about those specific issues and we'll resolve them, but there is absolutely nothing actionable in this bug.

Comment 23 Oleg Shirochenkov 2013-03-25 16:44:41 UTC
Dear Matthew, I'm sorry, I don't want to make war, also I understand that open source is free, but I want to highlight again: commit 68d929862e29a8b52a7f2f2f86a0600423b093cd seems as degradation.
It prevents to boot and it prevents to make write actions to efivars. A lot of hardware is affected but since kernel 3.8.3 looks fresh, I can't provide a hundreds of bugs.
Also, If Kay is sure that it affects at least HP Compaq Presario CQ58 I could try to fill EFI table if that can help to solve the issue but I don't know how much entries should be placed.

Comment 24 Matthew Garrett 2013-03-25 16:48:01 UTC
Oleg,

I'm completely happy to work on that (and am actually doing so right now). Do you already have an open bug for this?

Comment 25 Oleg Shirochenkov 2013-03-25 16:53:26 UTC
Matthew, I'm not the Red Hat / Fedora user or employee :)
The bug is listed in upstream Bug 55471 – efivars.c: fail for write boot entry - https://bugzilla.kernel.org/show_bug.cgi?id=55471

Comment 26 Matthew Garrett 2013-03-25 16:58:13 UTC
Kay,

You're not listed as a member of the kernel ACL. Please stop modifying the state of this bug.

Comment 27 Harald Hoyer 2013-04-29 09:31:35 UTC
I was bitten by this bug on a Thinkpad T420S.
I could not even boot into my BIOS. 
Installation of a new F19 Alpha screwed up the EFI.

Only via tricks I was able to recover. Please don't mess with the BIOS!!!

Comment 28 Harald Hoyer 2013-04-29 09:33:22 UTC
(In reply to comment #27)
> I was bitten by this bug on a Thinkpad T420S.
> I could not even boot into my BIOS. 
> Installation of a new F19 Alpha screwed up the EFI.
> 
> Only via tricks I was able to recover. Please don't mess with the BIOS!!!

Some kernel oops in the btrfs made the kernel write to pstore and several attempts to recover the files apparently made it fill up the space.

Comment 29 Kay Sievers 2013-04-29 11:52:48 UTC
This is getting ridiculous.

You guys can obviously not fix it, so stop NOW letting Fedora actively
destroying setups/laptops with this completely idiotic feature!

"Cheap" hardware's firmware is obviously not meant to be used that way,
and if people want to take that risk, they need to enable this feature,
which no normal user needs anyway.

I have the same issues still on the HP laptop with the rawhide kernel.

Comment 30 Josh Boyer 2013-04-29 12:54:24 UTC
(In reply to comment #27)
> I was bitten by this bug on a Thinkpad T420S.
> I could not even boot into my BIOS. 
> Installation of a new F19 Alpha screwed up the EFI.
> 
> Only via tricks I was able to recover. Please don't mess with the BIOS!!!

That doesn't make any sense.  See below.

(In reply to comment #29)
> This is getting ridiculous.
> 
> You guys can obviously not fix it, so stop NOW letting Fedora actively
> destroying setups/laptops with this completely idiotic feature!

Sigh.  Could you look at the configs before blindly yelling?

> "Cheap" hardware's firmware is obviously not meant to be used that way,
> and if people want to take that risk, they need to enable this feature,
> which no normal user needs anyway.
> 
> I have the same issues still on the HP laptop with the rawhide kernel.

Then it's unrelated to EFI pstore.  That option has been disabled by default since Fedora commit:

commit 4c578540af6ca542bab88582c9cf6dd02b4cfb10
Author: Justin M. Forbes <jforbes>
Date:   Tue Apr 2 07:34:22 2013 -0500

    Linux v3.9-rc5


on F19 and 

commit 6e429107c2fa5b88eabcfda9db729eea93095534
Author: Justin M. Forbes <jforbes>
Date:   Mon Apr 1 11:15:57 2013 -0500

    Linux v3.9-rc5

on rawhide.

Comment 31 Harald Hoyer 2013-04-29 13:04:05 UTC
(In reply to comment #30)
> (In reply to comment #27)
> > I was bitten by this bug on a Thinkpad T420S.
> > I could not even boot into my BIOS. 
> > Installation of a new F19 Alpha screwed up the EFI.
> > 
> > Only via tricks I was able to recover. Please don't mess with the BIOS!!!
> 
> That doesn't make any sense.  See below.

Thanks! In the process of finding a kernel, which does not OOPS for my btrfs, I might have booted in older rawhide kernels.

Comment 32 Kay Sievers 2013-04-29 14:58:47 UTC
(In reply to comment #30)

> > You guys can obviously not fix it, so stop NOW letting Fedora actively
> > destroying setups/laptops with this completely idiotic feature!
> 
> Sigh.  Could you look at the configs before blindly yelling?

Oh, sorry, you are right. I did not check the most recent kernel, I just did
not expect any change here to have happened. :)

Talking to Harald today brought back the mood of the time I reported the
bug. The reactions and non-reactions and how we seem to argue about theories
but ignore that we actively kill boxes with this facility, made me very
angry.

> > "Cheap" hardware's firmware is obviously not meant to be used that way,
> > and if people want to take that risk, they need to enable this feature,
> > which no normal user needs anyway.
> > 
> > I have the same issues still on the HP laptop with the rawhide kernel.
> 
> Then it's unrelated to EFI pstore.  That option has been disabled by default
> since Fedora commit:

It was, on the HP laptop there is still kernel slightly older than that.
As said, I did not expect it to be gone. Great and happy that this is
solved now.

Thanks and sorry again, I'll try to do better next time.


Note You need to log in before you can comment on or make changes to this bug.