Bug 516909 - KSM breaks encryption 157 > kernel > 139 - KSM support now disabled
KSM breaks encryption 157 > kernel > 139 - KSM support now disabled
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
All Linux
high Severity high
: ---
: ---
Assigned To: Justin M. Forbes
Fedora Extras Quality Assurance
:
: 517545 (view as bug list)
Depends On:
Blocks: F12VirtBlocker
  Show dependency treegraph
 
Reported: 2009-08-11 16:52 EDT by Warren Togami
Modified: 2009-08-27 09:24 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-08-27 09:24:48 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
attempted fix (last one was wrong diff) (3.05 KB, patch)
2009-08-25 11:27 EDT, Andrea Arcangeli
no flags Details | Diff
new proposed patch (5.09 KB, patch)
2009-08-25 15:52 EDT, Andrea Arcangeli
no flags Details | Diff

  None (edit)
Description Warren Togami 2009-08-11 16:52:01 EDT
1) Upgrade to grubby-7.0.2-1.fc12.x86_64
2) Install grubby-7.0.2-1.fc12.x86_64

title Fedora (2.6.31-0.145.rc5.git3.fc12.x86_64)
	root (hd0,0)
	kernel /vmlinuz-2.6.31-0.145.rc5.git3.fc12.x86_64 ro root=/dev/mapper/vg0-rootfs rhgb quiet usbcore.autosuspend=1 SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us rd_plytheme=charge
	initrd /initrd-generic-2.6.31-0.145.rc5.git3.fc12.x86_64.img

This stanza was written by new-kernel-pkg.  Note that it is booting the initrd-generic shipped within the kernel RPM.

3) Attempt boot.  My LVM vg is encrypted, with rootfs on a lv within that vg.  I type my passphrase, it successfully unlocks it, but then sits there forever.

key slot 0 unlocked.
                    Command successful.
Comment 1 Warren Togami 2009-08-11 17:13:50 EDT
oops, step #2 is install kernel-2.6.31-0.145.rc5.git3.fc12.x86_64

This is with dracut-0.8-1.fc12.noarch.  I tried to create a new initrd image with dracut, but that image exhibits the same problem as initrd-generic.
Comment 2 Warren Togami 2009-08-11 18:07:15 EDT
dracut on kernel-2.6.31-0.125.rc5.git2.fc12.x86_64 works.  This seems to be a problem with kernel-2.6.31-0.145.rc5.git3.fc12.x86_64.  Reassigning.
Comment 3 Warren Togami 2009-08-11 19:31:41 EDT
kernel-2.6.31-0.145.2.1.rc5.git3.fc12.x86_64 is broken in the same manner.
Comment 4 Sven Lankes 2009-08-12 06:58:46 EDT
Same here. The last kernel that is working for me is kernel-2.6.31-0.139.rc5.git3.fc12.x86_64
Comment 5 Warren Togami 2009-08-12 12:42:59 EDT
kernel-2.6.31-0.149.rc5.git3.fc12.x86_64 mkinitrd FAIL
kernel-2.6.31-0.149.rc5.git3.fc12.x86_64 dracut   FAIL
Comment 6 Warren Togami 2009-08-12 18:31:46 EDT
I built a LiveCD with kernel-2.6.31-0.149.rc5.git3.fc12.x86_6 + dracut.  It gets stuck forever without any error messages and just fails to boot.  It seems this has nothing to do with encrypted root.
Comment 7 Warren Togami 2009-08-12 18:58:06 EDT
GOOD kernel-2.6.31-0.139.rc5.git3.fc12.x86_64 mkinitrd
GOOD kernel-2.6.31-0.139.rc5.git3.fc12.x86_64 dracut
FAIL kernel-2.6.31-0.142.rc5.git3.fc12.x86_64 mkinitrd
FAIL kernel-2.6.31-0.142.rc5.git3.fc12.x86_64 dracut

Confirmed, it broke somewhere between 139 and 142.
Comment 8 Dennis Gilmore 2009-08-12 19:21:20 EDT
kernel-2.6.31-0.149.rc5.git3.fc12.sparc64 works just fine here.  im using unencrypted lvm 

dracut-0.7-4.fc12  was used in the kernel build
Comment 9 Warren Togami 2009-08-13 15:39:17 EDT
I'm confused.  The very same livecd of Comment #6 works today, but the kernel installed on my laptop silently gets stuck after unlocking the encrypted disk.

Sven, are you using encryption?  enrypted LVM vg specifically?
Comment 10 Warren Togami 2009-08-13 17:00:37 EDT
http://people.redhat.com/wtogami/temp/post139loop.jpg
SysRQ-p after it gets stuck.  It appears to be stuck in a loop.
Comment 11 Sven Lankes 2009-08-14 03:16:35 EDT
I am using encryption. 

I can reproduce those hangs on two machines - both using full vg encryption.
Comment 12 Sven Lankes 2009-08-15 16:44:33 EDT
If I'm not mistaken F12Alpha is going to ship with a kernel >139 and that would mean full disk encrytion is broken for alpha. As this is a rather important feature I think blocking on F12Alpha is warranted.
Comment 13 Sven Lankes 2009-08-15 16:45:51 EDT
This seems to be the "non-boot-side" of the same bug:

https://bugzilla.redhat.com/show_bug.cgi?id=517545
Comment 14 Tom London 2009-08-15 17:10:03 EDT
Yeah, believe I can reproduce a similar issue by plugging in a USB hard drive with a Luks encrypted file system:  
https://bugzilla.redhat.com/show_bug.cgi?id=517545

As reported there, works for 0.139, fails for later kernels up to and including
kernel-2.6.31-0.156.rc6.fc12.x86_64.

All these kernels boot fine on my unencrypted LVM, but exhibit "cryptsetup
won't die and consumes available cpu cycles".

I've posted SysRQ-p traces there....

Same issue?
Comment 15 Milan Broz 2009-08-15 19:40:32 EDT
It works with Linus' kernel, patches which introduced problem in Fedora:

Kernel Samepage Merging (KSM).
 linux-2.6-ksm.patch
 linux-2.6-ksm-updates.patch

Quite serious bug, probably all encrypted system are not bootable now.
Comment 16 Milan Broz 2009-08-15 19:42:08 EDT
*** Bug 517545 has been marked as a duplicate of this bug. ***
Comment 17 Sven Lankes 2009-08-16 07:17:59 EDT
Both my rawhide machines are back to working state with kernel -157 (which disables the ksm-patches).
Comment 18 Milan Broz 2009-08-16 09:19:36 EDT
From the included KSM series, probmlematic is this patch
Subject: [PATCH 9/12] ksm: fix oom deadlock

(fixes one deadlock...and introduces another one:-)
Comment 19 Tom London 2009-08-16 12:22:22 EDT
-157 fixes my "plugging in a USB hard drive with encrypted FS" issue.

FS now mounts and cryptsetup has properly exited.
Comment 20 Mark McLoughlin 2009-08-17 11:11:29 EDT
The F12 Alpha kernel is kernel-2.6.31-0.125.4.2.rc5.git2.fc12, so removing this from the alpha blocker
Comment 21 Mark McLoughlin 2009-08-18 06:36:00 EDT
Summary:

  - '[PATCH 9/12] ksm: fix oom deadlock' appears to cause deadlock with an
    encrypted root volume

  - This was added in 2.6.31-0.141.rc5.git3 by the addition of this set
    of KSM patches:

  http://cvs.fedoraproject.org/viewvc/rpms/kernel/devel/linux-2.6-ksm-updates.patch?revision=1.1

  - the KSM patches have since been disabled since 2.6.31-0.157.rc6 pending
    a fix for this
Comment 22 Milan Broz 2009-08-18 06:44:49 EDT
>   - '[PATCH 9/12] ksm: fix oom deadlock' appears to cause deadlock with an
>     encrypted root volume

FYI: no need to have encrypted root volume, any "cryptsetup luksOpen" on x86_64 will cause deadlock, for process backtrace see bug 517545.
Comment 23 Mark McLoughlin 2009-08-18 12:03:17 EDT
Andrea suggests checking whether these programs are calling madvise() with bogus flags
Comment 24 Milan Broz 2009-08-18 12:44:27 EDT
(In reply to comment #23)
> Andrea suggests checking whether these programs are calling madvise() with
> bogus flags

Not explicitly, but probably forgot to unlock memory - try this code:

#include <sys/mman.h>

int main (int argc, char *argv[])
{
        mlockall(MCL_CURRENT | MCL_FUTURE);
//      munlockall();

        return 0;
}
Comment 25 Andrea Arcangeli 2009-08-25 11:02:29 EDT
Investingating why those troublesome checks that deadlocks mlocked programs are added to page fault path... at first glance they look unnecessary, so asking just in case...

Date: Tue, 25 Aug 2009 16:58:32 +0200
From: Andrea Arcangeli <aarcange@redhat.com>                                                               
To: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>, Rik van Riel <riel@redhat.com>,
        Chris Wright <chrisw@redhat.com>,
        Nick Piggin <nickpiggin@yahoo.com.au>,
        Andrew Morton <akpm@linux-foundation.org>,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 9/12] ksm: fix oom deadlock                                                            

On Mon, Aug 03, 2009 at 01:18:16PM +0100, Hugh Dickins wrote:
> tables which have been freed for reuse; and even do_anonymous_page
> and __do_fault need to check they're not being called by break_ksm
> to reinstate a pte after zap_pte_range has zapped that page table.

This deadlocks exit_mmap in an infinite loop when there's some region
locked. mlock calls gup and pretends to page fault successfully if
there's a vma existing on the region, but it doesn't page fault
anymore because of the mm_count being 0 already, so follow_page fails
and gup retries the page fault forever. And generally I don't like to
add those checks to page fault fast path.

Given we check mm_users == 0 (ksm_test_exit) after taking mmap_sem in
unmerge_and_remove_all_rmap_items, why do we actually need to care
that a page fault happens? We hold mmap_sem so we're guaranteed to see
mm_users == 0 and we won't ever break COW on that mm with mm_users ==
0 so I think those troublesome checks from page fault can be simply
removed.
Comment 28 Andrea Arcangeli 2009-08-25 11:27:22 EDT
Created attachment 358588 [details]
attempted fix (last one was wrong diff)
Comment 30 Andrea Arcangeli 2009-08-25 15:52:15 EDT
Created attachment 358624 [details]
new proposed patch

this is actually making ksm_exit simpler and it already contains down_write(mmap_sem)

(also this time I checked which workstation I'm running firefox on, before picking a random file from /tmp ;)

discussion is going live on linux-mm with Hugh
Comment 31 Andrea Arcangeli 2009-08-27 05:34:22 EDT
Hugh acked my attachment 358624 [details] so please apply it and then we can close this bug. We've still some issue to discuss on oom handling with ksm on linux-mm but those aren't crtical issues and once we solve them, patches will flow in rawhide.

thanks!
Comment 32 Justin M. Forbes 2009-08-27 09:24:48 EDT
Already applied and should be in kernel-2.6.31-0.180.rc7.git4.fc12 today.  KSM has been re-enabled.

Note You need to log in before you can comment on or make changes to this bug.