Bug 516909
Summary: | KSM breaks encryption 157 > kernel > 139 - KSM support now disabled | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Warren Togami <wtogami> | ||||||
Component: | kernel | Assignee: | Justin M. Forbes <jforbes> | ||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | rawhide | CC: | aarcange, bruno, harald, itamar, kernel-maint, markmc, mbroz, selinux, sven, virt-maint | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-08-27 13:24:48 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 498968 | ||||||||
Attachments: |
|
Description
Warren Togami
2009-08-11 20:52:01 UTC
oops, step #2 is install kernel-2.6.31-0.145.rc5.git3.fc12.x86_64 This is with dracut-0.8-1.fc12.noarch. I tried to create a new initrd image with dracut, but that image exhibits the same problem as initrd-generic. dracut on kernel-2.6.31-0.125.rc5.git2.fc12.x86_64 works. This seems to be a problem with kernel-2.6.31-0.145.rc5.git3.fc12.x86_64. Reassigning. kernel-2.6.31-0.145.2.1.rc5.git3.fc12.x86_64 is broken in the same manner. Same here. The last kernel that is working for me is kernel-2.6.31-0.139.rc5.git3.fc12.x86_64 kernel-2.6.31-0.149.rc5.git3.fc12.x86_64 mkinitrd FAIL kernel-2.6.31-0.149.rc5.git3.fc12.x86_64 dracut FAIL I built a LiveCD with kernel-2.6.31-0.149.rc5.git3.fc12.x86_6 + dracut. It gets stuck forever without any error messages and just fails to boot. It seems this has nothing to do with encrypted root. GOOD kernel-2.6.31-0.139.rc5.git3.fc12.x86_64 mkinitrd GOOD kernel-2.6.31-0.139.rc5.git3.fc12.x86_64 dracut FAIL kernel-2.6.31-0.142.rc5.git3.fc12.x86_64 mkinitrd FAIL kernel-2.6.31-0.142.rc5.git3.fc12.x86_64 dracut Confirmed, it broke somewhere between 139 and 142. kernel-2.6.31-0.149.rc5.git3.fc12.sparc64 works just fine here. im using unencrypted lvm dracut-0.7-4.fc12 was used in the kernel build I'm confused. The very same livecd of Comment #6 works today, but the kernel installed on my laptop silently gets stuck after unlocking the encrypted disk. Sven, are you using encryption? enrypted LVM vg specifically? http://people.redhat.com/wtogami/temp/post139loop.jpg SysRQ-p after it gets stuck. It appears to be stuck in a loop. I am using encryption. I can reproduce those hangs on two machines - both using full vg encryption. If I'm not mistaken F12Alpha is going to ship with a kernel >139 and that would mean full disk encrytion is broken for alpha. As this is a rather important feature I think blocking on F12Alpha is warranted. This seems to be the "non-boot-side" of the same bug: https://bugzilla.redhat.com/show_bug.cgi?id=517545 Yeah, believe I can reproduce a similar issue by plugging in a USB hard drive with a Luks encrypted file system: https://bugzilla.redhat.com/show_bug.cgi?id=517545 As reported there, works for 0.139, fails for later kernels up to and including kernel-2.6.31-0.156.rc6.fc12.x86_64. All these kernels boot fine on my unencrypted LVM, but exhibit "cryptsetup won't die and consumes available cpu cycles". I've posted SysRQ-p traces there.... Same issue? It works with Linus' kernel, patches which introduced problem in Fedora: Kernel Samepage Merging (KSM). linux-2.6-ksm.patch linux-2.6-ksm-updates.patch Quite serious bug, probably all encrypted system are not bootable now. *** Bug 517545 has been marked as a duplicate of this bug. *** Both my rawhide machines are back to working state with kernel -157 (which disables the ksm-patches). From the included KSM series, probmlematic is this patch Subject: [PATCH 9/12] ksm: fix oom deadlock (fixes one deadlock...and introduces another one:-) -157 fixes my "plugging in a USB hard drive with encrypted FS" issue. FS now mounts and cryptsetup has properly exited. The F12 Alpha kernel is kernel-2.6.31-0.125.4.2.rc5.git2.fc12, so removing this from the alpha blocker Summary: - '[PATCH 9/12] ksm: fix oom deadlock' appears to cause deadlock with an encrypted root volume - This was added in 2.6.31-0.141.rc5.git3 by the addition of this set of KSM patches: http://cvs.fedoraproject.org/viewvc/rpms/kernel/devel/linux-2.6-ksm-updates.patch?revision=1.1 - the KSM patches have since been disabled since 2.6.31-0.157.rc6 pending a fix for this > - '[PATCH 9/12] ksm: fix oom deadlock' appears to cause deadlock with an > encrypted root volume FYI: no need to have encrypted root volume, any "cryptsetup luksOpen" on x86_64 will cause deadlock, for process backtrace see bug 517545. Andrea suggests checking whether these programs are calling madvise() with bogus flags (In reply to comment #23) > Andrea suggests checking whether these programs are calling madvise() with > bogus flags Not explicitly, but probably forgot to unlock memory - try this code: #include <sys/mman.h> int main (int argc, char *argv[]) { mlockall(MCL_CURRENT | MCL_FUTURE); // munlockall(); return 0; } Investingating why those troublesome checks that deadlocks mlocked programs are added to page fault path... at first glance they look unnecessary, so asking just in case...
Date: Tue, 25 Aug 2009 16:58:32 +0200
From: Andrea Arcangeli <aarcange>
To: Hugh Dickins <hugh.dickins.uk>
Cc: Izik Eidus <ieidus>, Rik van Riel <riel>,
Chris Wright <chrisw>,
Nick Piggin <nickpiggin.au>,
Andrew Morton <akpm>,
linux-kernel.org, linux-mm
Subject: Re: [PATCH 9/12] ksm: fix oom deadlock
On Mon, Aug 03, 2009 at 01:18:16PM +0100, Hugh Dickins wrote:
> tables which have been freed for reuse; and even do_anonymous_page
> and __do_fault need to check they're not being called by break_ksm
> to reinstate a pte after zap_pte_range has zapped that page table.
This deadlocks exit_mmap in an infinite loop when there's some region
locked. mlock calls gup and pretends to page fault successfully if
there's a vma existing on the region, but it doesn't page fault
anymore because of the mm_count being 0 already, so follow_page fails
and gup retries the page fault forever. And generally I don't like to
add those checks to page fault fast path.
Given we check mm_users == 0 (ksm_test_exit) after taking mmap_sem in
unmerge_and_remove_all_rmap_items, why do we actually need to care
that a page fault happens? We hold mmap_sem so we're guaranteed to see
mm_users == 0 and we won't ever break COW on that mm with mm_users ==
0 so I think those troublesome checks from page fault can be simply
removed.
Created attachment 358588 [details]
attempted fix (last one was wrong diff)
Created attachment 358624 [details]
new proposed patch
this is actually making ksm_exit simpler and it already contains down_write(mmap_sem)
(also this time I checked which workstation I'm running firefox on, before picking a random file from /tmp ;)
discussion is going live on linux-mm with Hugh
Hugh acked my attachment 358624 [details] so please apply it and then we can close this bug. We've still some issue to discuss on oom handling with ksm on linux-mm but those aren't crtical issues and once we solve them, patches will flow in rawhide.
thanks!
Already applied and should be in kernel-2.6.31-0.180.rc7.git4.fc12 today. KSM has been re-enabled. |