Created attachment 1008541 [details] Kernel messages when the system freezes Description of problem: I am using Fedora 22 (aarch64) on an APM X-Gene Board (A2 stepping). On that I am running multiple virtual machines with KVM that are used to compile lots of things. After a while the system is becoming more and more unresponsive until it completely freezes. This usually happens after an hour or two under heavy CPU and IO load and when it starts to get noticeably slower it usually takes less than two or three minutes until the freeze. The guest kernel dumps the attached message. After the guest system is "forced off" and rebooted, everything is fine again. I was required to reboot the entire host sometimes although I am not sure if this was related. The guest system just didn't want to boot any more. Version-Release number of selected component (if applicable): qemu-2.3.0-0.1.rc0.fc22 kernel-4.0.0-0.rc5.git4.1.fc23.kvm1 (also happens with kernel-4.0.0-0.rc3.git0.1.fc22) https://hrw.fedorapeople.org/aarch64/kernel-4.0.0-0.rc5.git4.1.fc23.kvm1.src.rpm/ How reproducible: Every time. Happens sooner or later probably depending on the load of the system. Steps to Reproduce: 1. Run a virtual machine. 2. Compile the Linux kernel with -j8 or higher. 3. Repeat and wait until the system freezes. Actual results: The guest system freezes and needs to be "forced off". Expected results: Should not freeze at all.
You'll need to work with the person that provided that kernel. *** This bug has been marked as a duplicate of bug 126342 ***
Marcin Juszkiewicz <mjuszkiewicz> asked me to open this bug report here. The bug is just the same with the "official" kernel 4.0.0-0.rc3.git0.1.fc22.
Does the host & guest have this upstream patch? commit 285994a62c80f1d72c6924282bcb59608098d5ec Author: Catalin Marinas <catalin.marinas> Date: Wed Mar 11 12:20:39 2015 +0000 arm64: Invalidate the TLB corresponding to intermediate page table levels The ARM architecture allows the caching of intermediate page table levels and page table freeing requires a sequence like: pmd_clear() TLB invalidation pte page freeing With commit 5e5f6dc10546 (arm64: mm: enable HAVE_RCU_TABLE_FREE logic), the page table freeing batching was moved from tlb_remove_page() to tlb_remove_table(). The former takes care of TLB invalidation as this is also shared with pte clearing and page cache page freeing. The latter, however, does not invalidate the TLBs for intermediate page table levels as it probably relies on the architecture code to do it if required. When the mm->mm_users < 2, tlb_remove_table() does not do any batching and page table pages are freed before tlb_finish_mmu() which performs the actual TLB invalidation. This patch introduces __tlb_flush_pgtable() for arm64 and calls it from the {pte,pmd,pud}_free_tlb() directly without relying on deferred page Signed-off-by: Catalin Marinas <catalin.marinas table freeing. Fixes: 5e5f6dc10546 arm64: mm: enable HAVE_RCU_TABLE_FREE logic Reported-by: Jon Masters <jcm> Tested-by: Jon Masters <jcm> Tested-by: Steve Capper <steve.capper> Signed-off-by: Catalin Marinas <catalin.marinas btw: how is this bz a dupe of 126342 ? the latter bz is about kernel config options and dates back to 2004 ... maybe another bz number?
(In reply to Don Dutile from comment #3) > btw: how is this bz a dupe of 126342 ? the latter bz is about kernel config > options and dates back to 2004 ... maybe another bz number? It's a custom kernel, not something provided by the fedora project. Before the reporter made things slightly more clear, the standard process is to dup the bug to the customkernel alias since we don't support any random kernel that someone might build. You can undup it if you'd like, but there's nothing stock about the fedora kernel in regard to aarch64. The rc3 kernel mentioned in comment #2 is lacking the TLB patch. The rc4 and rc5 kernels in fedora have the TLB patch, but the rc5 kernels lack the kernel-arm64 megapatch and it's my understanding that KVM support was carried in that and not something that comes from upstream. The situation is essentially a mess.
(In reply to Josh Boyer from comment #4) > (In reply to Don Dutile from comment #3) > > btw: how is this bz a dupe of 126342 ? the latter bz is about kernel config > > options and dates back to 2004 ... maybe another bz number? > > It's a custom kernel, not something provided by the fedora project. Before > the reporter made things slightly more clear, the standard process is to dup > the bug to the customkernel alias since we don't support any random kernel > that someone might build. > (In reply to Josh Boyer from comment #4) > (In reply to Don Dutile from comment #3) > > btw: how is this bz a dupe of 126342 ? the latter bz is about kernel config > > options and dates back to 2004 ... maybe another bz number? > > > The situation is essentially a mess. +1
Reopening this because the issue still exists. Host system: [root@mustang ~]# uname -a Linux mustang.ipfire.org 4.0.0-0.rc6.git0.1.fc23.aarch64 #1 SMP Tue Mar 31 13:34:47 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux Virtual machine: [root@wainscott ~]# uname -a Linux wainscott.ipfire.org 4.0.0-0.rc6.git0.1.fc23.aarch64 #1 SMP Tue Mar 31 13:34:47 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux [root@mustang ~]# dmesg | grep -i kvm [ 0.295900] kvm [1]: interrupt-controller@780c0000 IRQ17 [ 0.296023] kvm [1]: timer IRQ3 [ 0.296040] kvm [1]: Hyp mode initialized successfully
Created attachment 1012744 [details] dmesg output of the VM
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs. Fedora 22 has now been rebased to 4.2.3-200.fc22. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in over 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.