1207282 – kernel/KVM rcu_sched detected stalls on CPUs/tasks

Bug 1207282 - kernel/KVM rcu_sched detected stalls on CPUs/tasks [NEEDINFO]

Summary: kernel/KVM rcu_sched detected stalls on CPUs/tasks

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	22
Hardware:	aarch64
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Mark Salter
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1207320
TreeView+	depends on / blocked

Reported:	2015-03-30 15:30 UTC by Michael Tremer
Modified:	2015-11-23 17:18 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Clones:	1207320 (view as bug list)
Environment:
Last Closed:	2015-11-23 17:18:50 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	jforbes: needinfo?

Attachments	(Terms of Use)
Kernel messages when the system freezes (1.66 KB, text/plain) 2015-03-30 15:30 UTC, Michael Tremer	no flags	Details
dmesg output of the VM (8.60 KB, text/plain) 2015-04-09 16:10 UTC, Michael Tremer	no flags	Details
View All

Description Michael Tremer 2015-03-30 15:30:33 UTC

Created attachment 1008541 [details]
Kernel messages when the system freezes

Description of problem:
I am using Fedora 22 (aarch64) on an APM X-Gene Board (A2 stepping). On that I am running multiple virtual machines with KVM that are used to compile lots of things. After a while the system is becoming more and more unresponsive until it completely freezes. This usually happens after an hour or two under heavy CPU and IO load and when it starts to get noticeably slower it usually takes less than two or three minutes until the freeze.

The guest kernel dumps the attached message.

After the guest system is "forced off" and rebooted, everything is fine again. I was required to reboot the entire host sometimes although I am not sure if this was related. The guest system just didn't want to boot any more.

Version-Release number of selected component (if applicable):
qemu-2.3.0-0.1.rc0.fc22
kernel-4.0.0-0.rc5.git4.1.fc23.kvm1 (also happens with kernel-4.0.0-0.rc3.git0.1.fc22)
https://hrw.fedorapeople.org/aarch64/kernel-4.0.0-0.rc5.git4.1.fc23.kvm1.src.rpm/


How reproducible:
Every time. Happens sooner or later probably depending on the load of the system.

Steps to Reproduce:
1. Run a virtual machine.
2. Compile the Linux kernel with -j8 or higher.
3. Repeat and wait until the system freezes.

Actual results:
The guest system freezes and needs to be "forced off".

Expected results:
Should not freeze at all.

Comment 1 Josh Boyer 2015-03-30 15:37:26 UTC

You'll need to work with the person that provided that kernel.

*** This bug has been marked as a duplicate of bug 126342 ***

Comment 2 Michael Tremer 2015-03-30 15:43:31 UTC

Marcin Juszkiewicz <mjuszkiewicz> asked me to open this bug report here.

The bug is just the same with the "official" kernel 4.0.0-0.rc3.git0.1.fc22.

Comment 3 Don Dutile (Red Hat) 2015-03-30 17:44:28 UTC

Does the host & guest have this upstream patch?
commit 285994a62c80f1d72c6924282bcb59608098d5ec
Author: Catalin Marinas <catalin.marinas>
Date:   Wed Mar 11 12:20:39 2015 +0000

    arm64: Invalidate the TLB corresponding to intermediate page table levels
    
    The ARM architecture allows the caching of intermediate page table
    levels and page table freeing requires a sequence like:
    
        pmd_clear()
        TLB invalidation
        pte page freeing
    
    With commit 5e5f6dc10546 (arm64: mm: enable HAVE_RCU_TABLE_FREE logic),
    the page table freeing batching was moved from tlb_remove_page() to
    tlb_remove_table(). The former takes care of TLB invalidation as this is
    also shared with pte clearing and page cache page freeing. The latter,
    however, does not invalidate the TLBs for intermediate page table levels
    as it probably relies on the architecture code to do it if required.
    When the mm->mm_users < 2, tlb_remove_table() does not do any batching
    and page table pages are freed before tlb_finish_mmu() which performs
    the actual TLB invalidation.
    
    This patch introduces __tlb_flush_pgtable() for arm64 and calls it from
    the {pte,pmd,pud}_free_tlb() directly without relying on deferred page
    Signed-off-by: Catalin Marinas <catalin.marinas
    table freeing.
    
    Fixes: 5e5f6dc10546 arm64: mm: enable HAVE_RCU_TABLE_FREE logic
    Reported-by: Jon Masters <jcm>
    Tested-by: Jon Masters <jcm>
    Tested-by: Steve Capper <steve.capper>
    Signed-off-by: Catalin Marinas <catalin.marinas

btw: how is this bz a dupe of 126342 ?  the latter bz is about kernel config options and dates back to 2004 ... maybe another bz number?

Comment 4 Josh Boyer 2015-03-30 17:50:26 UTC

(In reply to Don Dutile from comment #3) 
> btw: how is this bz a dupe of 126342 ?  the latter bz is about kernel config
> options and dates back to 2004 ... maybe another bz number?

It's a custom kernel, not something provided by the fedora project.  Before the reporter made things slightly more clear, the standard process is to dup the bug to the customkernel alias since we don't support any random kernel that someone might build.

You can undup it if you'd like, but there's nothing stock about the fedora kernel in regard to aarch64.  The rc3 kernel mentioned in comment #2 is lacking the TLB patch.  The rc4 and rc5 kernels in fedora have the TLB patch, but the rc5 kernels lack the kernel-arm64 megapatch and it's my understanding that KVM support was carried in that and not something that comes from upstream.

The situation is essentially a mess.

Comment 5 Don Dutile (Red Hat) 2015-03-30 19:23:31 UTC

(In reply to Josh Boyer from comment #4)
> (In reply to Don Dutile from comment #3) 
> > btw: how is this bz a dupe of 126342 ?  the latter bz is about kernel config
> > options and dates back to 2004 ... maybe another bz number?
> 
> It's a custom kernel, not something provided by the fedora project.  Before
> the reporter made things slightly more clear, the standard process is to dup
> the bug to the customkernel alias since we don't support any random kernel
> that someone might build.
> (In reply to Josh Boyer from comment #4)
> (In reply to Don Dutile from comment #3) 
> > btw: how is this bz a dupe of 126342 ?  the latter bz is about kernel config
> > options and dates back to 2004 ... maybe another bz number?
> 

> 
> The situation is essentially a mess.

+1

Comment 6 Michael Tremer 2015-04-09 16:09:16 UTC

Reopening this because the issue still exists.

Host system:

[root@mustang ~]# uname -a
Linux mustang.ipfire.org 4.0.0-0.rc6.git0.1.fc23.aarch64 #1 SMP Tue Mar 31 13:34:47 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

Virtual machine:

[root@wainscott ~]# uname -a
Linux wainscott.ipfire.org 4.0.0-0.rc6.git0.1.fc23.aarch64 #1 SMP Tue Mar 31 13:34:47 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

[root@mustang ~]# dmesg | grep -i kvm
[    0.295900] kvm [1]: interrupt-controller@780c0000 IRQ17
[    0.296023] kvm [1]: timer IRQ3
[    0.296040] kvm [1]: Hyp mode initialized successfully

Comment 7 Michael Tremer 2015-04-09 16:10:15 UTC

Created attachment 1012744 [details]
dmesg output of the VM

Comment 11 Justin M. Forbes 2015-10-20 19:32:54 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs.

Fedora 22 has now been rebased to 4.2.3-200.fc22.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23.

If you experience different issues, please open a new bug report for those.

Comment 12 Fedora Kernel Team 2015-11-23 17:18:50 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in over 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.