Bug 1286000

Summary: KVM guest VM random hangs - perf Interrupt taking too long
Product: [Fedora] Fedora Reporter: Saso Tavcar <fast>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 23CC: fast, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, mchehab
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-03 13:56:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Saso Tavcar 2015-11-27 08:12:46 UTC
Description of problem:

KVM guest random hangs few times per day started from Fedora 22 kernel-4.1.x and grater.
Problem stil persits now after upgrading to Fedora 23 with kernel-4.2.6. for particular VM.

If I boot this machine back to kernel-4.0.4, VM is runing fine and stable.


Version-Release number of selected component (if applicable):

Fedora release 22 from kernels > 4.1.x
Fedora release 23 (Twenty Three) 

KVM guest services running (4x vCPU, 12Gb RAM):

openvswitch-2.4.0-1.fc23.x86_64
httpd-2.4.17-3.fc23.x86_64
firewalld-0.3.14.2-4.fc23.noarch
keepalived-1.2.19-2.fc23.x86_64

KVM host services running:

openvswitch-2.4.0-1.fc23.x86_64
firewalld-0.3.14.2-4.fc23.noarch
qemu-2.4.1-1.fc23.x86_64
libvirt-1.2.18.1-2.fc23.x86_64

How reproducible:

Booting and running VM to every new kernel, started with Fedora 22 kernels > 4.1.x.  


Steps to Reproduce:
1. dnf update
2. Restart and running VM with a new kernel (every new kernel from 4.1.x till 4.2.6)


Actual results:

KVM guest VM stuck with 100% CPU, not even VM console on virt-manager does respond.

[root@horizon1 ~]# cat /var/log/messages*|grep "perf interrupt"
#
# 88 day of uptime runing kernel 4.0.4-301.fc22.x86_64
# after installing and running new kernel 4.2.6
#
Nov 24 19:16:06 horizon1 kernel: [ 4606.027436] perf interrupt took too long (2546 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Nov 24 21:13:05 horizon1 kernel: [ 2033.602214] perf interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Nov 24 22:18:13 horizon1 kernel: [ 5941.758623] perf interrupt took too long (5080 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Nov 25 06:40:38 horizon1 kernel: [36086.226498] perf interrupt took too long (10021 > 9615), lowering kernel.perf_event_max_sample_rate to 13000
Nov 25 12:18:44 horizon1 kernel: [ 3906.467819] perf interrupt took too long (2532 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Nov 25 13:38:57 horizon1 kernel: [ 8718.624366] perf interrupt took too long (5069 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Nov 25 17:33:41 horizon1 kernel: [ 4817.835230] perf interrupt took too long (2575 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Nov 25 20:12:24 horizon1 kernel: [ 4689.057228] perf interrupt took too long (2587 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Nov 25 21:49:20 horizon1 kernel: [10505.444042] perf interrupt took too long (5147 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Nov 25 23:54:51 horizon1 kernel: [ 5794.339573] perf interrupt took too long (2560 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Nov 26 03:08:17 horizon1 kernel: [17400.684225] perf interrupt took too long (5151 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Nov 26 09:11:23 horizon1 kernel: [39186.016291] perf interrupt took too long (10076 > 9615), lowering kernel.perf_event_max_sample_rate to 13000
Nov 26 14:58:41 horizon1 kernel: [ 3625.981674] perf interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Nov 26 16:26:59 horizon1 kernel: [ 8923.976410] perf interrupt took too long (5033 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
#
# booting back to kernel 4.0.4 - !!!No more VM hangs and perf messages!!!
#



KVM host:

[root@solaris1 ~]# uname -a
Linux solaris1.domain.com 4.2.6-300.fc23.x86_64 #1 SMP Tue Nov 10 19:32:21 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

[root@solaris1 ~]# cat /var/log/messages|grep perf
Nov 23 19:50:01 solaris1 kernel: kvm [20832]: vcpu0 unimplemented perfctr wrmsr: 0xc0010007 data 0xffff
Nov 23 20:06:25 solaris1 kernel: kvm [3394]: vcpu0 unimplemented perfctr wrmsr: 0xc0010007 data 0xffff
Nov 23 21:43:21 solaris1 kernel: kvm [20832]: vcpu0 unimplemented perfctr wrmsr: 0xc0010007 data 0xffff
Nov 24 08:16:39 solaris1 kernel: kvm [20832]: vcpu0 unimplemented perfctr wrmsr: 0xc0010007 data 0xffff
Nov 24 08:55:37 solaris1 kernel: kvm [20832]: vcpu0 unimplemented perfctr wrmsr: 0xc0010007 data 0xffff
Nov 24 11:20:08 solaris1 kernel: kvm [20832]: vcpu0 unimplemented perfctr wrmsr: 0xc0010007 data 0xffff
Nov 24 12:37:57 solaris1 kernel: kvm [3394]: vcpu0 unimplemented perfctr wrmsr: 0xc0010007 data 0xffff
Nov 24 17:51:52 solaris1 hp-snmp-agents: Already stopped Performance agent (cmaperfd): [  OK  ]
Nov 24 17:58:10 solaris1 kernel: Initializing cgroup subsys perf_event
Nov 24 17:58:10 solaris1 kernel: perf: AMD IBS detected (0x0000001f)
Nov 24 17:59:07 solaris1 hp-snmp-agents: Starting Performance agent (cmaperfd): [  OK  ]
Nov 24 22:19:59 solaris1 kernel: perf interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 50000


Expected results:

Running KVM guest stable with newer kernels 4.2.x as it runs stable with kernels 4.0.x

Comment 1 Laura Abbott 2016-09-23 19:37:50 UTC
*********** MASS BUG UPDATE **************
 
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs.
 
Fedora 23 has now been rebased to 4.7.4-100.fc23.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25.
 
If you experience different issues, please open a new bug report for those.

Comment 2 Saso Tavcar 2016-10-03 13:56:35 UTC
This issue was somehow fixed from kernels 4.3.x onwards.