Bug 1013708

Summary:	perf samples too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Product:	[Fedora] Fedora	Reporter:	Jan Welker <jan>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED NOTABUG	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	19	CC:	alexey.brodkin, aschorr, bugzilla-redhat, collura, davidbranquinho, gansalmon, horst, igeorgex, itamar, jonathan, jones.peter.busi, js, kernel-maint, lampe, madhu.chinakonda, marcelo.barbosa, michele, michele, spetreolle
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-01-04 14:36:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Welker 2013-09-30 15:57:08 UTC

Messages log says:
perf samples too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

3.11.1-200.fc19.x86_64

I expect kernel.perf_event_max_sample_rate to be 100 000

Comment 1 Jan Welker 2013-09-30 15:58:53 UTC

I do not know if this is related, but I did a BTRFS scrub (heavy disk I/O) while the error occurred.

Comment 2 Jan Welker 2013-10-02 14:18:14 UTC

Same thing with the latest kernel 3.11.2-201.fc19.x86_64

Comment 3 Jan Welker 2013-10-28 17:11:20 UTC

And with 3.12.0-0.rc6.git4.2.fc21.x86_64 ...

Comment 4 Andrew J. Schorr 2013-10-30 15:46:28 UTC

I got this message on 3.11.3-201.fc19.x86_64:

Oct 30 03:22:32 ti15 kernel: [ID kern.warning] [56333.143955] perf samples too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

Comment 5 Julian Stecklina 2013-11-11 21:44:26 UTC

I got it on 3.11.7-300.fc20.x86_64

Comment 6 Steve 2013-11-18 09:16:49 UTC

And I got it on 3.11.8-200.fc19.x86_64.

Comment 7 Michele Baldessari 2013-11-26 21:33:10 UTC

This change came via 3.11:
commit 14c63f17b1fde5a575a28e96547a22b451c71fb5
Author: Dave Hansen <dave.hansen.com>
Date:   Fri Jun 21 08:51:36 2013 -0700

    perf: Drop sample rate when sampling is too slow
    
    This patch keeps track of how long perf's NMI handler is taking,
    and also calculates how many samples perf can take a second.  If
    the sample length times the expected max number of samples
    exceeds a configurable threshold, it drops the sample rate.
    
    This way, we don't have a runaway sampling process eating up the
    CPU.
    
    This patch can tend to drop the sample rate down to level where
    perf doesn't work very well.  *BUT* the alternative is that my
    system hangs because it spends all of its time handling NMIs.
    
    I'll take a busted performance tool over an entire system that's
    busted and undebuggable any day.
    
    BTW, my suspicion is that there's still an underlying bug here.
    Using the HPET instead of the TSC is definitely a contributing
    factor, but I suspect there are some other things going on.
    But, I can't go dig down on a bug like that with my machine
    hanging all the time.

The warnings are harmless. From Documentation/sysctl/kernel.txt
perf_cpu_time_max_percent:

Hints to the kernel how much CPU time it should be allowed to
use to handle perf sampling events.  If the perf subsystem
is informed that its samples are exceeding this limit, it
will drop its sampling frequency to attempt to reduce its CPU
usage.

Some perf sampling happens in NMIs.  If these samples
unexpectedly take too long to execute, the NMIs can become
stacked up next to each other so much that nothing else is
allowed to execute.

0: disable the mechanism.  Do not monitor or correct perf's
   sampling rate no matter how CPU time it takes.

1-100: attempt to throttle perf's sample rate to this
   percentage of CPU.  Note: the kernel calculates an
   "expected" length of each sample event.  100 here means
   100% of that expected length.  Even if this is set to
   100, you may still see sample throttling if this
   length is exceeded.  Set to 0 if you truly do not care
   how much CPU is consumed.

Comment 8 Michele Baldessari 2013-12-25 19:18:50 UTC

I'd say this one can be closed, unless anyone thinks otherwise. 

Josh, what's your take on this one? Ok to close?

Comment 9 Justin M. Forbes 2014-01-03 22:11:17 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.12.6-200.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 10 Michele Baldessari 2014-01-04 14:36:35 UTC

I am going ahead an closing this one as it is pretty much by design and can
be simply disabled.

Comment 11 Peter H. Jones 2014-01-05 21:00:23 UTC

Please repopen. I am seeing this in connection with bug 1048605. I tried adding a fixed-address ethernet port at 10.0.0.1 and starting "ncat -u -l -v -p 6666" and opening the firewall to UDP port 6666 on another ethernet-connected machine,
and adding netconsole=6665.0.2/p4p1,6666.1 to the first line on the Fedora-Jam version mentioned in that bug. How can the bug be "simply diabled"?

Comment 12 Horst Schirmeier 2014-04-14 11:03:45 UTC

(In reply to Peter H. Jones from comment #11)
> How can the bug be "simply diabled"?

Just have a look at the patch: https://lkml.org/lkml/2013/5/29/640

> perf_cpu_time_max_percent:
> [...]
> 0: disable the mechanism.  Do not monitor or correct perf's
>    sampling rate no matter how CPU time it takes.

-->  echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent

Comment 13 Michael Lampe 2014-10-19 02:16:59 UTC

This just means you opted out of monitoring and correcting. Now you really get it. Disabling is something else.