The following patch solves the problems introduced by Robert's commit 41bf498 and reported by Arun Sharma. This commit gets rid of the base + index notation for reading and writing PMU msrs. The problem is that for fixed counters, the new calculation for the base did not take into account the fixed counter indexes, thus all fixed counters were read/written from fixed counter 0. Although all fixed counters share the same config MSR, they each have their own counter register. Without: $ task -e unhalted_core_cycles -e instructions_retired -e baclears noploop 1 noploop for 1 seconds 242202299 unhalted_core_cycles (0.00% scaling, ena=1000790892, run=1000790892) 2389685946 instructions_retired (0.00% scaling, ena=1000790892, run=1000790892) 49473 baclears (0.00% scaling, ena=1000790892, run=1000790892) With: $ task -e unhalted_core_cycles -e instructions_retired -e baclears noploop 1 noploop for 1 seconds 2392703238 unhalted_core_cycles (0.00% scaling, ena=1000840809, run=1000840809) 2389793744 instructions_retired (0.00% scaling, ena=1000840809, run=1000840809) 47863 baclears (0.00% scaling, ena=1000840809, run=1000840809) Acknowledgements: Red Hat would like to thank Li Yu for reporting this issue.
Upstream commit: http://git.kernel.org/linus/fc66c5210ec2539e800e87d7b3a985323c7be96e Introduced in: http://git.kernel.org/linus/41bf498949a263fa0b2d32524b89d696ac330e94
Statement: This issue did not affect the versions of Linux kernel as shipped with Red Hat Enterprise Linux 4, 5, and Red Hat Enterprise MRG as they did not backport the upstream commit 41bf498 that introduced the issue. This has been addressed in Red Hat Enterprise Linux 6 via https://rhn.redhat.com/errata/RHSA-2011-1350.html.
*** Bug 717049 has been marked as a duplicate of this bug. ***
*** Bug 721283 has been marked as a duplicate of this bug. ***
------- Comment From ranittal.ibm.com 2011-10-03 08:44 EDT------- Hi Don, Do we have any update on this? Can you please confirm which build would include this fix? Thanks.
This issue has been addressed in following products: Red Hat Enterprise Linux 6 Via RHSA-2011:1350 https://rhn.redhat.com/errata/RHSA-2011-1350.html
Created kernel tracking bugs for this issue Affects: fedora-all [bug 748669]
------- Comment From nabharay.com 2011-11-25 06:47 EDT------- While running pounder test on HS22 with RHEL 6.2 RC1 64 bit , the machine got crashed after 44 hours of test run and vmcore was generated. The backtrace of the vmcore is same as the mentioned in the bug description. ---uname ouptut---- Linux hs22.in.ibm.com 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 x86_64 GNU/Linux Machine Type = HS22 Attaching the crash report and sos report. I am pasting the back trace below: This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2011-11-25-07:20:47/vmcore [PARTIAL DUMP] CPUS: 16 DATE: Fri Nov 25 07:18:47 2011 UPTIME: 1 days, 20:33:37 LOAD AVERAGE: 190.73, 316.30, 344.43 TASKS: 776 NODENAME: hs22.in.ibm.com RELEASE: 2.6.32-220.el6.x86_64 VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011 MACHINE: x86_64 (2666 Mhz) MEMORY: 70 GB PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 13" PID: 2016 COMMAND: "timed_loop" TASK: ffff880a9eb20b00 [THREAD_INFO: ffff8800623ee000] CPU: 13 STATE: TASK_RUNNING (PANIC) crash> bt PID: 2016 TASK: ffff880a9eb20b00 CPU: 13 COMMAND: "timed_loop" #0 [ffff8808234a7b00] machine_kexec at ffffffff81031fcb #1 [ffff8808234a7b60] crash_kexec at ffffffff810b8f72 #2 [ffff8808234a7c30] panic at ffffffff814ec348 #3 [ffff8808234a7cb0] watchdog_overflow_callback at ffffffff810d8fad #4 [ffff8808234a7cd0] __perf_event_overflow at ffffffff8110a89d #5 [ffff8808234a7d70] perf_event_overflow at ffffffff8110ae54 #6 [ffff8808234a7d80] intel_pmu_handle_irq at ffffffff8101e096 #7 [ffff8808234a7e90] perf_event_nmi_handler at ffffffff814f09f9 #8 [ffff8808234a7ea0] notifier_call_chain at ffffffff814f2545 #9 [ffff8808234a7ee0] atomic_notifier_call_chain at ffffffff814f25aa #10 [ffff8808234a7ef0] notify_die at ffffffff81096bce #11 [ffff8808234a7f20] do_nmi at ffffffff814f01c3 #12 [ffff8808234a7f50] nmi at ffffffff814efad0 [exception RIP: _spin_lock_irqsave+47] RIP: ffffffff814ef22f RSP: ffff8800623ef928 RFLAGS: 00000083 RAX: 0000000000008c60 RBX: ffff880800033db8 RCX: 0000000000008c54 RDX: 0000000000000282 RSI: 0000000000000001 RDI: ffff880800033db8 RBP: ffff8800623ef928 R8: 0000000000000002 R9: 00000000000030e6 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880800000040 R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #13 [ffff8800623ef928] _spin_lock_irqsave at ffffffff814ef22f #14 [ffff8800623ef930] __wake_up at ffffffff810517f2 #15 [ffff8800623ef970] wakeup_kswapd at ffffffff811295ae #16 [ffff8800623ef9b0] __alloc_pages_nodemask at ffffffff81123e5b #17 [ffff8800623efad0] alloc_pages_current at ffffffff81158b7a #18 [ffff8800623efb00] __page_cache_alloc at ffffffff81110e57 #19 [ffff8800623efb30] __do_page_cache_readahead at ffffffff81126a7b #20 [ffff8800623efbc0] ra_submit at ffffffff81126bd1 #21 [ffff8800623efbd0] filemap_fault at ffffffff81112123 #22 [ffff8800623efc40] __do_fault at ffffffff8113b2c4 #23 [ffff8800623efcd0] handle_pte_fault at ffffffff8113b877 #24 [ffff8800623efdb0] handle_mm_fault at ffffffff8113c4b4 #25 [ffff8800623efe00] __do_page_fault at ffffffff81042b39 #26 [ffff8800623eff20] do_page_fault at ffffffff814f248e #27 [ffff8800623eff50] page_fault at ffffffff814ef845 RIP: 00000036c824e950 RSP: 00007fff4c167988 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000020d9010 RCX: 00007fff4c16928a RDX: 0000000000401389 RSI: 0000000000000100 RDI: 00007fff4c167990 RBP: 00007fff4c1680ec R8: 000000000000ffff R9: 000000000000000f R10: fffffffffffff38f R11: 0000000000000000 R12: 00007fff4c168208 R13: 00000000004012e0 R14: 00007fff4c168218 R15: 0000000000004093 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b This is a regression from RHEL 6.2 SNAP3 or SNAP4. Thanks
Created attachment 536207 [details] Sosreport comment #17 ------- Comment on attachment From nabharay.com 2011-11-25 06:49 EDT------- Sosreport comment #17
Created attachment 536208 [details] Crash report comment #17 ------- Comment on attachment From nabharay.com 2011-11-25 06:51 EDT------- Crash report as mentioned in comment #17
- [kernel] perf: Optimize event scheduling locking (Steve Best) [744986] is the only patch that may have caused this regression since snap3/4.
Linda, I'm not sure why bzs are getting dupped to this bz. checking out kernel 220 it has Upstream commit: http://git.kernel.org/linus/fc66c5210ec2539e800e87d7b3a985323c7be96e anyone have any idea why this is still open? I assume it is for RHEL 6.2, maybe it is for another RHEL 6.x release? -Steve
(In reply to comment #19) > Linda, > > I'm not sure why bzs are getting dupped to this bz. checking out kernel 220 it > has > Upstream commit: > http://git.kernel.org/linus/fc66c5210ec2539e800e87d7b3a985323c7be96e > > anyone have any idea why this is still open? I assume it is for RHEL 6.2, maybe > it is for another RHEL 6.x release? > > -Steve This is a top-level security bug. It is meant to keep track of the trackers (see Depends on). This will remain opened until all the trackers have been addressed, including rhel-6.2. Thanks.
------- Comment From tpnoonan.com 2012-04-25 11:32 EDT------- (In reply to comment #21) ------- Comment From > eteo 2011-12-01 00:17:36 EDT------- > I'm not sure why bzs are getting dupped to this bz. checking out > kernel 220 it > > > anyone have any idea why this is still open? I assume it is for RHEL 6.2, > maybe > This is a top-level > security bug. It is meant to keep track of the trackers (see Depends on). > This will remain opened until all the trackers have been addressed, > including rhel-6.2. Thanks. hi red hat, have all the trackers been addressed/can this one now be closed? thanks
(In reply to comment #21) > ------- Comment From tpnoonan.com 2012-04-25 11:32 EDT------- > hi red hat, have all the trackers been addressed/can this one now be closed? Hi IBM, yes, closing.