Red Hat Bugzilla – Full Text Bug Listing
|Summary:||Performance regression between kernels 2.6.32-131.0.15 and 2.6.32-220|
|Product:||Red Hat Enterprise Linux 6||Reporter:||Joseph Kachuck <jkachuck>|
|Component:||kernel||Assignee:||John Feeney <jfeeney>|
|Status:||CLOSED ERRATA||QA Contact:||Desktop QE <desktop-qa-list>|
|Version:||6.2||CC:||acontant, airlied, andre.gagnon, czhang, ddadap, denis.leclerc, francois.paquin, gwu, jkachuck, jvillalo, kem, kzhang, len.brown, lwoodman, mlichvar, prarit, rcherukuri, tpelka|
|Target Milestone:||rc||Keywords:||OtherQA, Regression|
|Fixed In Version:||kernel-2.6.32-315.el6||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2013-02-21 01:24:30 EST||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
|Bug Blocks:||782183, 840683, 834101|
Description Joseph Kachuck 2012-06-20 15:40:42 EDT
An ISV reported an OpenGL performance regression to NVIDIA. The report was investigated, and the regression was isolated to an unknown change in the kernel shipped with RHEL 6.2. 2.6.32-220 exhibits the poor performance; 2.6.32-131.0.15 does not. All tests were performed on the same system, with the kernel being the only component changed between tests. The regression appears to be related to kernel functionality used by the NVIDIA proprietary graphics driver to update page table attributes. A test case and profiling data can be provided upon request. (We would need to get permission to share the test case / data from the original reporter; if permission is not granted, we can try to develop a separate test case to exhibit the behavior..)
Comment 1 Joseph Kachuck 2012-06-20 15:46:15 EDT
To reproduce the bug, run the attached gfxProfile program with the following options: $ gfxProfile -x 1920 -y 1080 -e GL_RGB -n 32 This will produce output like this: Profiler glMapBuffer: Nb of cycles: 32 Total time: 9.47134 Average time: 0.295979 Minimum time: 1.20782e-06 Maximum time: 1.88427 Standard dev: 0.580849 Ext. Type: GL_UNSIGNED_BYTE Ext. Format: GL_RGB Int. Format: GL_RGBA8 Tex. Target: GL_TEXTURE_2D RT Clock Time: 0.0797068s System Clock Tick Time: 0.08s CPU Time Used: 0.08 % Kernel: 0.00 % User: 100.00 Avg: 0.00248204 Std Dev: 0.000355385 MBytes/Sec: 2506.32 MPix/Sec: 835.44 All tests were performed with Fermi-based Quadro cards (e.g. Quadro 4000, Quadro 5000) with the NVIDIA proprietary driver. With the 2.6.32-131.0.15 kernel, the average "Total time" is consistently about 0.75 seconds. With the 2.6.32-220 kernel, "Total time" ranges between about 2s and 12s. Most of the time (~85%) in the bad case seems to be spent in change_page_attr_set_clr(), which is called by default_send_IPI_mask_allbutself_phys(), which is called by the NVIDIA kernel driver in the path for allocating and freeing kernel memory.
Comment 3 Joseph Kachuck 2012-06-20 15:47:57 EDT
Created attachment 593302 [details] gfxProfile-nvidia-bug-report.log.gz
Comment 4 Joseph Kachuck 2012-06-20 15:52:29 EDT
FWIW, the NVIDIA engineer who was investigating this regression reports that it is reproducible with older generation Quadro boards as well, and does not appear to be system specific.
Comment 6 Joseph Kachuck 2012-06-21 10:21:15 EDT
=== In Red Hat Customer Portal Case 00653942 === --- Comment by Dadap, Daniel on 6/20/2012 8:06 PM --- Another data point: the bad performance is only reproducible with the Intel IOMMU. If the IOMMU is disabled in the kernel, the performance is fine.
Comment 13 Dave Airlie 2012-08-07 16:43:59 EDT
one question is if this is being seen with an upstream kernel kernel as well? if so we should report it to Intel and get their input on it, since its IOMMU related and we fixed a lot of bugs with flushing the iommu with the Intel GPU and it looks like one of those bug fixes is affecting this.
Comment 14 Dave Airlie 2012-08-08 06:27:06 EDT
can we get the logs from the 6.1 kernel as well? I don't think the iommu changes pointed out so far are anything to do with it.
Comment 15 Jean Blouin 2012-08-09 16:25:03 EDT
Created attachment 603345 [details] redhat 5 logs As requested here is the log for Redhat 5
Comment 16 Jean Blouin 2012-08-10 10:08:13 EDT
Created attachment 603558 [details] z800-rhel6.1-no-repro-nvidia-bug-report.log.gz This log was generated by our partner at NVIDIA: The attached log file was generated with an R295_00 nightly builds with the last "good" kernel (2.6.32-131.0.15.el6.x86_64) and contains the output of dmesg.
Comment 17 Dave Airlie 2012-08-12 23:03:03 EDT
okay I've tried to reproduce this on a machine with a Fermi in it, GF108 Quadro 1000M card with no luck. So we probably need to narrow down what range of cards are affected. Does it still happen if you boot 220 with intel_iommu=on? Since RHEL6 doesn't enable iommu by default, swiotlb is probably being used and that may mean reproducing it is a bit more difficult.
Comment 19 Dave Airlie 2012-08-13 02:26:36 EDT
Okay I've just tested on a HP workstation with 32GB RAM with a Quadro 2000 in it, and with an original G80 card, neither of these show the problem here either. I've done perf record and see the functions mentioned taking 10% of the time, with both kernels.
Comment 20 Dave Airlie 2012-08-13 02:31:16 EDT
can you also try removing the vga=795 line from the 220 kernel test, and see if it makes any difference.
Comment 21 Dave Airlie 2012-08-13 03:03:18 EDT
So I think this might actually be CPU specific, can you guys confirm that? I've tested across 3-4 machines here today and cannot reproduce this at all, I see 10% time in those functions with both kernels.
Comment 22 Ramana 2012-08-13 19:47:37 EDT
This happens only on HPZ800 only. We also tested on HPZ600/200/420 but could not reproduce the error. Card we tested with is NVIDIA Quadro FX 5800. Here is command line from HPZ800 system: Command line: ro root=/dev/mapper/VolGroup-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=VolGroup/lv_swap SYSFONT=latarcyrheb-sun16 rhgb crashkernel=128M quiet rd_LVM_LV=VolGroup/lv_root KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM vga=0 vga=0ff
Comment 26 Prarit Bhargava 2012-08-17 14:29:20 EDT
Can someone send me the pci ID for the NVIDIA Quadro FX 5800? Thanks, P.
Comment 27 Garrison Wu 2012-08-20 11:32:18 EDT
Comment 28 Prarit Bhargava 2012-08-20 11:41:10 EDT
We've started a bisect kernel email thread off-bug. I'm providing the customer with kernels to narrow this down. P.
Comment 29 Prarit Bhargava 2012-08-21 09:10:30 EDT
From comment #1: >With the 2.6.32-131.0.15 kernel, the average "Total time" is consistently about 0.75 seconds. With the 2.6.32-220 kernel, "Total time" ranges between about 2s and 12s. Results from the test kernels: Linux curacao 2.6.32-131.el6bz834097_UNSUPPORTED.x86_64 #1 SMP Mon Aug 20 10:03:25 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux ========================================================================================================= 1: Total time: 0.612426 2: Total time: 0.593709 3: Total time: 0.593888 4: Total time: 0.604433 5: Total time: 0.605384 Seems to be better than the reported value in 2.6.32-131.0.15 . Linux curacao 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux ========================================================================================================= 1: Total time: 0.659442 2: Total time: 0.66436 3: Total time: 0.669277 4: Total time: 0.664821 5: Total time: 0.662532 Seems to be slightly higher than the 2.6.32-131-el6bz834097_UNSUPPORTED kernel, but still better than 2.6.32-131.0.15. Am I looking at the correct results? Or is there some other value I should be examining? Thanks, P.
Comment 30 Daniel Dadap 2012-08-27 12:34:56 EDT
That looks like the correct part of the result output to be looking at. Those numbers suggest that the issue has not been reproduced.
Comment 31 Prarit Bhargava 2012-09-05 10:02:47 EDT
We've received the NVIDIA cards and I had one of them installed in the z800 in my office. My numbers are sort of strange from the gfxProfile testing. I seem get a wide range of numbers on both 131.0.15 and 220, and 303 (latest internal). 131.0.15: Total time: 0.570904 Total time: 0.358336 Total time: 0.569248 Total time: 0.569218 Total time: 0.359859 220: Total time: 0.600308 Total time: 0.587827 Total time: 0.365252 Total time: 0.568623 Total time: 0.372128 303: Total time: 0.580631 Total time: 0.438698 Total time: 0.584163 Total time: 0.59195 Total time: 0.576675 There isn't really any consistency here, but I'm definitely not seeing a huge increase like you've reported. Let's look at this from a different angle. It's obvious that you're seeing something strange and I'm not; I think it was mentioned that the kernel seems stuck issuing IPIs but I didn't see anything that indicated why or how you came to that conclusion. Did you just use perf/oprofile? Could you show me exactly what you did to get that info so I can try to replicate it here?
Comment 33 Prarit Bhargava 2012-09-05 13:28:24 EDT
NVIDIA, You've indicated that the problem may lie in the page attribute code, change_page_attr_set_clr() which is called by your IPI call. Can you provide some simple debug code in the form of a kernel module that shows the problem? I'd like to see a real scenario in which the above problem can be triggered that is independent of the NVIDIA driver. P.
Comment 34 Larry Woodman 2012-09-05 14:16:15 EDT
In comment #1 this was stated: -------------------------------------------------------------------------------- With the 2.6.32-131.0.15 kernel, the average "Total time" is consistently about 0.75 seconds. With the 2.6.32-220 kernel, "Total time" ranges between about 2s and 12s. Most of the time (~85%) in the bad case seems to be spent in change_page_attr_set_clr(), which is called by default_send_IPI_mask_allbutself_phys(), which is called by the NVIDIA kernel driver in the path for allocating and freeing kernel memory. -------------------------------------------------------------------------------- So far we have not been able to reproduce this problem. Can someone sent the perftop output that shows ~85% of the time in change_page_attr_set_clr()? Also, can someone send the results "time gfxProfile -x 1920 -y 1080 -e GL_RGB -n 32" so I can see if the system is idle, system on user mode when this is happening? Thanks, Larry Woodman
Comment 35 Brian Nguyen 2012-09-05 21:23:10 EDT
Created attachment 610140 [details] perf/time logs of gfxProfile on -131 kernel vs. -214 kernel
Comment 36 Brian Nguyen 2012-09-05 21:25:00 EDT
Hi Prarit and Larry, Have you enabled hyperthreading in the Z800's BIOS? We have not been able to reproduce this regression without this setting. In response to: > Did you just use perf/oprofile? Yes, we hypothesized that IPIs might be involved here based on the results of perf. I have attached a tarball containing output from "perf report" and "time gfxProfile -x 1920 -y 1080 -e GL_RGB -n 32" for -131 and -214 kernels, using our 304.37 release driver. This tarball also contains an NVIDIA bug report with system configuration information on the -131 kernel.
Comment 37 Prarit Bhargava 2012-09-07 16:38:22 EDT
Steps 1) build ipi-time-test.ko 2) insmod ipi-time-test.ko ipi_time_runs=5000 3) ./analysis.sh > /tmp/result 4) ./stdev.sh /tmp/result Results from z800: 2.6.32-220.el6 5000 events Number of data points in "/tmp/result" = 137 Arithmetic mean (average) = 1386.040391921 Standard Deviation = 1756.661764137 Median number (middle) = 1308 2.6.32-131.0.15.el6 5000 events Number of data points in "/tmp/result" = 137 Arithmetic mean (average) = 1397.663867226 Standard Deviation = 1752.080106742 Median number (middle) = 1317 So it doesn't look like the problem is within the IPI. I'll try a few other systems to confirm. I will attach the test module and scripts after this post. P.
Comment 38 Prarit Bhargava 2012-09-07 16:39:29 EDT
Created attachment 610842 [details] Module to test IPIs on system
Comment 39 Prarit Bhargava 2012-09-07 16:40:29 EDT
Created attachment 610843 [details] Compiled ipi-time-test.ko module
Comment 40 Prarit Bhargava 2012-09-07 16:41:08 EDT
Created attachment 610844 [details] Script to analyze output from load of module
Comment 41 Prarit Bhargava 2012-09-07 16:42:02 EDT
Created attachment 610845 [details] Script to calculate median and standard deviation
Comment 42 Daniel Dadap 2012-09-07 16:44:35 EDT
Do you have the actual recorded data? That looks like a huge standard deviation in both cases...
Comment 43 Prarit Bhargava 2012-09-10 09:31:28 EDT
Created attachment 611430 [details] /var/log/messages from 131 kernel
Comment 44 Prarit Bhargava 2012-09-10 09:34:29 EDT
Created attachment 611432 [details] /var/log/messages from 220 kernel Please note: I did not keep the -220 messages which were used to compile the previously reported data. I repeated the test to get this output. Number of data points in "/tmp/result" = 139 Arithmetic mean (average) = 1437.406156306 Standard Deviation = 1773.686294053 Median number (middle) = 1361
Comment 45 Prarit Bhargava 2012-09-10 09:35:23 EDT
(In reply to comment #42) > Do you have the actual recorded data? That looks like a huge standard > deviation in both cases... Daniel, I've attached the raw /var/log/messages from runs on -131.0.15 and -220 to this BZ. See comment #43 and comment #44. P.
Comment 46 Larry Woodman 2012-09-10 13:55:17 EDT
Created attachment 611529 [details] NVIDIA, please try this patch ans see if ite related. The only change I made to arch/x86/mm/pageattr.c between 6.1 & 6.2 was here. I doubt its related but itw worth a kwik try:
Comment 47 Daniel Dadap 2012-09-11 16:21:30 EDT
Created attachment 611908 [details] Updated test module and tools We were able to reproduce the issue with a modified version of Prarit's kernel module and analysis tools. Most of the modifications were to reduce the amount of noise in the data, specifically: 1) Increased the number of IPI calls per trial 2) Reduce amount of spew from the module (I/O overhead of printk() was adding noise) 3) Omit result of first trial from analysis, as it takes longer, and skews the other results. Attaching a tarball of the modified test tools, including results from runs with the -131 and -220 kernels. The tarball also includes some graphs of the min, mean, and max times from the various runs. Note that the max time is relatively flat in the non-repro case, and fluctuates wildly in the repro case. Min and mean are pretty flat in both repro and non-repro cases. As a result, stddev in the repro case is much larger than the non-repro case. We also tested with Larry's patch applied to -220, and it didn't seem to make a difference.
Comment 48 Daniel Dadap 2012-09-11 16:23:31 EDT
Created attachment 611909 [details] max time graph with patched kernel
Comment 49 Prarit Bhargava 2012-09-19 10:37:58 EDT
(In reply to comment #47) > Created attachment 611908 [details] > Updated test module and tools > > We were able to reproduce the issue with a modified version of Prarit's > kernel module and analysis tools. Most of the modifications were to reduce > the amount of noise in the data, specifically: > > 1) Increased the number of IPI calls per trial > 2) Reduce amount of spew from the module (I/O overhead of printk() was > adding noise) > 3) Omit result of first trial from analysis, as it takes longer, and skews > the other results. > > Attaching a tarball of the modified test tools, including results from runs > with the -131 and -220 kernels. The tarball also includes some graphs of the > min, mean, and max times from the various runs. Note that the max time is > relatively flat in the non-repro case, and fluctuates wildly in the repro > case. Min and mean are pretty flat in both repro and non-repro cases. As a > result, stddev in the repro case is much larger than the non-repro case. > > We also tested with Larry's patch applied to -220, and it didn't seem to > make a difference. I'm not seeing the wild fluctuations that you are. A few questions: 1. Is the NVIDIA driver loaded when running your test? In my case I have not loaded the NVIDIA driver. 2. Could this be a difference in BIOS versions? P.
Comment 50 Prarit Bhargava 2012-09-19 10:41:30 EDT
Created attachment 614408 [details] logs from ipi test for 131.0.15 and 220 kernels Here are my test runs for -131.0.15.el6 and -220.el6. P.
Comment 51 Prarit Bhargava 2012-09-19 12:33:24 EDT
I got rather frustrated with my inability to reproduce this in anyway on the z800. I grabbed another intel-based 2-socket Nehalem-EP system and to my surprise I see the ipi affect as indicated by NVIDIA in comment #47 and comment #48. I'm trying to see if I can narrow this down to a particular change in our kernel. I will update when I find something. P.
Comment 52 Prarit Bhargava 2012-09-19 15:44:46 EDT
On the 2-socket Nehalem-EP system I tested on, adding "intel_idle.max_cstate=0" as a kernel parameter seems to resolve the strange delays in the ipi calls. I strongly suggest that the customer employ this workaround for the time being while we investigate a proper fix. P.
Comment 54 Prarit Bhargava 2012-09-19 15:49:31 EDT
Reassigning to mjg for now. mjg -- the test to reproduce this problem is in comment #47 and comment #48. You do NOT have to go all the way to plotting graphs. On "good" runs, you'll see that the mean works out to ~.04, and the standard deviation is ~.004. On bad runs you'll see the mean fluctuate wildly from anywhere between .04 - .18. I'll see that intel-knightsferry-01.lab.bos.redhat.com is loaned to you for testing. P.
Comment 56 Matthew Garrett 2012-09-20 11:24:25 EDT
Is there any objection to opening this bug up to Intel?
Comment 57 Daniel Dadap 2012-09-20 11:37:37 EDT
Prarit, it sounds like you've reproduced the bug now, but to answer your earlier question, yes, we ran all of the tests with the NVIDIA module unloaded, to be sure that the performance regression was isolated from our driver. Also, thanks for suggesting the workaround. We will forward that suggestion to the reporting customer. Matthew, NVIDIA does not have any objections to opening the bug up to Intel. I don't believe the original reporters of the issue would object, either (after all, this will ultimately help them), but I'd like to confirm with them before answering definitively on their behalf. Actually, I'll just go ahead and ask if anybody has any objections to making this bug completely public, to make it easier to share with any other parties, if necessary.
Comment 58 Matthew Garrett 2012-09-20 14:29:52 EDT
That sounds good to me, Daniel. Let me know.
Comment 59 Prarit Bhargava 2012-09-20 14:51:28 EDT
I received private email from Autodesk saying that it was okay to add Intel to this BZ. Adding email@example.com (onsite Intel engineer) and Len Brown. P.
Comment 60 Daniel Dadap 2012-09-20 17:00:06 EDT
Hi everyone, NVIDIA and Autodesk both agreed that it's fine to open up this bug report to the public. Autodesk reported that setting Prarit's recommended kernel parameter does not fully resolve the originally reported regression for them, so it's likely that the IPI performance regression is not the only factor in the regression they observed in their application's performance. We will attempt to isolate the remaining factors, but in the meantime, this issue still needs to be resolved.
Comment 61 Joseph Kachuck 2012-09-21 12:08:41 EDT
Hello, This bug is now opened to the public. Thank You Joe Kachuck
Comment 62 Matthew Garrett 2012-09-25 09:21:07 EDT
Ok, so just to clarify - th
Comment 64 Prarit Bhargava 2012-09-25 13:05:39 EDT
*** Bug 856647 has been marked as a duplicate of this bug. ***
Comment 66 Jarod Wilson 2012-10-01 13:14:23 EDT
Patch(es) available on kernel-2.6.32-315.el6
Comment 69 denis.leclerc 2012-10-01 14:38:40 EDT
Can you tell us what was fixed in kernel-2.6.32-315 that addresses this bug?
Comment 70 Prarit Bhargava 2012-10-04 08:20:55 EDT
Created attachment 621607 [details] RHEL6 patch 1/1 This patch can be applied to RHEL6.3. P.
Comment 72 denis.leclerc 2012-10-04 15:34:36 EDT
Prarit, I've tried the patch and I'm not seeing any difference, at least, not on an HP z820. Is the patch you attached complete? Are there other patches that are required as well? Denis Leclerc Autodesk
Comment 73 Joseph Kachuck 2012-10-05 08:57:24 EDT
Hello, The patch re-enables C-state auto promotion/demotion on certain classes of CPUs. Please confirm the kernel version the patch was applied to. Thank You Joe Kachuck
Comment 74 denis.leclerc 2012-10-05 09:02:11 EDT
(In reply to comment #73) > Hello, > The patch re-enables C-state auto promotion/demotion on certain classes > of CPUs. > > Please confirm the kernel version the patch was applied to. > > Thank You > Joe Kachuck I applied the patch to kernel-2.6.32-279.9.1.
Comment 75 Joseph Kachuck 2012-10-05 13:35:06 EDT
Hello, If you boot with intel_idle.max_cstate=0, does it work around the issue you are seeing? Thank You Joe Kachuck
Comment 76 denis.leclerc 2012-10-05 14:11:58 EDT
Joe, Using intel_idle.max_cstate=0 works around the issue and it doesn't. Although it reduces the time by 1/3 to 1/2 (when used with acpi=off), the glMapBuffer calls in the application still remain prohibitively long. As first reported by nVidia above, the calls to change_page_attr_set_clr() are taking way longer than they used to and there are anywhere from 1,500 to 3,000 of these calls for each glMapBuffer() call in the application. I suspect that change_page_attr_set_clr() is flushing the TLB or calling some other function which must be executed on all CPU's, so turning cstates and ACPI off will help speed things up, but it doesn't solve the problem. The real question is what has changed in the implementation of change_page_attr_set_clr() between RHEL 5.3 and 6.2. I'm currently checking what the nVidia driver does on 5.3 to which incarnation of change_page_attr_set_clr() it calls to see what the differences are. Denis Leclerc
Comment 77 denis.leclerc 2012-10-05 15:26:59 EDT
Joe, Looking at what the nVidia driver does on 5.3, it calls change_page_attr_addr() for each page. Above the change_page_attr_addr() code in the 5.3 kernel, there's a comment saying the caller should call global_flush_tlb() after calling change_page_attr_addr(), which the driver does, but only once after allocating and modifying all the pages in the buffer it's allocating. I will try to instrument the code in change_page_attr_set_clr() on 6.2 to prove that it is indeed flushing the TLB for each call and that it's the TLB flush that's taking so much time. Denis Leclerc
Comment 80 Joseph Kachuck 2013-01-24 10:39:27 EST
Hello, From looking at this bug. It should be closed when RHEL 6.4 is released. Please let me know if you do not agree? Thank You Joe Kachuck
Comment 82 errata-xmlrpc 2013-02-21 01:24:30 EST
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html