Bug 581933 - pci_mmcfg_init() making some of main memory uncacheable
Summary: pci_mmcfg_init() making some of main memory uncacheable
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Shyam Iyer
QA Contact: Eryu Guan
URL:
Whiteboard:
: 646623 713326 (view as bug list)
Depends On:
Blocks: 563345 646623
TreeView+ depends on / blocked
 
Reported: 2010-04-13 16:20 UTC by Stuart Hayes
Modified: 2018-11-30 20:42 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 666305 (view as bug list)
Environment:
Last Closed: 2011-01-13 21:26:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch that looks for mmfcg bus numbers (1.33 KB, patch)
2010-07-01 18:04 UTC, Shyam Iyer
no flags Details | Diff
Consider start and Ending Bus numbers To make MMCONFIG memory uncacheable (1.99 KB, patch)
2010-07-02 17:24 UTC, Shyam Iyer
no flags Details | Diff
Preserves default behaviour but spits KERN_WARN message (4.41 KB, patch)
2010-08-20 19:14 UTC, Shyam Iyer
no flags Details | Diff
Preserves default behaviour but spits KERN_WARN message (2.49 KB, patch)
2010-08-20 19:21 UTC, Shyam Iyer
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Stuart Hayes 2010-04-13 16:20:28 UTC
Description of problem:

A customer complained of seeing lower disk performance using bonnie++ on Dell's Precision T3500 than on the older T3400.  Looking into this problem, I found that it didn't happen with 2GB of memory, but it did happen with 4GB.  I narrowed down the problem to a big chunk of memory just above the 4GB boundary being marked as uncacheable.

This is happening because the MCFG ACPI table on this system has one segment with a base address at 0xF8000000, start bus of 0, and an end bus of 63.  The window needed for PCI mm config is 1MB per bus, but the RHEL5.5 kernel (2.6.18-194.el5), in pci_mmcfg_init(), is ignoring the start and end busses and calling ioremap_nocache() at the base address of 0xF8000000 and a size of 256MB.  The system has main memory at 4G (physical address 0x100000000), so 128MB of memory from 0x100000000 to 0x108000000 is being made uncacheable, which really hurts performance, especially since the mem_map is put just above the 4G boundary by sparse_early_mem_map_alloc().

Using "pci=nommconf" works around this issue.


Version-Release number of selected component (if applicable):

2.6.18-194.el5


How reproducible:

Every time.


Steps to Reproduce:
1. Install RHEL5.5 on a system with 4GB of memory, where MCFG has a segment with base address at 0xF8000000 (such as the Dell Precision T3500)
2. Run some sort of benchmark (such as bonnie++).  (Or, even better, just try copying the first 128MB of memory above the 4GB boundary and see how long it takes.)
3. Observe high CPU utilization and longer run times with bonnie++, unless you use the "pci=nommconf" kernel parameter.

  
Actual results:
High CPU utilization and longer test runs with bonnie++ without "pci=nommconf".


Expected results:
Results from bonnie++ that are the same as the results when "pci=nommconf" is used (or "mem=2G"), and results that are similar to other systems using the same hard drive.



Additional info:

It appears that the 2.6.18-194 kernel does look at the bus numbers in MCFG, but only when CONFIG_XEN is defined.  This is also fixed in later upstream kernels.

Comment 1 Gary Smith 2010-05-26 08:52:12 UTC
Hi Stuart


During the initial investigations, all testing undertaken by Dell Support EMEA has shown no performance issues with the A02 BIOS.

The tests undertaken were made by writing data to /dev/shm rather than disk, and the performance dropped from 1-3GB/s to 70-100MB/s with A03 onwards.

Can you give some additional information as to why this is only occurring with BIOS vA03 onwards?

What changes went into A03 that could cause this problem to present?


Regards, Gary

Comment 2 Stuart Hayes 2010-05-26 15:41:37 UTC
I assume that the MCFG table (or memory map) was different previous to A03.

The problem is occurring because the kernel makes 256MB of memory uncachable, starting at the base address given in the MCFG table, regardless of whether this 256MB region overlaps main system memory or not.

Comment 5 Shyam Iyer 2010-07-01 18:04:51 UTC
Created attachment 428515 [details]
Patch that looks for mmfcg bus numbers

Hi Stuart,

Does the attached patch fix the problem?

Thanks,
Shyam

Comment 6 Shyam Iyer 2010-07-01 18:52:54 UTC
Comment on attachment 428515 [details]
Patch that looks for mmfcg bus numbers

I should have waited for my build to complete
Disregard this. I will send a fresh one..

Comment 7 Shyam Iyer 2010-07-02 17:24:47 UTC
Created attachment 429120 [details]
Consider start and Ending Bus numbers To make MMCONFIG memory uncacheable

Stuart,

This patch should be able to fix the problem. Please test.

Thanks,
Shyam

Comment 8 Stuart Hayes 2010-07-02 20:57:20 UTC
Shyam,

That patch fixes the problem.

It only made 0x0400.0000 bytes non-cachable on my system (from 0xf8000000 to 0xfbffffff), and, just to be sure, I ran the "bonnie++" test that found the problem originally, which showed that the problem was no longer there.

Thanks!

Comment 9 Shyam Iyer 2010-07-02 21:06:18 UTC
Thanks for the quick feedback. The earlier analysis really helped!

Comment 10 RHEL Program Management 2010-07-08 15:48:23 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 11 Shyam Iyer 2010-07-09 15:33:13 UTC
Test kernels available here for testing

http://people.redhat.com/jfeeney/.bz581933

Comment 12 Issue Tracker 2010-07-13 12:33:14 UTC
Event posted on 13-07-2010 01:33pm BST by gasmith


Dell's customer has tested the patched kernel(s) from
http://people.redhat.com/jfeeney/.bz581933/ and confirms the performance
has improved.


Regards, Gary


This event sent from IssueTracker by gasmith 
 issue 667243

Comment 13 Ray Frush 2010-08-10 23:04:27 UTC
Original customer who identified the problem (me) also confirms that patched kernel improves performance.

Comment 14 Shyam Iyer 2010-08-13 18:24:22 UTC
I have got push back on fixing this using ACPI defaults.. for legacy systems.


So the new fix I have should fix the issue by booting the kernel with the parameter

"use_acpi_mcfg_max_pci_bus_num=1"

Please test using test kernel found here http://people.redhat.com/jfeeney/.bz581933v2/

Comment 15 Ray Frush 2010-08-13 20:50:37 UTC
This latest fix is not working.


Test results:

Using a Dell T3500 workstation with 4GB RAM.
Simple test:
    bonnie++ -f  (run in /tmp)

Stock RHEL5.4 kernel:
kernel /vmlinuz-2.6.18-164.el5 swiotlb=65536 ro root=LABEL=/1 rhgb quiet

Time: 12m 30s

Stock RHLE5.4 kernel with PCI workaround:
kernel /vmlinuz-2.6.18-164.el5 swiotlb=65536 ro root=LABEL=/1 rhgb quiet pci=nommconf

Time: 6m 27s

First proposed fix 
kernel /vmlinuz-2.6.18-204.el5BZ581933 swiotlb=65536 ro root=LABEL=/1 rhgb quiet

Time 6m 24s


Second proposed fix:
kernel /vmlinuz-2.6.18-211.el5BZ581933v2 swiotlb=65536 ro root=LABEL=/1 rhgb quiet use_acpi_mcfg_max_pci_bus_num=1

Time: 12m 40s   Second test: 13m 12s


This kernel is not good.

Comment 16 Shyam Iyer 2010-08-13 21:57:52 UTC
Thanks for testing.

Can you upload the dmesg logs from the 1st kernel fix and the 2nd kernel fix?

Thanks,
Shyam

Comment 17 Shyam Iyer 2010-08-14 00:00:02 UTC
LOL ;) Never mind comment#16..

Boot with "acpi_mcfg_max_pci_bus_num=1" and test.

use_acpi_mcfg_max_pci_bus_num is the corresponding variable in the kernel

Thanks,
Shyam

Comment 18 Shyam Iyer 2010-08-15 00:14:01 UTC
Argh... 

The kernel parameter to test with is "acpi_mcfg_max_pci_bus_num=on"

I confused the value in the code to the kernel parameter..
Sorry for the confusion..

Comment 19 Ray Frush 2010-08-16 16:14:13 UTC
OK, that worked.
kernel /vmlinuz-2.6.18-211.el5BZ581933v2 swiotlb=65536 ro root=LABEL=/1 rhgb quiet acpi_mcfg_max_pci_bus_num=on

Test time was 6m 26s.

For inclusion into the "next release" this option would need to be ON by default in the Red Hat install kernel to avoid performance issues during the install, which is where I first spotted the problem.

Comment 20 Shyam Iyer 2010-08-18 14:57:23 UTC
(In reply to comment #19)
> OK, that worked.
> kernel /vmlinuz-2.6.18-211.el5BZ581933v2 swiotlb=65536 ro root=LABEL=/1 rhgb
> quiet acpi_mcfg_max_pci_bus_num=on
> 
> Test time was 6m 26s.
> 
Thanks for testing.

> For inclusion into the "next release" this option would need to be ON by
> default in the Red Hat install kernel to avoid performance issues during the
> install, which is where I first spotted the problem.

I am afraid I had to make the fix a kernel parameter to preserve default behaviour on quite a few systems. I think this can be added parameter to the install kernel or scripted in kickstarts.

Thoughts ?

Comment 21 Ray Frush 2010-08-18 15:32:29 UTC
When booting directly off of Red Hat published CD/DVD install media, it's not straightforward to include kernel boot parameters.   There is a significant install time penalty if the kernel doesn't handle the PCI/BIOS memory mapping correctly.

What I noticed is that the default kernel in Knoppix 6.0.1, and Ubuntu 9.10 both solve this problem without additional boot time kernel parameters.  It is my opinion that the Red Hat kernel should also have this behavior.   

Requiring a boot time parameter makes this no better than our current workaround.

Comment 22 Jarod Wilson 2010-08-18 19:17:02 UTC
Adding an additional boot param when installing from media is quite trivial, you simply hit tab when you see isolinux, and put it in at the boot: prompt. And if speed of install time is really that important, you shouldn't be installing from CD/DVD anyway, you should set up a pxe network install server (which can automatically apply the parameter for you as well).

What you're essentially saying is that you think its a good idea to risk breaking other people's machines so that yours install faster without having to add a single boot param. Not even "installs" w/o a param, but "installs faster". It would be one thing if your system panicked on install, but if it still installs just fine, and there's a boot arg to increase performance, there's not a whole lot of justification for changing the default behavior, given that its entirely possible changing it will result in other people having systems that no longer function properly.

We're rather against knowingly introducing possible regressions, particularly this late in the product life cycle. If this patch is going in, its going in with the required additional boot parameter to enable it.

Comment 23 Ray Frush 2010-08-19 15:41:11 UTC
My hope was that the Red Hat kernel would have the same behavior as kernels found in other distributions.   That they would perform well when booted using the default kernel boot options.  Nothing more.   I wasn't implying that it was a good idea to risk breaking anything.

Dell may have stronger opinions about how the kernel should support the system in question.  I'm just the end user.

Comment 24 Stuart Hayes 2010-08-19 16:08:14 UTC
I agree, it would be a lot nicer if it worked without a kernel parameter.  I just worked on antoher issue two days ago--another customer evaluating Dell systems complained about disk performance, and it turned out to be this issue.

I would be surprised if the no-kernel-parameter patch would cause regressions in other systems... barring a bug in the patch, it seems like the only way this could cause regressions is if a system didn't specify the right number of PCI busses in its ACPI MCFG table.  Of course, I guess that's possible.

Comment 25 Jarod Wilson 2010-08-20 15:35:06 UTC
Basically, what it boils down to is this: RHEL5 was released nearly 4 years ago, and is running on plenty of hardware at least that old, as well as any number of systems released since then. We've definitely seen plenty of busted bios, acpi tables, etc., and while those might not be our fault, we certainly get the blame if a system that has been working just fine for 4 years suddenly stops functioning properly.

With RHEL5 being in its 6th update release now, we're just very very sensitive about the possibility of regressions. The other distributions mentioned aren't quite as sensitive to this as RHEL5 is, and are based on much newer upstream kernel versions, not based on a 4 year old kernel with a stable guarantee. :)

On the bright side, we should be performant out of the box with RHEL6...

Comment 26 Stuart Hayes 2010-08-20 17:41:21 UTC
What if we add a warning message if the kernel detects that it is marking more memory uncacheable than the MCFG table says it needs (based on the number of PCI busses)?  Or only put the warning message when that extra uncacheable memory goes above 4GB (or if it sees that it is marking real memory uncacheable based on the memory maps)?

Something like...

Warning:  pci_mmcfg_init marking 256MB memory space uncacheable, but MCFG table only requires 128MB, which may result in lower performance.  Please use kernel parameter "acpi_mcfg_max_pci_bus_num=on".

Comment 27 Jarod Wilson 2010-08-20 18:05:10 UTC
I'm amenable to that. In fact, it would be a rather nice addition to round out this patch, as otherwise, folks may never stumble onto that kernel parameter on their own.

Comment 28 Shyam Iyer 2010-08-20 18:12:22 UTC
(In reply to comment #27)
> I'm amenable to that. In fact, it would be a rather nice addition to round out
> this patch, as otherwise, folks may never stumble onto that kernel parameter on
> their own.

Great.. I have a patch brewing right now to do just that. 

Thanks folks.. this is good.

Comment 29 Shyam Iyer 2010-08-20 19:14:52 UTC
Created attachment 440018 [details]
Preserves default behaviour but spits KERN_WARN message

Comment 30 Shyam Iyer 2010-08-20 19:21:53 UTC
Created attachment 440020 [details]
Preserves default behaviour but spits KERN_WARN message

This one should compile (:

Comment 31 Shyam Iyer 2010-08-23 17:55:57 UTC
Test kernels available for testing here.

http://people.redhat.com/jfeeney/.bz581933

Comment 32 Ray Frush 2010-08-23 19:31:03 UTC
Formatting of the warning message is a little sketchy, and has some typos... 

Aug 23 13:22:16 linux3 kernel: Freeing initrd memory: 2575k freed
Aug 23 13:22:16 linux3 kernel: NET: Registered protocol family 16
Aug 23 13:22:16 linux3 kernel: ACPI: bus type pci registered
Aug 23 13:22:16 linux3 kernel: pci_mmcfg_init marking 256MB memory space                                uncacheable, bu  MCFG table only requires                               64MB. This may result in lower performance.                             Try using kernel parameter                              "acpi_mcfg_max_pci_bus_num=on"<6>PCI: Using MMCONFIG at f8000000
Aug 23 13:22:16 linux3 kernel: ACPI: Interpreter enabled
Aug 23 13:22:16 linux3 kernel: ACPI: Using IOAPIC for interrupt routing
Aug 23 13:22:16 linux3 kernel: ACPI: No dock devices found.
Aug 23 13:22:16 linux3 kernel: ACPI: PCI Root Bridge [PCI0] (0000:00)


It would be nice if the message text actually included the word "WARNING".

I'm not sure it's intuitive to look in /var/log/messages for hints on how to correct performance issues.  It's not the first place I look...

Comment 33 Shyam Iyer 2010-08-23 19:52:58 UTC
Thanks for the feed back on the formatting and "WARNING" keyword. I will have those corrected.

I think it is common practice to look at /var/log/messages for hints on any failures/errors etc.

At best the error log level for the message could be elevated to be displayed on the console. But that would need a discussion first..

Comment 34 Ray Frush 2010-08-23 20:01:52 UTC
"I think it is common practice to look at /var/log/messages for hints on any
failures/errors etc."

I won't dispute the above comment, when looking for failures or errors.  However, this problem doesn't show up as an error or outright failure, it's just a "slow" machine.

I would vote for setting the error level so that the message hits the console, even in /quiet mode.   This makes it very easy for the end user to notice, and correct.

Comment 35 Shyam Iyer 2010-08-25 16:26:47 UTC
Test kernels available for testing here.

http://people.redhat.com/jfeeney/.bz581933

Comment 36 Ray Frush 2010-08-25 21:37:11 UTC
Nice warning messages at boot, even in /quiet mode.  Tells me _exactly_ what I need to do.

I'm satisfied with this compromise solution.   It will save many Sys Admins hours of work trying to nail down why the performance sucks.

Thanks.

Comment 38 Jarod Wilson 2010-09-21 21:00:07 UTC
in kernel-2.6.18-223.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 40 Ray Frush 2010-09-21 22:34:51 UTC
I note that this message showed up on the console at boot:

Warning: pci_mmcfg_init marking 256MB space uncacheable.


A search of /var/log/messages reveals the suggested fix:

Sep 21 16:16:07 linux3 kernel: Warning: pci_mmcfg_init marking 256MB space uncacheable.
Sep 21 16:16:07 linux3 kernel: MCFG table requires 64MB uncacheable only. Try booting with acpi_mcfg_max_pci_bus_num=on
Sep 21 16:16:07 linux3 kernel: PCI: Using MMCONFIG at f8000000


Disk I/O benchmark continues to run at improved speed when kernel is installed with the acpi_mcfg_max_pci_bus_num=on option on the command line.

I got a time of 6m 09s, which is the best time recorded so far.


Nice work.

Comment 41 Jane Lv 2010-11-11 21:34:14 UTC
Shyam,

May I suggest if the kernel boot parameter acpi_mcfg_max_pic_bus_num could be on automatically when kernel detects more memory uncacheable during system booting?  I believe customer won't be aware of the /var/log/messages before they find some problem appears to them.

Thanks.

Comment 42 Shyam Iyer 2010-11-11 21:54:18 UTC
Jane, 

Please see Comment #19 to Comment #27 for the decision history on this.

-Shyam

Comment 43 Suresh Siddha 2010-11-11 22:16:01 UTC
BTW, 32bit kernel also has the same issue. I don't see the 32bit kernel fixes in the proposed patch. Thanks.

Comment 46 Eryu Guan 2010-12-02 05:40:42 UTC
No hardware available(4G mem and Dell Precision T3500/T5500/T7500 systems). The fix has been verified by Dell and customer as well.
I confirmed patch linux-2.6-pci-fix-pci_mmcfg_init-making-some-memory-uncacheable.patch is applied in kernel 2.6.18-233.el5 correctly.

Comment 47 Ronald Pacheco 2010-12-03 01:03:15 UTC
*** Bug 646623 has been marked as a duplicate of this bug. ***

Comment 49 John Feeney 2010-12-08 20:25:16 UTC
With regard to the 32bit fix, since the x86_64 patch has been committed to RHEL-5.6 (see comment #38), I would imagine the 32bit fix would require its own bz at this point. Perhaps this title should be modified to reflect x86_64 only. Just my $.02.

Comment 50 Jarod Wilson 2010-12-08 21:02:04 UTC
(In reply to comment #49)
> With regard to the 32bit fix, since the x86_64 patch has been committed to
> RHEL-5.6 (see comment #38), I would imagine the 32bit fix would require its own
> bz at this point. Perhaps this title should be modified to reflect x86_64 only.
> Just my $.02.

With where we're at in the 5.6 devel cycle, there's little chance 32-bit support would make it into the release, so it should definitely be cloned to a new bug to be tacked next release.

Comment 51 Jane Lv 2010-12-30 03:22:41 UTC
Cloned bug #666305 for 32bit fix tracking.

Comment 53 errata-xmlrpc 2011-01-13 21:26:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Comment 54 Johnny Hughes 2011-04-25 14:33:09 UTC
We are still seeing the same issues on the CentOS mailing list for at least HP Proliant DL380 servers (models G6 AND G7) running x86_64 and using the 2.6.18-238.9.1.el5 kernel.

So either there has been a regression or the fix did not work for DL380 models G6 and G7.

http://lists.centos.org/pipermail/centos/2011-April/110718.html

Comment 55 Shyam Iyer 2011-04-25 14:43:35 UTC
Did booting with kernel parameter acpi_mcfg_max_pci_bus_num=on not fix the issue ?

-Shyam
Dell Onsite Engineer

Comment 56 Jarod Wilson 2011-04-25 16:56:36 UTC
(In reply to comment #54)
> We are still seeing the same issues on the CentOS mailing list for at least HP
> Proliant DL380 servers (models G6 AND G7) running x86_64 and using the
> 2.6.18-238.9.1.el5 kernel.
> 
> So either there has been a regression or the fix did not work for DL380 models
> G6 and G7.
> 
> http://lists.centos.org/pipermail/centos/2011-April/110718.html

Yeah, um, what Shyam said. The thread seems to suggest that the work-around parameter is doing exactly what's expected here.

Comment 57 Johnny Hughes 2011-04-25 21:23:51 UTC
Oh, if that is the fix then it is working.  I mistakenly thought that the fix did away with the need to manually add the kernel parameter.

Comment 58 Jarod Wilson 2011-04-25 21:50:08 UTC
(In reply to comment #57)
> Oh, if that is the fix then it is working.  I mistakenly thought that the fix
> did away with the need to manually add the kernel parameter.

Yeah, see comment #14, there were concerns about altering behavior on legacy systems already deployed in the field at this stage in RHEL5's life, so the work-around parameter is the way to go.

Comment 59 Mark Goodwin 2011-07-01 01:25:32 UTC
*** Bug 713326 has been marked as a duplicate of this bug. ***

Comment 60 Christian Horn 2014-07-25 04:25:57 UTC
Sorry for noise in this old bz, yet one of our partners just hit this and we noticed that
a) the performance influence can be in very different areas (since it is not clear which processes or data are uncached)
b) we should do more to warn users about the situation

Technical question from our partner to improve this, any input welcome:
-------------
Why can this region be used as generic RAM if the kernel thinks it should be reserved for MMCONFIG? 

With "acpi_mcfg_max_pci_bus_num=on" the kernel reserves exactly the space specified by the BIOS. Without it, it always "reserves" 256MB, even if the BIOS specifies less. (256MB is the max value for a system with a single PCIe root - max 256 buses on one root, 1MB space per bus).
But it doesn't actually "reserve" it (meaning that the memory can't be used as generic RAM), it seems to just disable caching. This leads to the situation that ordinary RAM becomes uncacheable.
And again, that makes no sense - if the mmconfig area actually extended over the whole range, using it as ordinary RAM would be extremely dangerous, to the point of damaging hardware by writing arbitrary values to PCI config space.

Note that the KB article only talks about page cache. But this RAM can also be used for other things, such as kernel data structures. It seems that exactly this happened in our case. Therefore not only page cache access for certain files but almost everything was slowed down drastically (it seem that the kernel had placed some often-accessed data structures in this area by chance).
-------------

Comment 61 Christian Horn 2014-07-25 04:27:42 UTC
When "acpi_mcfg_max_pci_bus_num=on" is used: wouldn't it be reasonable to at least prevent the reserved region to reach out past the 4GB boundary?
To our best knowledge here (talked to our BIOS guys), PCI config space can't be mapped above 4GB at all.


Note You need to log in before you can comment on or make changes to this bug.