Red Hat Bugzilla – Bug 581933
pci_mmcfg_init() making some of main memory uncacheable
Last modified: 2014-07-31 09:18:51 EDT
Description of problem:
A customer complained of seeing lower disk performance using bonnie++ on Dell's Precision T3500 than on the older T3400. Looking into this problem, I found that it didn't happen with 2GB of memory, but it did happen with 4GB. I narrowed down the problem to a big chunk of memory just above the 4GB boundary being marked as uncacheable.
This is happening because the MCFG ACPI table on this system has one segment with a base address at 0xF8000000, start bus of 0, and an end bus of 63. The window needed for PCI mm config is 1MB per bus, but the RHEL5.5 kernel (2.6.18-194.el5), in pci_mmcfg_init(), is ignoring the start and end busses and calling ioremap_nocache() at the base address of 0xF8000000 and a size of 256MB. The system has main memory at 4G (physical address 0x100000000), so 128MB of memory from 0x100000000 to 0x108000000 is being made uncacheable, which really hurts performance, especially since the mem_map is put just above the 4G boundary by sparse_early_mem_map_alloc().
Using "pci=nommconf" works around this issue.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install RHEL5.5 on a system with 4GB of memory, where MCFG has a segment with base address at 0xF8000000 (such as the Dell Precision T3500)
2. Run some sort of benchmark (such as bonnie++). (Or, even better, just try copying the first 128MB of memory above the 4GB boundary and see how long it takes.)
3. Observe high CPU utilization and longer run times with bonnie++, unless you use the "pci=nommconf" kernel parameter.
High CPU utilization and longer test runs with bonnie++ without "pci=nommconf".
Results from bonnie++ that are the same as the results when "pci=nommconf" is used (or "mem=2G"), and results that are similar to other systems using the same hard drive.
It appears that the 2.6.18-194 kernel does look at the bus numbers in MCFG, but only when CONFIG_XEN is defined. This is also fixed in later upstream kernels.
During the initial investigations, all testing undertaken by Dell Support EMEA has shown no performance issues with the A02 BIOS.
The tests undertaken were made by writing data to /dev/shm rather than disk, and the performance dropped from 1-3GB/s to 70-100MB/s with A03 onwards.
Can you give some additional information as to why this is only occurring with BIOS vA03 onwards?
What changes went into A03 that could cause this problem to present?
I assume that the MCFG table (or memory map) was different previous to A03.
The problem is occurring because the kernel makes 256MB of memory uncachable, starting at the base address given in the MCFG table, regardless of whether this 256MB region overlaps main system memory or not.
Created attachment 428515 [details]
Patch that looks for mmfcg bus numbers
Does the attached patch fix the problem?
Comment on attachment 428515 [details]
Patch that looks for mmfcg bus numbers
I should have waited for my build to complete
Disregard this. I will send a fresh one..
Created attachment 429120 [details]
Consider start and Ending Bus numbers To make MMCONFIG memory uncacheable
This patch should be able to fix the problem. Please test.
That patch fixes the problem.
It only made 0x0400.0000 bytes non-cachable on my system (from 0xf8000000 to 0xfbffffff), and, just to be sure, I ran the "bonnie++" test that found the problem originally, which showed that the problem was no longer there.
Thanks for the quick feedback. The earlier analysis really helped!
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Test kernels available here for testing
Event posted on 13-07-2010 01:33pm BST by gasmith
Dell's customer has tested the patched kernel(s) from
http://people.redhat.com/jfeeney/.bz581933/ and confirms the performance
This event sent from IssueTracker by gasmith
Original customer who identified the problem (me) also confirms that patched kernel improves performance.
I have got push back on fixing this using ACPI defaults.. for legacy systems.
So the new fix I have should fix the issue by booting the kernel with the parameter
Please test using test kernel found here http://people.redhat.com/jfeeney/.bz581933v2/
This latest fix is not working.
Using a Dell T3500 workstation with 4GB RAM.
bonnie++ -f (run in /tmp)
Stock RHEL5.4 kernel:
kernel /vmlinuz-2.6.18-164.el5 swiotlb=65536 ro root=LABEL=/1 rhgb quiet
Time: 12m 30s
Stock RHLE5.4 kernel with PCI workaround:
kernel /vmlinuz-2.6.18-164.el5 swiotlb=65536 ro root=LABEL=/1 rhgb quiet pci=nommconf
Time: 6m 27s
First proposed fix
kernel /vmlinuz-2.6.18-204.el5BZ581933 swiotlb=65536 ro root=LABEL=/1 rhgb quiet
Time 6m 24s
Second proposed fix:
kernel /vmlinuz-2.6.18-211.el5BZ581933v2 swiotlb=65536 ro root=LABEL=/1 rhgb quiet use_acpi_mcfg_max_pci_bus_num=1
Time: 12m 40s Second test: 13m 12s
This kernel is not good.
Thanks for testing.
Can you upload the dmesg logs from the 1st kernel fix and the 2nd kernel fix?
LOL ;) Never mind comment#16..
Boot with "acpi_mcfg_max_pci_bus_num=1" and test.
use_acpi_mcfg_max_pci_bus_num is the corresponding variable in the kernel
The kernel parameter to test with is "acpi_mcfg_max_pci_bus_num=on"
I confused the value in the code to the kernel parameter..
Sorry for the confusion..
OK, that worked.
kernel /vmlinuz-2.6.18-211.el5BZ581933v2 swiotlb=65536 ro root=LABEL=/1 rhgb quiet acpi_mcfg_max_pci_bus_num=on
Test time was 6m 26s.
For inclusion into the "next release" this option would need to be ON by default in the Red Hat install kernel to avoid performance issues during the install, which is where I first spotted the problem.
(In reply to comment #19)
> OK, that worked.
> kernel /vmlinuz-2.6.18-211.el5BZ581933v2 swiotlb=65536 ro root=LABEL=/1 rhgb
> quiet acpi_mcfg_max_pci_bus_num=on
> Test time was 6m 26s.
Thanks for testing.
> For inclusion into the "next release" this option would need to be ON by
> default in the Red Hat install kernel to avoid performance issues during the
> install, which is where I first spotted the problem.
I am afraid I had to make the fix a kernel parameter to preserve default behaviour on quite a few systems. I think this can be added parameter to the install kernel or scripted in kickstarts.
When booting directly off of Red Hat published CD/DVD install media, it's not straightforward to include kernel boot parameters. There is a significant install time penalty if the kernel doesn't handle the PCI/BIOS memory mapping correctly.
What I noticed is that the default kernel in Knoppix 6.0.1, and Ubuntu 9.10 both solve this problem without additional boot time kernel parameters. It is my opinion that the Red Hat kernel should also have this behavior.
Requiring a boot time parameter makes this no better than our current workaround.
Adding an additional boot param when installing from media is quite trivial, you simply hit tab when you see isolinux, and put it in at the boot: prompt. And if speed of install time is really that important, you shouldn't be installing from CD/DVD anyway, you should set up a pxe network install server (which can automatically apply the parameter for you as well).
What you're essentially saying is that you think its a good idea to risk breaking other people's machines so that yours install faster without having to add a single boot param. Not even "installs" w/o a param, but "installs faster". It would be one thing if your system panicked on install, but if it still installs just fine, and there's a boot arg to increase performance, there's not a whole lot of justification for changing the default behavior, given that its entirely possible changing it will result in other people having systems that no longer function properly.
We're rather against knowingly introducing possible regressions, particularly this late in the product life cycle. If this patch is going in, its going in with the required additional boot parameter to enable it.
My hope was that the Red Hat kernel would have the same behavior as kernels found in other distributions. That they would perform well when booted using the default kernel boot options. Nothing more. I wasn't implying that it was a good idea to risk breaking anything.
Dell may have stronger opinions about how the kernel should support the system in question. I'm just the end user.
I agree, it would be a lot nicer if it worked without a kernel parameter. I just worked on antoher issue two days ago--another customer evaluating Dell systems complained about disk performance, and it turned out to be this issue.
I would be surprised if the no-kernel-parameter patch would cause regressions in other systems... barring a bug in the patch, it seems like the only way this could cause regressions is if a system didn't specify the right number of PCI busses in its ACPI MCFG table. Of course, I guess that's possible.
Basically, what it boils down to is this: RHEL5 was released nearly 4 years ago, and is running on plenty of hardware at least that old, as well as any number of systems released since then. We've definitely seen plenty of busted bios, acpi tables, etc., and while those might not be our fault, we certainly get the blame if a system that has been working just fine for 4 years suddenly stops functioning properly.
With RHEL5 being in its 6th update release now, we're just very very sensitive about the possibility of regressions. The other distributions mentioned aren't quite as sensitive to this as RHEL5 is, and are based on much newer upstream kernel versions, not based on a 4 year old kernel with a stable guarantee. :)
On the bright side, we should be performant out of the box with RHEL6...
What if we add a warning message if the kernel detects that it is marking more memory uncacheable than the MCFG table says it needs (based on the number of PCI busses)? Or only put the warning message when that extra uncacheable memory goes above 4GB (or if it sees that it is marking real memory uncacheable based on the memory maps)?
Warning: pci_mmcfg_init marking 256MB memory space uncacheable, but MCFG table only requires 128MB, which may result in lower performance. Please use kernel parameter "acpi_mcfg_max_pci_bus_num=on".
I'm amenable to that. In fact, it would be a rather nice addition to round out this patch, as otherwise, folks may never stumble onto that kernel parameter on their own.
(In reply to comment #27)
> I'm amenable to that. In fact, it would be a rather nice addition to round out
> this patch, as otherwise, folks may never stumble onto that kernel parameter on
> their own.
Great.. I have a patch brewing right now to do just that.
Thanks folks.. this is good.
Created attachment 440018 [details]
Preserves default behaviour but spits KERN_WARN message
Created attachment 440020 [details]
Preserves default behaviour but spits KERN_WARN message
This one should compile (:
Test kernels available for testing here.
Formatting of the warning message is a little sketchy, and has some typos...
Aug 23 13:22:16 linux3 kernel: Freeing initrd memory: 2575k freed
Aug 23 13:22:16 linux3 kernel: NET: Registered protocol family 16
Aug 23 13:22:16 linux3 kernel: ACPI: bus type pci registered
Aug 23 13:22:16 linux3 kernel: pci_mmcfg_init marking 256MB memory space uncacheable, bu MCFG table only requires 64MB. This may result in lower performance. Try using kernel parameter "acpi_mcfg_max_pci_bus_num=on"<6>PCI: Using MMCONFIG at f8000000
Aug 23 13:22:16 linux3 kernel: ACPI: Interpreter enabled
Aug 23 13:22:16 linux3 kernel: ACPI: Using IOAPIC for interrupt routing
Aug 23 13:22:16 linux3 kernel: ACPI: No dock devices found.
Aug 23 13:22:16 linux3 kernel: ACPI: PCI Root Bridge [PCI0] (0000:00)
It would be nice if the message text actually included the word "WARNING".
I'm not sure it's intuitive to look in /var/log/messages for hints on how to correct performance issues. It's not the first place I look...
Thanks for the feed back on the formatting and "WARNING" keyword. I will have those corrected.
I think it is common practice to look at /var/log/messages for hints on any failures/errors etc.
At best the error log level for the message could be elevated to be displayed on the console. But that would need a discussion first..
"I think it is common practice to look at /var/log/messages for hints on any
I won't dispute the above comment, when looking for failures or errors. However, this problem doesn't show up as an error or outright failure, it's just a "slow" machine.
I would vote for setting the error level so that the message hits the console, even in /quiet mode. This makes it very easy for the end user to notice, and correct.
Nice warning messages at boot, even in /quiet mode. Tells me _exactly_ what I need to do.
I'm satisfied with this compromise solution. It will save many Sys Admins hours of work trying to nail down why the performance sucks.
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.
I note that this message showed up on the console at boot:
Warning: pci_mmcfg_init marking 256MB space uncacheable.
A search of /var/log/messages reveals the suggested fix:
Sep 21 16:16:07 linux3 kernel: Warning: pci_mmcfg_init marking 256MB space uncacheable.
Sep 21 16:16:07 linux3 kernel: MCFG table requires 64MB uncacheable only. Try booting with acpi_mcfg_max_pci_bus_num=on
Sep 21 16:16:07 linux3 kernel: PCI: Using MMCONFIG at f8000000
Disk I/O benchmark continues to run at improved speed when kernel is installed with the acpi_mcfg_max_pci_bus_num=on option on the command line.
I got a time of 6m 09s, which is the best time recorded so far.
May I suggest if the kernel boot parameter acpi_mcfg_max_pic_bus_num could be on automatically when kernel detects more memory uncacheable during system booting? I believe customer won't be aware of the /var/log/messages before they find some problem appears to them.
Please see Comment #19 to Comment #27 for the decision history on this.
BTW, 32bit kernel also has the same issue. I don't see the 32bit kernel fixes in the proposed patch. Thanks.
No hardware available(4G mem and Dell Precision T3500/T5500/T7500 systems). The fix has been verified by Dell and customer as well.
I confirmed patch linux-2.6-pci-fix-pci_mmcfg_init-making-some-memory-uncacheable.patch is applied in kernel 2.6.18-233.el5 correctly.
*** Bug 646623 has been marked as a duplicate of this bug. ***
With regard to the 32bit fix, since the x86_64 patch has been committed to RHEL-5.6 (see comment #38), I would imagine the 32bit fix would require its own bz at this point. Perhaps this title should be modified to reflect x86_64 only. Just my $.02.
(In reply to comment #49)
> With regard to the 32bit fix, since the x86_64 patch has been committed to
> RHEL-5.6 (see comment #38), I would imagine the 32bit fix would require its own
> bz at this point. Perhaps this title should be modified to reflect x86_64 only.
> Just my $.02.
With where we're at in the 5.6 devel cycle, there's little chance 32-bit support would make it into the release, so it should definitely be cloned to a new bug to be tacked next release.
Cloned bug #666305 for 32bit fix tracking.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
We are still seeing the same issues on the CentOS mailing list for at least HP Proliant DL380 servers (models G6 AND G7) running x86_64 and using the 2.6.18-238.9.1.el5 kernel.
So either there has been a regression or the fix did not work for DL380 models G6 and G7.
Did booting with kernel parameter acpi_mcfg_max_pci_bus_num=on not fix the issue ?
Dell Onsite Engineer
(In reply to comment #54)
> We are still seeing the same issues on the CentOS mailing list for at least HP
> Proliant DL380 servers (models G6 AND G7) running x86_64 and using the
> 2.6.18-238.9.1.el5 kernel.
> So either there has been a regression or the fix did not work for DL380 models
> G6 and G7.
Yeah, um, what Shyam said. The thread seems to suggest that the work-around parameter is doing exactly what's expected here.
Oh, if that is the fix then it is working. I mistakenly thought that the fix did away with the need to manually add the kernel parameter.
(In reply to comment #57)
> Oh, if that is the fix then it is working. I mistakenly thought that the fix
> did away with the need to manually add the kernel parameter.
Yeah, see comment #14, there were concerns about altering behavior on legacy systems already deployed in the field at this stage in RHEL5's life, so the work-around parameter is the way to go.
*** Bug 713326 has been marked as a duplicate of this bug. ***
Sorry for noise in this old bz, yet one of our partners just hit this and we noticed that
a) the performance influence can be in very different areas (since it is not clear which processes or data are uncached)
b) we should do more to warn users about the situation
Technical question from our partner to improve this, any input welcome:
Why can this region be used as generic RAM if the kernel thinks it should be reserved for MMCONFIG?
With "acpi_mcfg_max_pci_bus_num=on" the kernel reserves exactly the space specified by the BIOS. Without it, it always "reserves" 256MB, even if the BIOS specifies less. (256MB is the max value for a system with a single PCIe root - max 256 buses on one root, 1MB space per bus).
But it doesn't actually "reserve" it (meaning that the memory can't be used as generic RAM), it seems to just disable caching. This leads to the situation that ordinary RAM becomes uncacheable.
And again, that makes no sense - if the mmconfig area actually extended over the whole range, using it as ordinary RAM would be extremely dangerous, to the point of damaging hardware by writing arbitrary values to PCI config space.
Note that the KB article only talks about page cache. But this RAM can also be used for other things, such as kernel data structures. It seems that exactly this happened in our case. Therefore not only page cache access for certain files but almost everything was slowed down drastically (it seem that the kernel had placed some often-accessed data structures in this area by chance).
When "acpi_mcfg_max_pci_bus_num=on" is used: wouldn't it be reasonable to at least prevent the reserved region to reach out past the 4GB boundary?
To our best knowledge here (talked to our BIOS guys), PCI config space can't be mapped above 4GB at all.