Bug 536675 - Resume fails on HP 2530p with DMAR enabled
Resume fails on HP 2530p with DMAR enabled
Status: CLOSED WONTFIX
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
12
All Linux
low Severity medium
: ---
: ---
Assigned To: David Woodhouse
Fedora Extras Quality Assurance
:
: 539861 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-11-10 16:21 EST by Matthew Garrett
Modified: 2013-01-10 02:48 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-12-03 22:24:31 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dmesg (75.59 KB, text/plain)
2009-11-10 17:02 EST, Matthew Garrett
no flags Details
Actual dmesg (60.53 KB, text/plain)
2009-11-10 17:32 EST, Matthew Garrett
no flags Details
dmesg with test patches (59.58 KB, text/plain)
2009-11-25 10:56 EST, David Woodhouse
no flags Details

  None (edit)
Description Matthew Garrett 2009-11-10 16:21:41 EST
With DMAR enabled, the kernel crashes in resume very early (before control is passed back to C code). Passing intel_iommu=off results in everything working fine. What information is needed?
Comment 1 David Woodhouse 2009-11-10 16:49:43 EST
Argh, HP. It'll be crashing in the BIOS -- and before we turn the IOMMU back on after resume. That's kind of confusing. We do turn the IOMMU off before we suspend; I'm not sure how it can even _tell_ that it was enabled.

Can you show dmesg as it boots up?
Comment 2 Matthew Garrett 2009-11-10 17:02:47 EST
Created attachment 368954 [details]
dmesg

dmesg from boot on an HP 2530p
Comment 3 Matthew Garrett 2009-11-10 17:03:04 EST
This is an EFI boot, but the same happens using BIOS
Comment 4 David Woodhouse 2009-11-10 17:11:41 EST
dmesg with iommu enabled, please?
Comment 5 David Woodhouse 2009-11-10 17:12:26 EST
Also try booting with 'iommu=pt'.
Comment 6 David Woodhouse 2009-11-10 17:15:49 EST
Er, wait. That one _did_ have IOMMU enabled. And it did get back to C code.... and all the way back up.

What was the problem, again?
Comment 7 Matthew Garrett 2009-11-10 17:17:16 EST
Bizarre. I'm sure I didn't suspend that. Hang on, let me do this again - I've clearly screwed something up.
Comment 8 Matthew Garrett 2009-11-10 17:32:18 EST
Created attachment 368962 [details]
Actual dmesg

Attached the wrong file before
Comment 9 Matthew Garrett 2009-11-10 17:40:39 EST
Fails with iommu=pt
Comment 10 David Woodhouse 2009-11-13 19:04:25 EST
Others have seen this bug and allege that it's not actually crashing in the BIOS -- it's crashing in the kernel when it gets back to iommu_enable_translation() on resume. Can you confirm or deny that?

A register dump of the entire DMAR would be a useful thing to see, both before and immediately after suspend.

What happens if you boot with iommu=pt, and then comment out the call to init_iommu_hw() in iommu_resume(), so it seems to fail? I expect things will go south fairly shortly thereafter, but perhaps it'll come back up a little further?
Comment 11 David Woodhouse 2009-11-24 05:38:24 EST
*** Bug 539861 has been marked as a duplicate of this bug. ***
Comment 12 David Woodhouse 2009-11-24 05:42:27 EST
Would be useful to see the output with this patch applied:
http://david.woodhou.se/suspend-hack

Both with intel_iommu=igfx_off, and without. For the latter you'll probably need a serial console or USB debug cable, or maybe netconsole.
Comment 13 David Woodhouse 2009-11-25 10:56:55 EST
Created attachment 373772 [details]
dmesg with test patches

OK, I now have an HP6930p and booted it with a variant of my test patch.

Because we have the GFX_WA config option enabled, the graphics device gets given a 1:1 mapping to all of memory. A small hack prevents the kernel from re-enabling the IOMMU which is dedicated to the graphics device on resume. And then everything works -- including the other IOMMUs.

So the problem is that the dedicated GFX IOMMU is not properly set up on resume, somehow. The register dump looks sane, but something is wrong. When we set the 'translation enable' bit in the command register and wait for that to be reflected in the status register, the status register never changes.

One for the chipset folks...
Comment 14 David Woodhouse 2009-11-25 13:45:26 EST
Hm. If I kexec back into the same kernel after a suspend/resume cycle, it works -- it comes back up OK, and re-initialises the hardware correctly.

Which means that it _isn't_ just that the IOMMU hardware is set up wrong. Investigating further...
Comment 15 David Woodhouse 2009-11-25 18:43:36 EST
I can even re-enable translation with a userspace hack on /dev/mem, right after the suspend/resume cycle. I note that when it fails, the screen is still turned off from the suspend. It works after the video has been initialised again.

Trying to narrow this down further right now, to prove that this observation is something other than just coincidence.
Comment 16 David Woodhouse 2009-11-25 19:51:46 EST
This excerpt from the above boot log shows the problem:

pci 0000:00:02.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
pci 0000:00:02.0: restoring config space at offset 0x8 (was 0x1, writing 0x7111)
pci 0000:00:02.0: restoring config space at offset 0x6 (was 0xc, writing 0x4000000c)
pci 0000:00:02.0: restoring config space at offset 0x4 (was 0x4, writing 0x58000004)
pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900000, writing 0x900403)

Before the contents of PCI config space for the graphics device (specifically, the BARs) are restored to their correct state, the IOMMU doesn't seem to function.
But as soon as they are restored, everything is fine.

If I put a hack into the IOMMU resume code to restore just the BARs (words #4 and #6), it all works fine.
Comment 17 David Woodhouse 2009-11-25 20:13:06 EST
Adding a call to this function into the loop in init_iommu_hw(), which happens on resume, seems to catch and fix the problem.

void cantiga_hp_hack(struct dmar_drhd_unit *drhd)
{
	int i;
	uint32_t mmiobar;

	for (i = 0; i < drhd->devices_cnt; i++) {
		if (!drhd->devices[i] ||
		    drhd->devices[i]->vendor != 0x8086 ||
		    drhd->devices[i]->device != 0x2a42)
			continue;

		pci_read_config_dword(drhd->devices[i], PCI_BASE_ADDRESS_0,
				      &mmiobar);
		if (!(mmiobar & PCI_BASE_ADDRESS_MEM_MASK) &&
		    pci_resource_start(drhd->devices[i], 0)) {
			WARN(1, "BIOS failed to restore BARs for integrated graphics device\n");
			pci_write_config_dword(drhd->devices[i],
					       PCI_BASE_ADDRESS_0,
					       pci_resource_start(drhd->devices[i], 0) | mmiobar);
		}
	}
}
Comment 18 David Woodhouse 2009-11-26 06:11:42 EST
A 2.6.31.6-151.fc12 kernel with this fix is building at http://koji.fedoraproject.org/koji/taskinfo?taskID=1831944

There may be some refinement to come, but this should do the job for now. Please confirm that it fixes the problem for you.

Please don't be distracted by bug #540218, which also affects some of the same machines.
Comment 19 Robinson Maureira 2009-11-26 16:48:40 EST
I can confirm that in my HP 6730b works ok, resume after suspend using suspend button, or closing the lid.
Comment 20 Brad Scalio 2009-11-29 10:57:19 EST
Confirmed works on Suspend to RAM and Suspend to Disk on HP EliteBook 6930p

Thanks
Comment 21 drago01 2009-11-29 11:12:47 EST
Works fine on my 6930p too.
Comment 22 David Woodhouse 2009-11-29 12:34:04 EST
Thanks for testing. Out of interest, does the F12 kernel take ages to initialise for you on this hardware and run very slowly, repeatedly saying:
'[drm] TV-20: set mode NTSC 480i 0'?
Comment 23 drago01 2009-11-29 14:16:32 EST
(In reply to comment #22)
> Thanks for testing. Out of interest, does the F12 kernel take ages to
> initialise for you on this hardware and run very slowly, repeatedly saying:
> '[drm] TV-20: set mode NTSC 480i 0'?  

No, I get a couple of this message but it is far from slow ~12-14 sec from grub->gdm and after X is up compiz works just fine (yes GM45 version not the ati one).

Using the F13 kernel I noticed something similar (X very slow) I just blamed the drm patches not being in sync with the X driver.
Comment 24 David Woodhouse 2009-11-29 14:36:32 EST
Sorry, I meant F13 (rawhide). Sounds like you're seeing what I'm seeing.
Comment 25 drago01 2009-11-29 14:48:13 EST
(In reply to comment #24)
> Sorry, I meant F13 (rawhide). Sounds like you're seeing what I'm seeing.  

OK, yeah seems to be the same issue, just confirmed it, is there a bug open for that?
Comment 26 Brad Scalio 2009-11-29 17:27:08 EST
F12 is just fine, not slow at all ... X seems stable, resume and suspend stable tested it a few times in a row to ram and disk, only about 8 480i messages in messages, all seems well right now

In fact, this update also fixed 540218 bug for me

Let us know if you need anything else
Comment 27 drago01 2009-11-29 17:39:10 EST
(In reply to comment #26)
> F12 is just fine, not slow at all ... X seems stable, resume and suspend stable
> tested it a few times in a row to ram and disk, only about 8 480i messages in
> messages, all seems well right now
> 
> In fact, this update also fixed 540218 bug for me
> 
> Let us know if you need anything else  

Yeah F12 is fine, the bug is in F13 (rawhide).
Comment 28 Tristan 2009-12-01 16:44:53 EST
On new kernel 2.6.31.6-151 I can confirm that the display is turned on again after resume. Well this would work for me, but maybe this kernel warning could help you to improve your excellent work.
Feel free to contact me if I you need more information.
 
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Back to C!
------------[ cut here ]------------
WARNING: at drivers/pci/intel-iommu.c:3098 cantiga_gfx_bar_enable+0x6f/0xa4() (Not tainted)
Hardware name: HP EliteBook 2530p
BIOS failed to restore BARs for integrated graphics device; fixing...
Modules linked in: aes_i586 aes_generic vfat fat rfcomm sco bridge stp llc bnep l2cap sunrpc coretemp ipv6 cpufreq_ondemand acpi_cpufreq fuse dm_multipath uinput snd_hda_codec_analog snd_hda_intel snd_hda_codec mmc_block snd_hwdep firewire_ohci snd_seq firewire_core snd_seq_device sdhci_pci snd_pcm sdhci btusb crc_itu_t mmc_core bluetooth ricoh_mmc snd_timer snd soundcore snd_page_alloc iTCO_wdt iTCO_vendor_support arc4 ecb joydev uvcvideo videodev e1000e tpm_infineon v4l1_compat wmi iwlagn iwlcore hp_accel lis3lv02d serio_raw input_polldev mac80211 cfg80211 rfkill pata_acpi ata_generic i915 drm_kms_helper drm i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
Pid: 3024, comm: pm-suspend Not tainted 2.6.31.6-151.fc12.i686 #1
Call Trace:
 [<c0436d8b>] warn_slowpath_common+0x70/0x87
 [<c05b8a08>] ? cantiga_gfx_bar_enable+0x6f/0xa4
 [<c0436de0>] warn_slowpath_fmt+0x29/0x2c
 [<c05b8a08>] cantiga_gfx_bar_enable+0x6f/0xa4
 [<c05b8abd>] iommu_resume+0x80/0x126
 [<c06282e7>] __sysdev_resume+0x19/0xb0
 [<c062842e>] sysdev_resume+0xb0/0x11b
 [<c045cec6>] suspend_devices_and_enter+0x10e/0x184
 [<c045d009>] enter_state+0xcd/0x119
 [<c045c7e4>] state_store+0x98/0xac
 [<c045c74c>] ? state_store+0x0/0xac
 [<c059505d>] kobj_attr_store+0x16/0x22
 [<c0501625>] sysfs_write_file+0xc1/0xec
 [<c0501564>] ? sysfs_write_file+0x0/0xec
 [<c04c07d4>] vfs_write+0x85/0xe4
 [<c04c08d1>] sys_write+0x40/0x62
 [<c040363c>] syscall_call+0x7/0xb
---[ end trace 5a79984d5796fe95 ]---
CPU0: Thermal monitoring handled by SMI
Extended CMOS year: 2000
Enabling non-boot CPUs ...
SMP alternatives: switching to SMP code
Booting processor 1 APIC 0x1 ip 0x6000
Comment 29 David Woodhouse 2009-12-02 04:05:15 EST
(In reply to comment #28)
> WARNING: at drivers/pci/intel-iommu.c:3098 cantiga_gfx_bar_enable+0x6f/0xa4()
> (Not tainted)
> Hardware name: HP EliteBook 2530p
> BIOS failed to restore BARs for integrated graphics device; fixing...

That's just confirming that it's noticed and fixed the problem. Your BIOS is broken, but we cope.

It's reminding you to report the fault to your vendor and demand a fixed BIOS, and ensuring that these problems end up in kerneloops.org where we can see how prevalent they are (and which companies are responsible for them).

Thanks.
Comment 30 Bug Zapper 2010-11-04 02:35:10 EDT
This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 31 Bug Zapper 2010-12-03 22:24:31 EST
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.