Using Beta-1.5 DVD Server ISO. Occurs on ppc64le and ppc64. After a successful installation on a Power8 baremetal machine, unable to reboot until the login prompt. Instead some errors messages and information (regarding PCI, atombios, radeon) are displayed, then we got a :/# prompt with a limited shell. Note Fedora 26 is able to install and reboot successfully on this machine. The dmesg logs (for ppc64le and ppc64) are provided as attachments.
Created attachment 1333249 [details] dmesg log for ppc64le
Created attachment 1333250 [details] dmesg log for ppc64
Eric, what machine type is it?
Dan, machine type: Power8 Server-8247-21L-SN212907A FW840.00 (TV840_056) Machine type-model: 8247-21L Serial number: 212907A
Note that Beta-1.5 (as 1.3 and 1.2) was tested on a LPAR (ppc64le and ppc64 modes) with success: install + reboot OK.
What I don't get is why the installation passed, there should be the same kernel used, but the first system boot fails. Eric, I suppose your machine has a discreet Radeon card plugged in, right?
Created attachment 1333572 [details] Fedora 26 ppc64le logs (anaconda, dmesg, lshw, lspci)
Dan, I do not have physical access to this machine (in the Toulouse lab). I re-installed Fedora 26 on it to capture hardware information. I provide the F26-ppc64le.tgz archive containing the anaconda logs, dmesg, lshw and lspci outputs (with some references to Radeon). I always used this machine as a headless one with VNC, and it never interfered with Radeon before F27.
Adding: modprobe.blacklist=radeon to the kernel boot arguments disables radeon detection and allows to boot until the login prompt. This can be used as a bypass to verify the Anaconda installation.
adding Ben to CC for his opinion Ben, doesn't this problem sound familiar to you? Looks to me that newer kernels (4.13?) don't like a Radeon card in a Power8 system. Thanks, Dan
I have a similar configuration, except that it has more CPUs and more RAM, but it DOES have the same CEDAR (FirePro 2270) graphics card. I am running F26 on this system. See next comment for comments on the dmesg logs you sent for F26 and F27, repectively.
Created attachment 1339865 [details] Comparing the F26 and F27 dmesg logs See the attachment for the promised comparison between the F26 and F27 dmesg logs.
Just as data point that line: pci 0001:03 : [PE# 00] Using 64-bit DMA iommu bypass (through TVE#0) is suspicious and likely one of the root issue but i am not sure what DMA iommu bypass means.
Created attachment 1344328 [details] dmesg log for ppc64le (Minimal Server install, no kworker error) Interestingly, when the Minimal Server install is used, the reboot after installation goes on until the login prompt. The kworker process error does not occur.
I built an absolute-latest (4.14.0-rc4) kernel and it appeared to boot OK. However, the dmesg log shows a lot of troublesome messages concerning the graphics card, e.g.: [nnnnn.nnnnnn] [drm] initializing kernel modesetting (CEDAR 0x1002:0x68F2 0x1002:0x0126 0x00). [nnnnn.nnnnnn] pci 0001:09 : [PE# 02] Using 64-bit DMA iommu bypass (through TVE#0) ... [nnnnn.nnnnnn] EEH: Frozen PHB#1-PE#2 detected [nnnnn.nnnnnn] EEH: PE location: U78CB.001.WZS00U9-P1-C12, PHB location: N/A [nnnnn.nnnnnn] [c000000ff438f7f0] [d0000000169de870] r600_irq_init+0x4b8/0x4e0 [radeon] [nnnnn.nnnnnn] [c000000ff438f830] [d000000016a09570] evergreen_startup+0x1548/0x2e10 [radeon] [nnnnn.nnnnnn] [c000000ff438f8e0] [d000000016a0b1a8] evergreen_init+0x240/0x480 [radeon] [nnnnn.nnnnnn] [c000000ff438f950] [d000000016963100] radeon_device_init+0x638/0xd10 [radeon] [nnnnn.nnnnnn] [c000000ff438f9e0] [d000000016966534] radeon_driver_load_kms+0xec/0x2d0 [radeon] [nnnnn.nnnnnn] [c000000ff438fa60] [d000000012d2b19c] drm_dev_register+0x1d4/0x290 [drm] [nnnnn.nnnnnn] [c000000ff438fb00] [d000000012d2c1fc] drm_get_pci_dev+0xc4/0x210 [drm] [nnnnn.nnnnnn] [c000000ff438fb90] [d000000016960820] radeon_pci_probe+0xc8/0x110 [radeon] [nnnnn.nnnnnn] EEH: Detected PCI bus error on PHB#1-PE#2 [nnnnn.nnnnnn] EEH: This PCI device has failed 1 times in the last hour [nnnnn.nnnnnn] EEH: Notify device drivers to shutdown [nnnnn.nnnnnn] EEH: Collect temporary log [nnnnn.nnnnnn] EEH: of node=0001:09:00.1 [nnnnn.nnnnnn] EEH: PCI device/vendor: aa681002 [nnnnn.nnnnnn] EEH: PCI cmd/status register: 00100140 [nnnnn.nnnnnn] EEH: PCI-E capabilities and status follow: [nnnnn.nnnnnn] EEH: PCI-E 00: 0012a010 00648fa1 0000293e 09000d02 [nnnnn.nnnnnn] EEH: PCI-E 10: 10120000 00000000 00000000 00000000 [nnnnn.nnnnnn] EEH: PCI-E 20: 00000000 [nnnnn.nnnnnn] EEH: PCI-E AER capability register set follows: [nnnnn.nnnnnn] EEH: PCI-E AER 00: 00010001 00000000 00000000 00062030 [nnnnn.nnnnnn] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 [nnnnn.nnnnnn] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [nnnnn.nnnnnn] EEH: PCI-E AER 30: 00000000 00000000 [nnnnn.nnnnnn] EEH: of node=0001:09:00.0 [nnnnn.nnnnnn] EEH: PCI device/vendor: 68f21002 [nnnnn.nnnnnn] EEH: PCI cmd/status register: 00100546 [nnnnn.nnnnnn] EEH: PCI-E capabilities and status follow: [nnnnn.nnnnnn] EEH: PCI-E 00: 0012a010 00648fa1 0000293e 09000d02 [nnnnn.nnnnnn] EEH: PCI-E 10: 10120000 00000000 00000000 00000000 [nnnnn.nnnnnn] EEH: PCI-E 20: 00000000 [nnnnn.nnnnnn] EEH: PCI-E AER capability register set follows: [nnnnn.nnnnnn] EEH: PCI-E AER 00: 00010001 00000000 00000000 00062030 [nnnnn.nnnnnn] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 [nnnnn.nnnnnn] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [nnnnn.nnnnnn] EEH: PCI-E AER 30: 00000000 00000000 [nnnnn.nnnnnn] EEH: Reset with hotplug activity [nnnnn.nnnnnn] iommu: Removing device 0001:09:00.1 from group 1 [nnnnn.nnnnnn] [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xFFFFFFFF) [nnnnn.nnnnnn] radeon 0001:09:00.0: disabling GPU acceleration ... [nnnnn.nnnnnn] EEH: 2100000 reads ignored for recovering device at location=U78CB.001.WZS00U9-P1-C12 driver=radeon pci addr=0001:09:00.0 [nnnnn.nnnnnn] EEH: Might be infinite loop in radeon driver <followed by stack trace> etc.
EVENTUALLY the FirePro card gets properly initialized, apparently; I am able to start the X server and run accelerated GL applications, e.g. glxgears.
Particularly troubling are these lines: [nnnnn.nnnnnn] pci 0001:09 : [PE# 02] Using 64-bit DMA iommu bypass (through TVE#0) [nnnnn.nnnnnn] EEH: Frozen PHB#1-PE#2 detected <followed by stack trace> Contrast with this line from older kernels: [nnnnn.nnnnnn] radeon 0001:09:00.0: Using 32-bit DMA via iommu
I bisected the kernel to find out where the "bypass" and "Frozen" lines cropped up, and tracked the behavior down to the following commit: commit 07d306c838c5c30196619baae36107d0615e459b Merge: a3ddacbae5ab c013b65ad8a1 Author: Linus Torvalds <torvalds> Date: Tue Jul 11 09:59:37 2017 -0700 Merge git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - Add Renesas RZ/A WDT Watchdog driver - STM32 Independent WatchDoG (IWDG) support - UniPhier watchdog support - Add F71868 support - Add support for NCT6793D and NCT6795D - dw_wdt: add reset lines support - core: add option to avoid early handling of watchdog - core: introduce watchdog_worker_should_ping helper - Cleanups and improvements for sama5d4, intel-mid_wdt, s3c2410_wdt, orion_wdt, gpio_wdt, it87_wdt, meson_wdt, davinci_wdt, bcm47xx_wdt, zx2967_wdt, cadence_wdt * git://www.linux-watchdog.org/linux-watchdog: (32 commits) watchdog: introduce watchdog_worker_should_ping helper watchdog: uniphier: add UniPhier watchdog driver dt-bindings: watchdog: add description for UniPhier WDT controller watchdog: cadence_wdt: make of_device_ids const. watchdog: zx2967: constify zx2967_wdt_ops. watchdog: bcm47xx_wdt: constify bcm47xx_wdt_hard_ops and bcm47xx_wdt_soft_ops watchdog: davinci: Add missing clk_disable_unprepare(). watchdog: davinci: Handle return value of clk_prepare_enable watchdog: meson: Handle return value of clk_prepare_enable watchdog: it87: Add support for various Super-IO chips watchdog: it87: Use infrastructure to stop watchdog on reboot watchdog: it87: Drop support for resetting watchdog though CIR and Game port watchdog: it87: Convert to use watchdog core infrastructure watchdog: it87: Drop FSF mailing address watchdog: dw_wdt: get reset lines from dt watchdog: bindings: dw_wdt: add reset lines watchdog: w83627hf: Add support for NCT6793D and NCT6795D watchdog: core: add option to avoid early handling of watchdog watchdog: f71808e_wdt: Add F71868 support watchdog: Add STM32 IWDG driver ... This is a huge commit. Analysis continues.
I bisected the problem with different endpoints, and tracked it down to this commit: commit 8e3f1b1d8255105f31556aacf8aeb6071b00d469 Author: Russell Currey <ruscur> Date: Wed Jun 21 17:18:04 2017 +1000 powerpc/powernv/pci: Enable 64-bit devices to access >4GB DMA space On PHB3/POWER8 systems, devices can select between two different sections of address space, TVE#0 and TVE#1. TVE#0 is intended for 32bit devices that aren't capable of addressing more than 4GB. Selecting TVE#1 instead, with the capability of addressing over 4GB, is performed by setting bit 59 of a PCI address. However, some devices aren't capable of addressing at least 59 bits, but still want more than 4GB of DMA space. In order to enable this, reconfigure TVE#0 to be suitable for 64-bit devices by allocating memory past the initial 4GB that is inaccessible by 64-bit DMAs. This bypass mode is only enabled if a device requests 4GB or more of DMA address space, if the system has PHB3 (POWER8 systems), and if the device does not share a PE with any devices from different vendors. Signed-off-by: Russell Currey <ruscur> Signed-off-by: Michael Ellerman <mpe.au> The later commit that fixed at least some problems introduced by 8e3f1b1d8255105f31556aacf8aeb6071b00d469: commit 253fd51e2f533552ae35a0c661705da6c4842c1b Author: Alistair Popple <alistair.au> Date: Wed Jul 26 15:26:40 2017 +1000 powerpc/powernv/pci: Return failure for some uses of dma_set_mask() Commit 8e3f1b1d8255 ("powerpc/powernv/pci: Enable 64-bit devices to access >4GB DMA space") introduced the ability for PCI device drivers to request a DMA mask between 64 and 32 bits and actually get a mask greater than 32-bits. However currently if certain machine configuration dependent conditions are not meet the code silently falls back to a 32-bit mask. This makes it hard for device drivers to detect which mask they actually got. Instead we should return an error when the request could not be fulfilled which allows drivers to either fallback or implement other workarounds as documented in DMA-API-HOWTO.txt. Signed-off-by: Alistair Popple <alistair.au> Acked-by: Russell Currey <ruscur> Signed-off-by: Michael Ellerman <mpe.au> appears not to be the answer, at least not for the AMD FirePro 2270 on which the problem was reported. The signature of the problem is that as soon as we start using the 64-bit DMA iommu bypass (through TVE#0) for the Radeon card, we get "EEH: Frozen PHB#1-PE#2 detected" messages and a TON of EEH messages with stack traces from the Radeon driver, messages about atombios [being] stuck in a loop, etc. I have not yet had a chance to test this with a different card.
I have now had a chance to swap out the FirePro 2270 for an Embedded Radeon E6465 (with a newer Caicos GPU and more VRAM), and this problem does not occur.
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. As kernel maintainers, we try to keep up with bugzilla but due the rate at which the upstream kernel project moves, bugs may be fixed without any indication to us. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs. Fedora 27 has now been rebased to 4.15.3-300.f27. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those.
This problem may be fixed, or at least worked around, by this patch I submitted upstream 02/21/2018: [PATCH] drm/radeon: insist on 32-bit DMA for Cedar In radeon_device_init, set the need_dma32 flag for Cedar chips (e.g. FirePro 2270). This fixes, or at least works around, a bug on PowerPC exposed by last year's commits 8e3f1b1d8255105f31556aacf8aeb6071b00d469 (Russell Currey) and 253fd51e2f533552ae35a0c661705da6c4842c1b (Alistair Popple) which enabled the 64-bit DMA iommu bypass. This caused the device to freeze, in some cases unrecoverably, and is the subject of several bug reports internal to Red Hat. Signed-off-by: Ben Crocker <bcrocker> --- drivers/gpu/drm/radeon/radeon_device.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index ffc10cadcf34..02538903830d 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1395,7 +1395,10 @@ int radeon_device_init(struct radeon_device *rdev, if (rdev->flags & RADEON_IS_AGP) rdev->need_dma32 = true; if ((rdev->flags & RADEON_IS_PCI) && - (rdev->family <= CHIP_RS740)) + (rdev->family <= CHIP_RS740 || rdev->family == CHIP_CEDAR)) + rdev->need_dma32 = true; + if ((rdev->flags & RADEON_IS_PCIE) && + (rdev->family == CHIP_CEDAR)) rdev->need_dma32 = true; dma_bits = rdev->need_dma32 ? 32 : 40;
Here is a revised version of the patch; I think it's in final form now: Author: Ben Crocker <bcrocker> Date: Thu Feb 22 17:50:45 2018 -0500 drm/radeon: insist on 32-bit DMA for Cedar on PPC64/PPC64LE In radeon_device_init, set the need_dma32 flag for Cedar chips (e.g. FirePro 2270). This fixes, or at least works around, a bug on PowerPC exposed by last year's commits 8e3f1b1d8255105f31556aacf8aeb6071b00d469 (Russell Currey) and 253fd51e2f533552ae35a0c661705da6c4842c1b (Alistair Popple) which enabled the 64-bit DMA iommu bypass. This caused the device to freeze, in some cases unrecoverably, and is the subject of several bug reports internal to Red Hat. Signed-off-by: Ben Crocker <bcrocker> diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index ffc10cadcf34..32b577c776b9 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1397,6 +1397,10 @@ int radeon_device_init(struct radeon_device *rdev, if ((rdev->flags & RADEON_IS_PCI) && (rdev->family <= CHIP_RS740)) rdev->need_dma32 = true; +#ifdef CONFIG_PPC64 + if (rdev->family == CHIP_CEDAR) + rdev->need_dma32 = true; +#endif dma_bits = rdev->need_dma32 ? 32 : 40; r = pci_set_dma_mask(rdev->pdev, DMA_BIT_MASK(dma_bits));
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs. Fedora 27 has now been rebased to 4.18.10-100.fc27. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 28 or Fedora 29, and are still experiencing this issue, please change the version to Fedora 28 or 29. If you experience different issues, please open a new bug report for those.
This message is a reminder that Fedora 27 is nearing its end of life. On 2018-Nov-30 Fedora will stop maintaining and issuing updates for Fedora 27. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '27'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 27 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Checked with Fedora 27 Server official ISO images (for ppc64 and ppc64le). The problem does not occur anymore, so change status to CLOSED.