Bug 1426796

Summary: Regression: kernel doesn't boot on qemu on either ppc64 or ppc64le
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: awilliam, bugproxy, cz172638, dan, dgibson, efintzel, gansalmon, gmarr, hannsj_uhl, ichavero, itamar, jonathan, kernel-maint, labbott, madhu.chinakonda, mchehab, menantea, normand, pbrobinson, rjones, robatino
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64   
OS: Linux   
Whiteboard: AcceptedFreezeException
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-24 16:57:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 910269, 1071880, 1349185, 1430827    
Attachments:
Description Flags
build.log for ppc64
none
build.log for ppc64le none

Description Richard W.M. Jones 2017-02-24 22:38:34 UTC
Description of problem:

As described in the summary, the current Rawhide kernel does not
boot on qemu-system-ppc64 on either ppc64 or ppc64le architectures.

This worked only a few days ago, so it's a recent regression in
something or other.

Version-Release number of selected component (if applicable):

Last tested versions which *worked*:

qemu 2:2.8.0-2.fc26
kernel 4.11.0-0.rc0.git2.1.fc26
libguestfs-1.35.27-1.fc26

Failing versions:

qemu 2:2.8.0-2.fc26
kernel 4.11.0-0.rc0.git3.1.fc26
libguestfs-1.35.28-1.fc26

How reproducible:

At least once.

Steps to Reproduce:
1. Boot Linux on qemu.

Please see the full command lines used when I attach the build logs.

Comment 1 Richard W.M. Jones 2017-02-24 22:39:50 UTC
Created attachment 1257483 [details]
build.log for ppc64

Comment 2 Richard W.M. Jones 2017-02-24 22:40:48 UTC
Created attachment 1257484 [details]
build.log for ppc64le

Comment 3 Fedora Blocker Bugs Application 2017-03-03 13:15:04 UTC
Proposed as a Blocker for 26-alpha by Fedora user michelmno using the blocker tracking app because:

 current linux kernel in last compose 20170302 make the POWERPC boot to hang in qemu env.

Comment 4 Adam Williamson 2017-03-03 22:39:48 UTC
Neither ppc64 nor ppc64le is a release-blocking arch, so this cannot possibly be a release blocker. I'm gonna take the initiative to change this to a proposed freeze exception.

Comment 5 Geoffrey Marr 2017-03-06 18:33:32 UTC
Discussed during the 2017-03-06 blocker review meeting: [1]

The decision was made to accept this bug as an Alpha Freeze Exception as this would be a blocker on a blocking-arch, and as such, affects the usage of the ppc64 and ppc64le significantly to warrant such action.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2017-03-06/f26-blocker-review.2017-03-06-17.02.txt

Comment 6 Richard W.M. Jones 2017-03-08 10:37:54 UTC
Still failing in the same way with:

kernel 4.11.0-0.rc1.git0.1.fc27
qemu 2:2.8.0-2.fc26

Comment 7 Dan Horák 2017-03-09 15:16:28 UTC
should be something KVM or qemu related, as the latest kernel boots on bare metal

[root@ibm-p8-generic-01 ~]# uname -a
Linux ibm-p8-generic-01.lab.eng.brq.redhat.com 4.11.0-0.rc1.git1.2.fc27.ppc64le #1 SMP Thu Mar 9 03:59:12 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

Comment 8 Michel Normand 2017-03-09 17:18:14 UTC
Note that we have the same problem if the host is a F25 with following kernel and qemu version (1) when trying to create a guest with f26 compose from (2) which is the kernel 4.11

(1) 
===
$qemu-ppc64le --version
qemu-ppc64lef version 2.7.1(qemu-2.7.1-2.fc25), Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
$uname -a
Linux fenix.test.toulouse-stg.fr.ibm.com 4.9.9-200.fc25.ppc64le #1 SMP Thu Feb 16 16:10:02 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux
===
(2) https://kojipkgs.fedoraproject.org/compose/branched/latest-Fedora-26/compose/Server/ppc64le/iso/
===

Comment 9 David Gibson 2017-03-10 03:02:35 UTC
Do you know if the same kernel boots under PowerVM or on bare metal?

In other words, I'm trying to isolate if the problem is in the guest kernel or in qemu.

Comment 10 Michel Normand 2017-03-10 07:09:36 UTC
(In reply to David Gibson from comment #9)
> Do you know if the same kernel boots under PowerVM or on bare metal?
> 
> In other words, I'm trying to isolate if the problem is in the guest kernel
> or in qemu.

above comment #7 reported boot success on bare metal

Comment 11 Éric Fintzel 2017-03-10 07:53:34 UTC
I tried to boot the Fedora-Server-dvd-ppc64le-26-20170309.n.0.iso image on a LPAR and got:

OF stdout device is: /vdevice/vty@30000000
Preparing to boot Linux version 4.11.0-0.rc1.git0.1.fc26.ppc64le (mockbuild.fedoraproject.org) (gcc version 7.0.1 20170225 (Red Hat 7.0.1-0.10) (GCC) ) #1 SMP Mon Mar 6 18:25:13 UTC 2017
Detected machine type: 0000000000000101
Max number of cores passed to firmware: 128 (NR_CPUS = 1024)
Calling ibm,client-architecture-support... done
command line: BOOT_IMAGE=/ppc/ppc64/vmlinuz ro
memory layout at init:
  memory_limit : 0000000000000000 (16 MB aligned)
  alloc_bottom : 000000000ec80000
  alloc_top    : 0000000010000000
  alloc_top_hi : 0000000010000000
  rmo_top      : 0000000010000000
  ram_top      : 0000000010000000
instantiating rtas at 0x000000000eca0000... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
No memory for flatten_device_tree (no room)
EXIT called ok
0 >

Comment 12 Dan Horák 2017-03-10 08:08:02 UTC
(In reply to David Gibson from comment #9)
> Do you know if the same kernel boots under PowerVM or on bare metal?
> 
> In other words, I'm trying to isolate if the problem is in the guest kernel
> or in qemu.

comment #7 reports successful boot using OPAL firmware

platform	: PowerNV
model		: 8247-21L
machine		: PowerNV 8247-21L
firmware	: OPAL

So isn't the kernel image too big (again)?

Comment 13 Laurent Vivier 2017-03-10 08:08:39 UTC
(In reply to Richard W.M. Jones from comment #0)
> As described in the summary, the current Rawhide kernel does not
> boot on qemu-system-ppc64 on either ppc64 or ppc64le architectures.

Do you use TCG, KVM PR or KVM HV?

Comment 14 Richard W.M. Jones 2017-03-10 08:58:19 UTC
Please see the log file attached which shows exactly how the kernel
is being booted, including the full qemu command line.  In brief - it's TCG.

You can try to reproduce the problem trivially by doing something like:

qemu-system-ppc64 -machine pseries-2.8,accel=tcg,usb=off,dump-guest-core=off -m 768 -kernel /boot/<name-of-vmlinuz> -append 'panic=1 console=hvc0 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=vt100'

As can also be seen if you look at the log file, it hung after printing:

Booting Linux via __start() @ 0x0000000000400000 ...

Comment 15 Michel Normand 2017-03-12 05:19:40 UTC
For information,
Now with last rawhide compose 20170310 with kernel 4.11.0-0.rc1.git3.1.fc27.ppc64le the boot completed without hang.

While still failing on F26 compose 20170311 with kernel 4.11.0-0.rc1.git0.1.fc26.ppc64le

So I assume we need to update the kernel for f26 branch in git tree http://pkgs.fedoraproject.org/cgit/rpms/kernel.git/

Comment 16 Dan Horák 2017-03-13 15:36:34 UTC
Is there any commit between 4.11.0-0.rc1.git0 and 4.11.0-0.rc1.git3 that would explain the change? Or any other idea from anyone?

Comment 17 Peter Robinson 2017-03-13 15:39:14 UTC
(In reply to Dan Horák from comment #16)
> Is there any commit between 4.11.0-0.rc1.git0 and 4.11.0-0.rc1.git3 that
> would explain the change? Or any other idea from anyone?

This is the rc1 -> rc2 changelog:
https://lwn.net/Articles/716899/https://lwn.net/Articles/716899/

The one that standard out is:
powerpc/64: Avoid panic during boot due to divide by zero in
init_cache_info()

Comment 18 Peter Robinson 2017-03-13 15:43:38 UTC
But there's also some mm work, some bugs around OPAL, and some other bits around page tables and some other early boot stuff looking through that changelog

Comment 19 Fedora Update System 2017-03-14 15:11:28 UTC
bcm283x-firmware-20170314-2.509beaa.fc26 kernel-4.11.0-0.rc2.git0.1.fc26 linux-firmware-20170313-72.git695f2d6d.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-51a9cd79a0

Comment 20 Fedora Update System 2017-03-15 04:24:16 UTC
bcm283x-firmware-20170314-2.509beaa.fc26, kernel-4.11.0-0.rc2.git0.1.fc26, linux-firmware-20170313-72.git695f2d6d.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-51a9cd79a0

Comment 21 Peter Robinson 2017-03-15 13:02:34 UTC
so testing this on a F-25 userspace (just updating the kernel) I get the following panic (only ppc64le tested):

OF stdout device is: /vdevice/vty@30000000
Preparing to boot Linux version 4.11.0-0.rc2.git0.1.fc26.ppc64le (mockbuild.fedoraproject.org) (gcc version 7.0.1 20170225 (Red Hat 7.0.1-0.10) (GCC) ) #1 SMP Mon Mar 13 16:51:13 UTC 2017
Detected machine type: 0000000000000101
command line: BOOT_IMAGE=/vmlinuz-4.11.0-0.rc2.git0.1.fc26.ppc64le root=UUID=9ffac9f6-7a7c-4034-8431-09439b815c3d ro net.ifnames=0 rhgb quiet console=ttyS0 LANG=en_US.UTF-8
Max number of cores passed to firmware: 1024 (NR_CPUS = 1024)
Calling ibm,client-architecture-support... done
memory layout at init:
  memory_limit : 0000000000000000 (16 MB aligned)
  alloc_bottom : 0000000004270000
  alloc_top    : 0000000030000000
  alloc_top_hi : 0000000280000000
  rmo_top      : 0000000030000000
  ram_top      : 0000000280000000
instantiating rtas at 0x000000002fff0000... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x0000000004280000 -> 0x0000000004280a9e
Device tree struct  0x0000000004290000 -> 0x00000000042a0000
Quiescing Open Firmware ...
Booting Linux via __start() @ 0x0000000002000000 ...
 -> smp_release_cpus()
spinning_secondaries = 3
 <- smp_release_cpus()
Linux ppc64le
#1 SMP Mon Mar 1[    0.465500] Warning: unable to open an initial console.
[    0.967975] Unable to handle kernel paging request for data at address 0x2d326f6974726976
[    0.969153] Faulting instruction address: 0xc00000000033e98c
[    0.969638] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.970076] SMP NR_CPUS=1024 
[    0.970077] NUMA 
[    0.970322] pSeries
[    0.970636] Modules linked in: virtio_console virtio_blk virtio_pci virtio_ring virtio
[    0.971308] CPU: 1 PID: 227 Comm: systemd Not tainted 4.11.0-0.rc2.git0.1.fc26.ppc64le #1
[    0.971937] task: c0000000034fb500 task.stack: c000000003460000
[    0.972388] NIP: c00000000033e98c LR: c00000000033ea04 CTR: c00000000051c510
[    0.972933] REGS: c0000000034639f0 TRAP: 0380   Not tainted  (4.11.0-0.rc2.git0.1.fc26.ppc64le)
[    0.973596] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>
[    0.973601]   CR: 22022884  XER: 20000000
[    0.974312] CFAR: c00000000033e908 SOFTE: 1 
[    0.974312] GPR00: c00000000033ea04 c000000003463c70 c00000000134fa00 0000000000000000 
[    0.974312] GPR04: 00000000017000c0 000000000000001a 0000000000000001 c00000027fa63d00 
[    0.974312] GPR08: 000000027eb40000 0000000000000000 0000000000000000 0000000000000c00 
[    0.974312] GPR12: 0000000000002200 c00000000fdc0900 00003fffb3e97360 c0000000034dda80 
[    0.974312] GPR16: 00003fffb3e8d560 00003fffb3e973c8 00003fffb3e8cf60 00003fffb3e96f98 
[    0.974312] GPR20: 00003fffb3e96b98 c0000000034fb500 0000000000000000 0000000000000000 
[    0.974312] GPR24: ffffffffffffffff 0000000000000000 c0000000000e9bbc c00000027e01e880 
[    0.974312] GPR28: ffffffffffffffff 00000000017000c0 2d326f6974726976 c00000027e01e880 
[    0.979553] NIP [c00000000033e98c] kmem_cache_alloc_node+0x13c/0x330
[    0.980037] LR [c00000000033ea04] kmem_cache_alloc_node+0x1b4/0x330
[    0.980514] Call Trace:
[    0.980701] [c000000003463c70] [c00000000033ea04] kmem_cache_alloc_node+0x1b4/0x330 (unreliable)
[    0.981371] [c000000003463cd0] [c0000000000e9bbc] copy_process.isra.5.part.6+0x18c/0x19f0
[    0.981993] [c000000003463db0] [c0000000000eb63c] _do_fork+0xec/0x490
[    0.982486] [c000000003463e30] [c00000000000bb88] ppc_clone+0x8/0xc
[    0.982972] Instruction dump:
[    0.983199] 7c0803a6 eb61ffd8 eb81ffe0 7d908120 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 
[    0.983793] 60420000 e95f0022 e93f0000 79290720 <7f1e502a> 0b090000 0b190000 39200000 
[    0.984400] ---[ end trace 89b319055f8dd539 ]---
[    0.984752] 
[    1.000829] Unable to handle kernel paging request for data at address 0x2d326f6974726976
[    1.001516] Faulting instruction address: 0xc00000000033e98c
[    1.001972] Oops: Kernel access of bad area, sig: 11 [#2]
[    1.002407] SMP NR_CPUS=1024 
[    1.002408] NUMA 
[    1.002638] pSeries
[    1.002975] Modules linked in: virtio_console virtio_blk virtio_pci virtio_ring virtio
[    1.003637] CPU: 1 PID: 1 Comm: systemd Tainted: G      D         4.11.0-0.rc2.git0.1.fc26.ppc64le #1
[    1.004354] task: c00000027bceda00 task.stack: c00000027e108000
[    1.004814] NIP: c00000000033e98c LR: c00000000033ea04 CTR: c00000000051c510
[    1.005363] REGS: c00000027e10b9f0 TRAP: 0380   Tainted: G      D          (4.11.0-0.rc2.git0.1.fc26.ppc64le)
[    1.006132] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>
[    1.006135]   CR: 22022884  XER: 20000000
[    1.006860] CFAR: c00000000033e908 SOFTE: 1 
[    1.006860] GPR00: c00000000033ea04 c00000027e10bc70 c00000000134fa00 0000000000000000 
[    1.006860] GPR04: 00000000017000c0 000000000000001c 0000000000000001 c00000027fa63d00 
[    1.006860] GPR08: 000000027eb40000 0000000000000000 0000000000000000 0000000000000c01 
[    1.006860] GPR12: 0000000000002200 c00000000fdc0900 0000000000000269 c0000000034f9080 
[    1.006860] GPR16: 000000003ea38d20 000000003ea189d0 00000100381a0cd0 0000000000000000 
[    1.006860] GPR20: 000000003ea34e38 c00000027bceda00 0000000000000000 0000000000000000 
[    1.006860] GPR24: ffffffffffffffff 0000000000000000 c0000000000e9bbc c00000027e01e880 
[    1.006860] GPR28: ffffffffffffffff 00000000017000c0 2d326f6974726976 c00000027e01e880 
[    1.012187] NIP [c00000000033e98c] kmem_cache_alloc_node+0x13c/0x330
[    1.012683] LR [c00000000033ea04] kmem_cache_alloc_node+0x1b4/0x330
[    1.013175] Call Trace:
[    1.013366] [c00000027e10bc70] [c00000000033ea04] kmem_cache_alloc_node+0x1b4/0x330 (unreliable)
[    1.014049] [c00000027e10bcd0] [c0000000000e9bbc] copy_process.isra.5.part.6+0x18c/0x19f0
[    1.014684] [c00000027e10bdb0] [c0000000000eb63c] _do_fork+0xec/0x490
[    1.015186] [c00000027e10be30] [c00000000000bb88] ppc_clone+0x8/0xc
[    1.015672] Instruction dump:
[    1.015903] 7c0803a6 eb61ffd8 eb81ffe0 7d908120 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 
[    1.016507] 60420000 e95f0022 e93f0000 79290720 <7f1e502a> 0b090000 0b190000 39200000 
[    1.017124] ---[ end trace 89b319055f8dd53a ]---
[    1.017482] 
[    3.019479] systemd: 18 output lines suppressed due to ratelimiting
[    3.019988] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.019988] 
[    3.020842] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.020842]

Comment 22 Menanteau Guy 2017-03-15 14:08:15 UTC
Note that Fedora ppc64le Rawhide of 20170314 is failing also to boot. 
It uses kernel kernel-4.11.0-0.rc2.git0.2
And I confirm comment 15, Fedora ppc64le Rawhide of 20170313 boot ok with kernel-4.11.0-0.rc1.git3.1

Comment 23 Laura Abbott 2017-03-15 14:27:03 UTC
Please try kernel-4.11.0-0.rc2.git1.1 or greater. There was a fix https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce70df089143c49385b4f32f39d41fb50fbf6a7c that came in right after -rc2 which would affect powerpc

Comment 24 Peter Robinson 2017-03-15 14:45:05 UTC
(In reply to Laura Abbott from comment #23)
> Please try kernel-4.11.0-0.rc2.git1.1 or greater. There was a fix
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=ce70df089143c49385b4f32f39d41fb50fbf6a7c that came in right after -rc2
> which would affect powerpc

4.11.0-0.rc2.git1.1.fc27.ppc64le boots for me on the above mentioned machine :)

Comment 25 Fedora Update System 2017-03-16 02:50:54 UTC
bcm283x-firmware-20170314-2.509beaa.fc26 kernel-4.11.0-0.rc2.git2.2.fc26 linux-firmware-20170313-72.git695f2d6d.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-51a9cd79a0

Comment 26 Fedora Update System 2017-03-16 16:22:16 UTC
bcm283x-firmware-20170314-2.509beaa.fc26, kernel-4.11.0-0.rc2.git2.2.fc26, linux-firmware-20170313-72.git695f2d6d.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-51a9cd79a0

Comment 27 Adam Williamson 2017-03-16 21:24:20 UTC
pbrobinson: does rc2.git2.2 boot for you? that's scheduled to be in Alpha currently.

Comment 28 Menanteau Guy 2017-03-17 09:13:15 UTC
I updated a f25 with f26 packages, installed kernel rc2.git2.2 and reboot was ok on ppc64le.

Comment 29 Fedora Update System 2017-03-20 22:19:19 UTC
bcm283x-firmware-20170314-2.509beaa.fc26, kernel-4.11.0-0.rc2.git2.2.fc26, linux-firmware-20170313-72.git695f2d6d.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Comment 30 Adam Williamson 2017-03-24 16:57:30 UTC
Think we can close this.