Bug 1490895

Summary: kernel crash when trying upgrade VM to F27
Product: [Fedora] Fedora Reporter: Kamil Páral <kparal>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 26CC: airlied, ajax, awilliam, bskeggs, chmelarz, eparis, esandeen, hdegoede, ichavero, itamar, jarodwilson, jforbes, jglisse, jonathan, josef, jwboyer, kernel-maint, labbott, linville, mchehab, mjg59, nhorman, quintela, robatino, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: RejectedBlocker AcceptedFreezeException
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-15 13:12:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1396703    
Attachments:
Description Flags
journal during upgrade
none
rpm -qa
none
vm.xml none

Description Kamil Páral 2017-09-12 12:40:44 UTC
Description of problem:
When I try to upgrade my VM from F26 to F27, the upgrade process gets stuck on "starting upgrade, please wait" screen and nothing happens. The system doesn't react to VT switches, and hard reboot it required. In journal there is this crash:

Sep 12 14:26:13 f26 kernel: ------------[ cut here ]------------
Sep 12 14:26:13 f26 kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo_util.c:589!
Sep 12 14:26:13 f26 kernel: invalid opcode: 0000 [#1] SMP
Sep 12 14:26:13 f26 kernel: Modules linked in: snd_hda_codec_generic crct10dif_pclmul crc32_pclmul ghash_clmulni
Sep 12 14:26:13 f26 kernel: CPU: 1 PID: 315 Comm: plymouthd Not tainted 4.12.11-300.fc26.x86_64 #1
Sep 12 14:26:13 f26 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
Sep 12 14:26:13 f26 kernel: task: ffffa033b6890000 task.stack: ffffbba8005d8000
Sep 12 14:26:13 f26 kernel: RIP: 0010:ttm_bo_kmap+0x1b5/0x260 [ttm]
Sep 12 14:26:13 f26 kernel: RSP: 0018:ffffbba8005dbb90 EFLAGS: 00010202
Sep 12 14:26:13 f26 kernel: RAX: ffffa033b6b66290 RBX: ffffa033fb9ff400 RCX: ffffa033fb9ff690
Sep 12 14:26:13 f26 kernel: RDX: 0000000000000300 RSI: 0000000000000000 RDI: ffffa033fb9ff458
Sep 12 14:26:13 f26 kernel: RBP: ffffbba8005dbbd0 R08: ffffa033fb9ff528 R09: 0000000000000400
Sep 12 14:26:13 f26 kernel: R10: 0000000000000008 R11: 0000000000001e86 R12: ffffa033fc8666b0
Sep 12 14:26:13 f26 kernel: R13: 0000000000000000 R14: ffffa033b604aa70 R15: 0000000000000000
Sep 12 14:26:13 f26 kernel: FS:  00007f15a0236d00(0000) GS:ffffa033ffd00000(0000) knlGS:0000000000000000
Sep 12 14:26:13 f26 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 12 14:26:13 f26 kernel: CR2: 00005633c67c40c8 CR3: 0000000036438000 CR4: 00000000003406e0
Sep 12 14:26:13 f26 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 12 14:26:13 f26 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 12 14:26:13 f26 kernel: Call Trace:
Sep 12 14:26:13 f26 kernel:  ? qxl_bo_kunmap_atomic_page+0x85/0x90 [qxl]
Sep 12 14:26:13 f26 kernel:  qxl_bo_kmap+0x42/0x70 [qxl]
Sep 12 14:26:13 f26 kernel:  qxl_draw_dirty_fb+0x1f5/0x420 [qxl]
Sep 12 14:26:13 f26 kernel:  qxl_framebuffer_surface_dirty+0xa0/0xf0 [qxl]
Sep 12 14:26:13 f26 kernel:  ? __kmalloc+0x1d1/0x210
Sep 12 14:26:13 f26 kernel:  drm_mode_dirtyfb_ioctl+0x17e/0x1c0 [drm]
Sep 12 14:26:13 f26 kernel:  drm_ioctl+0x213/0x4d0 [drm]
Sep 12 14:26:13 f26 kernel:  ? drm_mode_getfb+0x110/0x110 [drm]
Sep 12 14:26:13 f26 kernel:  do_vfs_ioctl+0xa5/0x600
Sep 12 14:26:13 f26 kernel:  ? security_file_ioctl+0x43/0x60
Sep 12 14:26:13 f26 kernel:  SyS_ioctl+0x79/0x90
Sep 12 14:26:13 f26 kernel:  do_syscall_64+0x67/0x140
Sep 12 14:26:13 f26 kernel:  entry_SYSCALL64_slow_path+0x25/0x25
Sep 12 14:26:13 f26 kernel: RIP: 0033:0x7f159f2215e7
Sep 12 14:26:13 f26 kernel: RSP: 002b:00007ffc3e140e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 12 14:26:13 f26 kernel: RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f159f2215e7
Sep 12 14:26:13 f26 kernel: RDX: 00007ffc3e140e50 RSI: 00000000c01864b1 RDI: 0000000000000009
Sep 12 14:26:13 f26 kernel: RBP: 00007ffc3e140e50 R08: 00007f159d14377c R09: 0000000000000010
Sep 12 14:26:13 f26 kernel: R10: 000000000000000a R11: 0000000000000246 R12: 00000000c01864b1
Sep 12 14:26:13 f26 kernel: R13: 0000000000000009 R14: 00005633c63e0330 R15: 00007f159dca078c
Sep 12 14:26:13 f26 kernel: Code: d0 49 8b be 80 00 00 00 48 c1 e6 0c 41 f6 46 62 04 74 4a 49 03 7e 70 4c 01 e7 
Sep 12 14:26:13 f26 kernel: RIP: ttm_bo_kmap+0x1b5/0x260 [ttm] RSP: ffffbba8005dbb90
Sep 12 14:26:13 f26 kernel: ---[ end trace 05d2301963119f2f ]---


This is reproducible every time for me. Once I remove "rhgb quiet" from kernel cmdline, the system upgrade proceeds as expected.


Version-Release number of selected component (if applicable):
kernel-4.12.11-300.fc26.x86_64

How reproducible:
possibly always

Steps to Reproduce:
1. dnf system-upgrade download --releasever=27 --enablerepo=updates-testing
2. dnf system-upgrade reboot
3. see stuck screen on "starting upgrade"
4. repeat without "rhgb quiet"
5. see working upgrade

Additional info:
This might be connected to bug 1490832. Perhaps the stuck screen is caused by dnf and not by kernel? Either way, the crash occurs every time.

Comment 1 Kamil Páral 2017-09-12 12:41:41 UTC
Created attachment 1324864 [details]
journal during upgrade

Comment 2 Kamil Páral 2017-09-12 12:41:53 UTC
Created attachment 1324865 [details]
rpm -qa

Comment 3 Kamil Páral 2017-09-12 12:42:09 UTC
Created attachment 1324866 [details]
vm.xml

Comment 4 Kamil Páral 2017-09-12 12:43:10 UTC
Proposing as beta blocker because upgrades must work. Please note this might not affect bare metals, but just (certain) VMs.

Comment 5 Kamil Páral 2017-09-12 12:45:56 UTC
This might have been also detected by OpenQA, but there are no logs to confirm this:
https://openqa.fedoraproject.org/tests/140446

Comment 6 Zdenek Chmelar 2017-09-12 14:19:29 UTC
I have exactly the same problem with F27 in Gnome-Boxes. 
Each time I boot the system with kernel 4.13.0-1.fc27.x86_64 (and prior release candidates), the boot process hangs and I have to kill the system.
Removing the "rhgb quiet" section from kernel boot menu allows the system to boot till the end but the boot process does not end with cmd login prompt or desktop session. Screen just shows the last boot messages.
If I change TTY and login on the command line, system works until I try to start desktop session (wayland). Then it freezes again.
If I want to start F27 properly, I use working kernel 4.11.8-300.fc26.x86_64.

Logs from journal

Sep 12 15:05:41 localhost.localdomain kernel: ------------[ cut here ]------------
Sep 12 15:05:41 localhost.localdomain kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo_util.c:589!
Sep 12 15:05:41 localhost.localdomain kernel: invalid opcode: 0000 [#1] SMP
Sep 12 15:05:41 localhost.localdomain kernel: Modules linked in: snd_intel8x0 snd_ac97_codec ac97_bus crct10dif_pclmul crc32_pclmul snd_seq ppdev snd_seq_device ghash_clmulni_intel snd_pcm parport_pc parport snd
Sep 12 15:05:41 localhost.localdomain kernel: CPU: 3 PID: 336 Comm: plymouthd Not tainted 4.13.0-1.fc27.x86_64 #1
Sep 12 15:05:41 localhost.localdomain kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
Sep 12 15:05:41 localhost.localdomain kernel: task: ffff97d7b5754c80 task.stack: ffffbcf3c0a28000
Sep 12 15:05:41 localhost.localdomain kernel: RIP: 0010:ttm_bo_kmap+0x1b5/0x260 [ttm]
Sep 12 15:05:41 localhost.localdomain kernel: RSP: 0018:ffffbcf3c0a2bb58 EFLAGS: 00010283
Sep 12 15:05:41 localhost.localdomain kernel: RAX: ffff97d7b5787190 RBX: ffff97d7b5720800 RCX: ffff97d7b5720a90
Sep 12 15:05:41 localhost.localdomain kernel: RDX: 0000000000000300 RSI: 0000000000000000 RDI: ffff97d7b5720858
Sep 12 15:05:41 localhost.localdomain kernel: RBP: ffffbcf3c0a2bb98 R08: ffff97d7b5720928 R09: 0000000000000400
Sep 12 15:05:41 localhost.localdomain kernel: R10: 0000000000000008 R11: 0000000000000fe4 R12: ffff97d7b5c626a8
Sep 12 15:05:41 localhost.localdomain kernel: R13: 0000000000000000 R14: ffff97d7b95025c8 R15: 0000000000000000
Sep 12 15:05:41 localhost.localdomain kernel: FS:  00007f84a5b81240(0000) GS:ffff97d7be980000(0000) knlGS:0000000000000000
Sep 12 15:05:41 localhost.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 12 15:05:41 localhost.localdomain kernel: CR2: 000055adf6732870 CR3: 00000001355ad000 CR4: 00000000003406e0
Sep 12 15:05:41 localhost.localdomain kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 12 15:05:41 localhost.localdomain kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 12 15:05:41 localhost.localdomain kernel: Call Trace:
Sep 12 15:05:41 localhost.localdomain kernel:  ? qxl_bo_kunmap_atomic_page+0x85/0x90 [qxl]
Sep 12 15:05:41 localhost.localdomain kernel:  qxl_bo_kmap+0x42/0x70 [qxl]
Sep 12 15:05:41 localhost.localdomain kernel:  qxl_draw_dirty_fb+0x1f5/0x420 [qxl]
Sep 12 15:05:41 localhost.localdomain kernel:  qxl_framebuffer_surface_dirty+0xa0/0xf0 [qxl]
Sep 12 15:05:41 localhost.localdomain kernel:  ? __kmalloc+0x1d1/0x210
Sep 12 15:05:41 localhost.localdomain kernel:  drm_mode_dirtyfb_ioctl+0x17e/0x1c0 [drm]
Sep 12 15:05:41 localhost.localdomain kernel:  ? drm_mode_getfb+0x110/0x110 [drm]
Sep 12 15:05:41 localhost.localdomain kernel:  drm_ioctl_kernel+0x5d/0xb0 [drm]
Sep 12 15:05:41 localhost.localdomain kernel:  drm_ioctl+0x31b/0x3d0 [drm]
Sep 12 15:05:41 localhost.localdomain kernel:  ? drm_mode_getfb+0x110/0x110 [drm]
Sep 12 15:05:41 localhost.localdomain kernel:  do_vfs_ioctl+0xa5/0x600
Sep 12 15:05:41 localhost.localdomain kernel:  ? security_file_ioctl+0x43/0x60
Sep 12 15:05:41 localhost.localdomain kernel:  SyS_ioctl+0x79/0x90
Sep 12 15:05:41 localhost.localdomain kernel:  do_syscall_64+0x67/0x140
Sep 12 15:05:41 localhost.localdomain kernel:  entry_SYSCALL64_slow_path+0x25/0x25
Sep 12 15:05:41 localhost.localdomain kernel: RIP: 0033:0x7f84a48dd0d7
Sep 12 15:05:41 localhost.localdomain kernel: RSP: 002b:00007ffdf3d048a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 12 15:05:41 localhost.localdomain kernel: RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f84a48dd0d7
Sep 12 15:05:41 localhost.localdomain kernel: RDX: 00007ffdf3d048e0 RSI: 00000000c01864b1 RDI: 0000000000000009
Sep 12 15:05:41 localhost.localdomain kernel: RBP: 00007ffdf3d048e0 R08: 00007f84a30af77c R09: 000055ad7524cc20
Sep 12 15:05:41 localhost.localdomain kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 00000000c01864b1
Sep 12 15:05:41 localhost.localdomain kernel: R13: 0000000000000009 R14: 000055ad74ef7e90 R15: 00007f84a3c0c78c
Sep 12 15:05:41 localhost.localdomain kernel: Code: d0 49 8b be 80 00 00 00 48 c1 e6 0c 41 f6 46 62 04 74 4a 49 03 7e 70 4c 01 e7 e8 c7 72 e0 d9 48 89 03 44 8b 45 d0 e9 18 ff ff ff <0f> 0b 4b 8d 7c 2c 58 44 89 4
Sep 12 15:05:41 localhost.localdomain kernel: RIP: ttm_bo_kmap+0x1b5/0x260 [ttm] RSP: ffffbcf3c0a2bb58
Sep 12 15:05:41 localhost.localdomain kernel: ---[ end trace a8e66fc5b2d12371 ]---

Comment 7 Adam Williamson 2017-09-12 19:15:17 UTC
kparal: openQA tests don't use qxl, they use the 'std' driver in qemu instead. so that's very likely not quite the same failure.

openQA uploads the logs much the same way it does everything else - from within the SUT, using a tty in this case - so of course if it can't actually get to a working console on tty6, it won't be able to upload logs. There are a few things we could try to work around intermittent boot failures, but we haven't got around to trying any of them yet.

I suspect this is actually just the same as https://bugzilla.redhat.com/show_bug.cgi?id=1462381 - that's a known bug where qxl + graphical boot has problems on kernel 4.12 (and, apparently, early 4.13 too). The traceback in https://bugzilla.redhat.com/show_bug.cgi?id=1462381#c8 looks about the same as yours and Zdenek's. One reporter says that 4.12.12 (for F25 and F26) fixes this; it looks like jforbes backported a patch that's only just been submitted upstream:

https://www.spinics.net/lists/dri-devel/msg151958.html

but didn't backport it to f27 / rawhide kernels (yet). So current status is, I think, that the kernels in updates-testing for f25 and f26 fix this, but current f25 and f26 stable still have the bug, and so do f27 and rawhide.

Comment 8 Adam Williamson 2017-09-12 23:25:53 UTC
I'm at least +1 FE on this, for the record, probably -1 Beta blocker as it's pretty easy to workaround (just take out rhgb).

Comment 9 Kamil Páral 2017-09-13 09:04:32 UTC
I tried kernel-4.12.12-300.fc26 and not only it doesn't improve the situation (the traceback is still there, and the system is still frozen), but it also causes massive graphical corruption in the running system:
https://bodhi.fedoraproject.org/updates/kernel-4.12.12-300.fc26#comment-658704

Comment 10 Dennis Gilmore 2017-09-14 17:23:18 UTC
+1 FE -1 Blocker

Comment 11 Adam Williamson 2017-09-15 02:12:57 UTC
Discussed at 2017-09-14 Beta Go/No-Go meeting, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-meeting-2/2017-09-14/f27-beta-go-no-go-meeting.2017-09-14-17.00.html . Rejected as a blocker as it's specific to qxl VMs and easy to work around (by removing 'rhgb'), but accepted as a freeze exception as it *would* be nice to fix this.

Kamil, any objection to just closing this as a dupe?

Comment 12 Kamil Páral 2017-09-15 13:12:05 UTC

*** This bug has been marked as a duplicate of bug 1462381 ***