Bug 1490895
Summary: | kernel crash when trying upgrade VM to F27 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kamil Páral <kparal> | ||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 26 | CC: | airlied, ajax, awilliam, bskeggs, chmelarz, eparis, esandeen, hdegoede, ichavero, itamar, jarodwilson, jforbes, jglisse, jonathan, josef, jwboyer, kernel-maint, labbott, linville, mchehab, mjg59, nhorman, quintela, robatino, steved | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | RejectedBlocker AcceptedFreezeException | ||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-09-15 13:12:05 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1396703 | ||||||||||
Attachments: |
|
Description
Kamil Páral
2017-09-12 12:40:44 UTC
Created attachment 1324864 [details]
journal during upgrade
Created attachment 1324865 [details]
rpm -qa
Created attachment 1324866 [details]
vm.xml
Proposing as beta blocker because upgrades must work. Please note this might not affect bare metals, but just (certain) VMs. This might have been also detected by OpenQA, but there are no logs to confirm this: https://openqa.fedoraproject.org/tests/140446 I have exactly the same problem with F27 in Gnome-Boxes. Each time I boot the system with kernel 4.13.0-1.fc27.x86_64 (and prior release candidates), the boot process hangs and I have to kill the system. Removing the "rhgb quiet" section from kernel boot menu allows the system to boot till the end but the boot process does not end with cmd login prompt or desktop session. Screen just shows the last boot messages. If I change TTY and login on the command line, system works until I try to start desktop session (wayland). Then it freezes again. If I want to start F27 properly, I use working kernel 4.11.8-300.fc26.x86_64. Logs from journal Sep 12 15:05:41 localhost.localdomain kernel: ------------[ cut here ]------------ Sep 12 15:05:41 localhost.localdomain kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo_util.c:589! Sep 12 15:05:41 localhost.localdomain kernel: invalid opcode: 0000 [#1] SMP Sep 12 15:05:41 localhost.localdomain kernel: Modules linked in: snd_intel8x0 snd_ac97_codec ac97_bus crct10dif_pclmul crc32_pclmul snd_seq ppdev snd_seq_device ghash_clmulni_intel snd_pcm parport_pc parport snd Sep 12 15:05:41 localhost.localdomain kernel: CPU: 3 PID: 336 Comm: plymouthd Not tainted 4.13.0-1.fc27.x86_64 #1 Sep 12 15:05:41 localhost.localdomain kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 Sep 12 15:05:41 localhost.localdomain kernel: task: ffff97d7b5754c80 task.stack: ffffbcf3c0a28000 Sep 12 15:05:41 localhost.localdomain kernel: RIP: 0010:ttm_bo_kmap+0x1b5/0x260 [ttm] Sep 12 15:05:41 localhost.localdomain kernel: RSP: 0018:ffffbcf3c0a2bb58 EFLAGS: 00010283 Sep 12 15:05:41 localhost.localdomain kernel: RAX: ffff97d7b5787190 RBX: ffff97d7b5720800 RCX: ffff97d7b5720a90 Sep 12 15:05:41 localhost.localdomain kernel: RDX: 0000000000000300 RSI: 0000000000000000 RDI: ffff97d7b5720858 Sep 12 15:05:41 localhost.localdomain kernel: RBP: ffffbcf3c0a2bb98 R08: ffff97d7b5720928 R09: 0000000000000400 Sep 12 15:05:41 localhost.localdomain kernel: R10: 0000000000000008 R11: 0000000000000fe4 R12: ffff97d7b5c626a8 Sep 12 15:05:41 localhost.localdomain kernel: R13: 0000000000000000 R14: ffff97d7b95025c8 R15: 0000000000000000 Sep 12 15:05:41 localhost.localdomain kernel: FS: 00007f84a5b81240(0000) GS:ffff97d7be980000(0000) knlGS:0000000000000000 Sep 12 15:05:41 localhost.localdomain kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 12 15:05:41 localhost.localdomain kernel: CR2: 000055adf6732870 CR3: 00000001355ad000 CR4: 00000000003406e0 Sep 12 15:05:41 localhost.localdomain kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Sep 12 15:05:41 localhost.localdomain kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Sep 12 15:05:41 localhost.localdomain kernel: Call Trace: Sep 12 15:05:41 localhost.localdomain kernel: ? qxl_bo_kunmap_atomic_page+0x85/0x90 [qxl] Sep 12 15:05:41 localhost.localdomain kernel: qxl_bo_kmap+0x42/0x70 [qxl] Sep 12 15:05:41 localhost.localdomain kernel: qxl_draw_dirty_fb+0x1f5/0x420 [qxl] Sep 12 15:05:41 localhost.localdomain kernel: qxl_framebuffer_surface_dirty+0xa0/0xf0 [qxl] Sep 12 15:05:41 localhost.localdomain kernel: ? __kmalloc+0x1d1/0x210 Sep 12 15:05:41 localhost.localdomain kernel: drm_mode_dirtyfb_ioctl+0x17e/0x1c0 [drm] Sep 12 15:05:41 localhost.localdomain kernel: ? drm_mode_getfb+0x110/0x110 [drm] Sep 12 15:05:41 localhost.localdomain kernel: drm_ioctl_kernel+0x5d/0xb0 [drm] Sep 12 15:05:41 localhost.localdomain kernel: drm_ioctl+0x31b/0x3d0 [drm] Sep 12 15:05:41 localhost.localdomain kernel: ? drm_mode_getfb+0x110/0x110 [drm] Sep 12 15:05:41 localhost.localdomain kernel: do_vfs_ioctl+0xa5/0x600 Sep 12 15:05:41 localhost.localdomain kernel: ? security_file_ioctl+0x43/0x60 Sep 12 15:05:41 localhost.localdomain kernel: SyS_ioctl+0x79/0x90 Sep 12 15:05:41 localhost.localdomain kernel: do_syscall_64+0x67/0x140 Sep 12 15:05:41 localhost.localdomain kernel: entry_SYSCALL64_slow_path+0x25/0x25 Sep 12 15:05:41 localhost.localdomain kernel: RIP: 0033:0x7f84a48dd0d7 Sep 12 15:05:41 localhost.localdomain kernel: RSP: 002b:00007ffdf3d048a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Sep 12 15:05:41 localhost.localdomain kernel: RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f84a48dd0d7 Sep 12 15:05:41 localhost.localdomain kernel: RDX: 00007ffdf3d048e0 RSI: 00000000c01864b1 RDI: 0000000000000009 Sep 12 15:05:41 localhost.localdomain kernel: RBP: 00007ffdf3d048e0 R08: 00007f84a30af77c R09: 000055ad7524cc20 Sep 12 15:05:41 localhost.localdomain kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 00000000c01864b1 Sep 12 15:05:41 localhost.localdomain kernel: R13: 0000000000000009 R14: 000055ad74ef7e90 R15: 00007f84a3c0c78c Sep 12 15:05:41 localhost.localdomain kernel: Code: d0 49 8b be 80 00 00 00 48 c1 e6 0c 41 f6 46 62 04 74 4a 49 03 7e 70 4c 01 e7 e8 c7 72 e0 d9 48 89 03 44 8b 45 d0 e9 18 ff ff ff <0f> 0b 4b 8d 7c 2c 58 44 89 4 Sep 12 15:05:41 localhost.localdomain kernel: RIP: ttm_bo_kmap+0x1b5/0x260 [ttm] RSP: ffffbcf3c0a2bb58 Sep 12 15:05:41 localhost.localdomain kernel: ---[ end trace a8e66fc5b2d12371 ]--- kparal: openQA tests don't use qxl, they use the 'std' driver in qemu instead. so that's very likely not quite the same failure. openQA uploads the logs much the same way it does everything else - from within the SUT, using a tty in this case - so of course if it can't actually get to a working console on tty6, it won't be able to upload logs. There are a few things we could try to work around intermittent boot failures, but we haven't got around to trying any of them yet. I suspect this is actually just the same as https://bugzilla.redhat.com/show_bug.cgi?id=1462381 - that's a known bug where qxl + graphical boot has problems on kernel 4.12 (and, apparently, early 4.13 too). The traceback in https://bugzilla.redhat.com/show_bug.cgi?id=1462381#c8 looks about the same as yours and Zdenek's. One reporter says that 4.12.12 (for F25 and F26) fixes this; it looks like jforbes backported a patch that's only just been submitted upstream: https://www.spinics.net/lists/dri-devel/msg151958.html but didn't backport it to f27 / rawhide kernels (yet). So current status is, I think, that the kernels in updates-testing for f25 and f26 fix this, but current f25 and f26 stable still have the bug, and so do f27 and rawhide. I'm at least +1 FE on this, for the record, probably -1 Beta blocker as it's pretty easy to workaround (just take out rhgb). I tried kernel-4.12.12-300.fc26 and not only it doesn't improve the situation (the traceback is still there, and the system is still frozen), but it also causes massive graphical corruption in the running system: https://bodhi.fedoraproject.org/updates/kernel-4.12.12-300.fc26#comment-658704 +1 FE -1 Blocker Discussed at 2017-09-14 Beta Go/No-Go meeting, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-meeting-2/2017-09-14/f27-beta-go-no-go-meeting.2017-09-14-17.00.html . Rejected as a blocker as it's specific to qxl VMs and easy to work around (by removing 'rhgb'), but accepted as a freeze exception as it *would* be nice to fix this. Kamil, any objection to just closing this as a dupe? *** This bug has been marked as a duplicate of bug 1462381 *** |