1445312 – Kernel 4.10.10-200 panics in nouveau module

Bug 1445312 - Kernel 4.10.10-200 panics in nouveau module

Summary: Kernel 4.10.10-200 panics in nouveau module

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xorg-x11-drv-nouveau
Sub Component:
Version:	25
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Ben Skeggs
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1497548
TreeView+	depends on / blocked

Reported:	2017-04-25 13:18 UTC by Colin.Simpson
Modified:	2017-12-12 10:12 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-12 10:12:13 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backtrace output (3.61 KB, text/plain) 2017-04-25 13:18 UTC, Colin.Simpson	no flags	Details
vmcore-dmesg-4.11.6-201 (75.65 KB, text/plain) 2017-06-27 16:56 UTC, Colin.Simpson	no flags	Details
View All

Description Colin.Simpson 2017-04-25 13:18:23 UTC

Created attachment 1273911 [details]
Backtrace output

Description of problem:
All the released 4.10 kernels have hung my machine hard under wayland. This seems to happen a few times a day.

Added kdump and have a backtrace from 4.10.10-200

Version-Release number of selected component (if applicable):
4.10.10-200

How reproducible:
Wait long enough during the day

Steps to Reproduce:
1.Run Wayland desktop
2.Desktop hangs (often seems in Window overview mode)


Additional info:
Managed to get a BT from kdump.

PANIC: "divide error: 0000 [#1] SMP"
 [exception RIP: drm_calc_vbltimestamp_from_scanoutpos+358]

Complete BT attached.

Comment 1 Colin.Simpson 2017-05-09 10:51:23 UTC

Still seeing this with 4.10.13-200.fc25.x86_64

[  502.862336] RIP: 0010:drm_calc_vbltimestamp_from_scanoutpos+0x166/0x310 [drm]

Looks like this maybe upstream here:

https://bugs.freedesktop.org/show_bug.cgi?id=100691

But would have thought this would have been more commonly seen.

Comment 2 Colin.Simpson 2017-05-15 09:29:39 UTC

Still hosed in 4.10.14-200.fc25.x86_64

RIP: drm_calc_vbltimestamp_from_scanoutpos+0x166/0x310 [drm] RSP: ffff9cb97dc03aa8

Usually goes when you move the mouse to the corner and get Window overview mode.

Comment 3 Laura Abbott 2017-05-15 16:10:46 UTC

Let's have the nouveau team look at this

Comment 4 Colin.Simpson 2017-05-17 13:06:13 UTC

Just keeping this ticket up to date. Kernel 4.10.15-200.fc25.x86_64 still has this issue. Thanks for passing this to the nouveau team.

Comment 5 Colin.Simpson 2017-05-29 10:00:43 UTC

And still in 4.10.17-200.fc25.x86_64

 RIP: drm_calc_vbltimestamp_from_scanoutpos+0x166/0x310 [drm] RSP: ffff895abdc03aa8

Are the nouveau team thinking anything about this bug? Concerned it may make it into F26.

Thanks

Comment 6 Colin.Simpson 2017-06-02 10:36:24 UTC

And still in 4.11.3-200.fc25.x86_64

RIP: drm_calc_vbltimestamp_from_scanoutpos+0x14f/0x2d0 [drm] RSP: ffff89ec3dd83af8

Nouveau team any information.....can you say something about this!

Comment 7 Colin.Simpson 2017-06-16 17:24:14 UTC

Still in 4.11.4-200.fc25.x86_64

RIP: drm_calc_vbltimestamp_from_scanoutpos+0x14f/0x2d0 [drm] RSP: ffffa0897dc43af8

You aren't exactly giving me much faith in Nouveau. For example, I have a problem in RHEL open with RH support and one of their suggestions was to switch from NVIDIA proprietary to Nouveau. I have no faith in doing this, given that there seems to be little interest in fixing bleeding edge bugs in Nouveau.

Also what's the point in a RHEL customer trying Fedora for future RHEL technology previews, if there is no interest (not even a reply) on the issues we find.

Comment 8 Laura Abbott 2017-06-16 17:52:02 UTC

Please test 4.11.5 which should have a fix.

*** This bug has been marked as a duplicate of bug 1460456 ***

Comment 9 Laura Abbott 2017-06-16 17:56:21 UTC

Sorry about that, I misread the function that was panicking so I don't think that particular fix will apply but it's still worth testing 4.11.5 which may have other nouveau fixes.

Comment 10 Ben Skeggs 2017-06-16 22:05:59 UTC

Can you post a full kernel log please?

I'm aware of this issue from the upstream report, but have never seen it myself, despite testing on a LOT of different hardware in the same timeframe.  I also cannot pinpoint why this would be happening from a scan over the relevant changes.

In this case, since it's random, doing a bisect between 4.9 and 4.10 to determine the exact commit causing it would be problematic.  It'd be useful if it could be managed, but there'd be some doubt in the result due to the randomness.

Comment 11 Colin.Simpson 2017-06-19 09:29:11 UTC

Not sure what you mean by the full kernel log.

I have a complete crashdump from 4.11.5-200, if I can send that to you some how?

Comment 12 Colin.Simpson 2017-06-27 10:42:20 UTC

I now have a complete crashdump for 4.11.6-201.fc25.x86_64.

How can I get this to you to move this forward?

Comment 13 Colin.Simpson 2017-06-27 16:56:34 UTC

Created attachment 1292407 [details]
vmcore-dmesg-4.11.6-201

Comment 14 Colin.Simpson 2017-06-27 16:57:10 UTC

Attached what I thought you maybe needed.

Comment 15 Colin.Simpson 2017-07-17 09:20:31 UTC

This is now happening on F26, not hugely surprising as it hasn't been addressed. But I no longer have an old kernel to back out to. 

 Linux version 4.11.9-300.fc26.x86_64
RIP: drm_calc_vbltimestamp_from_scanoutpos+0x14f/0x2d0 [drm] RSP: ffff8ce7fdc03b08


Ben, I I asked how to send you a crash dump but never have heard back from you?

Can you let me know how we can address this bug?

Comment 16 Colin.Simpson 2017-07-19 09:40:13 UTC

I'd hoped to stop this happening by running Xorg instead of Wayland for my Gnome session. But arghh it still happens with Xorg.

RIP: drm_calc_vbltimestamp_from_scanoutpos+0x14f/0x2d0 [drm] RSP: ffff8d2ebddc3b08

Comment 17 Carl Henrik Lunde 2017-09-18 12:48:17 UTC

This has started happening in RHEL after upgrading from 7.3 to 7.4.

3.10.0-693.2.1.el7.x86_64

[1029086.283151] divide error: 0000 [#1] SMP
[1029086.283167] Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc xt_nat veth nls_utf8 cifs dns_resolver tcp_lp f
use xt_addrtype br_netfilter xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
 ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv
4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter btrfs raid6_pq xor dm_thin_pool dm_persistent_data dm_bio_p
rison dm_bufio snd_hda_codec_hdmi intel_powerclamp
[1029086.283360]  coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel mei_wdt hp_wmi ppdev sparse_keymap aesni_intel rfkill snd_hda_codec_realtek lrw gf128mu
l snd_hda_codec_generic glue_helper ablk_helper cryptd snd_hda_intel snd_hda_codec snd_hda_core i2c_i801 pcspkr snd_hwdep snd_seq joydev snd_seq_device snd_pcm parport_pc snd_timer parport sn
d sg soundcore mei_me mei tpm_infineon shpchp acpi_pad nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic nouveau mxm_wmi i2c_algo_bit drm
_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci e1000e libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw ptp pps_core i2c_core wmi video dm_mirror
 dm_region_hash dm_log dm_mod
[1029086.283558] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-693.1.1.el7.x86_64 #1
[1029086.283575] Hardware name: HP HP EliteDesk 800 G2 SFF/8054, BIOS N01 Ver. 02.16 08/08/2016
[1029086.283594] task: ffffffff819f9480 ti: ffffffff819e4000 task.ti: ffffffff819e4000
[1029086.283612] RIP: 0010:[<ffffffffc0172ad9>]  [<ffffffffc0172ad9>] drm_calc_vbltimestamp_from_scanoutpos+0x169/0x320 [drm]
[1029086.283646] RSP: 0018:ffff88082dc03a58  EFLAGS: 00010002
[1029086.283659] RAX: 0001377b41ca6940 RBX: 0000000000000000 RCX: 0000000000000000
[1029086.283674] RDX: 0000000000000000 RSI: 00000000ffffffe0 RDI: 0003a7f26059f405
[1029086.283690] RBP: ffff88082dc03af8 R08: 0000000000000001 R09: ffff8808073b9000
[1029086.283707] R10: 00000000001f006f R11: ffff8808087fe910 R12: 0000000000000007
[1029086.283723] R13: ffff8807a37fd420 R14: 0000000000000002 R15: 0000000000000003
[1029086.283739] FS:  0000000000000000(0000) GS:ffff88082dc00000(0000) knlGS:0000000000000000
[1029086.283757] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1029086.283770] CR2: 00007f3fb303af00 CR3: 00000000019f2000 CR4: 00000000003407f0
[1029086.283786] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1029086.283801] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1029086.283817] Stack:
[1029086.283822]  ffff88082dc03ac0 ffff8807a37fd420 0000000000059399 00000000000fb3dc
[1029086.283840]  000000000005952b ffff880800000002 ffff880000000000 00000000000fb3dc
[1029086.283858]  ffff88082dc03b68 ffff88082dc03b1c 00000001814c669d 000002bdffffffe2
[1029086.283876] Call Trace:
[1029086.283882]  <IRQ>
[1029086.283889]
[1029086.283914]  [<ffffffffc02a62de>] nouveau_display_vblstamp+0x6e/0x80 [nouveau]
[1029086.283934]  [<ffffffffc01727a3>] drm_get_last_vbltimestamp+0x53/0x90 [drm]
[1029086.283956]  [<ffffffffc0172f79>] drm_update_vblank_count+0x79/0x2c0 [drm]
[1029086.283978]  [<ffffffffc0173c66>] drm_handle_vblank+0x96/0x280 [drm]
[1029086.283998]  [<ffffffffc0173e67>] drm_crtc_handle_vblank+0x17/0x20 [drm]
[1029086.284031]  [<ffffffffc02a5f55>] nouveau_display_vblank_handler+0x15/0x20 [nouveau]
[1029086.284058]  [<ffffffffc01f0cdc>] nvif_notify+0xac/0x1a0 [nouveau]
[1029086.284089]  [<ffffffffc0254514>] ? nv50_disp_vblank_fini_+0x14/0x20 [nouveau]
[1029086.284122]  [<ffffffffc029c27a>] nvkm_client_ntfy+0x6a/0x70 [nouveau]
[1029086.284146]  [<ffffffffc01f0fe2>] nvkm_client_notify+0x22/0x30 [nouveau]
[1029086.284171]  [<ffffffffc01f41e6>] nvkm_notify_send+0x86/0x160 [nouveau]
[1029086.284195]  [<ffffffffc01f1f1f>] nvkm_event_send+0xdf/0x100 [nouveau]
[1029086.284226]  [<ffffffffc0253c21>] nvkm_disp_vblank+0x41/0x60 [nouveau]
[1029086.284257]  [<ffffffffc0256f5b>] gf119_disp_intr+0x10b/0x270 [nouveau]
[1029086.284288]  [<ffffffffc0254553>] nv50_disp_intr_+0x13/0x20 [nouveau]
[1029086.284317]  [<ffffffffc02538b4>] nvkm_disp_intr+0x14/0x20 [nouveau]
[1029086.284341]  [<ffffffffc01f177f>] nvkm_engine_intr+0x1f/0x30 [nouveau]
[1029086.284366]  [<ffffffffc01f5a37>] nvkm_subdev_intr+0x17/0x20 [nouveau]
[1029086.284396]  [<ffffffffc023bf47>] nvkm_mc_intr+0xf7/0x1a0 [nouveau]
[1029086.284425]  [<ffffffffc0240e23>] nvkm_pci_intr+0x53/0x80 [nouveau]
[1029086.284441]  [<ffffffff81130a2e>] __handle_irq_event_percpu+0x3e/0x1c0
[1029086.284457]  [<ffffffff81130be2>] handle_irq_event_percpu+0x32/0x80
[1029086.284471]  [<ffffffff81130c6c>] handle_irq_event+0x3c/0x60
[1029086.284484]  [<ffffffff811338f7>] handle_edge_irq+0x77/0x130
[1029086.284499]  [<ffffffff8102d2c8>] handle_irq+0xb8/0x150
[1029086.284512]  [<ffffffff810f3ddc>] ? tick_check_idle+0x8c/0xd0
[1029086.284526]  [<ffffffff816b053a>] ? atomic_notifier_call_chain+0x1a/0x20
[1029086.284543]  [<ffffffff816b75ed>] do_IRQ+0x4d/0xe0
[1029086.284554]  [<ffffffff816ac1ed>] common_interrupt+0x6d/0x6d
[1029086.284567]  <EOI>
[1029086.284572]
[1029086.285527]  [<ffffffff81527c22>] ? cpuidle_enter_state+0x52/0xc0
[1029086.286484]  [<ffffffff81527c18>] ? cpuidle_enter_state+0x48/0xc0
[1029086.287430]  [<ffffffff81527d68>] cpuidle_idle_call+0xd8/0x210
[1029086.288372]  [<ffffffff81034fee>] arch_cpu_idle+0xe/0x30
[1029086.289310]  [<ffffffff810e7bca>] cpu_startup_entry+0x14a/0x1c0
[1029086.290251]  [<ffffffff81692c57>] rest_init+0x77/0x80
[1029086.291186]  [<ffffffff81b45060>] start_kernel+0x439/0x45a
[1029086.292121]  [<ffffffff81b44a30>] ? repair_env_string+0x5c/0x5c
[1029086.293052]  [<ffffffff81b44120>] ? early_idt_handler_array+0x120/0x120
[1029086.293979]  [<ffffffff81b445ef>] x86_64_start_reservations+0x24/0x26
[1029086.294908]  [<ffffffff81b44740>] x86_64_start_kernel+0x14f/0x172
[1029086.295814] Code: e0 02 83 f8 01 41 8b 85 a8 00 00 00 45 19 ff 0f af 44 24 58 41 83 e7 fe 41 83 c7 03 03 44 24 5c 48 98 48 69 c0 40 42 0f 00 48 99 <48> f7 f9 49 89 c5 8b 05 47 a5 03 00 85 c0 0f 84 b3 00 00 00 e8
[1029086.296827] RIP  [<ffffffffc0172ad9>] drm_calc_vbltimestamp_from_scanoutpos+0x169/0x320 [drm]
[1029086.297826]  RSP <ffff88082dc03a58>

Comment 18 Colin.Simpson 2017-09-19 09:56:19 UTC

Sadly the way i have resolved this is moving to the 4.13.0-1.fc27 on Fedora 26, not really an option with RHEL7.4

When I had discussions with a RH engineer on this they said they couldn't reproduce with the hardware they had. Also it was a hard issue to debug.

I guess a support call if you have that option?

Comment 19 Fedora End Of Life 2017-11-16 19:25:51 UTC

This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 20 Fedora End Of Life 2017-12-12 10:12:13 UTC

Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.