Bug 1937129 - page fault with nouveau on jetson-tk1
Summary: page fault with nouveau on jetson-tk1
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ARMTracker
TreeView+ depends on / blocked
 
Reported: 2021-03-09 22:17 UTC by Nicolas Chauvet (kwizart)
Modified: 2021-08-17 14:36 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-17 14:10:02 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg with fedora kernel. (68.75 KB, text/plain)
2021-03-10 14:30 UTC, Nicolas Chauvet (kwizart)
no flags Details

Description Nicolas Chauvet (kwizart) 2021-03-09 22:17:45 UTC
Description of problem:

I'm experiencing a nouveau driver page fault when trying to use the fedora kernel with gnome-shell on jetson-tk1 (armhfp)



Version-Release number of selected component (if applicable):
kernel-5.11.5-300.fc34.armv7hl

How reproducible:
always

Steps to Reproduce:
1. on jetson-tk1. gnome. systemctl isolate graphical
2.
3.

Actual results:
page:1706ccc7 refcount:0 mapcount:0 mapping:29d7e10e index:0x10039 pfn:0xf0481
aops:anon_aops.1 ino:48d7
flags: 0xf800000()
raw: 0f800000 eec8a24c efbe1678 c2686110 00010039 00000000 ffffffff 00000000
raw: 00000000
page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u))
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:1179!
Internal error: Oops - BUG: 0 [#1] SMP ARM
Modules linked in: rfkill ofpart spi_nor mtd snd_soc_tegra30_i2s snd_soc_tegra_pcm tegra_drm snd_soc_tegra_rt5640 snd_soc_tegra_utils snd_soc_rt5640 snd_hda_codec_hdmi snd_soc_rl6231 snd_hd>
CPU: 2 PID: 859 Comm: gnome-shell Not tainted 5.11.5-300.fc34.armv7hl #1
Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
PC is at get_page+0x20/0x38
LR is at __dump_page+0x110/0x464
pc : [<c04caeec>]    lr : [<c04c69ec>]    psr: 60000113
sp : c73ebdf0  ip : 2eb7a000  fp : a747e000
r10: a747f000  r9 : 0000071f  r8 : c44b5600
r7 : a747f000  r6 : c75d21fc  r5 : 00000000  r4 : eec8a224
r3 : 00000027  r2 : 00000027  r1 : 00000000  r0 : 00000059
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 873e406a  DAC: 00000051
Process gnome-shell (pid: 859, stack limit = 0xdd661172)
Stack: (0xc73ebdf0 to 0xc73ec000)
bde0:                                     eec8a224 c04cd81c c2b399c0 eefc919c
be00: eec8a224 0000071f c2b399c0 a747f000 000f0481 00000001 00000000 c04cdbb4
be20: 00000000 00000000 00000000 00000000 00000000 c5089000 00000001 c7208d00
be40: c2b399c0 00000001 0000071f c04cdcec 0000071f 00000000 000f0481 bf29a5c0
be60: 0000071f 00000001 00000080 00000010 00000000 00000001 00000000 00000000
be80: 00000000 00000000 c352ea30 00000000 c73ebef4 c5089000 c2b399c0 c73ebfb0
bea0: 00000040 c44b5648 00000800 bf3bda84 c73ebef4 c2b399c0 00000255 a747e000
bec0: c73ebfb0 c04cb7f4 00000001 c2b399c0 00000255 c04ceff0 c1b9a894 bebbf974
bee0: c5ec4200 c099f9f4 fffffff3 c099f9f4 00000000 c2b399c0 00000255 00100cca
bf00: 00010038 a747e000 c73e69d0 c73e69d0 00000000 00000000 00000000 00000000
bf20: 00000000 eefc9164 c03002a4 c73ebfb0 a747e000 c2b399c0 c44b5600 00000805
bf40: 00000255 c44b5648 00000800 c0d37788 00000000 c04023b8 c5bdfa28 c5bdf800
bf60: c5bdfa1c 00000805 a747e000 ffffffff c73ebfb0 c1510e20 aef3db00 00001000
bf80: 00000000 c031400c a747e000 00000805 c73ebfb0 0000906e ae9757f0 40000010
bfa0: ffffffff 10c5387d 10c5387d c0300e80 0000906e 2001e000 a747e000 a747e000
bfc0: 0152f7f8 0152eee8 0152f7f8 0152ed38 0152ee58 aef3db00 00001000 00000000
bfe0: 00000000 bebbf9c0 afd22fec ae9757f0 40000010 ffffffff 00000000 00000000
[<c04caeec>] (get_page) from [<c04cd81c>] (insert_page+0xa8/0x114)
[<c04cd81c>] (insert_page) from [<c04cdbb4>] (__vm_insert_mixed+0x94/0x1ac)
[<c04cdbb4>] (__vm_insert_mixed) from [<c04cdcec>] (vmf_insert_mixed_prot+0x20/0x28)
[<c04cdcec>] (vmf_insert_mixed_prot) from [<bf29a5c0>] (ttm_bo_vm_fault_reserved+0x280/0x318 [ttm])
[<bf29a5c0>] (ttm_bo_vm_fault_reserved [ttm]) from [<bf3bda84>] (nouveau_ttm_fault+0x60/0x90 [nouveau])
[<bf3bda84>] (nouveau_ttm_fault [nouveau]) from [<c04cb7f4>] (__do_fault+0x58/0xb0)
[<c04cb7f4>] (__do_fault) from [<c04ceff0>] (handle_mm_fault+0x7c0/0x97c)
[<c04ceff0>] (handle_mm_fault) from [<c0d37788>] (do_page_fault+0x2c0/0x348)
[<c0d37788>] (do_page_fault) from [<c031400c>] (do_DataAbort+0x3c/0xbc)
[<c031400c>] (do_DataAbort) from [<c0300e80>] (__dabt_usr+0x40/0x60)
Exception stack(0xc73ebfb0 to 0xc73ebff8)
bfa0:                                     0000906e 2001e000 a747e000 a747e000
bfc0: 0152f7f8 0152eee8 0152f7f8 0152ed38 0152ee58 aef3db00 00001000 00000000
bfe0: 00000000 bebbf9c0 afd22fec ae9757f0 40000010 ffffffff
Code: e353007f 8a000002 e59f1014 ebffef94 (e7f001f2) 
---[ end trace 38b95f8878f32175 ]---

Expected results:
no page fault.

Additional info:
I'm not reproducing using the grate downstream kernel based on linux-next 20210302.
I will try to reproduce with vanilla linux-next in the coming days.

Comment 1 Nicolas Chauvet (kwizart) 2021-03-10 12:42:24 UTC
FYI, I'm not reproducing using linux-next 20210302.

Will try with 5.12-rc1...

Comment 2 Nicolas Chauvet (kwizart) 2021-03-10 13:10:57 UTC
5.12-rc1 also (still) have the page fault bug. But the triggered fault is a different one (related to polkit), and there I can have a graphical display... (but too unstable to verify gpu acceleration).


[   58.003759] BUG: Bad page state in process polkitd  pfn:ee9b1
[   58.009509] page:8a64ce78 refcount:2 mapcount:129 mapping:473e54ab index:0x0 pfn:0xee9b1
[   58.017597] aops:0xc0b0ea14 ino:1749
[   58.021177] flags: 0x40000000()
[   58.024339] raw: 40000000 00000100 00000122 c43d81f8 00000000 00000000 00000080 00000002
[   58.032422] page dumped because: nonzero _refcount
[   58.037204] Modules linked in: nouveau tegra_drm host1x drm_ttm_helper tegra_soctherm ttm iova zram zsmalloc xhci_tegra ci_hdrc_tegra phy_tegra_xusb ahci_tegra libahci_platform tegra124_e
[   58.061017] CPU: 2 PID: 689 Comm: polkitd Not tainted 5.12.0-rc2-tegra+ #198
[   58.068051] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[   58.074305] [<c010ec40>] (unwind_backtrace) from [<c010a1ec>] (show_stack+0x10/0x14)
[   58.082039] [<c010a1ec>] (show_stack) from [<c0a86b20>] (dump_stack+0xc0/0xd4)
[   58.089250] [<c0a86b20>] (dump_stack) from [<c02341ec>] (bad_page+0xdc/0x10c)
[   58.096373] [<c02341ec>] (bad_page) from [<c02383d4>] (get_page_from_freelist+0xde8/0x116c)
[   58.104709] [<c02383d4>] (get_page_from_freelist) from [<c0238cd8>] (__alloc_pages_nodemask+0x17c/0x1014)
[   58.114258] [<c0238cd8>] (__alloc_pages_nodemask) from [<c021e478>] (__pte_alloc+0x24/0x178)
[   58.122679] [<c021e478>] (__pte_alloc) from [<c021fb40>] (copy_page_range+0x6e4/0xa18)
[   58.130580] [<c021fb40>] (copy_page_range) from [<c011f154>] (dup_mm+0x328/0x458)
[   58.138050] [<c011f154>] (dup_mm) from [<c011fee4>] (copy_process+0x980/0x16c4)
[   58.145344] [<c011fee4>] (copy_process) from [<c0120e9c>] (kernel_clone+0xa4/0x3e4)
[   58.152986] [<c0120e9c>] (kernel_clone) from [<c01214a0>] (sys_clone+0x74/0x90)
[   58.160281] [<c01214a0>] (sys_clone) from [<c01000c0>] (ret_fast_syscall+0x0/0x58)
[   58.167835] Exception stack(0xc56fffa8 to 0xc56ffff0)
[   58.172873] ffa0:                   b491e078 00000001 01200011 00000000 00000000 00000000
[   58.181032] ffc0: b491e078 00000001 b4face1c 00000078 bea4a000 b491e550 00000001 bea4a264
[   58.189188] ffe0: b491e010 bea49e38 b4f018ec b4f017fc
[   58.194225] Disabling lock debugging due to kernel taint
[   58.199523] BUG: Bad page state in process polkitd  pfn:ee9b2
[   58.205253] page:8be0376d refcount:2 mapcount:129 mapping:473e54ab index:0x0 pfn:0xee9b2
[   58.213328] aops:0xc0b0ea14 ino:1749
[   58.216892] flags: 0x40000000()
[   58.220025] raw: 40000000 00000100 00000122 c43d81f8 00000000 00000000 00000080 00000002
[   58.228096] page dumped because: nonzero _refcount
[   58.232872] Modules linked in: nouveau tegra_drm host1x drm_ttm_helper tegra_soctherm ttm iova zram zsmalloc xhci_tegra ci_hdrc_tegra phy_tegra_xusb ahci_tegra libahci_platform tegra124_e
[   58.256679] CPU: 2 PID: 689 Comm: polkitd Tainted: G    B             5.12.0-rc2-tegra+ #198
[   58.265097] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[   58.271348] [<c010ec40>] (unwind_backtrace) from [<c010a1ec>] (show_stack+0x10/0x14)
[   58.279077] [<c010a1ec>] (show_stack) from [<c0a86b20>] (dump_stack+0xc0/0xd4)
[   58.286284] [<c0a86b20>] (dump_stack) from [<c02341ec>] (bad_page+0xdc/0x10c)
[   58.293405] [<c02341ec>] (bad_page) from [<c02383d4>] (get_page_from_freelist+0xde8/0x116c)
[   58.301739] [<c02383d4>] (get_page_from_freelist) from [<c0238cd8>] (__alloc_pages_nodemask+0x17c/0x1014)
[   58.311288] [<c0238cd8>] (__alloc_pages_nodemask) from [<c021e478>] (__pte_alloc+0x24/0x178)
[   58.319709] [<c021e478>] (__pte_alloc) from [<c021fb40>] (copy_page_range+0x6e4/0xa18)
[   58.327609] [<c021fb40>] (copy_page_range) from [<c011f154>] (dup_mm+0x328/0x458)
[   58.335077] [<c011f154>] (dup_mm) from [<c011fee4>] (copy_process+0x980/0x16c4)
[   58.342371] [<c011fee4>] (copy_process) from [<c0120e9c>] (kernel_clone+0xa4/0x3e4)
[   58.350013] [<c0120e9c>] (kernel_clone) from [<c01214a0>] (sys_clone+0x74/0x90)
[   58.357308] [<c01214a0>] (sys_clone) from [<c01000c0>] (ret_fast_syscall+0x0/0x58)
[   58.364861] Exception stack(0xc56fffa8 to 0xc56ffff0)
[   58.369900] ffa0:                   b491e078 00000001 01200011 00000000 00000000 00000000
[   58.378057] ffc0: b491e078 00000001 b4face1c 00000078 bea4a000 b491e550 00000001 bea4a264
[   58.386214] ffe0: b491e010 bea49e38 b4f018ec b4f017fc
[   58.391250] BUG: Bad page state in process polkitd  pfn:ee9b3
[   58.396981] page:32413595 refcount:2 mapcount:129 mapping:473e54ab index:0x0 pfn:0xee9b3
[   58.405054] aops:0xc0b0ea14 ino:1749

Comment 3 Nicolas Chauvet (kwizart) 2021-03-10 14:30:17 UTC
Created attachment 1762323 [details]
dmesg with fedora kernel.

Comment 4 Nicolas Chauvet (kwizart) 2021-03-10 16:37:21 UTC
As this bug is concerned:
5.10.16-200.fc33.armv7hl is known good (doesn't exhibit the page fault).
5.11.0-rc6-next-20210201-tegra+ is known bad (already exhibit the issue).

Comment 5 Nicolas Chauvet (kwizart) 2021-03-10 16:55:22 UTC
5.11.0-rc4-next-20210119-tegra+ is known bad.

Comment 6 Nicolas Chauvet (kwizart) 2021-03-10 19:59:58 UTC
461619f5c3242aaee9ec3f0b7072719bd86ea207 is the first bad commit
drm/nouveau: switch to new allocator

(Will try to revert on top of 5.11.5)

git bisect start
# bad: [5c8fe583cce542aa0b84adc939ce85293de36e5e] Linux 5.11-rc1
git bisect bad 5c8fe583cce542aa0b84adc939ce85293de36e5e
# good: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect good 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# bad: [2911ed9f47b47cb5ab87d03314b3b9fe008e607f] Merge tag 'char-misc-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad 2911ed9f47b47cb5ab87d03314b3b9fe008e607f
# bad: [ac73e3dc8acd0a3be292755db30388c3580f5674] Merge branch 'akpm' (patches from Andrew)
git bisect bad ac73e3dc8acd0a3be292755db30388c3580f5674
# bad: [b10733527bfd864605c33ab2e9a886eec317ec39] Merge tag 'amd-drm-next-5.11-2020-12-09' of git://people.freedesktop.org/~agd5f/linux into drm-next
git bisect bad b10733527bfd864605c33ab2e9a886eec317ec39
# bad: [9713158cb2a918c3f6f5522eed23cdeb61f22e75] drm/amdgpu: Add and use seperate reg headers for dcn302
git bisect bad 9713158cb2a918c3f6f5522eed23cdeb61f22e75
# bad: [c0f98d2f8b076bf3e3183aa547395f919c943a14] Merge tag 'drm-misc-next-2020-11-05' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect bad c0f98d2f8b076bf3e3183aa547395f919c943a14
# good: [6a6e5988a2657cd0c91f6f1a3e7d194599248b6d] drm/ttm: replace last move_notify with delete_mem_notify
git bisect good 6a6e5988a2657cd0c91f6f1a3e7d194599248b6d
# good: [f566fdcd6cc49a9d5b5d782f56e3e7cb243f01b8] drm/i915: Force VT'd workarounds when running as a guest OS
git bisect good f566fdcd6cc49a9d5b5d782f56e3e7cb243f01b8
# good: [e76ab2cf21c38331155ea613cdf18582f011c30f] drm/i915: Remove per-platform IIR HPD masking
git bisect good e76ab2cf21c38331155ea613cdf18582f011c30f
# bad: [268af50f38b1f2199a2e85e38073d7a25c20190c] drm/panfrost: Support cache-coherent integrations
git bisect bad 268af50f38b1f2199a2e85e38073d7a25c20190c
# good: [e000650375b65ff77c5ee852b5086f58c741179e] fbdev/atafb: Remove unused extern variables
git bisect good e000650375b65ff77c5ee852b5086f58c741179e
# bad: [461619f5c3242aaee9ec3f0b7072719bd86ea207] drm/nouveau: switch to new allocator
git bisect bad 461619f5c3242aaee9ec3f0b7072719bd86ea207
# good: [d099fc8f540add80f725014fdd4f7f49f3c58911] drm/ttm: new TT backend allocation pool v3
git bisect good d099fc8f540add80f725014fdd4f7f49f3c58911
# good: [e93b2da9799e5cb97760969f3e1f02a5bdac29fe] drm/amdgpu: switch to new allocator v2
git bisect good e93b2da9799e5cb97760969f3e1f02a5bdac29fe
# good: [0fe3cf3a53b5c1205ec7d321be1185b075dff205] drm/radeon: switch to new allocator v2
git bisect good 0fe3cf3a53b5c1205ec7d321be1185b075dff205
# first bad commit: [461619f5c3242aaee9ec3f0b7072719bd86ea207] drm/nouveau: switch to new allocator

Comment 7 Nicolas Chauvet (kwizart) 2021-08-17 14:10:02 UTC
with 5.14-rc5 as a base + tegra-next + tegra-drm-next + tegra-drm-fixes (scheduled for next) + PM patches (scheduled for 5.16, but optionals).
And using libdrm scheduled for the new tegra uABI...

I have no issue anymore to have a graphical display using Wayland on workstation Spin (jetson-tk1).

Comment 8 Nicolas Chauvet (kwizart) 2021-08-17 14:36:05 UTC
Actually, it doesn't seem that reliable on a second boot... So might need to wait for 5.16 to see more improvements (specially about iommu/memory/dGPU support...).


Note You need to log in before you can comment on or make changes to this bug.