Bug 2220671 - [ThinkStation P3 TWR RHEL9.2GA] Nvidia T1000 Graphics card cert fail
Summary: [ThinkStation P3 TWR RHEL9.2GA] Nvidia T1000 Graphics card cert fail
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: sos
Version: 9.2
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 9.2
Assignee: Pavel Moravec
QA Contact: Supportability QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-06 03:15 UTC by Kean
Modified: 2023-08-14 14:31 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
screenshot (1.69 MB, image/jpeg)
2023-07-06 03:15 UTC, Kean
no flags Details
sosreport files (16.65 MB, application/x-xz)
2023-07-06 03:24 UTC, Kean
no flags Details
pick out from PCP plugin testing log (20.86 KB, image/png)
2023-07-11 01:20 UTC, Kean
no flags Details
under /var/tmp/sos* (6.09 MB, application/zip)
2023-07-31 06:39 UTC, Kean
no flags Details
sosreport files (14.46 MB, application/x-xz)
2023-07-31 06:40 UTC, Kean
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-161592 0 None None None 2023-07-18 09:37:15 UTC

Description Kean 2023-07-06 03:15:45 UTC
Created attachment 1974233 [details]
screenshot

From the result we can see the certification is PASS, but unable to output test report.

[steps reproduce]
 
1、Install RHEL9.2GA system
2、Enter OS
3、Install dpRHEL9 and ts8.61
4、execute rhcert-cli run --test=video_drm_3d
 
Failure rate: 100%
 
[expected result]
Can be cert successfully
 
[actual result]
There seems got a PASS result, but unable to output test report

[addtional information]
CPU: 13th Gen Intel Core i9-13900K

Vedio:Nvidia T1000 8GB
Memory: 64GB
Bios: S0IKT20A

Comment 1 Kean 2023-07-06 03:24:38 UTC
Created attachment 1974246 [details]
sosreport files

Comment 2 Kean 2023-07-07 02:54:13 UTC
Hi Jiri Hnidek,

Please help review it, it's blocked certification.

Thanks.

Comment 3 Kean 2023-07-10 02:44:58 UTC
Hi @mknutson 

Please help review it, it's blocked certification.

Thanks.

Comment 4 Kean 2023-07-11 01:20:11 UTC
Created attachment 1975046 [details]
pick out from PCP plugin testing log

Comment 5 Jose Castillo 2023-07-21 09:54:20 UTC
Commenting from sos perspective since the bz has been moved to this component. 

All services related to Performance Co-Pilot seem to be running ok at the time the sos was obtained:

  UNIT                                                                                             LOAD   ACTIVE SUB       DESCRIPTION
  pmcd.service                                                                                     loaded active running   Performance Metrics Collector Daemon
  pmie.service                                                                                     loaded active running   Performance Metrics Inference Engine
  pmie_farm.service                                                                                loaded active running   pmie farm service
  pmlogger.service                                                                                 loaded active running   Performance Metrics Archive Logger
  pmlogger_farm.service                                                                            loaded active running   pmlogger farm service

But the error in the latest screenshot on June 26th at 23:26 was due to an issue connecting locally to the pmcd daemon. Looking at the logs, we can see that there were several soft lockup logged, with stack traces similar to this one:

Jun 26 23:26:07 localhost kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 26s! [kworker/14:2:519]
Jun 26 23:26:07 localhost kernel: Modules linked in: bridge stp llc uinput rfcomm snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bn
ep sunrpc vfat fat iTCO_wdt iTCO_vendor_support mei_wdt intel_rapl_msr x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore pcspkr think_lmi firmware_attributes_class wmi_bmof iwlmvm snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocatio
n soundwire_cadence mac80211 snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_hda_codec_realtek libarc4 snd_hda_codec_generic snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core ledtrig_audio snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_hda_codec_hdmi snd_soc_core snd_compress iwlwifi snd_hda_inte
l btusb snd_intel_dspcfg snd_intel_sdw_acpi btrtl snd_hda_codec btbcm
Jun 26 23:26:07 localhost kernel: btintel snd_hda_core btmtk snd_hwdep cfg80211 bluetooth i2c_i801 snd_seq i2c_smbus snd_seq_device snd_pcm mei_me rtsx_usb_ms memstick mei snd_timer processor_thermal_device_pci processor_thermal_device snd processor_thermal_rfim rfkill processor_thermal_mbox processor_thermal_rapl i
ntel_rapl_common soundcore int340x_thermal_zone int3400_thermal acpi_thermal_rel intel_pmc_core acpi_pad acpi_tad joydev xfs libcrc32c rtsx_usb_sdmmc mmc_core rtsx_usb i915 nouveau mxm_wmi drm_buddy drm_ttm_helper intel_gtt i2c_algo_bit drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
cec ttm ahci libahci drm nvme e1000e libata crct10dif_pclmul crc32_pclmul nvme_core crc32c_intel ghash_clmulni_intel nvme_common t10_pi wmi video dm_mirror dm_region_hash dm_log dm_mod fuse
Jun 26 23:26:07 localhost kernel: CPU: 14 PID: 519 Comm: kworker/14:2 Kdump: loaded Tainted: G        W        --------  ---  5.14.0-284.11.1.el9_2.x86_64 #1
Jun 26 23:26:07 localhost kernel: Hardware name: LENOVO ThinkStion P3 TWR/1064, BIOS S0IKT20A 05/05/2023
Jun 26 23:26:07 localhost kernel: Workqueue: events ttm_device_delayed_workqueue [ttm]
Jun 26 23:26:07 localhost kernel: RIP: 0010:dma_resv_iter_walk_unlocked.part.0+0x27/0x160
Jun 26 23:26:07 localhost kernel: Code: 00 00 00 0f 1f 44 00 00 41 54 41 bc ff ff ff ff 55 53 48 89 fb 48 8b 43 10 48 85 c0 74 1d 48 8d 78 38 44 89 e2 f0 0f c1 50 38 <83> fa 01 0f 84 dc 00 00 00 85 d2 0f 8e 09 01 00 00 8b 43 1c 3b 43
Jun 26 23:26:07 localhost kernel: RSP: 0018:ffffb7b540effd50 EFLAGS: 00000213
Jun 26 23:26:07 localhost kernel: RAX: ffff9388d06ba8a0 RBX: ffffb7b540effd80 RCX: 0000000000000002
Jun 26 23:26:07 localhost kernel: RDX: 0000000000000002 RSI: 0000000000000003 RDI: ffff9388d06ba8d8
Jun 26 23:26:07 localhost kernel: RBP: ffff9388d06ba8a0 R08: 0000000000000001 R09: 0000000000000000
Jun 26 23:26:07 localhost kernel: R10: ffff9388c9b86300 R11: 02132f0000000000 R12: 00000000ffffffff
Jun 26 23:26:07 localhost kernel: R13: ffff93888466a400 R14: ffff93888466a550 R15: ffff93888466a578
Jun 26 23:26:07 localhost kernel: FS:  0000000000000000(0000) GS:ffff9397bf380000(0000) knlGS:0000000000000000
Jun 26 23:26:07 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 26 23:26:07 localhost kernel: CR2: 00007fca99740100 CR3: 00000007d9810006 CR4: 0000000000770ee0
Jun 26 23:26:07 localhost kernel: PKRU: 55555554
Jun 26 23:26:07 localhost kernel: Call Trace:
Jun 26 23:26:07 localhost kernel: <TASK>
Jun 26 23:26:07 localhost kernel: dma_resv_iter_first_unlocked+0x25/0x70
Jun 26 23:26:07 localhost kernel: dma_resv_test_signaled+0x32/0xd0
Jun 26 23:26:07 localhost kernel: ttm_bo_release+0x61/0x340 [ttm]
Jun 26 23:26:07 localhost kernel: ? ttm_resource_free+0x64/0x80 [ttm]
Jun 26 23:26:07 localhost kernel: ttm_bo_delayed_delete+0x1dd/0x240 [ttm]
Jun 26 23:26:07 localhost kernel: ttm_device_delayed_workqueue+0x18/0x40 [ttm]
Jun 26 23:26:07 localhost kernel: process_one_work+0x1e5/0x3c0
Jun 26 23:26:07 localhost kernel: ? rescuer_thread+0x3a0/0x3a0
Jun 26 23:26:07 localhost kernel: worker_thread+0x50/0x3b0
Jun 26 23:26:07 localhost kernel: ? rescuer_thread+0x3a0/0x3a0
Jun 26 23:26:07 localhost kernel: kthread+0xd6/0x100
Jun 26 23:26:07 localhost kernel: ? kthread_complete_and_exit+0x20/0x20
Jun 26 23:26:07 localhost kernel: ret_from_fork+0x1f/0x30
Jun 26 23:26:07 localhost kernel: </TASK>
Jun 26 23:26:19 localhost kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 14-... } 21471 jiffies s: 957 root: 0x1/.
Jun 26 23:26:19 localhost kernel: rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x4000/.
Jun 26 23:26:19 localhost kernel: Task dump for CPU 14:
Jun 26 23:26:19 localhost kernel: task:kworker/14:2    state:R  running task     stack:    0 pid:  519 ppid:     2 flags:0x00004008
Jun 26 23:26:19 localhost kernel: Workqueue: events ttm_device_delayed_workqueue [ttm]
Jun 26 23:26:19 localhost kernel: Call Trace:
Jun 26 23:26:19 localhost kernel: <TASK>
Jun 26 23:26:19 localhost kernel: ? ttm_device_delayed_workqueue+0x18/0x40 [ttm]
Jun 26 23:26:19 localhost kernel: ? process_one_work+0x1e5/0x3c0
Jun 26 23:26:19 localhost kernel: ? rescuer_thread+0x3a0/0x3a0
Jun 26 23:26:19 localhost kernel: ? worker_thread+0x50/0x3b0
Jun 26 23:26:19 localhost kernel: ? rescuer_thread+0x3a0/0x3a0
Jun 26 23:26:19 localhost kernel: ? kthread+0xd6/0x100
Jun 26 23:26:19 localhost kernel: ? kthread_complete_and_exit+0x20/0x20
Jun 26 23:26:19 localhost kernel: ? ret_from_fork+0x1f/0x30
Jun 26 23:26:19 localhost kernel: </TASK>

We also had pmcd stuck for two minutes, and the timestamps seem to match roughly the connection issue:

Jun 26 23:30:05 localhost kernel: </TASK>
Jun 26 23:30:05 localhost kernel: INFO: task pmcd:2332 blocked for more than 122 seconds.
Jun 26 23:30:05 localhost kernel:      Tainted: G        W    L   --------  ---  5.14.0-284.11.1.el9_2.x86_64 #1
Jun 26 23:30:05 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 26 23:30:05 localhost kernel: task:pmcd            state:D stack:    0 pid: 2332 ppid:     1 flags:0x00000002
Jun 26 23:30:05 localhost kernel: Call Trace:
Jun 26 23:30:05 localhost kernel: <TASK>
Jun 26 23:30:05 localhost kernel: __schedule+0x248/0x620
Jun 26 23:30:05 localhost kernel: schedule+0x5a/0xc0
Jun 26 23:30:05 localhost kernel: schedule_preempt_disabled+0x11/0x20
Jun 26 23:30:05 localhost kernel: __mutex_lock.constprop.0+0x2a1/0x430
Jun 26 23:30:05 localhost kernel: __netlink_dump_start+0xc2/0x2e0
Jun 26 23:30:05 localhost kernel: ? validate_linkmsg+0x110/0x110
Jun 26 23:30:05 localhost kernel: rtnetlink_rcv_msg+0x287/0x390
Jun 26 23:30:05 localhost kernel: ? validate_linkmsg+0x110/0x110
Jun 26 23:30:05 localhost kernel: ? rtnl_calcit.isra.0+0x140/0x140
Jun 26 23:30:05 localhost kernel: netlink_rcv_skb+0x4e/0x100
Jun 26 23:30:05 localhost kernel: netlink_unicast+0x23b/0x360
Jun 26 23:30:05 localhost kernel: netlink_sendmsg+0x238/0x480
Jun 26 23:30:05 localhost kernel: sock_sendmsg+0x5f/0x70
Jun 26 23:30:05 localhost kernel: __sys_sendto+0xf0/0x160
Jun 26 23:30:05 localhost kernel: ? __sys_getsockname+0x7e/0xc0
Jun 26 23:30:05 localhost kernel: ? syscall_exit_work+0x11a/0x150
Jun 26 23:30:05 localhost kernel: __x64_sys_sendto+0x20/0x30
Jun 26 23:30:05 localhost kernel: do_syscall_64+0x59/0x90
Jun 26 23:30:05 localhost kernel: ? syscall_exit_to_user_mode+0x12/0x30
Jun 26 23:30:05 localhost kernel: ? do_syscall_64+0x69/0x90
Jun 26 23:30:05 localhost kernel: ? syscall_exit_to_user_mode+0x12/0x30
Jun 26 23:30:05 localhost kernel: ? do_syscall_64+0x69/0x90
Jun 26 23:30:05 localhost kernel: ? exc_page_fault+0x62/0x150


- Did you try running the tests after the lockups disappeared? 
- Did you see any issue generating the sos report itself? 
- Could you elaborate what do you mean by "but unable to output test report"? I can see the report was created fine, and no errors were logged in sos_logs directory.

Comment 6 Kean 2023-07-24 01:27:38 UTC
Hi Jose,

Thanks for your feedback.

You mentioned that 'We also had pmcd stuck for two minutes, and the timestamps seem to match roughly the connection issue:', in my testing, it will be stuck without going on even I remove pcp plugin.

- Did you try running the tests after the lockups disappeared? 
>> Look at the screenshot image, when I run the test, is will always stay there and could not force exit by Ctrl+C, so I could not re-run sos report.

- Did you see any issue generating the sos report itself? 
>> No, as you see the log, it just stuck there until I reboot system.

- Could you elaborate what do you mean by "but unable to output test report"? I can see the report was created fine, and no errors were logged in sos_logs directory.
>> I don't think you can get a complete report. Unless you restart the system, after running the test, you will get stuck in the sos report anyway, that's my issue here.

When I exec 'rhcert-cli run --test=video_drm_3d', the test will get a PASS result but will be stuck at the sos report shown in the attachment.

Thanks.

Comment 7 Jianwei Weng 2023-07-25 04:23:40 UTC
Created attachment 1977408 [details]
sosreport

Comment 8 Jianwei Weng 2023-07-25 04:28:57 UTC
Hello,

I suggested Lenovo upgrade the sos to the latest version and rerun, but the sosreport was still stuck after running video_drm_3D test. 
this is specific to video_drm_3D test only, Lenovo ran the other video tests, the sosreport could be generated successfully.
regarding the video_drm_3D, compared with the other video test, the difference is it runs GPU benchmark tests with the following commands:

DRI_PRIME=1 DISPLAY=:0  glmark2
DRI_PRIME=1 DISPLAY=:0 env LIBGL_ALWAYS_SOFTWARE=1 glmark2

I uploaded the latest sosreport package, please take a look and help investigate the root cause.

Thank you
Jianwei

Comment 9 Jianwei Weng 2023-07-25 04:38:31 UTC
Another point to highlight, after running the "video_drm_3D" test, the partner terminated it because sosreport was consistently hanging. Even when attempting to manually run sosreport later, it remained stuck until the system was rebooted to generate the report.

Comment 10 Pavel Moravec 2023-07-27 15:17:11 UTC
(In reply to Jianwei Weng from comment #9)
> Another point to highlight, after running the "video_drm_3D" test, the
> partner terminated it because sosreport was consistently hanging. Even when
> attempting to manually run sosreport later, it remained stuck until the
> system was rebooted to generate the report.

You say sosreport gets hung while we see sosreport tarballs being generated (which implies sosreport terminated (or got stuck at its final stage..?)). Why do you know sosreport was stuck? Did you see the process running (or rather "running")? While it really generated a sosreport tarball already?

Could you enable verbose logs and re-try, to let us see where sos report got stuck? In /etc/sos/sos.conf , have :

[global]
verbose = 3

Then sosreport will generate sos_logs/sos.log in verbose mode (either in generated tarball, if really generated, or in /var/tmp/sos*/sos_logs/sos.logfile if sosreport got stuck "in the middle"). Please provide that file (or tarball or /var/tmp/sos* directory with the file).

Comment 11 Jianwei Weng 2023-07-28 09:15:51 UTC
Hello Lenovo Team,

Could you please run a video_drm_3D test first, and then follow Pavel's instructions above to run and upload the file here?

Thanks
Jianwei

Comment 12 Kean 2023-07-28 09:28:38 UTC
Hi Jianwei,

I will but the device is working for other testing certifications, next Mon will do it.

Thanks all.

Comment 13 Kean 2023-07-31 06:39:03 UTC
Created attachment 1980827 [details]
under /var/tmp/sos*

files requested

test cmd stuck as before.

Comment 14 Kean 2023-07-31 06:40:04 UTC
Created attachment 1980828 [details]
sosreport files

after restarting, we got sosreport here.

Comment 15 Jose Castillo 2023-07-31 12:16:49 UTC
Thank you for running the sos report with Pavel's suggestions. Looking at its logs, we can see that we have several plugins timing out:

2023-07-30 23:28:53,408 WARNING: [plugin:podman] command 'podman network ls' timed out after 300s
2023-07-30 23:38:55,881 INFO: Plugin autofs timed out
2023-07-30 23:38:56,537 INFO: Plugin buildah timed out
2023-07-30 23:38:56,544 WARNING: [plugin:buildah] command 'buildah containers' timed out after 300s
2023-07-30 23:39:02,620 INFO: Plugin dracut timed out
2023-07-30 23:39:09,991 INFO: Plugin filesys timed out
2023-07-30 23:43:56,545 INFO: Plugin flatpak timed out
2023-07-30 23:44:02,622 INFO: Plugin grub2 timed out
2023-07-30 23:44:04,264 INFO: Plugin host timed out
2023-07-30 23:44:12,056 INFO: Plugin kernel timed out
2023-07-30 23:49:02,632 INFO: Plugin keyutils timed out

And at least one traceback:

$ cat ./sos_logs/virsh-plugin-errors.txt                                                                                                       
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/sos/report/init.py", line 1224, in setup
	plug.setup()
  File "/usr/lib/python3.9/site-packages/sos/report/plugins/virsh.py", line 53, in setup
	k_list = self.collect_cmd_output('%s %s-list' % (cmd, k),
  File "/usr/lib/python3.9/site-packages/sos/report/plugins/init.py", line 2518, in collect_cmd_output
	return self._collect_cmd_output(
  File "/usr/lib/python3.9/site-packages/sos/report/plugins/init.py", line 2354, in _collect_cmd_output
	result = sos_get_command_output(
  File "/usr/lib/python3.9/site-packages/sos/utilities.py", line 267, in sos_get_command_output
	raise e
  File "/usr/lib/python3.9/site-packages/sos/utilities.py", line 236, in sos_get_command_output
	_check_poller(p)
  File "/usr/lib/python3.9/site-packages/sos/utilities.py", line 186, in _check_poller
	raise SoSTimeoutError
sos.utilities.SoSTimeoutError

But these logs don't tell us exactly why they were stuck. Looking at the host logs in the second sos, we can see there were some issues again with soft lockups. They start around here:

Jul 30 22:53:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/2:1:199]
Jul 30 22:53:53 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [kworker/2:1:199]
Jul 30 22:54:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 82s! [kworker/2:1:199]
Jul 30 22:54:53 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 108s! [kworker/2:1:199]
[...]

And by the time the first sos was running, the soft lockups were still happening:

Jul 30 23:28:49 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2004s! [kworker/2:1:199]
Jul 30 23:29:17 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2031s! [kworker/2:1:199]
Jul 30 23:29:45 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2057s! [kworker/2:1:199]
Jul 30 23:30:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2094s! [kworker/2:1:199]
Jul 30 23:41:17 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2701s! [kworker/2:1:199]
Jul 30 23:41:45 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2727s! [kworker/2:1:199]
Jul 30 23:42:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2764s! [kworker/2:1:199]
[...]

The last soft lockup happened around 1:53 a.m.:

Jul 31 01:50:45 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 9936s! [kworker/2:1:199]
Jul 31 01:51:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 9973s! [kworker/2:1:199]
Jul 31 01:51:53 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 9999s! [kworker/2:1:199]
Jul 31 01:52:21 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 10025s! [kworker/2:1:199]
Jul 31 01:52:49 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 10051s! [kworker/2:1:199]
Jul 31 01:53:17 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 10077s! [kworker/2:1:199]

And the newest sos was created at 2 a.m., when no more soft lockups were present:

$ head sos_logs/sos.log
2023-07-31 02:03:14,766 DEBUG: set sysroot to '/' (default)

So as per my previous note, the sos report here is a victim of something else going on in this machine. You need to look into the soft lockups generated by the ttm kernel module, i.e.:

Jul 30 22:53:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/2:1:199]
[...]
Jul 30 22:53:25 localhost kernel: CPU: 2 PID: 199 Comm: kworker/2:1 Kdump: loaded Not tainted 5.14.0-284.11.1.el9_2.x86_64 #1
Jul 30 22:53:25 localhost kernel: Hardware name: LENOVO ThinkStion P3 TWR/1064, BIOS S0IKT20A 05/05/2023
Jul 30 22:53:25 localhost kernel: Workqueue: events ttm_device_delayed_workqueue [ttm]
Jul 30 22:53:25 localhost kernel: RIP: 0010:dma_resv_iter_walk_unlocked.part.0+0x82/0x160
Jul 30 22:53:25 localhost kernel: Code: 89 c5 48 83 e5 fc 48 89 6b 10 48 83 fb e8 74 06 83 e0 03 89 43 18 8b 55 38 48 8d 7d 38 85 d2 74 5e 8d 4a 01 89 d0 f0 0f b1 0f <0f> 85 b2 00 00 00 09 ca 0f 88 9e 00 00 00 48 89 6b 10 48 85 ed 74
Jul 30 22:53:25 localhost kernel: RSP: 0018:ffffbc35808efd50 EFLAGS: 00000246
Jul 30 22:53:25 localhost kernel: RAX: 0000000000000001 RBX: ffffbc35808efd80 RCX: 0000000000000002
Jul 30 22:53:25 localhost kernel: RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff9bcf859d1cf8
Jul 30 22:53:25 localhost kernel: RBP: ffff9bcf859d1cc0 R08: 0000000000000001 R09: 0000000000000000
Jul 30 22:53:25 localhost kernel: R10: 0000000000000000 R11: 0000000000000217 R12: 00000000ffffffff
Jul 30 22:53:25 localhost kernel: R13: ffff9bcfb73dc800 R14: ffff9bcfb73dc950 R15: ffff9bcfb73dc978
Jul 30 22:53:25 localhost kernel: FS:  0000000000000000(0000) GS:ffff9bd0f7080000(0000) knlGS:0000000000000000
Jul 30 22:53:25 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 30 22:53:25 localhost kernel: CR2: 00007fe77538b000 CR3: 0000000136bde002 CR4: 0000000000770ee0
Jul 30 22:53:25 localhost kernel: PKRU: 55555554
Jul 30 22:53:25 localhost kernel: Call Trace:
Jul 30 22:53:25 localhost kernel: <TASK>
Jul 30 22:53:25 localhost kernel: dma_resv_iter_first_unlocked+0x25/0x70
Jul 30 22:53:25 localhost kernel: dma_resv_test_signaled+0x32/0xd0
Jul 30 22:53:25 localhost kernel: ttm_bo_release+0x61/0x340 [ttm]
Jul 30 22:53:25 localhost kernel: ? ttm_resource_free+0x64/0x80 [ttm]
Jul 30 22:53:25 localhost kernel: ttm_bo_delayed_delete+0x1dd/0x240 [ttm]
Jul 30 22:53:25 localhost kernel: ttm_device_delayed_workqueue+0x18/0x40 [ttm]
Jul 30 22:53:25 localhost kernel: process_one_work+0x1e5/0x3c0
Jul 30 22:53:25 localhost kernel: ? rescuer_thread+0x3a0/0x3a0
Jul 30 22:53:25 localhost kernel: worker_thread+0x50/0x3b0
Jul 30 22:53:25 localhost kernel: ? rescuer_thread+0x3a0/0x3a0
Jul 30 22:53:25 localhost kernel: kthread+0xd6/0x100
Jul 30 22:53:25 localhost kernel: ? kthread_complete_and_exit+0x20/0x20
Jul 30 22:53:25 localhost kernel: ret_from_fork+0x1f/0x30

May I suggest that:
- You check the logs for soft lockups before running the tests and/or capturing sos, and
- Perhaps, open a new bug report with the soft lockup traces?

One last question - have you checked if there are any issues at memory level?

Comment 16 Kean 2023-08-01 08:09:20 UTC
Hi Jose,

And the newest sos was created at 2 a.m. when no more soft lockups were present:
>> The os has been restarted, so less soft lockup there after 2 a.m.

When it comes to "Workqueue: events ttm_device_delayed_workqueue [ttm]", ttm module is Memory managers related to graphics memory, so
1) I will try to set a longer time to kernel.watchdog_thresh to 30 and test it.
2) Why does this soft lockup always happen on CUP#2 even after restarting the system?

Maybe after testing, I will open a new ticket for soft lockup.

PA said that the memory testing is fine so far.

Thanks.


Note You need to log in before you can comment on or make changes to this bug.