Created attachment 1974233 [details] screenshot From the result we can see the certification is PASS, but unable to output test report. [steps reproduce] 1、Install RHEL9.2GA system 2、Enter OS 3、Install dpRHEL9 and ts8.61 4、execute rhcert-cli run --test=video_drm_3d Failure rate: 100% [expected result] Can be cert successfully [actual result] There seems got a PASS result, but unable to output test report [addtional information] CPU: 13th Gen Intel Core i9-13900K Vedio:Nvidia T1000 8GB Memory: 64GB Bios: S0IKT20A
Created attachment 1974246 [details] sosreport files
Hi Jiri Hnidek, Please help review it, it's blocked certification. Thanks.
Hi @mknutson Please help review it, it's blocked certification. Thanks.
Created attachment 1975046 [details] pick out from PCP plugin testing log
Commenting from sos perspective since the bz has been moved to this component. All services related to Performance Co-Pilot seem to be running ok at the time the sos was obtained: UNIT LOAD ACTIVE SUB DESCRIPTION pmcd.service loaded active running Performance Metrics Collector Daemon pmie.service loaded active running Performance Metrics Inference Engine pmie_farm.service loaded active running pmie farm service pmlogger.service loaded active running Performance Metrics Archive Logger pmlogger_farm.service loaded active running pmlogger farm service But the error in the latest screenshot on June 26th at 23:26 was due to an issue connecting locally to the pmcd daemon. Looking at the logs, we can see that there were several soft lockup logged, with stack traces similar to this one: Jun 26 23:26:07 localhost kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 26s! [kworker/14:2:519] Jun 26 23:26:07 localhost kernel: Modules linked in: bridge stp llc uinput rfcomm snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bn ep sunrpc vfat fat iTCO_wdt iTCO_vendor_support mei_wdt intel_rapl_msr x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore pcspkr think_lmi firmware_attributes_class wmi_bmof iwlmvm snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocatio n soundwire_cadence mac80211 snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_hda_codec_realtek libarc4 snd_hda_codec_generic snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core ledtrig_audio snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_hda_codec_hdmi snd_soc_core snd_compress iwlwifi snd_hda_inte l btusb snd_intel_dspcfg snd_intel_sdw_acpi btrtl snd_hda_codec btbcm Jun 26 23:26:07 localhost kernel: btintel snd_hda_core btmtk snd_hwdep cfg80211 bluetooth i2c_i801 snd_seq i2c_smbus snd_seq_device snd_pcm mei_me rtsx_usb_ms memstick mei snd_timer processor_thermal_device_pci processor_thermal_device snd processor_thermal_rfim rfkill processor_thermal_mbox processor_thermal_rapl i ntel_rapl_common soundcore int340x_thermal_zone int3400_thermal acpi_thermal_rel intel_pmc_core acpi_pad acpi_tad joydev xfs libcrc32c rtsx_usb_sdmmc mmc_core rtsx_usb i915 nouveau mxm_wmi drm_buddy drm_ttm_helper intel_gtt i2c_algo_bit drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec ttm ahci libahci drm nvme e1000e libata crct10dif_pclmul crc32_pclmul nvme_core crc32c_intel ghash_clmulni_intel nvme_common t10_pi wmi video dm_mirror dm_region_hash dm_log dm_mod fuse Jun 26 23:26:07 localhost kernel: CPU: 14 PID: 519 Comm: kworker/14:2 Kdump: loaded Tainted: G W -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1 Jun 26 23:26:07 localhost kernel: Hardware name: LENOVO ThinkStion P3 TWR/1064, BIOS S0IKT20A 05/05/2023 Jun 26 23:26:07 localhost kernel: Workqueue: events ttm_device_delayed_workqueue [ttm] Jun 26 23:26:07 localhost kernel: RIP: 0010:dma_resv_iter_walk_unlocked.part.0+0x27/0x160 Jun 26 23:26:07 localhost kernel: Code: 00 00 00 0f 1f 44 00 00 41 54 41 bc ff ff ff ff 55 53 48 89 fb 48 8b 43 10 48 85 c0 74 1d 48 8d 78 38 44 89 e2 f0 0f c1 50 38 <83> fa 01 0f 84 dc 00 00 00 85 d2 0f 8e 09 01 00 00 8b 43 1c 3b 43 Jun 26 23:26:07 localhost kernel: RSP: 0018:ffffb7b540effd50 EFLAGS: 00000213 Jun 26 23:26:07 localhost kernel: RAX: ffff9388d06ba8a0 RBX: ffffb7b540effd80 RCX: 0000000000000002 Jun 26 23:26:07 localhost kernel: RDX: 0000000000000002 RSI: 0000000000000003 RDI: ffff9388d06ba8d8 Jun 26 23:26:07 localhost kernel: RBP: ffff9388d06ba8a0 R08: 0000000000000001 R09: 0000000000000000 Jun 26 23:26:07 localhost kernel: R10: ffff9388c9b86300 R11: 02132f0000000000 R12: 00000000ffffffff Jun 26 23:26:07 localhost kernel: R13: ffff93888466a400 R14: ffff93888466a550 R15: ffff93888466a578 Jun 26 23:26:07 localhost kernel: FS: 0000000000000000(0000) GS:ffff9397bf380000(0000) knlGS:0000000000000000 Jun 26 23:26:07 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 26 23:26:07 localhost kernel: CR2: 00007fca99740100 CR3: 00000007d9810006 CR4: 0000000000770ee0 Jun 26 23:26:07 localhost kernel: PKRU: 55555554 Jun 26 23:26:07 localhost kernel: Call Trace: Jun 26 23:26:07 localhost kernel: <TASK> Jun 26 23:26:07 localhost kernel: dma_resv_iter_first_unlocked+0x25/0x70 Jun 26 23:26:07 localhost kernel: dma_resv_test_signaled+0x32/0xd0 Jun 26 23:26:07 localhost kernel: ttm_bo_release+0x61/0x340 [ttm] Jun 26 23:26:07 localhost kernel: ? ttm_resource_free+0x64/0x80 [ttm] Jun 26 23:26:07 localhost kernel: ttm_bo_delayed_delete+0x1dd/0x240 [ttm] Jun 26 23:26:07 localhost kernel: ttm_device_delayed_workqueue+0x18/0x40 [ttm] Jun 26 23:26:07 localhost kernel: process_one_work+0x1e5/0x3c0 Jun 26 23:26:07 localhost kernel: ? rescuer_thread+0x3a0/0x3a0 Jun 26 23:26:07 localhost kernel: worker_thread+0x50/0x3b0 Jun 26 23:26:07 localhost kernel: ? rescuer_thread+0x3a0/0x3a0 Jun 26 23:26:07 localhost kernel: kthread+0xd6/0x100 Jun 26 23:26:07 localhost kernel: ? kthread_complete_and_exit+0x20/0x20 Jun 26 23:26:07 localhost kernel: ret_from_fork+0x1f/0x30 Jun 26 23:26:07 localhost kernel: </TASK> Jun 26 23:26:19 localhost kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 14-... } 21471 jiffies s: 957 root: 0x1/. Jun 26 23:26:19 localhost kernel: rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x4000/. Jun 26 23:26:19 localhost kernel: Task dump for CPU 14: Jun 26 23:26:19 localhost kernel: task:kworker/14:2 state:R running task stack: 0 pid: 519 ppid: 2 flags:0x00004008 Jun 26 23:26:19 localhost kernel: Workqueue: events ttm_device_delayed_workqueue [ttm] Jun 26 23:26:19 localhost kernel: Call Trace: Jun 26 23:26:19 localhost kernel: <TASK> Jun 26 23:26:19 localhost kernel: ? ttm_device_delayed_workqueue+0x18/0x40 [ttm] Jun 26 23:26:19 localhost kernel: ? process_one_work+0x1e5/0x3c0 Jun 26 23:26:19 localhost kernel: ? rescuer_thread+0x3a0/0x3a0 Jun 26 23:26:19 localhost kernel: ? worker_thread+0x50/0x3b0 Jun 26 23:26:19 localhost kernel: ? rescuer_thread+0x3a0/0x3a0 Jun 26 23:26:19 localhost kernel: ? kthread+0xd6/0x100 Jun 26 23:26:19 localhost kernel: ? kthread_complete_and_exit+0x20/0x20 Jun 26 23:26:19 localhost kernel: ? ret_from_fork+0x1f/0x30 Jun 26 23:26:19 localhost kernel: </TASK> We also had pmcd stuck for two minutes, and the timestamps seem to match roughly the connection issue: Jun 26 23:30:05 localhost kernel: </TASK> Jun 26 23:30:05 localhost kernel: INFO: task pmcd:2332 blocked for more than 122 seconds. Jun 26 23:30:05 localhost kernel: Tainted: G W L -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1 Jun 26 23:30:05 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 26 23:30:05 localhost kernel: task:pmcd state:D stack: 0 pid: 2332 ppid: 1 flags:0x00000002 Jun 26 23:30:05 localhost kernel: Call Trace: Jun 26 23:30:05 localhost kernel: <TASK> Jun 26 23:30:05 localhost kernel: __schedule+0x248/0x620 Jun 26 23:30:05 localhost kernel: schedule+0x5a/0xc0 Jun 26 23:30:05 localhost kernel: schedule_preempt_disabled+0x11/0x20 Jun 26 23:30:05 localhost kernel: __mutex_lock.constprop.0+0x2a1/0x430 Jun 26 23:30:05 localhost kernel: __netlink_dump_start+0xc2/0x2e0 Jun 26 23:30:05 localhost kernel: ? validate_linkmsg+0x110/0x110 Jun 26 23:30:05 localhost kernel: rtnetlink_rcv_msg+0x287/0x390 Jun 26 23:30:05 localhost kernel: ? validate_linkmsg+0x110/0x110 Jun 26 23:30:05 localhost kernel: ? rtnl_calcit.isra.0+0x140/0x140 Jun 26 23:30:05 localhost kernel: netlink_rcv_skb+0x4e/0x100 Jun 26 23:30:05 localhost kernel: netlink_unicast+0x23b/0x360 Jun 26 23:30:05 localhost kernel: netlink_sendmsg+0x238/0x480 Jun 26 23:30:05 localhost kernel: sock_sendmsg+0x5f/0x70 Jun 26 23:30:05 localhost kernel: __sys_sendto+0xf0/0x160 Jun 26 23:30:05 localhost kernel: ? __sys_getsockname+0x7e/0xc0 Jun 26 23:30:05 localhost kernel: ? syscall_exit_work+0x11a/0x150 Jun 26 23:30:05 localhost kernel: __x64_sys_sendto+0x20/0x30 Jun 26 23:30:05 localhost kernel: do_syscall_64+0x59/0x90 Jun 26 23:30:05 localhost kernel: ? syscall_exit_to_user_mode+0x12/0x30 Jun 26 23:30:05 localhost kernel: ? do_syscall_64+0x69/0x90 Jun 26 23:30:05 localhost kernel: ? syscall_exit_to_user_mode+0x12/0x30 Jun 26 23:30:05 localhost kernel: ? do_syscall_64+0x69/0x90 Jun 26 23:30:05 localhost kernel: ? exc_page_fault+0x62/0x150 - Did you try running the tests after the lockups disappeared? - Did you see any issue generating the sos report itself? - Could you elaborate what do you mean by "but unable to output test report"? I can see the report was created fine, and no errors were logged in sos_logs directory.
Hi Jose, Thanks for your feedback. You mentioned that 'We also had pmcd stuck for two minutes, and the timestamps seem to match roughly the connection issue:', in my testing, it will be stuck without going on even I remove pcp plugin. - Did you try running the tests after the lockups disappeared? >> Look at the screenshot image, when I run the test, is will always stay there and could not force exit by Ctrl+C, so I could not re-run sos report. - Did you see any issue generating the sos report itself? >> No, as you see the log, it just stuck there until I reboot system. - Could you elaborate what do you mean by "but unable to output test report"? I can see the report was created fine, and no errors were logged in sos_logs directory. >> I don't think you can get a complete report. Unless you restart the system, after running the test, you will get stuck in the sos report anyway, that's my issue here. When I exec 'rhcert-cli run --test=video_drm_3d', the test will get a PASS result but will be stuck at the sos report shown in the attachment. Thanks.
Created attachment 1977408 [details] sosreport
Hello, I suggested Lenovo upgrade the sos to the latest version and rerun, but the sosreport was still stuck after running video_drm_3D test. this is specific to video_drm_3D test only, Lenovo ran the other video tests, the sosreport could be generated successfully. regarding the video_drm_3D, compared with the other video test, the difference is it runs GPU benchmark tests with the following commands: DRI_PRIME=1 DISPLAY=:0 glmark2 DRI_PRIME=1 DISPLAY=:0 env LIBGL_ALWAYS_SOFTWARE=1 glmark2 I uploaded the latest sosreport package, please take a look and help investigate the root cause. Thank you Jianwei
Another point to highlight, after running the "video_drm_3D" test, the partner terminated it because sosreport was consistently hanging. Even when attempting to manually run sosreport later, it remained stuck until the system was rebooted to generate the report.
(In reply to Jianwei Weng from comment #9) > Another point to highlight, after running the "video_drm_3D" test, the > partner terminated it because sosreport was consistently hanging. Even when > attempting to manually run sosreport later, it remained stuck until the > system was rebooted to generate the report. You say sosreport gets hung while we see sosreport tarballs being generated (which implies sosreport terminated (or got stuck at its final stage..?)). Why do you know sosreport was stuck? Did you see the process running (or rather "running")? While it really generated a sosreport tarball already? Could you enable verbose logs and re-try, to let us see where sos report got stuck? In /etc/sos/sos.conf , have : [global] verbose = 3 Then sosreport will generate sos_logs/sos.log in verbose mode (either in generated tarball, if really generated, or in /var/tmp/sos*/sos_logs/sos.logfile if sosreport got stuck "in the middle"). Please provide that file (or tarball or /var/tmp/sos* directory with the file).
Hello Lenovo Team, Could you please run a video_drm_3D test first, and then follow Pavel's instructions above to run and upload the file here? Thanks Jianwei
Hi Jianwei, I will but the device is working for other testing certifications, next Mon will do it. Thanks all.
Created attachment 1980827 [details] under /var/tmp/sos* files requested test cmd stuck as before.
Created attachment 1980828 [details] sosreport files after restarting, we got sosreport here.
Thank you for running the sos report with Pavel's suggestions. Looking at its logs, we can see that we have several plugins timing out: 2023-07-30 23:28:53,408 WARNING: [plugin:podman] command 'podman network ls' timed out after 300s 2023-07-30 23:38:55,881 INFO: Plugin autofs timed out 2023-07-30 23:38:56,537 INFO: Plugin buildah timed out 2023-07-30 23:38:56,544 WARNING: [plugin:buildah] command 'buildah containers' timed out after 300s 2023-07-30 23:39:02,620 INFO: Plugin dracut timed out 2023-07-30 23:39:09,991 INFO: Plugin filesys timed out 2023-07-30 23:43:56,545 INFO: Plugin flatpak timed out 2023-07-30 23:44:02,622 INFO: Plugin grub2 timed out 2023-07-30 23:44:04,264 INFO: Plugin host timed out 2023-07-30 23:44:12,056 INFO: Plugin kernel timed out 2023-07-30 23:49:02,632 INFO: Plugin keyutils timed out And at least one traceback: $ cat ./sos_logs/virsh-plugin-errors.txt Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/sos/report/init.py", line 1224, in setup plug.setup() File "/usr/lib/python3.9/site-packages/sos/report/plugins/virsh.py", line 53, in setup k_list = self.collect_cmd_output('%s %s-list' % (cmd, k), File "/usr/lib/python3.9/site-packages/sos/report/plugins/init.py", line 2518, in collect_cmd_output return self._collect_cmd_output( File "/usr/lib/python3.9/site-packages/sos/report/plugins/init.py", line 2354, in _collect_cmd_output result = sos_get_command_output( File "/usr/lib/python3.9/site-packages/sos/utilities.py", line 267, in sos_get_command_output raise e File "/usr/lib/python3.9/site-packages/sos/utilities.py", line 236, in sos_get_command_output _check_poller(p) File "/usr/lib/python3.9/site-packages/sos/utilities.py", line 186, in _check_poller raise SoSTimeoutError sos.utilities.SoSTimeoutError But these logs don't tell us exactly why they were stuck. Looking at the host logs in the second sos, we can see there were some issues again with soft lockups. They start around here: Jul 30 22:53:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/2:1:199] Jul 30 22:53:53 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [kworker/2:1:199] Jul 30 22:54:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 82s! [kworker/2:1:199] Jul 30 22:54:53 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 108s! [kworker/2:1:199] [...] And by the time the first sos was running, the soft lockups were still happening: Jul 30 23:28:49 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2004s! [kworker/2:1:199] Jul 30 23:29:17 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2031s! [kworker/2:1:199] Jul 30 23:29:45 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2057s! [kworker/2:1:199] Jul 30 23:30:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2094s! [kworker/2:1:199] Jul 30 23:41:17 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2701s! [kworker/2:1:199] Jul 30 23:41:45 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2727s! [kworker/2:1:199] Jul 30 23:42:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 2764s! [kworker/2:1:199] [...] The last soft lockup happened around 1:53 a.m.: Jul 31 01:50:45 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 9936s! [kworker/2:1:199] Jul 31 01:51:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 9973s! [kworker/2:1:199] Jul 31 01:51:53 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 9999s! [kworker/2:1:199] Jul 31 01:52:21 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 10025s! [kworker/2:1:199] Jul 31 01:52:49 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 10051s! [kworker/2:1:199] Jul 31 01:53:17 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 10077s! [kworker/2:1:199] And the newest sos was created at 2 a.m., when no more soft lockups were present: $ head sos_logs/sos.log 2023-07-31 02:03:14,766 DEBUG: set sysroot to '/' (default) So as per my previous note, the sos report here is a victim of something else going on in this machine. You need to look into the soft lockups generated by the ttm kernel module, i.e.: Jul 30 22:53:25 localhost kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/2:1:199] [...] Jul 30 22:53:25 localhost kernel: CPU: 2 PID: 199 Comm: kworker/2:1 Kdump: loaded Not tainted 5.14.0-284.11.1.el9_2.x86_64 #1 Jul 30 22:53:25 localhost kernel: Hardware name: LENOVO ThinkStion P3 TWR/1064, BIOS S0IKT20A 05/05/2023 Jul 30 22:53:25 localhost kernel: Workqueue: events ttm_device_delayed_workqueue [ttm] Jul 30 22:53:25 localhost kernel: RIP: 0010:dma_resv_iter_walk_unlocked.part.0+0x82/0x160 Jul 30 22:53:25 localhost kernel: Code: 89 c5 48 83 e5 fc 48 89 6b 10 48 83 fb e8 74 06 83 e0 03 89 43 18 8b 55 38 48 8d 7d 38 85 d2 74 5e 8d 4a 01 89 d0 f0 0f b1 0f <0f> 85 b2 00 00 00 09 ca 0f 88 9e 00 00 00 48 89 6b 10 48 85 ed 74 Jul 30 22:53:25 localhost kernel: RSP: 0018:ffffbc35808efd50 EFLAGS: 00000246 Jul 30 22:53:25 localhost kernel: RAX: 0000000000000001 RBX: ffffbc35808efd80 RCX: 0000000000000002 Jul 30 22:53:25 localhost kernel: RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff9bcf859d1cf8 Jul 30 22:53:25 localhost kernel: RBP: ffff9bcf859d1cc0 R08: 0000000000000001 R09: 0000000000000000 Jul 30 22:53:25 localhost kernel: R10: 0000000000000000 R11: 0000000000000217 R12: 00000000ffffffff Jul 30 22:53:25 localhost kernel: R13: ffff9bcfb73dc800 R14: ffff9bcfb73dc950 R15: ffff9bcfb73dc978 Jul 30 22:53:25 localhost kernel: FS: 0000000000000000(0000) GS:ffff9bd0f7080000(0000) knlGS:0000000000000000 Jul 30 22:53:25 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 30 22:53:25 localhost kernel: CR2: 00007fe77538b000 CR3: 0000000136bde002 CR4: 0000000000770ee0 Jul 30 22:53:25 localhost kernel: PKRU: 55555554 Jul 30 22:53:25 localhost kernel: Call Trace: Jul 30 22:53:25 localhost kernel: <TASK> Jul 30 22:53:25 localhost kernel: dma_resv_iter_first_unlocked+0x25/0x70 Jul 30 22:53:25 localhost kernel: dma_resv_test_signaled+0x32/0xd0 Jul 30 22:53:25 localhost kernel: ttm_bo_release+0x61/0x340 [ttm] Jul 30 22:53:25 localhost kernel: ? ttm_resource_free+0x64/0x80 [ttm] Jul 30 22:53:25 localhost kernel: ttm_bo_delayed_delete+0x1dd/0x240 [ttm] Jul 30 22:53:25 localhost kernel: ttm_device_delayed_workqueue+0x18/0x40 [ttm] Jul 30 22:53:25 localhost kernel: process_one_work+0x1e5/0x3c0 Jul 30 22:53:25 localhost kernel: ? rescuer_thread+0x3a0/0x3a0 Jul 30 22:53:25 localhost kernel: worker_thread+0x50/0x3b0 Jul 30 22:53:25 localhost kernel: ? rescuer_thread+0x3a0/0x3a0 Jul 30 22:53:25 localhost kernel: kthread+0xd6/0x100 Jul 30 22:53:25 localhost kernel: ? kthread_complete_and_exit+0x20/0x20 Jul 30 22:53:25 localhost kernel: ret_from_fork+0x1f/0x30 May I suggest that: - You check the logs for soft lockups before running the tests and/or capturing sos, and - Perhaps, open a new bug report with the soft lockup traces? One last question - have you checked if there are any issues at memory level?
Hi Jose, And the newest sos was created at 2 a.m. when no more soft lockups were present: >> The os has been restarted, so less soft lockup there after 2 a.m. When it comes to "Workqueue: events ttm_device_delayed_workqueue [ttm]", ttm module is Memory managers related to graphics memory, so 1) I will try to set a longer time to kernel.watchdog_thresh to 30 and test it. 2) Why does this soft lockup always happen on CUP#2 even after restarting the system? Maybe after testing, I will open a new ticket for soft lockup. PA said that the memory testing is fine so far. Thanks.