Bug 1994590
Summary: | [ESXi][RHEL7.9][open-vm-tools] Snapshot of the RHEL7 guest on the VMWare ESXi hypervisor failed vm hangs | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Frank DeLorey <fdelorey> |
Component: | open-vm-tools | Assignee: | Ani Sinha <anisinha> |
Status: | CLOSED ERRATA | QA Contact: | ldu <ldu> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.9 | CC: | anisinha, boyang, cavery, dwysocha, eterrell, jaeshin, jen, jsaks, jsavanyo, jwolfe, ldu, leiwang, mrezanin, ossantos, ovasik, rknipp, schandle, shaselde, yacao, ymankad |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | open-vm-tools-11.0.5-3.el7_9.5 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-06-29 15:09:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Frank DeLorey
2021-08-17 13:17:12 UTC
(In reply to Frank DeLorey from comment #0) The patch you refer to https://gitlab.com/cavery/open-vm-tools/-/commit/5cb75e0a93b9531875e9e29515a0ea912ca3aca7 Is definitely in open-vm-tools-11.0.5-3.el7_9.1.x86_64. So if this patch was not enough we will have to kick it back to Vmware for further analysis. Cathy This might be duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1880404 Hi Vmware team, There is same issue occuring from same customer with another VM. Customer is using later version of open-vm-tools as below. Is there any update on this issue? open-vm-tools-11.0.5-3.el7_9.3.x86_64 Sun Sep 26 02:48:10 2021 KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-1160.36.2.el7.x86_64/vmlinux [TAINTED] DUMPFILE: /cores/retrace/tasks/792323210/crash/vmcore [PARTIAL DUMP] CPUS: 16 DATE: Fri Jan 28 01:15:22 GMT 2022 UPTIME: 123 days, 06:45:54 LOAD AVERAGE: 202.07, 201.02, 198.65 TASKS: 2589 NODENAME: __removed__ RELEASE: 3.10.0-1160.36.2.el7.x86_64 VERSION: #1 SMP Thu Jul 8 02:53:40 UTC 2021 MACHINE: x86_64 (2699 Mhz) MEMORY: 64 GB PANIC: "Kernel panic - not syncing: NMI: Not continuing" After dm-14 is being unmounted the task pool, pid 85586 is blocked for more than 120 seconds on sync filesystem operation. [10610970.978590] XFS (dm-14): Unmounting Filesystem [10611182.411074] INFO: task pool:85586 blocked for more than 120 seconds. [10611182.411156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10611182.411185] pool D ffff9d4e091205e0 0 85586 1 0x00000080 [10611182.411190] Call Trace: [10611182.411200] [<ffffffffb87874a0>] ? bit_wait+0x50/0x50 [10611182.411203] [<ffffffffb8789199>] schedule+0x29/0x70 [10611182.411206] [<ffffffffb8786e61>] schedule_timeout+0x221/0x2d0 [10611182.411209] [<ffffffffb87874a0>] ? bit_wait+0x50/0x50 [10611182.411212] [<ffffffffb8788a4d>] io_schedule_timeout+0xad/0x130 [10611182.411214] [<ffffffffb8788ae8>] io_schedule+0x18/0x20 [10611182.411217] [<ffffffffb87874b1>] bit_wait_io+0x11/0x50 [10611182.411219] [<ffffffffb8786fd7>] __wait_on_bit+0x67/0x90 [10611182.411224] [<ffffffffb81bd3a1>] wait_on_page_bit+0x81/0xa0 [10611182.411230] [<ffffffffb80c6ff0>] ? wake_bit_function+0x40/0x40 [10611182.411232] [<ffffffffb81bd4d1>] __filemap_fdatawait_range+0x111/0x190 [10611182.411237] [<ffffffffb80bc16a>] ? __queue_delayed_work+0xaa/0x1a0 [10611182.411243] [<ffffffffb83a67cd>] ? list_del+0xd/0x30 [10611182.411246] [<ffffffffb878956d>] ? wait_for_completion+0x11d/0x140 [10611182.411249] [<ffffffffb81c0ba7>] filemap_fdatawait_keep_errors+0x27/0x30 [10611182.411254] [<ffffffffb827f62a>] sync_inodes_sb+0x37a/0x3d0 [10611182.411257] [<ffffffffb82842b6>] sync_filesystem+0x66/0xb0 [10611182.411260] [<ffffffffb825186f>] freeze_super+0x8f/0x130 [10611182.411263] [<ffffffffb8263af4>] do_vfs_ioctl+0x414/0x5b0 [10611182.411266] [<ffffffffb8254494>] ? SYSC_newfstat+0x34/0x60 [10611182.411268] [<ffffffffb8263d31>] SyS_ioctl+0xa1/0xc0 [10611182.411273] [<ffffffffb8795f92>] system_call_fastpath+0x25/0x2a crash> pd 10611182-10610970 $2 = 212 There are 202 Uninterruptible tasks in the queues. crash> ps -S RU: 17 IN: 2367 UN: 202 ZO: 3 crash> ps -m | grep UN | tail [ 0 11:11:18.744] [UN] PID: 123137 TASK: ffff9d4c8cb2d280 CPU: 14 COMMAND: "kworker/u256:8" [ 0 11:11:19.652] [UN] PID: 70349 TASK: ffff9d4c8d200000 CPU: 14 COMMAND: "fce7aa1a16a855e" [ 0 11:11:20.113] [UN] PID: 80754 TASK: ffff9d4ba3689080 CPU: 3 COMMAND: "kworker/3:3" [ 0 11:11:20.083] [UN] PID: 85014 TASK: ffff9d4dea2f5280 CPU: 4 COMMAND: "TaskSchedulerBa" [ 0 11:11:20.122] [UN] PID: 2225 TASK: ffff9d599b9a2100 CPU: 3 COMMAND: "dockerd" [ 0 11:11:20.187] [UN] PID: 85586 TASK: ffff9d4e09120000 CPU: 13 COMMAND: "pool" [ 0 11:11:20.059] [UN] PID: 2271 TASK: ffff9d59addc6300 CPU: 7 COMMAND: "loop0" [ 0 11:11:20.159] [UN] PID: 85514 TASK: ffff9d4df9259080 CPU: 12 COMMAND: "openbox" [ 0 11:11:20.185] [UN] PID: 85522 TASK: ffff9d4dfede4200 CPU: 14 COMMAND: "node" [ 0 11:11:20.280] [UN] PID: 84921 TASK: ffff9d4b54834200 CPU: 3 COMMAND: "chrome" The oldest UN task is the task chome, pid 84921 which is scheduled for filesystem operation. crash> ps -p 84921 PID: 0 TASK: ffffffffb8c18480 CPU: 0 COMMAND: "swapper/0" PID: 1 TASK: ffff9d4b195d8000 CPU: 12 COMMAND: "systemd" PID: 102478 TASK: ffff9d4a7a6a2100 CPU: 10 COMMAND: "containerd" PID: 84612 TASK: ffff9d4e0151d280 CPU: 2 COMMAND: "containerd-shim" PID: 84625 TASK: ffff9d59bb19e300 CPU: 5 COMMAND: "bash" PID: 84899 TASK: ffff9d49f57cd280 CPU: 1 COMMAND: "node" PID: 84920 TASK: ffff9d4b1928c200 CPU: 9 COMMAND: "751c6f808f53b35" PID: 84921 TASK: ffff9d4b54834200 CPU: 3 COMMAND: "chrome" crash> set 84921 PID: 84921 COMMAND: "chrome" TASK: ffff9d4b54834200 [THREAD_INFO: ffff9d4c66510000] CPU: 3 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 84921 TASK: ffff9d4b54834200 CPU: 3 COMMAND: "chrome" #0 [ffff9d4c66513b70] __schedule at ffffffffb8788dc8 #1 [ffff9d4c66513bd8] schedule at ffffffffb8789199 #2 [ffff9d4c66513be8] __sb_start_write at ffffffffb8251521 #3 [ffff9d4c66513c50] __xfs_filemap_fault at ffffffffc082d9af [xfs] #4 [ffff9d4c66513cb0] xfs_filemap_page_mkwrite at ffffffffc082dae5 [xfs] #5 [ffff9d4c66513cc0] do_page_mkwrite at ffffffffb81ee88a #6 [ffff9d4c66513d40] do_wp_page at ffffffffb81f2347 #7 [ffff9d4c66513de8] handle_mm_fault at ffffffffb81f6682 #8 [ffff9d4c66513eb0] __do_page_fault at ffffffffb8790653 #9 [ffff9d4c66513f20] do_page_fault at ffffffffb8790975 #10 [ffff9d4c66513f50] page_fault at ffffffffb878c778 Below docker mount is starting to write. crash> mount|grep ffff9d5375fda000 ffff9d59ad89a700 ffff9d5375fda000 xfs /dev/mapper/docker-253:3-134-e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b /apps/docker/devicemapper/mnt/e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b But the filesystem for docker-253:3 is in FREEZE state and some more. SB: 0xffff9d5375fda000, frozen=SB_FREEZE_COMPLETE, /apps/docker/devicemapper/mnt/e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b/ (dm-13) [xfs], (locked,umount) SB: 0xffff9d59ba195000, frozen=SB_FREEZE_COMPLETE, /apps/docker/ (dm-3) [xfs], (locked,umount) SB: 0xffff9d4b4be5a800, frozen=SB_FREEZE_PAGEFAULT, /apps/docker/devicemapper/mnt/b36a4f0070f1f6c670ce97c5445a519e515707c639de28a56c83469e1aaac93d/ (dm-12) [xfs], (locked,umount) SB: 0xffff9d4b6b6a3000, frozen=SB_FREEZE_COMPLETE, /apps/docker/devicemapper/mnt/5ee9f7a410f0bb85f7b3e48fdded33655f69c15ca8b4d5ccad34913f8cb97300/ (dm-15) [xfs], (locked,umount) vmtoolsd did not provok it: crash> ps -g 1587 PID: 1587 TASK: ffff9d49f5469080 CPU: 8 COMMAND: "vmtoolsd" PID: 1614 TASK: ffff9d59b9a84200 CPU: 12 COMMAND: "HangDetector" PID: 85586 TASK: ffff9d4e09120000 CPU: 13 COMMAND: "pool" PID: 91690 TASK: ffff9d4df1b9e300 CPU: 10 COMMAND: "pool" PID: 95792 TASK: ffff9d4e9d6eb180 CPU: 12 COMMAND: "pool" PID: 108196 TASK: ffff9d4c5b375280 CPU: 4 COMMAND: "pool" PID: 123459 TASK: ffff9d4ddca9d280 CPU: 5 COMMAND: "pool" The loop devices are also blocked. [10611182.411729] INFO: task loop0:2271 blocked for more than 120 seconds. [10611182.411756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10611182.411801] loop0 D ffff9d59addc68e0 0 2271 2 0x00000080 [10611182.411803] Call Trace: [10611182.411807] [<ffffffffb8789199>] schedule+0x29/0x70 [10611182.411810] [<ffffffffb8251521>] __sb_start_write+0xe1/0x120 [10611182.411813] [<ffffffffb80c6f30>] ? wake_up_atomic_t+0x30/0x30 [10611182.411817] [<ffffffffc06a5e77>] __do_lo_send_write+0xc7/0x120 [loop] [10611182.411820] [<ffffffffc06a5f28>] do_lo_send_direct_write+0x58/0x70 [loop] [10611182.411824] [<ffffffffc06a6d7a>] do_bio_filebacked+0x23a/0x320 [loop] [10611182.411827] [<ffffffffc06a5ed0>] ? __do_lo_send_write+0x120/0x120 [loop] [10611182.411830] [<ffffffffb80d31d1>] ? __wake_up_common_lock+0x91/0xc0 [10611182.411832] [<ffffffffb80d2c40>] ? task_rq_unlock+0x20/0x20 [10611182.411836] [<ffffffffc06a75a1>] loop_thread+0x101/0x2e0 [loop] [10611182.411838] [<ffffffffb80c6f30>] ? wake_up_atomic_t+0x30/0x30 [10611182.411841] [<ffffffffc06a74a0>] ? loop_attr_do_show_backing_file+0xb0/0xb0 [loop] [10611182.411844] [<ffffffffb80c5e41>] kthread+0xd1/0xe0 [10611182.411847] [<ffffffffb80c5d70>] ? insert_kthread_work+0x40/0x40 [10611182.411850] [<ffffffffb8795df7>] ret_from_fork_nospec_begin+0x21/0x21 [10611182.411853] [<ffffffffb80c5d70>] ? insert_kthread_work+0x40/0x40 crash> ps|grep loop 2271 2 7 ffff9d59addc6300 UN 0.0 0 0 [loop0] 2272 2 10 ffff9d59bbb58000 UN 0.0 0 0 [loop1] crash> bt 2271 2272 PID: 2271 TASK: ffff9d59addc6300 CPU: 7 COMMAND: "loop0" #0 [ffff9d59a242fc50] __schedule at ffffffffb8788dc8 #1 [ffff9d59a242fcb8] schedule at ffffffffb8789199 #2 [ffff9d59a242fcc8] __sb_start_write at ffffffffb8251521 #3 [ffff9d59a242fd30] __do_lo_send_write at ffffffffc06a5e77 [loop] #4 [ffff9d59a242fd78] do_lo_send_direct_write at ffffffffc06a5f28 [loop] #5 [ffff9d59a242fdb0] do_bio_filebacked at ffffffffc06a6d7a [loop] #6 [ffff9d59a242fe58] loop_thread at ffffffffc06a75a1 [loop] #7 [ffff9d59a242fec8] kthread at ffffffffb80c5e41 PID: 2272 TASK: ffff9d59bbb58000 CPU: 10 COMMAND: "loop1" #0 [ffff9d59a24e7c50] __schedule at ffffffffb8788dc8 #1 [ffff9d59a24e7cb8] schedule at ffffffffb8789199 #2 [ffff9d59a24e7cc8] __sb_start_write at ffffffffb8251521 #3 [ffff9d59a24e7d30] __do_lo_send_write at ffffffffc06a5e77 [loop] #4 [ffff9d59a24e7d78] do_lo_send_direct_write at ffffffffc06a5f28 [loop] #5 [ffff9d59a24e7db0] do_bio_filebacked at ffffffffc06a6d7a [loop] #6 [ffff9d59a24e7e58] loop_thread at ffffffffc06a75a1 [loop] #7 [ffff9d59a24e7ec8] kthread at ffffffffb80c5e41 From the description comment #0 - Frank DeLorey: > Actual results: > > VM hangs host need to be rebooted to unlock VM Are you stating that the ESXi host needed to be rebooted, or simply that the VM needed to be powered of and rebooted? I am assuming the later, but asking for clarity. > Some superblocks are already frozen, but one of them is still in progress: > > crash> list -H super_blocks -s super_block.s_writers | grep -E '^ffff| frozen = ' | paste - - | grep -v ' = 0x0,' > ffff8dbe84f98000 frozen = 0x4, > ffff8dbeb9bce000 frozen = 0x3, > ffff8dc0535da000 frozen = 0x4, If the dump is still available, is it possible to determine: 1. What process is currently doing the fsfreeze on superblock ffff8dbeb9bce000 - /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper? 2. Is vmtoolsd blocked and if so, what is the file to which I/O is being blocked? Given that the filesystem freeze(s) are currently still happening, and assuming that it is vmtoosd vmsvc doing the quiesing, this cannot be the same bug as https://bugzilla.redhat.com/show_bug.cgi?id=1880404. BZ 1880404 occurred after the snapshot was completed and the quiesce_manifest.xml is being replaced. (In reply to Jay Shin from comment #12) If we are in the midst of taking a quiesed snapshot, other processes will be blocked if they attempt to write to a frozen filesystem. The question is what is the natire of the vlockage and do we have another process freezing any filesystem? Jay has made the statement that > > vmtoolsd did not provok it: > Wondering what the basis is for this statement? Again the same questions as above. - Is vmtoolsd tangled up in this? Does vmtoolsd show up in the output of "ps -m | grep UN"? If so. what superblock (filesystem) is it blocked by and what process has frozen that superblock? Cathy, Can you redirect these questions as appropriate? Hopefully the dump is still available. Thanks. Hi Jay and Frank, Could you provide any information needed in comment 14? Also John W do you want them to ask the customer to try the workaround from bz 2066140 Created file /etc/vmware-tools/tools.conf with following entry ( entries are case sensitive ) [vmbackup] enableXmlManifest = false Thanks, Cathy Sorry the dump is no longer available. I believe they needed to reboot the VM not the ESX host itself. We can ask them to try the workaround if it is believed to be applicable to this problem. Frank (In reply to Cathy Avery from comment #15) > Also John W do you want them to ask the customer to try the workaround from > bz 2066140 > > Created file /etc/vmware-tools/tools.conf with following entry ( entries are > case sensitive ) > > [vmbackup] > enableXmlManifest = false > > Thanks, > > Cathy If they have not already done so, yes that would be advisable. (In reply to Frank DeLorey from comment #16) > Sorry the dump is no longer available. I believe they needed to reboot the > VM not the ESX host itself. We can ask them to try the workaround if it is > believed to be applicable to this problem. > > Frank I have retraced it again. retrace-server-interact 333400596 crash vmtoolsd was not in UN state. crash> ps 1586 PID PPID CPU TASK ST %MEM VSZ RSS COMM 1586 1 8 ffff8dbdb65d0000 IN 0.0 580840 5548 vmtoolsd crash> bt 1586 PID: 1586 TASK: ffff8dbdb65d0000 CPU: 8 COMMAND: "vmtoolsd" #0 [ffff8dbe8d3f7978] __schedule at ffffffff8db868a7 #1 [ffff8dbe8d3f7a08] schedule at ffffffff8db86dc9 #2 [ffff8dbe8d3f7a18] schedule_hrtimeout_range_clock at ffffffff8db85fb2 #3 [ffff8dbe8d3f7ab0] schedule_hrtimeout_range at ffffffff8db86063 #4 [ffff8dbe8d3f7ac0] poll_schedule_timeout at ffffffff8d664115 #5 [ffff8dbe8d3f7af0] do_sys_poll at ffffffff8d6658ad #6 [ffff8dbe8d3f7f10] sys_poll at ffffffff8d665ab4 #7 [ffff8dbe8d3f7f50] system_call_fastpath at ffffffff8db93f92 RIP: 00007fa8238aec3d RSP: 00007ffd566908f0 RFLAGS: 00000246 RAX: 0000000000000007 RBX: 000055690db5bb50 RCX: 000055690db55d00 RDX: 0000000000000064 RSI: 0000000000000002 RDI: 000055690db50cf0 RBP: 0000000000000002 R8: 0000000000000007 R9: 0000000000000000 R10: 00007fa824ff4f88 R11: 0000000000000293 R12: 000055690db50cf0 R13: 0000000000000064 R14: 00007fa823be5420 R15: 0000000000000007 ORIG_RAX: 0000000000000007 CS: 0033 SS: 002b > 1. What process is currently doing the fsfreeze on superblock ffff8dbeb9bce000 - /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper? crash> search -t ffff8dbeb9bce000 | grep PID PID: 7235 TASK: ffff8dc269241080 CPU: 10 COMMAND: "pool" PID: 15098 TASK: ffff8dc9672d4200 CPU: 2 COMMAND: "pool" PID: 15578 TASK: ffff8dc0b4a75280 CPU: 9 COMMAND: "pool" PID: 23282 TASK: ffff8dbfcb818000 CPU: 10 COMMAND: "pool" PID: 41513 TASK: ffff8dbec7bf6300 CPU: 7 COMMAND: "minion-lightwei" PID: 45454 TASK: ffff8dc0b9858000 CPU: 0 COMMAND: "pool" crash> ps -m 7235 15098 15578 23282 41513 45454 [ 0 16:40:13.106] [IN] PID: 41513 TASK: ffff8dbec7bf6300 CPU: 7 COMMAND: "minion-lightwei" [ 6 22:18:05.152] [UN] PID: 45454 TASK: ffff8dc0b9858000 CPU: 0 COMMAND: "pool" [ 7 21:52:46.476] [UN] PID: 15098 TASK: ffff8dc9672d4200 CPU: 2 COMMAND: "pool" [ 7 22:07:52.783] [UN] PID: 7235 TASK: ffff8dc269241080 CPU: 10 COMMAND: "pool" [ 8 22:02:38.551] [UN] PID: 23282 TASK: ffff8dbfcb818000 CPU: 10 COMMAND: "pool" [ 8 22:17:44.586] [UN] PID: 15578 TASK: ffff8dc0b4a75280 CPU: 9 COMMAND: "pool" crash> bt 45454 PID: 45454 TASK: ffff8dc0b9858000 CPU: 0 COMMAND: "pool" #0 [ffff8dc27b72bcc8] __schedule at ffffffff8db868a7 #1 [ffff8dc27b72bd50] schedule at ffffffff8db86dc9 #2 [ffff8dc27b72bd60] rwsem_down_write_failed at ffffffff8db88505 #3 [ffff8dc27b72bdf0] call_rwsem_down_write_failed at ffffffff8d797757 #4 [ffff8dc27b72be38] down_write at ffffffff8db860fd #5 [ffff8dc27b72be50] freeze_super at ffffffff8d650ff6 #6 [ffff8dc27b72be80] do_vfs_ioctl at ffffffff8d663534 #7 [ffff8dc27b72bf00] sys_ioctl at ffffffff8d663771 #8 [ffff8dc27b72bf50] system_call_fastpath at ffffffff8db93f92 The file ffff8dbeeab4a600 is being accessed. DENTRY INODE SUPERBLK TYPE PATH ffff8dcd59cf4e40 ffff8dbdb600b208 ffff8dbeb9bce000 DIR /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae/ file operations = ffffffffc0724380 (r) xfs_dir_file_operations [xfs] file open mode = Read-Only (0x8000) file size = 28 bytes, ino = 256, link count = 3 uid = 0, gid = 0 ffff8dbee8fced00 ffff8dbeb9bce000 xfs /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae crash> rw_semaphore 0xffff8dbeb9bce068 -x struct rw_semaphore { { count = { counter = 0xfffffffe00000001 }, __UNIQUE_ID_rh_kabi_hide4 = { count = 0xfffffffe00000001 }, {<No data fields>} }, wait_lock = { raw_lock = { val = { counter = 0x0 } } }, osq = { tail = { counter = 0x0 } }, wait_list = { next = 0xffff8dc0804c3d90 }, owner = 0xffff8dc0b4a75280 } semaphore owner. crash> ps -m 15578 [ 8 22:17:44.586] [UN] PID: 15578 TASK: ffff8dc0b4a75280 CPU: 9 COMMAND: "pool" crash> bt 0xffff8dc0b4a75280 PID: 15578 TASK: ffff8dc0b4a75280 CPU: 9 COMMAND: "pool" #0 [ffff8dc0774a3c78] __schedule at ffffffff8db868a7 #1 [ffff8dc0774a3d08] schedule at ffffffff8db86dc9 #2 [ffff8dc0774a3d18] xfs_log_force_lsn at ffffffffc0707d3c [xfs] #3 [ffff8dc0774a3d98] __xfs_trans_commit at ffffffffc0702fb3 [xfs] #4 [ffff8dc0774a3dd8] xfs_trans_commit at ffffffffc07032d0 [xfs] #5 [ffff8dc0774a3de8] xfs_sync_sb at ffffffffc06da2d2 [xfs] #6 [ffff8dc0774a3e10] xfs_log_sbcount at ffffffffc06fcb93 [xfs] #7 [ffff8dc0774a3e20] xfs_quiesce_attr at ffffffffc07001a0 [xfs] #8 [ffff8dc0774a3e38] xfs_fs_freeze at ffffffffc0700471 [xfs] #9 [ffff8dc0774a3e50] freeze_super at ffffffff8d65108b #10 [ffff8dc0774a3e80] do_vfs_ioctl at ffffffff8d663534 #11 [ffff8dc0774a3f00] sys_ioctl at ffffffff8d663771 #12 [ffff8dc0774a3f50] system_call_fastpath at ffffffff8db93f92 struct file ffff8dbeb0b8df00 DENTRY INODE SUPERBLK TYPE PATH ffff8dcd59cf4e40 ffff8dbdb600b208 ffff8dbeb9bce000 DIR /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae/ file operations = ffffffffc0724380 (r) xfs_dir_file_operations [xfs] file open mode = Read-Only (0x8000) file size = 28 bytes, ino = 256, link count = 3 uid = 0, gid = 0 == Mount Info == ffff8dbee8fced00 ffff8dbeb9bce000 xfs /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae > Again the same questions as above. > - Is vmtoolsd tangled up in this? Does vmtoolsd show up in the output of "ps -m | grep UN"? > If so. what superblock (filesystem) is it blocked by and what process has frozen that superblock? crash> ps -m 1587 [ 0 00:00:00.000] [IN] PID: 1587 TASK: ffff9d49f5469080 CPU: 8 COMMAND: "vmtoolsd" crash> bt 1587 PID: 1587 TASK: ffff9d49f5469080 CPU: 8 COMMAND: "vmtoolsd" #0 [ffff9d59b948f9a0] __schedule at ffffffffb8788dc8 #1 [ffff9d59b948fa08] schedule at ffffffffb8789199 #2 [ffff9d59b948fa18] schedule_hrtimeout_range_clock at ffffffffb8788562 #3 [ffff9d59b948fab0] schedule_hrtimeout_range at ffffffffb8788613 #4 [ffff9d59b948fac0] poll_schedule_timeout at ffffffffb82646d5 #5 [ffff9d59b948faf0] do_sys_poll at ffffffffb8265e6d #6 [ffff9d59b948ff10] sys_poll at ffffffffb8266074 #7 [ffff9d59b948ff50] system_call_fastpath at ffffffffb8795f92 crash> mount|grep ffff9d5375fda000 ffff9d59ad89a700 ffff9d5375fda000 xfs /dev/mapper/docker-253:3-134-e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b /apps/docker/devicemapper/mnt/e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b crash> search -t ffff9d5375fda000 | grep PID PID: 80930 TASK: ffff9d59aa181080 CPU: 2 COMMAND: "kworker/2:2" PID: 84910 TASK: ffff9d59bb749080 CPU: 8 COMMAND: "node" PID: 84911 TASK: ffff9d4b4ae85280 CPU: 10 COMMAND: "node" PID: 84921 TASK: ffff9d4b54834200 CPU: 3 COMMAND: "chrome" PID: 84989 TASK: ffff9d4b19333180 CPU: 11 COMMAND: "Chrome_DBThread" PID: 84994 TASK: ffff9d59a6da2100 CPU: 4 COMMAND: "Chrome_IOThread" PID: 85013 TASK: ffff9d4dea2f2100 CPU: 5 COMMAND: "Chrome_HistoryT" PID: 120708 TASK: ffff9d59a879a100 CPU: 1 COMMAND: "minion-lightwei" crash> ps -m 80930 84910 84911 84921 84989 84994 85013 120708 [ 0 00:04:19.729] [IN] PID: 120708 TASK: ffff9d59a879a100 CPU: 1 COMMAND: "minion-lightwei" [ 0 11:06:24.847] [UN] PID: 80930 TASK: ffff9d59aa181080 CPU: 2 COMMAND: "kworker/2:2" [ 0 11:10:53.318] [UN] PID: 84989 TASK: ffff9d4b19333180 CPU: 11 COMMAND: "Chrome_DBThread" [ 0 11:11:07.502] [UN] PID: 84994 TASK: ffff9d59a6da2100 CPU: 4 COMMAND: "Chrome_IOThread" [ 0 11:11:12.644] [UN] PID: 85013 TASK: ffff9d4dea2f2100 CPU: 5 COMMAND: "Chrome_HistoryT" [ 0 11:11:20.280] [UN] PID: 84921 TASK: ffff9d4b54834200 CPU: 3 COMMAND: "chrome" [ 0 11:11:23.760] [IN] PID: 84910 TASK: ffff9d59bb749080 CPU: 8 COMMAND: "node" [ 0 11:11:23.766] [IN] PID: 84911 TASK: ffff9d4b4ae85280 CPU: 10 COMMAND: "node" crash> set 84989 PID: 84989 COMMAND: "Chrome_DBThread" TASK: ffff9d4b19333180 [THREAD_INFO: ffff9d4c3314c000] CPU: 11 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 84989 TASK: ffff9d4b19333180 CPU: 11 COMMAND: "Chrome_DBThread" #0 [ffff9d4c3314fb70] __schedule at ffffffffb8788dc8 #1 [ffff9d4c3314fbd8] schedule at ffffffffb8789199 #2 [ffff9d4c3314fbe8] __sb_start_write at ffffffffb8251521 #3 [ffff9d4c3314fc50] __xfs_filemap_fault at ffffffffc082d9af [xfs] #4 [ffff9d4c3314fcb0] xfs_filemap_page_mkwrite at ffffffffc082dae5 [xfs] #5 [ffff9d4c3314fcc0] do_page_mkwrite at ffffffffb81ee88a #6 [ffff9d4c3314fd40] do_wp_page at ffffffffb81f2347 #7 [ffff9d4c3314fde8] handle_mm_fault at ffffffffb81f6682 #8 [ffff9d4c3314feb0] __do_page_fault at ffffffffb8790653 #9 [ffff9d4c3314ff20] do_page_fault at ffffffffb8790975 #10 [ffff9d4c3314ff50] page_fault at ffffffffb878c778 crash> mount|grep ffff9d5375fda000 ffff9d59ad89a700 ffff9d5375fda000 xfs /dev/mapper/docker-253:3-134-e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b /apps/docker/devicemapper/mnt/e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b crash> ps -p 84989 PID: 0 TASK: ffffffffb8c18480 CPU: 0 COMMAND: "swapper/0" PID: 1 TASK: ffff9d4b195d8000 CPU: 12 COMMAND: "systemd" PID: 102478 TASK: ffff9d4a7a6a2100 CPU: 10 COMMAND: "containerd" PID: 84612 TASK: ffff9d4e0151d280 CPU: 2 COMMAND: "containerd-shim" PID: 84625 TASK: ffff9d59bb19e300 CPU: 5 COMMAND: "bash" PID: 84899 TASK: ffff9d49f57cd280 CPU: 1 COMMAND: "node" PID: 84920 TASK: ffff9d4b1928c200 CPU: 9 COMMAND: "751c6f808f53b35" PID: 84989 TASK: ffff9d4b19333180 CPU: 11 COMMAND: "Chrome_DBThread" From Jay's response in comment 18, vmtoolsd is not blocked by a frozen file and that another process has frozen filesystem(s). Can it be determined if vmtoolsd currently has any filesystem(s) frozen which would indicate that a quiesed snapshot is under way? It is conceivable that vmtoolsd could get blocked here and is another reason for the quiesed snapshot to fail gracefully if it cannot get access to all applicable mounted filesystems. Hello, o We have had our customer test the below fix Create file /etc/vmware-tools/tools.conf with following entry ( entries are case sensitive ) [vmbackup] enableXmlManifest = false o And they are reporting that there have not been any hangs since. Customer in case 03011147 has reported they have had another hang and we have requested they collect a vmcore. Details from custoMer below ~~~ we had another Docker device issue today. But looks to have started March. I've forced a Core Dump using NMI at the VM Console. Here's the email I sent to my team. Now Docker is unable to delete a device and XFS inode error. 2022-04-14T12:20:27.281993-05:00 nclpvnpapp10400 kernel: [2274616.594457] loop: Write error at byte offset 483581952, length 4096. 2022-04-14T12:20:27.285066-05:00 nclpvnpapp10400 kernel: [2274616.597477] loop: Write error at byte offset 482996224, length 4096. 2022-04-14T12:20:27.285587-05:00 nclpvnpapp10400 kernel: [2274616.597949] XFS (dm-13): metadata I/O error in "xfs_buf_iodone_callback_error" at daddr 0x1eaef8 len 8 error 28 The “dm-13” is this device…. [root@nclpvnpapp10400 systemdump]# find /dev/ -name "dm-*" -exec readlink -n {} \; -exec echo " -->" {} \; ../../dm-13 --> /dev/disk/by-id/dm-name-docker-253:9-134-e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874 Is found on both “loop” devices…. loop0 7:0 0 100G 0 loop `-docker-253:9-134-pool 253:10 0 100G 0 dm |-docker-253:9-134-b14647e3c4e045e56343bc4f5dc0dc91a5b1b40b069840ce1531b7d50f0b6572 253:11 0 10G 0 dm /apps/docker/devicemapper/mnt/b14647e3c4e045e56343bc4f5dc0dc91a5b1b40b069840ce1531b7d50f0b6572 |-docker-253:9-134-b1d4eeb84d3d15931ec52f03eab558fea101c20ba5e6dc383469ea51d3fc8cd9 253:12 0 10G 0 dm `-docker-253:9-134-e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874 253:13 0 10G 0 dm /apps/docker/devicemapper/mnt/e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874 loop1 7:1 0 2G 0 loop `-docker-253:9-134-pool 253:10 0 100G 0 dm |-docker-253:9-134-b14647e3c4e045e56343bc4f5dc0dc91a5b1b40b069840ce1531b7d50f0b6572 253:11 0 10G 0 dm /apps/docker/devicemapper/mnt/b14647e3c4e045e56343bc4f5dc0dc91a5b1b40b069840ce1531b7d50f0b6572 |-docker-253:9-134-b1d4eeb84d3d15931ec52f03eab558fea101c20ba5e6dc383469ea51d3fc8cd9 253:12 0 10G 0 dm `-docker-253:9-134-e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874 253:13 0 10G 0 dm /apps/docker/devicemapper/mnt/e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874 This looks to have started back in March…. [ah02870@nclpvnpapp10400 log]$ grep 994269 messages-20220327 | more 2022-03-20T17:23:50.353896-05:00 nclpvnpapp10400 NetworkManager[1616]: <info> [1647815030.3537] manager: (veth9942692): new Veth device (/org/freedesktop/NetworkManager/Devices/65723) 2022-03-20T17:25:59.233817-05:00 nclpvnpapp10400 NetworkManager[1616]: <info> [1647815159.2337] manager: (veth9942692): new Veth device (/org/freedesktop/NetworkManager/Devices/65793) 2022-03-21T02:30:11.985956-05:00 nclpvnpapp10400 kernel: [165768.325316] device-mapper: thin: Deletion of thin device 994269 failed. 2022-03-21T02:30:21.487968-05:00 nclpvnpapp10400 kernel: [165777.826863] device-mapper: thin: Deletion of thin device 994269 failed. We hit the “no space” on the April 11th… [ah02870@nclpvnpapp10400 log]$ grep "no space" /var/log/messages | more 2022-04-11T00:00:23.557086-05:00 nclpvnpapp10400 dockerd[2052]: time="2022-04-11T00:00:23.556744401-05:00" level=warning msg="Deletion of device b1d4eeb84d3d15931ec52f03eab558fea101c20ba5e6dc383469ea51d3fc8cd9, device_id=994269 failed:devmapper: Error saving transaction metadata: devmapper: Error creating metadata file: open /apps/docker/ devicemapper/metadata/.tmp534691609: no space left on device" storage-driver=devicemapper The device that cannot be deleted is DM-12. ../../dm-12 --> /dev/disk/by-id/dm-name-docker-253:9-134-b1d4eeb84d3d15931ec52f03eab558fea101c20ba5e6dc383469ea51d3fc8cd9 ~~~ Will respond with a link to the core when the customer has completed and uploaded the core dump. I also let them know that the workaround has helped with other customers. I have been reminded that it is the updated (new) quiese_manifest.xml file sent to the host that tells the vSphere host that the snapshot was a quiesed snapshot. With the tools.conf setting "enableXmlManifest = false", no manifest file is created to forward to the host. When viewing the snapshot information in the vSphere client's Manage Snapshots window, the snapshot will be displayed as a non-quiesed snapshot even though it was quiesed. The snapshot should still be fine and quiesced, it just won't be displayed that way. https://drive.google.com/file/d/1wg9k-gMDqqA_mPPzu7hjEVnZkbwwmBSb/view?usp=sharing Core and sosreport provided by 03011147 (In reply to Robert McSwain from comment #28) > https://drive.google.com/file/d/1wg9k-gMDqqA_mPPzu7hjEVnZkbwwmBSb/ > view?usp=sharing > > Core and sosreport provided by 03011147 Hi, I've been talking to John Wolfe. He asks if this information can be collected as soon as possible ( the logs will rollover at some point ): "If at all possible, collect the vm_support bundle from the ESXi host running the affected RHEL 7 VM at the time the NMI was triggered. Hopefully the hostd and vmkernel logs have not rotated past that point in time. Follow the KB article (and links) https://kb.vmware.com/s/article/1010705. It is only necessary to include the affected VM, not every VM on the host. If you provide the name of that VM and the time that the NMI was taken, I can scan the logs to any indication of vmtoolsd activity including snapshot activities and any problems." Thanks, Cathy @ (In reply to Robert McSwain from comment #33) > (In reply to Robert McSwain from comment #31) > > > > Current vmcore is here: https://galvatron.cee.redhat.com/manager/725430923 > > @cavery Setting needinfo for this. Thanks! What I am seeing is pretty much reflected in the other dumps. About a 100 uninterruptible tasks mostly ps and pybin. [ 0 00:00:00.000] [UN] PID: 2275 TASK: ffffa129bb4e4200 CPU: 3 COMMAND: "loop0" [ 0 00:00:00.001] [UN] PID: 91008 TASK: ffffa12a59361080 CPU: 0 COMMAND: "xfsaild/dm-13" [ 0 20:43:45.773] [UN] PID: 8799 TASK: ffffa1354aec6300 CPU: 10 COMMAND: "pool" [ 0 20:50:07.641] [UN] PID: 6605 TASK: ffffa135e71e4200 CPU: 15 COMMAND: "ps" [ 0 20:53:50.367] [UN] PID: 4848 TASK: ffffa138f728a100 CPU: 2 COMMAND: "ps" [ 0 20:54:56.914] [UN] PID: 4600 TASK: ffffa137e1535280 CPU: 2 COMMAND: "pybin" [ 0 20:56:24.384] [UN] PID: 1888 TASK: ffffa138f5af3180 CPU: 11 COMMAND: "zabbix_agentd" [ 0 20:57:11.162] [UN] PID: 92813 TASK: ffffa12a59333180 CPU: 7 COMMAND: "Chrome_IOThread" [ 0 22:01:25.845] [UN] PID: 92276 TASK: ffffa1384dbee300 CPU: 0 COMMAND: "1a1f09110779600" [ 0 22:04:48.137] [UN] PID: 92079 TASK: ffffa1382b4aa100 CPU: 8 COMMAND: "sisidsdaemon" [ 0 22:10:39.028] [UN] PID: 92865 TASK: ffffa137e9f2b180 CPU: 5 COMMAND: "Chrome_HistoryT" [ 0 22:11:00.116] [UN] PID: 92812 TASK: ffffa138fc799080 CPU: 4 COMMAND: "Chrome_CacheThr" [ 0 22:11:00.175] [UN] PID: 92278 TASK: ffffa129b9d1b180 CPU: 9 COMMAND: "chrome" crash> ps -g ffffa138f67a6300 PID: 1593 TASK: ffffa138f67a6300 CPU: 8 COMMAND: "vmtoolsd" PID: 1609 TASK: ffffa138c5268000 CPU: 2 COMMAND: "HangDetector" PID: 8799 TASK: ffffa1354aec6300 CPU: 10 COMMAND: "pool" PID: 20693 TASK: ffffa137ec905280 CPU: 5 COMMAND: "pool" PID: 32614 TASK: ffffa131583e9080 CPU: 14 COMMAND: "pool" PID: 45316 TASK: ffffa13118f29080 CPU: 8 COMMAND: "pool" PID: 47490 TASK: ffffa137dd6b3180 CPU: 15 COMMAND: "pool" crash> list -H super_blocks -s super_block.s_writers | grep -E '^ffff| frozen = ' | paste - - | grep -v ' = 0x0,'^C crash> crash> mount | grep -E ffffa133f8e57800 <----- frozen = 1 ffffa134fcd6b780 ffffa133f8e57800 xfs /dev/mapper/docker-253:9-134-e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874 /apps/docker/devicemapper/mnt/e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874 It looks like this bug has been idle for 6 months. Is anyone working on it or is the investigation stalled? If it's stalled, can someone identify next steps? ALso it looks like a confirmed workaround exists (https://bugzilla.redhat.com/show_bug.cgi?id=1994590#c24), and this looks like it's RHEL7 which requires special business case for any fix to be delivered (please see lifecycle page https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase and https://access.redhat.com/support/policy/updates/errata#Life_Cycle_Dates), is there a business case for a possible fix to be investigated and delivered in RHEL7? 2. Is someone actively working on this bug anymore or is it blocked? 2. Do we know if this problem exists in RHEL8, RHEL9 or upstream? There are open cases but I"m not sure how critical this is for any of the customers or if the workaround is adequate. If there's no clear business case, the workaround is usable for customers, and there's no one working on it or next steps identified, maybe this should be closed WONT_FIX in RHEL7? (In reply to Dave Wysochanski from comment #36) > It looks like this bug has been idle for 6 months. Is anyone working on it > or is the investigation stalled? If it's stalled, can someone identify next > steps? > > ALso it looks like a confirmed workaround exists > (https://bugzilla.redhat.com/show_bug.cgi?id=1994590#c24), and this looks > like it's RHEL7 which requires special business case for any fix to be > delivered (please see lifecycle page > https://access.redhat.com/support/policy/updates/ > errata#Maintenance_Support_2_Phase and > https://access.redhat.com/support/policy/updates/errata#Life_Cycle_Dates), > is there a business case for a possible fix to be investigated and delivered > in RHEL7? > 2. Is someone actively working on this bug anymore or is it blocked? > 2. Do we know if this problem exists in RHEL8, RHEL9 or upstream? > > There are open cases but I"m not sure how critical this is for any of the > customers or if the workaround is adequate. If there's no clear business > case, the workaround is usable for customers, and there's no one working on > it or next steps identified, maybe this should be closed WONT_FIX in RHEL7? @jwolfe Hi John is there any update on this BZ? Thanks, Cathy (In reply to Cathy Avery from comment #37) > (In reply to Dave Wysochanski from comment #36) > > It looks like this bug has been idle for 6 months. Is anyone working on it > > or is the investigation stalled? If it's stalled, can someone identify next > > steps? > > > > ALso it looks like a confirmed workaround exists > > (https://bugzilla.redhat.com/show_bug.cgi?id=1994590#c24), and this looks > > like it's RHEL7 which requires special business case for any fix to be > > delivered (please see lifecycle page > > https://access.redhat.com/support/policy/updates/ > > errata#Maintenance_Support_2_Phase and > > https://access.redhat.com/support/policy/updates/errata#Life_Cycle_Dates), > > is there a business case for a possible fix to be investigated and delivered > > in RHEL7? > > 2. Is someone actively working on this bug anymore or is it blocked? > > 2. Do we know if this problem exists in RHEL8, RHEL9 or upstream? > > > > There are open cases but I"m not sure how critical this is for any of the > > customers or if the workaround is adequate. If there's no clear business > > case, the workaround is usable for customers, and there's no one working on > > it or next steps identified, maybe this should be closed WONT_FIX in RHEL7? > > @jwolfe > > Hi John is there any update on this BZ? > > Thanks, > > Cathy Hi Cathy, I am still working on a fix for the hang(s) seen when doing quiesed snapshots. The quiesed snapshot code is basically the same in the versions of open-vm-tools on RHEL 7 through RHEL 9. When the fix is ready, Red Hat can provide updated open-vm-tools if needed. A patch is available for lib/syncDriver/syncDriverLinux.c to address this issue and is applicable to all earlier releases of this file. https://github.com/vmware/open-vm-tools/commit/9d458c53a7a656d4d1ba3a28d090cce82ac4af0e Track Linux filesystem id (FSID) for quiesced (frozen) filesystems Tracking the filesystem FSID along with each file descriptor (FD) as the ioctl FIFREEZE is done. An EBUSY could be seen because of an attempt to freeze the same superblock more than once depending on the OS configuration (e.g. usage of bind mounts). An EBUSY could also mean another process has locked or frozen that filesystem. When an EBUSY is received, the filesyste FSID is checked against the list of filesystems that have already be quiesced. If not previously seen, a warning that the filesystem is controlled by another process is logged and the quiesced snapshot request will be rejected. Linux quiesced snapshot requests will now be rejected if another process has frozen/locked a filesystem of interest, thus avoiding a potential block when deleting (unlink) the file '/etc/vmware-tools/quiesce_manifest.xml' as reported in this PR. The patch is in both the github "devel" and "stable-12.2.x" branches. It is in the upcoming open-vm-tools 12.2.0 release and applicable to the supported open-vm-tools releases on RHEL 7, RHEL 8 and RHEL 9. Assigning bz to Ani Verified on RHEL 7.9 with open-vm-tools-11.0.5-3.el7_9.5, and run regression testing,no new issue found, so change status to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Low: open-vm-tools security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3944 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |