Bug 1994590

Summary:	[ESXi][RHEL7.9][open-vm-tools] Snapshot of the RHEL7 guest on the VMWare ESXi hypervisor failed vm hangs
Product:	Red Hat Enterprise Linux 7	Reporter:	Frank DeLorey <fdelorey>
Component:	open-vm-tools	Assignee:	Ani Sinha <anisinha>
Status:	CLOSED ERRATA	QA Contact:	ldu <ldu>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.9	CC:	anisinha, boyang, cavery, dwysocha, eterrell, jaeshin, jen, jsaks, jsavanyo, jwolfe, ldu, leiwang, mrezanin, ossantos, ovasik, rknipp, schandle, shaselde, yacao, ymankad
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	open-vm-tools-11.0.5-3.el7_9.5	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-06-29 15:09:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frank DeLorey 2021-08-17 13:17:12 UTC

Description of problem:

Similar to BZ that should be fixed. Snapshot with loopback device hangs


Version-Release number of selected component (if applicable):

RHEL7.9
open-vm-tools-11.0.5-3.el7_9.1.x86_64 

How reproducible:

Random

Steps to Reproduce:
1. Create a quiesce snapshot of RHEL VM

Actual results:

VM hangs host need to be rebooted to unlock VM

Expected results:

Should not hang VM

Additional info:

The dump is here: https://galvatron.cee.redhat.com/manager/743119411

Some superblocks are already frozen, but one of them is still in progress:

crash> list -H super_blocks -s super_block.s_writers | grep -E '^ffff| frozen = ' | paste - - | grep -v ' = 0x0,'
ffff8dbe84f98000            frozen = 0x4,
ffff8dbeb9bce000            frozen = 0x3,
ffff8dc0535da000            frozen = 0x4,

crash> mount | grep -E 'ffff8dbe84f98000|ffff8dbeb9bce000|ffff8dc0535da000'
ffff8dbea23ca880 ffff8dbe84f98000 xfs    /dev/mapper/data-docker /apps/docker/
ffff8dbee8fced00 ffff8dbeb9bce000 xfs    /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae
ffff8dc103715080 ffff8dc0535da000 xfs    /dev/mapper/docker-253:4-134-68271adddb330a361b05eaf7296c39cca24efa9491f91d8537e01686cca7750e /apps/docker/devicemapper/mnt/68271adddb330a361b05eaf7296c39cca24efa9491f91d8537e01686cca7750e

So, /apps/docker/ is frozen, but /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae is not yet. Is it BZ [1] again?..

The beginning of the issue:

[10068326.501279] device-mapper: thin: Deletion of thin device 2451459 failed.
[10069223.427935] device-mapper: thin: Deletion of thin device 2451595 failed.
[10070474.572672] device-mapper: thin: Deletion of thin device 2451787 failed.
[10070961.544940] device-mapper: thin: Deletion of thin device 2451841 failed.
[10073588.039242] device-mapper: thin: Deletion of thin device 2452232 failed.
[10074004.710749] INFO: task pool:15578 blocked for more than 120 seconds.
[10074004.710804] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10074004.710832] pool            D ffff8dcd7d85acc0     0 15578      1 0x00000080
[10074004.710836] Call Trace:
[10074004.710846]  [<ffffffff8db86dc9>] schedule+0x29/0x70
[10074004.710908]  [<ffffffffc0707d3c>] xfs_log_force_lsn+0x32c/0x370 [xfs]
[10074004.710926]  [<ffffffffc0702fb3>] __xfs_trans_commit+0x203/0x260 [xfs]
[10074004.710937]  [<ffffffffc07032d0>] xfs_trans_commit+0x10/0x20 [xfs]
[10074004.710949]  [<ffffffffc06da2d2>] xfs_sync_sb+0x72/0x80 [xfs]
[10074004.710961]  [<ffffffffc06fcb93>] xfs_log_sbcount+0x53/0x60 [xfs]
[10074004.710973]  [<ffffffffc07001a0>] xfs_quiesce_attr+0x60/0xb0 [xfs]
[10074004.710982]  [<ffffffffc0700471>] xfs_fs_freeze+0x21/0x40 [xfs]
[10074004.710986]  [<ffffffff8d65108b>] freeze_super+0xbb/0x130
[10074004.710989]  [<ffffffff8d663534>] do_vfs_ioctl+0x414/0x5b0
[10074004.710993]  [<ffffffff8d663771>] SyS_ioctl+0xa1/0xc0
[10074004.710997]  [<ffffffff8db93f92>] system_call_fastpath+0x25/0x2a

vmtoolsd provoked it:

crash> ps -g 15578
PID: 1586   TASK: ffff8dbdb65d0000  CPU: 8   COMMAND: "vmtoolsd"
  PID: 1620   TASK: ffff8dbea8bb3180  CPU: 15  COMMAND: "HangDetector"
  PID: 7235   TASK: ffff8dc269241080  CPU: 10  COMMAND: "pool"
  PID: 15098  TASK: ffff8dc9672d4200  CPU: 2   COMMAND: "pool"
  PID: 15578  TASK: ffff8dc0b4a75280  CPU: 9   COMMAND: "pool"
  PID: 23282  TASK: ffff8dbfcb818000  CPU: 10  COMMAND: "pool"
  PID: 45454  TASK: ffff8dc0b9858000  CPU: 0   COMMAND: "pool"

So, it's about taking a VM snapshot, likely.

loop devices are indeed in use:

crash> ps -m 'loop*'
[  8 14:37:27.214] [UN]  PID: 2588   TASK: ffff8dbec53a0000  CPU: 6   COMMAND: "loop1"
[  8 22:17:44.562] [UN]  PID: 2587   TASK: ffff8dbe927e8000  CPU: 10  COMMAND: "loop0"

crash> bt 2587 2588
PID: 2587   TASK: ffff8dbe927e8000  CPU: 10  COMMAND: "loop0"
 #0 [ffff8dbec4fefc28] __schedule at ffffffff8db868a7
 #1 [ffff8dbec4fefcb8] schedule at ffffffff8db86dc9
 #2 [ffff8dbec4fefcc8] __sb_start_write at ffffffff8d650d11
 #3 [ffff8dbec4fefd30] __do_lo_send_write at ffffffffc0537e77 [loop]
 #4 [ffff8dbec4fefd78] do_lo_send_direct_write at ffffffffc0537f28 [loop]
 #5 [ffff8dbec4fefdb0] do_bio_filebacked at ffffffffc0538d7a [loop]
 #6 [ffff8dbec4fefe58] loop_thread at ffffffffc05395a1 [loop]
 #7 [ffff8dbec4fefec8] kthread at ffffffff8d4c5e71

PID: 2588   TASK: ffff8dbec53a0000  CPU: 6   COMMAND: "loop1"
 #0 [ffff8dbec5387c28] __schedule at ffffffff8db868a7
 #1 [ffff8dbec5387cb8] schedule at ffffffff8db86dc9
 #2 [ffff8dbec5387cc8] __sb_start_write at ffffffff8d650d11
 #3 [ffff8dbec5387d30] __do_lo_send_write at ffffffffc0537e77 [loop]
 #4 [ffff8dbec5387d78] do_lo_send_direct_write at ffffffffc0537f28 [loop]
 #5 [ffff8dbec5387db0] do_bio_filebacked at ffffffffc0538d7a [loop]
 #6 [ffff8dbec5387e58] loop_thread at ffffffffc05395a1 [loop]
 #7 [ffff8dbec5387ec8] kthread at ffffffff8d4c5e71

But the customer should already use the open-vm-tools package version with required fix as promised in the BZ [1]:

open-vm-tools-11.0.5-3.el7_9.1.x86_64                       Sun Apr 11 01:47:54 2021

https://github.com/vmware/open-vm-tools/commit/d58847b497e212737007958c945af1df22a8ab58.patch

From d58847b497e212737007958c945af1df22a8ab58 Mon Sep 17 00:00:00 2001
From: Oliver Kurth <okurth>
Date: Fri, 2 Aug 2019 11:07:21 -0700
Subject: [PATCH] Avoid freezing mount points for same device.

Loopback device setup could cause a cyclic dependency
between 2 mount points. In order to break the cycle,
avoid freezing the mount points to the same device.

This change also skips some system mount points for 'tmpfs'
and 'cgroup' etc as those share the same device/FS name.
This is fine because we can't quiese those mount points
anyway (system mount points don't support quiescing).

It does look like the ordering of the freezing was incorrect, hence the BZ might not have addressed the issue fully.

[1] https://bugzilla.redhat.com/1718861

This should have been picked up in rebase here: https://bugzilla.redhat.com/show_bug.cgi?id=1806675

Comment 3 Cathy Avery 2021-08-17 16:16:11 UTC

(In reply to Frank DeLorey from comment #0)

The patch you refer to https://gitlab.com/cavery/open-vm-tools/-/commit/5cb75e0a93b9531875e9e29515a0ea912ca3aca7

Is definitely in open-vm-tools-11.0.5-3.el7_9.1.x86_64. 

So if this patch was not enough we will have to kick it back to Vmware for further analysis.

Cathy

Comment 8 John Savanyo 2021-08-19 18:20:45 UTC

This might be duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1880404

Comment 12 Jay Shin 2022-02-01 12:22:46 UTC

Hi Vmware team,

There is same issue occuring from same customer with another VM.
Customer is using later version of open-vm-tools as below.

Is there any update on this issue?

open-vm-tools-11.0.5-3.el7_9.3.x86_64                       Sun Sep 26 02:48:10 2021

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-1160.36.2.el7.x86_64/vmlinux  [TAINTED]
    DUMPFILE: /cores/retrace/tasks/792323210/crash/vmcore  [PARTIAL DUMP]
        CPUS: 16
        DATE: Fri Jan 28 01:15:22 GMT 2022
      UPTIME: 123 days, 06:45:54
LOAD AVERAGE: 202.07, 201.02, 198.65
       TASKS: 2589
    NODENAME: __removed__
     RELEASE: 3.10.0-1160.36.2.el7.x86_64
     VERSION: #1 SMP Thu Jul 8 02:53:40 UTC 2021
     MACHINE: x86_64  (2699 Mhz)
      MEMORY: 64 GB
       PANIC: "Kernel panic - not syncing: NMI: Not continuing"

After dm-14 is being unmounted the task pool, pid 85586 is blocked for more than 120 seconds on sync filesystem operation.

[10610970.978590] XFS (dm-14): Unmounting Filesystem
[10611182.411074] INFO: task pool:85586 blocked for more than 120 seconds.
[10611182.411156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10611182.411185] pool            D ffff9d4e091205e0     0 85586      1 0x00000080
[10611182.411190] Call Trace:
[10611182.411200]  [<ffffffffb87874a0>] ? bit_wait+0x50/0x50
[10611182.411203]  [<ffffffffb8789199>] schedule+0x29/0x70
[10611182.411206]  [<ffffffffb8786e61>] schedule_timeout+0x221/0x2d0
[10611182.411209]  [<ffffffffb87874a0>] ? bit_wait+0x50/0x50
[10611182.411212]  [<ffffffffb8788a4d>] io_schedule_timeout+0xad/0x130
[10611182.411214]  [<ffffffffb8788ae8>] io_schedule+0x18/0x20
[10611182.411217]  [<ffffffffb87874b1>] bit_wait_io+0x11/0x50
[10611182.411219]  [<ffffffffb8786fd7>] __wait_on_bit+0x67/0x90
[10611182.411224]  [<ffffffffb81bd3a1>] wait_on_page_bit+0x81/0xa0
[10611182.411230]  [<ffffffffb80c6ff0>] ? wake_bit_function+0x40/0x40
[10611182.411232]  [<ffffffffb81bd4d1>] __filemap_fdatawait_range+0x111/0x190
[10611182.411237]  [<ffffffffb80bc16a>] ? __queue_delayed_work+0xaa/0x1a0
[10611182.411243]  [<ffffffffb83a67cd>] ? list_del+0xd/0x30
[10611182.411246]  [<ffffffffb878956d>] ? wait_for_completion+0x11d/0x140
[10611182.411249]  [<ffffffffb81c0ba7>] filemap_fdatawait_keep_errors+0x27/0x30
[10611182.411254]  [<ffffffffb827f62a>] sync_inodes_sb+0x37a/0x3d0
[10611182.411257]  [<ffffffffb82842b6>] sync_filesystem+0x66/0xb0
[10611182.411260]  [<ffffffffb825186f>] freeze_super+0x8f/0x130
[10611182.411263]  [<ffffffffb8263af4>] do_vfs_ioctl+0x414/0x5b0
[10611182.411266]  [<ffffffffb8254494>] ? SYSC_newfstat+0x34/0x60
[10611182.411268]  [<ffffffffb8263d31>] SyS_ioctl+0xa1/0xc0
[10611182.411273]  [<ffffffffb8795f92>] system_call_fastpath+0x25/0x2a

crash> pd 10611182-10610970
$2 = 212

There are 202 Uninterruptible tasks in the queues.

crash> ps -S
  RU: 17
  IN: 2367
  UN: 202
  ZO: 3

crash> ps -m | grep UN | tail
[  0 11:11:18.744] [UN]  PID: 123137  TASK: ffff9d4c8cb2d280  CPU: 14  COMMAND: "kworker/u256:8"
[  0 11:11:19.652] [UN]  PID: 70349  TASK: ffff9d4c8d200000  CPU: 14  COMMAND: "fce7aa1a16a855e"
[  0 11:11:20.113] [UN]  PID: 80754  TASK: ffff9d4ba3689080  CPU: 3   COMMAND: "kworker/3:3"
[  0 11:11:20.083] [UN]  PID: 85014  TASK: ffff9d4dea2f5280  CPU: 4   COMMAND: "TaskSchedulerBa"
[  0 11:11:20.122] [UN]  PID: 2225   TASK: ffff9d599b9a2100  CPU: 3   COMMAND: "dockerd"
[  0 11:11:20.187] [UN]  PID: 85586  TASK: ffff9d4e09120000  CPU: 13  COMMAND: "pool"
[  0 11:11:20.059] [UN]  PID: 2271   TASK: ffff9d59addc6300  CPU: 7   COMMAND: "loop0"
[  0 11:11:20.159] [UN]  PID: 85514  TASK: ffff9d4df9259080  CPU: 12  COMMAND: "openbox"
[  0 11:11:20.185] [UN]  PID: 85522  TASK: ffff9d4dfede4200  CPU: 14  COMMAND: "node"
[  0 11:11:20.280] [UN]  PID: 84921  TASK: ffff9d4b54834200  CPU: 3   COMMAND: "chrome"

The oldest UN task is the task chome, pid 84921 which is scheduled for filesystem operation.

crash> ps -p 84921
PID: 0      TASK: ffffffffb8c18480  CPU: 0   COMMAND: "swapper/0"
 PID: 1      TASK: ffff9d4b195d8000  CPU: 12  COMMAND: "systemd"
  PID: 102478  TASK: ffff9d4a7a6a2100  CPU: 10  COMMAND: "containerd"
   PID: 84612  TASK: ffff9d4e0151d280  CPU: 2   COMMAND: "containerd-shim"
    PID: 84625  TASK: ffff9d59bb19e300  CPU: 5   COMMAND: "bash"
     PID: 84899  TASK: ffff9d49f57cd280  CPU: 1   COMMAND: "node"
      PID: 84920  TASK: ffff9d4b1928c200  CPU: 9   COMMAND: "751c6f808f53b35"
       PID: 84921  TASK: ffff9d4b54834200  CPU: 3   COMMAND: "chrome"

crash> set 84921
    PID: 84921
COMMAND: "chrome"
   TASK: ffff9d4b54834200  [THREAD_INFO: ffff9d4c66510000]
    CPU: 3
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt
PID: 84921  TASK: ffff9d4b54834200  CPU: 3   COMMAND: "chrome"
 #0 [ffff9d4c66513b70] __schedule at ffffffffb8788dc8
 #1 [ffff9d4c66513bd8] schedule at ffffffffb8789199
 #2 [ffff9d4c66513be8] __sb_start_write at ffffffffb8251521
 #3 [ffff9d4c66513c50] __xfs_filemap_fault at ffffffffc082d9af [xfs]
 #4 [ffff9d4c66513cb0] xfs_filemap_page_mkwrite at ffffffffc082dae5 [xfs]
 #5 [ffff9d4c66513cc0] do_page_mkwrite at ffffffffb81ee88a
 #6 [ffff9d4c66513d40] do_wp_page at ffffffffb81f2347
 #7 [ffff9d4c66513de8] handle_mm_fault at ffffffffb81f6682
 #8 [ffff9d4c66513eb0] __do_page_fault at ffffffffb8790653
 #9 [ffff9d4c66513f20] do_page_fault at ffffffffb8790975
#10 [ffff9d4c66513f50] page_fault at ffffffffb878c778

Below docker mount is starting to write.

crash> mount|grep ffff9d5375fda000
ffff9d59ad89a700 ffff9d5375fda000 xfs    /dev/mapper/docker-253:3-134-e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b /apps/docker/devicemapper/mnt/e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b

But the filesystem for docker-253:3 is in FREEZE state and some more.

SB: 0xffff9d5375fda000, frozen=SB_FREEZE_COMPLETE, /apps/docker/devicemapper/mnt/e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b/ (dm-13) [xfs], (locked,umount)

SB: 0xffff9d59ba195000, frozen=SB_FREEZE_COMPLETE, /apps/docker/ (dm-3) [xfs], (locked,umount)
SB: 0xffff9d4b4be5a800, frozen=SB_FREEZE_PAGEFAULT, /apps/docker/devicemapper/mnt/b36a4f0070f1f6c670ce97c5445a519e515707c639de28a56c83469e1aaac93d/ (dm-12) [xfs], (locked,umount)
SB: 0xffff9d4b6b6a3000, frozen=SB_FREEZE_COMPLETE, /apps/docker/devicemapper/mnt/5ee9f7a410f0bb85f7b3e48fdded33655f69c15ca8b4d5ccad34913f8cb97300/ (dm-15) [xfs], (locked,umount)

vmtoolsd did not provok it:

crash> ps -g 1587
PID: 1587   TASK: ffff9d49f5469080  CPU: 8   COMMAND: "vmtoolsd"
  PID: 1614   TASK: ffff9d59b9a84200  CPU: 12  COMMAND: "HangDetector"
  PID: 85586  TASK: ffff9d4e09120000  CPU: 13  COMMAND: "pool"
  PID: 91690  TASK: ffff9d4df1b9e300  CPU: 10  COMMAND: "pool"
  PID: 95792  TASK: ffff9d4e9d6eb180  CPU: 12  COMMAND: "pool"
  PID: 108196  TASK: ffff9d4c5b375280  CPU: 4   COMMAND: "pool"
  PID: 123459  TASK: ffff9d4ddca9d280  CPU: 5   COMMAND: "pool"

The loop devices are also blocked.

[10611182.411729] INFO: task loop0:2271 blocked for more than 120 seconds.
[10611182.411756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10611182.411801] loop0           D ffff9d59addc68e0     0  2271      2 0x00000080
[10611182.411803] Call Trace:
[10611182.411807]  [<ffffffffb8789199>] schedule+0x29/0x70
[10611182.411810]  [<ffffffffb8251521>] __sb_start_write+0xe1/0x120
[10611182.411813]  [<ffffffffb80c6f30>] ? wake_up_atomic_t+0x30/0x30
[10611182.411817]  [<ffffffffc06a5e77>] __do_lo_send_write+0xc7/0x120 [loop]
[10611182.411820]  [<ffffffffc06a5f28>] do_lo_send_direct_write+0x58/0x70 [loop]
[10611182.411824]  [<ffffffffc06a6d7a>] do_bio_filebacked+0x23a/0x320 [loop]
[10611182.411827]  [<ffffffffc06a5ed0>] ? __do_lo_send_write+0x120/0x120 [loop]
[10611182.411830]  [<ffffffffb80d31d1>] ? __wake_up_common_lock+0x91/0xc0
[10611182.411832]  [<ffffffffb80d2c40>] ? task_rq_unlock+0x20/0x20
[10611182.411836]  [<ffffffffc06a75a1>] loop_thread+0x101/0x2e0 [loop]
[10611182.411838]  [<ffffffffb80c6f30>] ? wake_up_atomic_t+0x30/0x30
[10611182.411841]  [<ffffffffc06a74a0>] ? loop_attr_do_show_backing_file+0xb0/0xb0 [loop]
[10611182.411844]  [<ffffffffb80c5e41>] kthread+0xd1/0xe0
[10611182.411847]  [<ffffffffb80c5d70>] ? insert_kthread_work+0x40/0x40
[10611182.411850]  [<ffffffffb8795df7>] ret_from_fork_nospec_begin+0x21/0x21
[10611182.411853]  [<ffffffffb80c5d70>] ? insert_kthread_work+0x40/0x40

crash> ps|grep loop
   2271      2   7  ffff9d59addc6300  UN   0.0       0      0  [loop0]
   2272      2  10  ffff9d59bbb58000  UN   0.0       0      0  [loop1]

crash> bt 2271 2272
PID: 2271   TASK: ffff9d59addc6300  CPU: 7   COMMAND: "loop0"
 #0 [ffff9d59a242fc50] __schedule at ffffffffb8788dc8
 #1 [ffff9d59a242fcb8] schedule at ffffffffb8789199
 #2 [ffff9d59a242fcc8] __sb_start_write at ffffffffb8251521
 #3 [ffff9d59a242fd30] __do_lo_send_write at ffffffffc06a5e77 [loop]
 #4 [ffff9d59a242fd78] do_lo_send_direct_write at ffffffffc06a5f28 [loop]
 #5 [ffff9d59a242fdb0] do_bio_filebacked at ffffffffc06a6d7a [loop]
 #6 [ffff9d59a242fe58] loop_thread at ffffffffc06a75a1 [loop]
 #7 [ffff9d59a242fec8] kthread at ffffffffb80c5e41

PID: 2272   TASK: ffff9d59bbb58000  CPU: 10  COMMAND: "loop1"
 #0 [ffff9d59a24e7c50] __schedule at ffffffffb8788dc8
 #1 [ffff9d59a24e7cb8] schedule at ffffffffb8789199
 #2 [ffff9d59a24e7cc8] __sb_start_write at ffffffffb8251521
 #3 [ffff9d59a24e7d30] __do_lo_send_write at ffffffffc06a5e77 [loop]
 #4 [ffff9d59a24e7d78] do_lo_send_direct_write at ffffffffc06a5f28 [loop]
 #5 [ffff9d59a24e7db0] do_bio_filebacked at ffffffffc06a6d7a [loop]
 #6 [ffff9d59a24e7e58] loop_thread at ffffffffc06a75a1 [loop]
 #7 [ffff9d59a24e7ec8] kthread at ffffffffb80c5e41

Comment 14 John Wolfe 2022-03-26 17:09:08 UTC

From the description comment #0 - Frank DeLorey:

> Actual results:
>
> VM hangs host need to be rebooted to unlock VM

Are you stating that the ESXi host needed to be rebooted, or simply that the VM needed to be powered of and rebooted?

I am assuming the later, but asking for clarity.

> Some superblocks are already frozen, but one of them is still in progress:
>
> crash> list -H super_blocks -s super_block.s_writers | grep -E '^ffff| frozen = ' | paste - - | grep -v ' = 0x0,'
> ffff8dbe84f98000            frozen = 0x4,
> ffff8dbeb9bce000            frozen = 0x3,
> ffff8dc0535da000            frozen = 0x4,

If the dump is still available, is it possible to determine:

1. What process is currently doing the fsfreeze on superblock ffff8dbeb9bce000 - /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper?

2. Is vmtoolsd blocked and if so, what is the file to which I/O is being blocked?

Given that the filesystem freeze(s) are currently still happening, and assuming that it is vmtoosd vmsvc doing the quiesing, this cannot be the same bug as https://bugzilla.redhat.com/show_bug.cgi?id=1880404.  BZ 1880404 occurred after the snapshot was completed and the quiesce_manifest.xml is being replaced.


(In reply to Jay Shin from comment #12)

If we are in the midst of taking a quiesed snapshot, other processes will be blocked if they attempt to write to a frozen filesystem.  The question is what is the natire of the vlockage and do we have another process freezing any filesystem?

Jay has made the statement that

>
> vmtoolsd did not provok it:
>

Wondering what the basis is for this statement?

Again the same questions as above.

- Is vmtoolsd tangled up in this?   Does vmtoolsd show up in the output of "ps -m | grep UN"?

If so. what superblock (filesystem) is it blocked by and what process has frozen that superblock?

Cathy,
Can you redirect these questions as appropriate?   Hopefully the dump is still available.

Thanks.

Comment 15 Cathy Avery 2022-03-28 12:37:00 UTC

Hi Jay and Frank,

Could you provide any information needed in comment 14?

Also John W do you want them to ask the customer to try the workaround from bz 2066140 

Created file /etc/vmware-tools/tools.conf with following entry ( entries are case sensitive )

[vmbackup]
enableXmlManifest = false 

Thanks,

Cathy

Comment 16 Frank DeLorey 2022-03-28 13:17:51 UTC

Sorry the dump is no longer available. I believe they needed to reboot the VM not the ESX host itself. We can ask them to try the workaround if it is believed to be applicable to this problem.

Frank

Comment 17 John Wolfe 2022-03-28 15:21:58 UTC

(In reply to Cathy Avery from comment #15)

> Also John W do you want them to ask the customer to try the workaround from
> bz 2066140 
> 
> Created file /etc/vmware-tools/tools.conf with following entry ( entries are
> case sensitive )
> 
> [vmbackup]
> enableXmlManifest = false 
> 
> Thanks,
> 
> Cathy

If they have not already done so, yes that would be advisable.

Comment 18 Jay Shin 2022-03-29 06:41:36 UTC

(In reply to Frank DeLorey from comment #16)
> Sorry the dump is no longer available. I believe they needed to reboot the
> VM not the ESX host itself. We can ask them to try the workaround if it is
> believed to be applicable to this problem.
> 
> Frank

I have retraced it again.

retrace-server-interact 333400596 crash

vmtoolsd was not in UN state.

crash> ps 1586
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
   1586      1   8  ffff8dbdb65d0000  IN   0.0  580840   5548  vmtoolsd

crash> bt 1586
PID: 1586   TASK: ffff8dbdb65d0000  CPU: 8   COMMAND: "vmtoolsd"
 #0 [ffff8dbe8d3f7978] __schedule at ffffffff8db868a7
 #1 [ffff8dbe8d3f7a08] schedule at ffffffff8db86dc9
 #2 [ffff8dbe8d3f7a18] schedule_hrtimeout_range_clock at ffffffff8db85fb2
 #3 [ffff8dbe8d3f7ab0] schedule_hrtimeout_range at ffffffff8db86063
 #4 [ffff8dbe8d3f7ac0] poll_schedule_timeout at ffffffff8d664115
 #5 [ffff8dbe8d3f7af0] do_sys_poll at ffffffff8d6658ad
 #6 [ffff8dbe8d3f7f10] sys_poll at ffffffff8d665ab4
 #7 [ffff8dbe8d3f7f50] system_call_fastpath at ffffffff8db93f92
    RIP: 00007fa8238aec3d  RSP: 00007ffd566908f0  RFLAGS: 00000246
    RAX: 0000000000000007  RBX: 000055690db5bb50  RCX: 000055690db55d00
    RDX: 0000000000000064  RSI: 0000000000000002  RDI: 000055690db50cf0
    RBP: 0000000000000002   R8: 0000000000000007   R9: 0000000000000000
    R10: 00007fa824ff4f88  R11: 0000000000000293  R12: 000055690db50cf0
    R13: 0000000000000064  R14: 00007fa823be5420  R15: 0000000000000007
    ORIG_RAX: 0000000000000007  CS: 0033  SS: 002b

> 1. What process is currently doing the fsfreeze on superblock ffff8dbeb9bce000 - /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper?

crash> search -t ffff8dbeb9bce000 | grep PID
PID: 7235   TASK: ffff8dc269241080  CPU: 10  COMMAND: "pool"
PID: 15098  TASK: ffff8dc9672d4200  CPU: 2   COMMAND: "pool"
PID: 15578  TASK: ffff8dc0b4a75280  CPU: 9   COMMAND: "pool"
PID: 23282  TASK: ffff8dbfcb818000  CPU: 10  COMMAND: "pool"
PID: 41513  TASK: ffff8dbec7bf6300  CPU: 7   COMMAND: "minion-lightwei"
PID: 45454  TASK: ffff8dc0b9858000  CPU: 0   COMMAND: "pool"

crash> ps -m 7235 15098 15578 23282 41513 45454
[  0 16:40:13.106] [IN]  PID: 41513  TASK: ffff8dbec7bf6300  CPU: 7   COMMAND: "minion-lightwei"
[  6 22:18:05.152] [UN]  PID: 45454  TASK: ffff8dc0b9858000  CPU: 0   COMMAND: "pool"
[  7 21:52:46.476] [UN]  PID: 15098  TASK: ffff8dc9672d4200  CPU: 2   COMMAND: "pool"
[  7 22:07:52.783] [UN]  PID: 7235   TASK: ffff8dc269241080  CPU: 10  COMMAND: "pool"
[  8 22:02:38.551] [UN]  PID: 23282  TASK: ffff8dbfcb818000  CPU: 10  COMMAND: "pool"
[  8 22:17:44.586] [UN]  PID: 15578  TASK: ffff8dc0b4a75280  CPU: 9   COMMAND: "pool"

crash> bt 45454
PID: 45454  TASK: ffff8dc0b9858000  CPU: 0   COMMAND: "pool"
 #0 [ffff8dc27b72bcc8] __schedule at ffffffff8db868a7
 #1 [ffff8dc27b72bd50] schedule at ffffffff8db86dc9
 #2 [ffff8dc27b72bd60] rwsem_down_write_failed at ffffffff8db88505
 #3 [ffff8dc27b72bdf0] call_rwsem_down_write_failed at ffffffff8d797757
 #4 [ffff8dc27b72be38] down_write at ffffffff8db860fd
 #5 [ffff8dc27b72be50] freeze_super at ffffffff8d650ff6
 #6 [ffff8dc27b72be80] do_vfs_ioctl at ffffffff8d663534
 #7 [ffff8dc27b72bf00] sys_ioctl at ffffffff8d663771
 #8 [ffff8dc27b72bf50] system_call_fastpath at ffffffff8db93f92

The file ffff8dbeeab4a600 is being accessed.

     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff8dcd59cf4e40 ffff8dbdb600b208 ffff8dbeb9bce000 DIR  /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae/

file operations = ffffffffc0724380 (r) xfs_dir_file_operations [xfs]
file open mode = Read-Only (0x8000)
file size = 28 bytes, ino = 256, link count = 3
	uid = 0, gid = 0

ffff8dbee8fced00 ffff8dbeb9bce000 xfs    /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae

crash> rw_semaphore 0xffff8dbeb9bce068 -x
struct rw_semaphore {
  {
    count = {
      counter = 0xfffffffe00000001
    },
    __UNIQUE_ID_rh_kabi_hide4 = {
      count = 0xfffffffe00000001
    },
    {<No data fields>}
  },
  wait_lock = {
    raw_lock = {
      val = {
        counter = 0x0
      }
    }
  },
  osq = {
    tail = {
      counter = 0x0
    }
  },
  wait_list = {
    next = 0xffff8dc0804c3d90
  },
  owner = 0xffff8dc0b4a75280
}

semaphore owner.

crash> ps -m 15578
[  8 22:17:44.586] [UN]  PID: 15578  TASK: ffff8dc0b4a75280  CPU: 9   COMMAND: "pool"

crash> bt 0xffff8dc0b4a75280
PID: 15578  TASK: ffff8dc0b4a75280  CPU: 9   COMMAND: "pool"
 #0 [ffff8dc0774a3c78] __schedule at ffffffff8db868a7
 #1 [ffff8dc0774a3d08] schedule at ffffffff8db86dc9
 #2 [ffff8dc0774a3d18] xfs_log_force_lsn at ffffffffc0707d3c [xfs]
 #3 [ffff8dc0774a3d98] __xfs_trans_commit at ffffffffc0702fb3 [xfs]
 #4 [ffff8dc0774a3dd8] xfs_trans_commit at ffffffffc07032d0 [xfs]
 #5 [ffff8dc0774a3de8] xfs_sync_sb at ffffffffc06da2d2 [xfs]
 #6 [ffff8dc0774a3e10] xfs_log_sbcount at ffffffffc06fcb93 [xfs]
 #7 [ffff8dc0774a3e20] xfs_quiesce_attr at ffffffffc07001a0 [xfs]
 #8 [ffff8dc0774a3e38] xfs_fs_freeze at ffffffffc0700471 [xfs]
 #9 [ffff8dc0774a3e50] freeze_super at ffffffff8d65108b
#10 [ffff8dc0774a3e80] do_vfs_ioctl at ffffffff8d663534
#11 [ffff8dc0774a3f00] sys_ioctl at ffffffff8d663771
#12 [ffff8dc0774a3f50] system_call_fastpath at ffffffff8db93f92

struct file ffff8dbeb0b8df00

     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff8dcd59cf4e40 ffff8dbdb600b208 ffff8dbeb9bce000 DIR  /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae/

file operations = ffffffffc0724380 (r) xfs_dir_file_operations [xfs]
file open mode = Read-Only (0x8000)
file size = 28 bytes, ino = 256, link count = 3
	uid = 0, gid = 0

== Mount Info ==
ffff8dbee8fced00 ffff8dbeb9bce000 xfs    /dev/mapper/docker-253:4-134-d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae /apps/docker/devicemapper/mnt/d0af606bb651797f0b2e0e9da97318dd10cf5da7910ce8c7d79c8333f1853fae




> Again the same questions as above.

> - Is vmtoolsd tangled up in this?   Does vmtoolsd show up in the output of "ps -m | grep UN"?

> If so. what superblock (filesystem) is it blocked by and what process has frozen that superblock?

crash> ps -m 1587
[  0 00:00:00.000] [IN]  PID: 1587   TASK: ffff9d49f5469080  CPU: 8   COMMAND: "vmtoolsd"

crash> bt 1587
PID: 1587   TASK: ffff9d49f5469080  CPU: 8   COMMAND: "vmtoolsd"
 #0 [ffff9d59b948f9a0] __schedule at ffffffffb8788dc8
 #1 [ffff9d59b948fa08] schedule at ffffffffb8789199
 #2 [ffff9d59b948fa18] schedule_hrtimeout_range_clock at ffffffffb8788562
 #3 [ffff9d59b948fab0] schedule_hrtimeout_range at ffffffffb8788613
 #4 [ffff9d59b948fac0] poll_schedule_timeout at ffffffffb82646d5
 #5 [ffff9d59b948faf0] do_sys_poll at ffffffffb8265e6d
 #6 [ffff9d59b948ff10] sys_poll at ffffffffb8266074
 #7 [ffff9d59b948ff50] system_call_fastpath at ffffffffb8795f92

crash> mount|grep ffff9d5375fda000
ffff9d59ad89a700 ffff9d5375fda000 xfs    /dev/mapper/docker-253:3-134-e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b /apps/docker/devicemapper/mnt/e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b

crash> search -t ffff9d5375fda000 | grep PID
PID: 80930  TASK: ffff9d59aa181080  CPU: 2   COMMAND: "kworker/2:2"
PID: 84910  TASK: ffff9d59bb749080  CPU: 8   COMMAND: "node"
PID: 84911  TASK: ffff9d4b4ae85280  CPU: 10  COMMAND: "node"
PID: 84921  TASK: ffff9d4b54834200  CPU: 3   COMMAND: "chrome"
PID: 84989  TASK: ffff9d4b19333180  CPU: 11  COMMAND: "Chrome_DBThread"
PID: 84994  TASK: ffff9d59a6da2100  CPU: 4   COMMAND: "Chrome_IOThread"
PID: 85013  TASK: ffff9d4dea2f2100  CPU: 5   COMMAND: "Chrome_HistoryT"
PID: 120708  TASK: ffff9d59a879a100  CPU: 1   COMMAND: "minion-lightwei"

crash> ps -m 80930 84910 84911 84921 84989 84994 85013 120708
[  0 00:04:19.729] [IN]  PID: 120708  TASK: ffff9d59a879a100  CPU: 1   COMMAND: "minion-lightwei"
[  0 11:06:24.847] [UN]  PID: 80930  TASK: ffff9d59aa181080  CPU: 2   COMMAND: "kworker/2:2"
[  0 11:10:53.318] [UN]  PID: 84989  TASK: ffff9d4b19333180  CPU: 11  COMMAND: "Chrome_DBThread"
[  0 11:11:07.502] [UN]  PID: 84994  TASK: ffff9d59a6da2100  CPU: 4   COMMAND: "Chrome_IOThread"
[  0 11:11:12.644] [UN]  PID: 85013  TASK: ffff9d4dea2f2100  CPU: 5   COMMAND: "Chrome_HistoryT"
[  0 11:11:20.280] [UN]  PID: 84921  TASK: ffff9d4b54834200  CPU: 3   COMMAND: "chrome"
[  0 11:11:23.760] [IN]  PID: 84910  TASK: ffff9d59bb749080  CPU: 8   COMMAND: "node"
[  0 11:11:23.766] [IN]  PID: 84911  TASK: ffff9d4b4ae85280  CPU: 10  COMMAND: "node"

crash> set 84989
    PID: 84989
COMMAND: "Chrome_DBThread"
   TASK: ffff9d4b19333180  [THREAD_INFO: ffff9d4c3314c000]
    CPU: 11
  STATE: TASK_UNINTERRUPTIBLE 

crash> bt
PID: 84989  TASK: ffff9d4b19333180  CPU: 11  COMMAND: "Chrome_DBThread"
 #0 [ffff9d4c3314fb70] __schedule at ffffffffb8788dc8
 #1 [ffff9d4c3314fbd8] schedule at ffffffffb8789199
 #2 [ffff9d4c3314fbe8] __sb_start_write at ffffffffb8251521
 #3 [ffff9d4c3314fc50] __xfs_filemap_fault at ffffffffc082d9af [xfs]
 #4 [ffff9d4c3314fcb0] xfs_filemap_page_mkwrite at ffffffffc082dae5 [xfs]
 #5 [ffff9d4c3314fcc0] do_page_mkwrite at ffffffffb81ee88a
 #6 [ffff9d4c3314fd40] do_wp_page at ffffffffb81f2347
 #7 [ffff9d4c3314fde8] handle_mm_fault at ffffffffb81f6682
 #8 [ffff9d4c3314feb0] __do_page_fault at ffffffffb8790653
 #9 [ffff9d4c3314ff20] do_page_fault at ffffffffb8790975
#10 [ffff9d4c3314ff50] page_fault at ffffffffb878c778

crash> mount|grep ffff9d5375fda000
ffff9d59ad89a700 ffff9d5375fda000 xfs    /dev/mapper/docker-253:3-134-e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b /apps/docker/devicemapper/mnt/e9c571f885f4068dc99e2a14308a6b3c3e4fe5b8cddec5ad2194c791dd47317b

crash> ps -p 84989
PID: 0      TASK: ffffffffb8c18480  CPU: 0   COMMAND: "swapper/0"
 PID: 1      TASK: ffff9d4b195d8000  CPU: 12  COMMAND: "systemd"
  PID: 102478  TASK: ffff9d4a7a6a2100  CPU: 10  COMMAND: "containerd"
   PID: 84612  TASK: ffff9d4e0151d280  CPU: 2   COMMAND: "containerd-shim"
    PID: 84625  TASK: ffff9d59bb19e300  CPU: 5   COMMAND: "bash"
     PID: 84899  TASK: ffff9d49f57cd280  CPU: 1   COMMAND: "node"
      PID: 84920  TASK: ffff9d4b1928c200  CPU: 9   COMMAND: "751c6f808f53b35"
       PID: 84989  TASK: ffff9d4b19333180  CPU: 11  COMMAND: "Chrome_DBThread"

Comment 20 John Wolfe 2022-03-30 14:15:13 UTC

From Jay's response in comment 18, vmtoolsd is not blocked by a frozen file and that another process has frozen filesystem(s).

Can it be determined if vmtoolsd currently has any filesystem(s) frozen which would indicate that a quiesed snapshot is under way?

It is conceivable that vmtoolsd could get blocked here and is another reason for the quiesed snapshot to fail gracefully if it cannot get access to all applicable mounted filesystems.

Comment 24 Oscar Santos 2022-04-08 16:52:32 UTC

Hello,

o We have had our customer test the below fix

Create file /etc/vmware-tools/tools.conf with following entry ( entries are case sensitive )

[vmbackup]
enableXmlManifest = false 


o And they are reporting that there have not been any hangs since.

Comment 25 Robert McSwain 2022-04-15 15:22:51 UTC

Customer in case 03011147 has reported they have had another hang and we have requested they collect a vmcore. Details from custoMer below

~~~
we had another Docker device issue today.   But looks to have started March.   I've forced a Core Dump using NMI at the VM Console. 

Here's the email I sent to my team. 

Now Docker is unable to delete a device and XFS inode error.

2022-04-14T12:20:27.281993-05:00 nclpvnpapp10400 kernel: [2274616.594457] loop: Write error at byte offset 483581952, length 4096.
2022-04-14T12:20:27.285066-05:00 nclpvnpapp10400 kernel: [2274616.597477] loop: Write error at byte offset 482996224, length 4096.
2022-04-14T12:20:27.285587-05:00 nclpvnpapp10400 kernel: [2274616.597949] XFS (dm-13): metadata I/O error in "xfs_buf_iodone_callback_error" at daddr 0x1eaef8 len 8 error 28

The “dm-13” is this device….

[root@nclpvnpapp10400 systemdump]# find /dev/ -name "dm-*" -exec readlink -n {} \; -exec echo " -->" {} \;
../../dm-13 --> /dev/disk/by-id/dm-name-docker-253:9-134-e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874


Is found on both “loop” devices….

loop0                                                7:0    0  100G  0 loop 
`-docker-253:9-134-pool                            253:10   0  100G  0 dm   
  |-docker-253:9-134-b14647e3c4e045e56343bc4f5dc0dc91a5b1b40b069840ce1531b7d50f0b6572
                                                   253:11   0   10G  0 dm   /apps/docker/devicemapper/mnt/b14647e3c4e045e56343bc4f5dc0dc91a5b1b40b069840ce1531b7d50f0b6572
  |-docker-253:9-134-b1d4eeb84d3d15931ec52f03eab558fea101c20ba5e6dc383469ea51d3fc8cd9
                                                   253:12   0   10G  0 dm   
  `-docker-253:9-134-e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874
                                                   253:13   0   10G  0 dm   /apps/docker/devicemapper/mnt/e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874
loop1                                                7:1    0    2G  0 loop 
`-docker-253:9-134-pool                            253:10   0  100G  0 dm   
  |-docker-253:9-134-b14647e3c4e045e56343bc4f5dc0dc91a5b1b40b069840ce1531b7d50f0b6572
                                                   253:11   0   10G  0 dm   /apps/docker/devicemapper/mnt/b14647e3c4e045e56343bc4f5dc0dc91a5b1b40b069840ce1531b7d50f0b6572
  |-docker-253:9-134-b1d4eeb84d3d15931ec52f03eab558fea101c20ba5e6dc383469ea51d3fc8cd9
                                                   253:12   0   10G  0 dm   
  `-docker-253:9-134-e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874
                                                   253:13   0   10G  0 dm   /apps/docker/devicemapper/mnt/e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874


This looks to have started back in March….

[ah02870@nclpvnpapp10400 log]$ grep 994269 messages-20220327 | more
2022-03-20T17:23:50.353896-05:00 nclpvnpapp10400 NetworkManager[1616]: <info>  [1647815030.3537] manager: (veth9942692): new Veth device (/org/freedesktop/NetworkManager/Devices/65723)
2022-03-20T17:25:59.233817-05:00 nclpvnpapp10400 NetworkManager[1616]: <info>  [1647815159.2337] manager: (veth9942692): new Veth device (/org/freedesktop/NetworkManager/Devices/65793)
2022-03-21T02:30:11.985956-05:00 nclpvnpapp10400 kernel: [165768.325316] device-mapper: thin: Deletion of thin device 994269 failed.
2022-03-21T02:30:21.487968-05:00 nclpvnpapp10400 kernel: [165777.826863] device-mapper: thin: Deletion of thin device 994269 failed.

We hit the “no space” on the April 11th…

[ah02870@nclpvnpapp10400 log]$ grep "no space" /var/log/messages | more
2022-04-11T00:00:23.557086-05:00 nclpvnpapp10400 dockerd[2052]: time="2022-04-11T00:00:23.556744401-05:00" level=warning msg="Deletion of device b1d4eeb84d3d15931ec52f03eab558fea101c20ba5e6dc383469ea51d3fc8cd9, device_id=994269 failed:devmapper: Error saving transaction metadata: devmapper: Error creating metadata file: open /apps/docker/
devicemapper/metadata/.tmp534691609: no space left on device" storage-driver=devicemapper


The device that cannot be deleted is DM-12.

../../dm-12 --> /dev/disk/by-id/dm-name-docker-253:9-134-b1d4eeb84d3d15931ec52f03eab558fea101c20ba5e6dc383469ea51d3fc8cd9
~~~

Will respond with a link to the core when the customer has completed and uploaded the core dump. I also let them know that the workaround has helped with other customers.

Comment 26 John Wolfe 2022-04-15 16:00:12 UTC

I have been reminded that it is the updated (new) quiese_manifest.xml file sent to the host that tells the vSphere host that the snapshot was a quiesed snapshot.

With the tools.conf setting "enableXmlManifest = false", no manifest file is created to forward to the host.   When viewing the snapshot information in the vSphere client's Manage Snapshots window, the snapshot will be displayed as a non-quiesed snapshot even though it was quiesed.

The snapshot should still be fine and quiesced, it just won't be displayed that way.

Comment 28 Robert McSwain 2022-04-18 18:40:28 UTC

https://drive.google.com/file/d/1wg9k-gMDqqA_mPPzu7hjEVnZkbwwmBSb/view?usp=sharing

Core and sosreport provided by 03011147

Comment 29 Cathy Avery 2022-04-19 14:22:53 UTC

(In reply to Robert McSwain from comment #28)
> https://drive.google.com/file/d/1wg9k-gMDqqA_mPPzu7hjEVnZkbwwmBSb/
> view?usp=sharing
> 
> Core and sosreport provided by 03011147

Hi,

I've been talking to John Wolfe. He asks if this information can be collected as soon as possible ( the logs will rollover at some point ):

"If at all possible, collect the vm_support bundle from the ESXi host running the affected RHEL 7 VM at the time the NMI was triggered.  Hopefully the hostd and vmkernel logs have not rotated past that point in time.  Follow the KB article (and links) https://kb.vmware.com/s/article/1010705.   It is only necessary to include the affected VM, not every VM on the host.

If you provide the name of that VM and the time that the NMI was taken, I can scan the logs to any indication of vmtoolsd activity including snapshot activities and any problems."

Thanks,

Cathy

Comment 32 Robert McSwain 2022-04-28 16:41:10 UTC

Comment 35 Cathy Avery 2022-05-25 18:56:47 UTC

(In reply to Robert McSwain from comment #33)
> (In reply to Robert McSwain from comment #31)
> > 
> > Current vmcore is here:  https://galvatron.cee.redhat.com/manager/725430923
> 
> @cavery Setting needinfo for this. Thanks!

What I am seeing is pretty much reflected in the other dumps. 
About a 100 uninterruptible tasks mostly ps and pybin.


[ 0 00:00:00.000] [UN]  PID: 2275   TASK: ffffa129bb4e4200  CPU: 3   COMMAND: "loop0"
[ 0 00:00:00.001] [UN]  PID: 91008  TASK: ffffa12a59361080  CPU: 0   COMMAND: "xfsaild/dm-13"
[ 0 20:43:45.773] [UN]  PID: 8799   TASK: ffffa1354aec6300  CPU: 10  COMMAND: "pool"
[ 0 20:50:07.641] [UN]  PID: 6605   TASK: ffffa135e71e4200  CPU: 15  COMMAND: "ps"
[ 0 20:53:50.367] [UN]  PID: 4848   TASK: ffffa138f728a100  CPU: 2   COMMAND: "ps"
[ 0 20:54:56.914] [UN]  PID: 4600   TASK: ffffa137e1535280  CPU: 2   COMMAND: "pybin"
[ 0 20:56:24.384] [UN]  PID: 1888   TASK: ffffa138f5af3180  CPU: 11  COMMAND: "zabbix_agentd"
[ 0 20:57:11.162] [UN]  PID: 92813  TASK: ffffa12a59333180  CPU: 7   COMMAND: "Chrome_IOThread"
[ 0 22:01:25.845] [UN]  PID: 92276  TASK: ffffa1384dbee300  CPU: 0   COMMAND: "1a1f09110779600"
[ 0 22:04:48.137] [UN]  PID: 92079  TASK: ffffa1382b4aa100  CPU: 8   COMMAND: "sisidsdaemon"
[ 0 22:10:39.028] [UN]  PID: 92865  TASK: ffffa137e9f2b180  CPU: 5   COMMAND: "Chrome_HistoryT"
[ 0 22:11:00.116] [UN]  PID: 92812  TASK: ffffa138fc799080  CPU: 4   COMMAND: "Chrome_CacheThr"
[ 0 22:11:00.175] [UN]  PID: 92278  TASK: ffffa129b9d1b180  CPU: 9   COMMAND: "chrome"


crash> ps -g ffffa138f67a6300
PID: 1593   TASK: ffffa138f67a6300  CPU: 8   COMMAND: "vmtoolsd"
  PID: 1609   TASK: ffffa138c5268000  CPU: 2   COMMAND: "HangDetector"
  PID: 8799   TASK: ffffa1354aec6300  CPU: 10  COMMAND: "pool"
  PID: 20693  TASK: ffffa137ec905280  CPU: 5   COMMAND: "pool"
  PID: 32614  TASK: ffffa131583e9080  CPU: 14  COMMAND: "pool"
  PID: 45316  TASK: ffffa13118f29080  CPU: 8   COMMAND: "pool"
  PID: 47490  TASK: ffffa137dd6b3180  CPU: 15  COMMAND: "pool"

crash> list -H super_blocks -s super_block.s_writers | grep -E '^ffff| frozen = ' | paste - - | grep -v ' = 0x0,'^C
crash> 
crash> mount | grep -E ffffa133f8e57800   <----- frozen = 1
ffffa134fcd6b780 ffffa133f8e57800 xfs    /dev/mapper/docker-253:9-134-e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874 /apps/docker/devicemapper/mnt/e1806fa3db98e87d0660c571475312351acdc01505c7e82b9bdeb48d755df874

Comment 36 Dave Wysochanski 2022-11-15 10:05:23 UTC

It looks like this bug has been idle for 6 months.  Is anyone working on it or is the investigation stalled?  If it's stalled, can someone identify next steps?

ALso it looks like a confirmed workaround exists (https://bugzilla.redhat.com/show_bug.cgi?id=1994590#c24), and this looks like it's RHEL7 which requires special business case for any fix to be delivered (please see lifecycle page https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase and https://access.redhat.com/support/policy/updates/errata#Life_Cycle_Dates), is there a business case for a possible fix to be investigated and delivered in RHEL7?
2. Is someone actively working on this bug anymore or is it blocked?
2. Do we know if this problem exists in RHEL8, RHEL9 or upstream?

There are open cases but I"m not sure how critical this is for any of the customers or if the workaround is adequate.  If there's no clear business case, the workaround is usable for customers, and there's no one working on it or next steps identified, maybe this should be closed WONT_FIX in RHEL7?

Comment 37 Cathy Avery 2022-11-15 14:09:28 UTC

(In reply to Dave Wysochanski from comment #36)
> It looks like this bug has been idle for 6 months.  Is anyone working on it
> or is the investigation stalled?  If it's stalled, can someone identify next
> steps?
> 
> ALso it looks like a confirmed workaround exists
> (https://bugzilla.redhat.com/show_bug.cgi?id=1994590#c24), and this looks
> like it's RHEL7 which requires special business case for any fix to be
> delivered (please see lifecycle page
> https://access.redhat.com/support/policy/updates/
> errata#Maintenance_Support_2_Phase and
> https://access.redhat.com/support/policy/updates/errata#Life_Cycle_Dates),
> is there a business case for a possible fix to be investigated and delivered
> in RHEL7?
> 2. Is someone actively working on this bug anymore or is it blocked?
> 2. Do we know if this problem exists in RHEL8, RHEL9 or upstream?
> 
> There are open cases but I"m not sure how critical this is for any of the
> customers or if the workaround is adequate.  If there's no clear business
> case, the workaround is usable for customers, and there's no one working on
> it or next steps identified, maybe this should be closed WONT_FIX in RHEL7?

@jwolfe 

Hi John is there any update on this BZ?

Thanks,

Cathy

Comment 38 John Wolfe 2022-11-15 14:52:10 UTC

(In reply to Cathy Avery from comment #37)
> (In reply to Dave Wysochanski from comment #36)
> > It looks like this bug has been idle for 6 months.  Is anyone working on it
> > or is the investigation stalled?  If it's stalled, can someone identify next
> > steps?
> > 
> > ALso it looks like a confirmed workaround exists
> > (https://bugzilla.redhat.com/show_bug.cgi?id=1994590#c24), and this looks
> > like it's RHEL7 which requires special business case for any fix to be
> > delivered (please see lifecycle page
> > https://access.redhat.com/support/policy/updates/
> > errata#Maintenance_Support_2_Phase and
> > https://access.redhat.com/support/policy/updates/errata#Life_Cycle_Dates),
> > is there a business case for a possible fix to be investigated and delivered
> > in RHEL7?
> > 2. Is someone actively working on this bug anymore or is it blocked?
> > 2. Do we know if this problem exists in RHEL8, RHEL9 or upstream?
> > 
> > There are open cases but I"m not sure how critical this is for any of the
> > customers or if the workaround is adequate.  If there's no clear business
> > case, the workaround is usable for customers, and there's no one working on
> > it or next steps identified, maybe this should be closed WONT_FIX in RHEL7?
> 
> @jwolfe 
> 
> Hi John is there any update on this BZ?
> 
> Thanks,
> 
> Cathy

Hi Cathy,

I am still working on a fix for the hang(s) seen when doing quiesed snapshots. 

The quiesed snapshot code is basically the same in the versions of open-vm-tools
on RHEL 7 through RHEL 9.  When the fix is ready, Red Hat can provide updated 
open-vm-tools if needed.

Comment 39 John Wolfe 2023-01-22 23:57:24 UTC

A patch is available for lib/syncDriver/syncDriverLinux.c to address this issue and is applicable to all earlier releases of this file.

https://github.com/vmware/open-vm-tools/commit/9d458c53a7a656d4d1ba3a28d090cce82ac4af0e

Track Linux filesystem id (FSID) for quiesced (frozen) filesystems

Tracking the filesystem FSID along with each file descriptor (FD)
as the ioctl FIFREEZE is done.  An EBUSY could be seen because of
an attempt to freeze the same superblock more than once depending
on the OS configuration (e.g. usage of bind mounts).  An EBUSY could
also mean another process has locked or frozen that filesystem.

When an EBUSY is received, the filesyste FSID is checked against the
list of filesystems that have already be quiesced.  If not previously
seen, a warning that the filesystem is controlled by another process
is logged and the quiesced snapshot request will be rejected.

Linux quiesced snapshot requests will now be rejected if another process has frozen/locked a filesystem of interest, thus avoiding a potential block when deleting (unlink) the file '/etc/vmware-tools/quiesce_manifest.xml' as reported in this PR.

The patch is in both the github "devel" and "stable-12.2.x" branches.   It is in the upcoming open-vm-tools 12.2.0 release and applicable to the supported open-vm-tools releases on RHEL 7, RHEL 8 and RHEL 9.

Comment 40 Cathy Avery 2023-04-27 13:30:38 UTC

Assigning bz to Ani

Comment 48 ldu 2023-05-26 01:27:44 UTC

Verified on RHEL 7.9 with open-vm-tools-11.0.5-3.el7_9.5, and run regression testing,no new issue found, so change status to verified.

Comment 60 errata-xmlrpc 2023-06-29 15:09:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: open-vm-tools security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3944

Comment 61 Red Hat Bugzilla 2023-10-28 04:25:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days