Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2228893

Summary:

"rpm -e" in libguestfs appliance (guest) triggers xfs_iunlink_lookup() WARNING [kernel-5.14.0-347.el9]

Product:

Red Hat Enterprise Linux 9

Reporter:

Xiaodai Wang <xiaodwan>

Component:

kernel

Assignee:

Bill O'Donnell <billodo>

kernel sub component:

XFS

QA Contact:

Zorro Lang <zlang>

Status:

CLOSED MIGRATED

Docs Contact:

Severity:

unspecified

Priority:

unspecified

CC:

billodo, dchinner, esandeen, lersek, mxie, rjones, tyan, tzheng, virt-maint, vwu, xiaodwan, xzhou

Version:

9.3

Keywords:

MigratedToJIRA, Regression

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-09-23 12:08:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
v2v debug log	none

Description Xiaodai Wang 2023-08-03 14:31:01 UTC

Created attachment 1981479 [details]
v2v debug log

Description of problem:
coredump happens when converting a RHEL7.4 guest which installs vmtools

Version-Release number of selected component (if applicable):
virt-v2v-2.3.4-5.el9.x86_64
libvirt-9.3.0-2.el9.x86_64
qemu-kvm-8.0.0-9.el9.x86_64
libguestfs-1.50.1-6.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a RHEL7.4 guest, then install vmtools on it.
2. Run virt-v2v to convert the VM.
3. Check the log.

commandrvf: rpm -e --allmatches open-vm-tools-desktop open-vm-tools
[   37.653365] ------------[ cut here ]------------
[   37.653370] WARNING: CPU: 5 PID: 615 at fs/xfs/xfs_inode.c:1840 xfs_iunlink+0x154/0x1e0 [xfs]
[   37.659150] Modules linked in: xfs dm_mod sg virtio_snd snd_pcm snd_timer snd soundcore libcrc32c crc8 crc7 crc_itu_t virtiofs fuse ext4 mbcache jbd2 virtio_vdpa vdpa virtio_mem virtio_input virtio_dma_buf virtio_balloon virtio_scsi sd_mod t10_pi nd_pmem nd_btt virtio_net net_failover failover virtio_console virtio_blk ata_piix libata nfit libnvdimm crc32_generic crct10dif_pclmul crc32c_intel crc32_pclmul
[   37.672362] CPU: 5 PID: 615 Comm: rpm Not tainted 5.14.0-347.el9.x86_64 #1
[   37.675023] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.1-1.el9 04/01/2014
[   37.677598] RIP: 0010:xfs_iunlink+0x154/0x1e0 [xfs]
[   37.688127] Code: 77 3a 4c 89 4c 24 08 e8 ca f0 5c da 44 89 f6 48 8d bd c0 00 00 00 e8 3b 0e a8 da 49 89 c4 48 85 c0 74 07 48 83 78 20 00 75 2c <0f> 0b e8 05 32 5d da 41 bc 8b ff ff ff e9 fb fe ff ff 48 c7 c6 c6
[   37.694974] RSP: 0018:ffffbddf80ddbd38 EFLAGS: 00010246
[   37.696953] RAX: 0000000000000000 RBX: 000000000009c686 RCX: 0000000000000006
[   37.699629] RDX: ffff9a4243bf7d98 RSI: ffff9a4243bf7de8 RDI: 00000000000d8146
[   37.702344] RBP: ffff9a42dfa1a600 R08: 0000000000000000 R09: ffff9a42dfa1a6c0
[   37.705073] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   37.707755] R13: ffff9a42dfd6c740 R14: 00000000000d8146 R15: ffff9a425055f400
[   37.710450] FS:  00007f0a20cf4800(0000) GS:ffff9a42dcd40000(0000) knlGS:0000000000000000
[   37.713454] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   37.715629] CR2: 0000000001ab3558 CR3: 0000000004ad8000 CR4: 00000000000406e0
[   37.718767] Call Trace:
[   37.719747]  <TASK>
[   37.723503]  ? show_trace_log_lvl+0x1c4/0x2df
[   37.726701]  ? show_trace_log_lvl+0x1c4/0x2df
[   37.728410]  ? xfs_remove+0x269/0x390 [xfs]
[   37.730482]  ? xfs_iunlink+0x154/0x1e0 [xfs]
[   37.732541]  ? __warn+0x81/0x110
[   37.733830]  ? xfs_iunlink+0x154/0x1e0 [xfs]
[   37.735890]  ? report_bug+0x10a/0x140
[   37.737313]  ? handle_bug+0x3c/0x70
[   37.738685]  ? exc_invalid_op+0x14/0x70
[   37.740187]  ? asm_exc_invalid_op+0x16/0x20
[   37.741798]  ? xfs_iunlink+0x154/0x1e0 [xfs]
[   37.743914]  ? xfs_iunlink+0x145/0x1e0 [xfs]
[   37.745976]  xfs_remove+0x269/0x390 [xfs]
[   37.747969]  xfs_vn_unlink+0x53/0xa0 [xfs]
[   37.749970]  vfs_unlink+0x117/0x290
[   37.751383]  do_unlinkat+0x1af/0x2e0
[   37.752776]  __x64_sys_unlink+0x3e/0x60
[   37.754268]  ? __pfx___x64_sys_unlink+0x10/0x10
[   37.756015]  do_syscall_64+0x5c/0x90
[   37.757405]  ? syscall_exit_to_user_mode+0x12/0x30
[   37.759238]  ? do_syscall_64+0x69/0x90
[   37.760699]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   37.762597] RIP: 0033:0x7f0a1c6eaf77
[   37.767040] Code: 48 3d 00 f0 ff ff 77 03 48 98 c3 48 8b 15 09 0f 2d 00 f7 d8 64 89 02 48 83 c8 ff eb eb 66 0f 1f 44 00 00 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 0e 2d 00 f7 d8 64 89 01 48
[   37.773900] RSP: 002b:00007ffe88d79138 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[   37.776804] RAX: ffffffffffffffda RBX: 0000000001ab33a0 RCX: 00007f0a1c6eaf77
[   37.779430] RDX: 00007ffe88d790a0 RSI: 0000000000000000 RDI: 0000000001ab3550
[   37.782080] RBP: 0000000001ab3550 R08: 0000000000000001 R09: 0000000000000000
[   37.784761] R10: 2d6e65706f2f6572 R11: 0000000000000206 R12: 0000000000000000
[   37.787414] R13: 0000000001aa25d0 R14: 0000000000000000 R15: 0000000000000006
[   37.790082]  </TASK>
[   37.790922] ---[ end trace 88eb53a6eaeed878 ]---
[   37.792694] XFS (dm-1): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c.  Caller xfs_remove+0x168/0x390 [xfs]
[   37.793192] CPU: 5 PID: 615 Comm: rpm Tainted: G        W         -------  ---  5.14.0-347.el9.x86_64 #1
[   37.793199] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.1-1.el9 04/01/2014
[   37.793201] Call Trace:
[   37.793205]  <TASK>
warning: file /u[   37.793209]  dump_stack_lvl+0x34/0x48
sr/share/open-vm[   37.793219]  xfs_trans_cancel+0x123/0x150 [xfs]
-tools/messages/[   37.793626]  xfs_remove+0x168/0x390 [xfs]
ja/vmware-vgauth[   37.794074]  xfs_vn_unlink+0x53/0xa0 [xfs]
-cmd.vmsg: remov[   37.794471]  vfs_unlink+0x117/0x290
e failed: Struct[   37.794480]  do_unlinkat+0x1af/0x2e0
ure needs cleani[   37.794487]  __x64_sys_unlink+0x3e/0x60
ng
warning: fil[   37.794491]  ? __pfx___x64_sys_unlink+0x10/0x10
e /usr/share/ope[   37.794495]  do_syscall_64+0x5c/0x90
n-vm-tools/messa[   37.794504]  ? syscall_exit_to_user_mode+0x12/0x30
ges/ja/vmtoolsd.[   37.794510]  ? do_syscall_64+0x69/0x90
vmsg: remove fai[   37.794516]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
led: Input/outpu[   37.794523] RIP: 0033:0x7f0a1c6eaf77
t error
warning[   37.794537] Code: 48 3d 00 f0 ff ff 77 03 48 98 c3 48 8b 15 09 0f 2d 00 f7 d8 64 89 02 48 83 c8 ff eb eb 66 0f 1f 44 00 00 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 0e 2d 00 f7 d8 64 89 01 48
: file /usr/shar[   37.794541] RSP: 002b:00007ffe88d79138 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
e/open-vm-tools/[   37.794547] RAX: ffffffffffffffda RBX: 0000000001ab33a0 RCX: 00007f0a1c6eaf77
messages/ja/tool[   37.794550] RDX: 00007ffe88d790a0 RSI: 0000000000000000 RDI: 0000000001ab3550
boxcmd.vmsg: rem[   37.794553] RBP: 0000000001ab3550 R08: 0000000000000001 R09: 0000000000000000
ove failed: Inpu[   37.794556] R10: 2d6e65706f2f6572 R11: 0000000000000206 R12: 0000000000000000
t/output error
[   37.794559] R13: 0000000001aa25d0 R14: 0000000000000000 R15: 0000000000000006
warning: file /u[   37.794565]  </TASK>
sr/share/open-vm[   37.796201] XFS (dm-1): xfs_difree: xfs_ialloc_read_agi() returned error -5.
-tools/messages/[   37.796233] XFS (dm-1): xfs_difree: xfs_ialloc_read_agi() returned error -5.
ja/libvgauth.la.[   37.796256] XFS (dm-1): xfs_difree: xfs_ialloc_read_agi() returned error -5.
vmsg: remove fai[   37.797071] XFS (dm-1): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0x13c/0x150 [xfs] (fs/xfs/xfs_trans.c:1098).  Shutting down filesystem.
led: Input/outpu[   37.797597] XFS (dm-1): Please unmount the filesystem and rectify the problem(s)
t error

Actual results:
As description.

Expected results:
The V2V conversion should be success.

Additional info:
1. I can't reproduce it on rhel7.4 when vmtools is not installed.
2. I also tried rhel7.9, but it's not reproduced.

Comment 2 Laszlo Ersek 2023-08-03 14:59:27 UTC

This seems like an XFS filesystem corruption (or XFS kernel driver bug). The command

  rpm -e --allmatches open-vm-tools-desktop open-vm-tools

is bound to delete files, and the function that splats is xfs_iunlink() -- in the (guest) kernel. So the rpm userspace process invokes the unlink() syscall, it lands in xfs_iunlink(), and the latter blows up.

We even get an error report from the XFS driver:

> XFS (dm-1): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c.  Caller xfs_remove+0x168/0x390 [xfs]

And further reports:

> XFS (dm-1): xfs_difree: xfs_ialloc_read_agi() returned error -5.
> XFS (dm-1): xfs_difree: xfs_ialloc_read_agi() returned error -5.
> XFS (dm-1): xfs_difree: xfs_ialloc_read_agi() returned error -5.
> XFS (dm-1): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0x13c/0x150 [xfs] (fs/xfs/xfs_trans.c:1098).  Shutting down filesystem.
> XFS (dm-1): Please unmount the filesystem and rectify the problem(s)

The guest (= appliance) kernel is 5.14.0-347.el9.x86_64 -- that's a very recent RHEL9 development kernel. So I would blame this symptom on a recent RHEL9 kernel change.

Here's what the appliance kernel source ("fs/xfs/xfs_inode.c") looks like, at 5.14.0-347.el9.x86_64:

   1820 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1820) /*
   1821 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1821)  * Find an inode on the unlinked list. This does not take references to the
   1822 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1822)  * inode as we have existence guarantees by holding the AGI buffer lock and that
   1823 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1823)  * only unlinked, referenced inodes can be on the unlinked inode list.  If we
   1824 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1824)  * don't find the inode in cache, then let the caller handle the situation.
   1825 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1825)  */
   1826 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1826) static struct xfs_inode *
   1827 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1827) xfs_iunlink_lookup(
   1828 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1828)      struct xfs_perag        *pag,
   1829 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1829)      xfs_agino_t             agino)
   1830 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1830) {
   1831 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1831)      struct xfs_inode        *ip;
   1832 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1832) 
   1833 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1833)      rcu_read_lock();
   1834 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1834)      ip = radix_tree_lookup(&pag->pag_ici_root, agino);
   1835 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1835) 
   1836 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1836)      /*
   1837 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1837)       * Inode not in memory or in RCU freeing limbo should not happen.
   1838 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1838)       * Warn about this and let the caller handle the failure.
   1839 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1839)       */
   1840 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1840)      if (WARN_ON_ONCE(!ip || !ip->i_ino)) {
   1841 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1841)              rcu_read_unlock();
   1842 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1842)              return NULL;
   1843 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1843)      }
   1844 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1844)      ASSERT(!xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM));
   1845 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1845)      rcu_read_unlock();
   1846 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1846)      return ip;
   1847 955a0d9d7e13e (Bill O'Donnell      2023-05-18 11:11:44 -0500 1847) }

The WARN_ON_ONCE on line 1840 is the one that fires.

That line comes from downstream commit 955a0d9d7e13 ("xfs: introduce xfs_iunlink_lookup", 2023-05-18), for bug 2167832 [XFS: sync to upstream v6.0].

At this time, the latest development kernel is kernel-5.14.0-349.el9, and "fs/xfs/xfs_inode.c" is identical between -347.el9 (for which this issue is being reported) and -349.el9 (= the latest build). So it looks like an acute downstream kernel bug.

I'm moving this to the kernel, as a regression.

Comment 3 Laszlo Ersek 2023-08-03 15:08:21 UTC

Xiaodai, you mention that this is 100% reproducible, and that the XFS filesystem in question was originally created by the RHEL-7.4 installer (AIUI).

If that's the case (and the filesystem per se is not corrupt), then please hang on to the disk image -- I suspect that there may be something special in the filesystem that is not invalid, but breaks the latest RHEL-9.3 XFS driver. So the XFS developers might want to investigate the disk image. Thanks.

Comment 4 Richard W.M. Jones 2023-08-03 15:25:29 UTC

Make sure that Eric Sandeen is CC'd, if this is really XFS corruption.

Comment 5 Eric Sandeen 2023-08-03 15:42:48 UTC

FWIW this is probably not exactly a corruption, but rather an unexpected state that causes a failure:

> WARNING: CPU: 5 PID: 615 at fs/xfs/xfs_inode.c:1840 xfs_iunlink+0x154/0x1e0 [xfs]

and a subsequent filesystem shutdown when the dirty transaction has to be canceled as a result of the error:

> XFS (dm-1): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c.

Comment 6 Bill O'Donnell 2023-08-03 16:06:33 UTC

Can you obtain a coredump when the warning occurs?

Comment 7 Xiaodai Wang 2023-08-03 16:26:49 UTC

(In reply to Laszlo Ersek from comment #3)
> Xiaodai, you mention that this is 100% reproducible, and that the XFS
> filesystem in question was originally created by the RHEL-7.4 installer
> (AIUI).

Yes, The RHEL7.4 guest is newly installed by the release tree.

> 
> If that's the case (and the filesystem per se is not corrupt), then please
> hang on to the disk image -- I suspect that there may be something special
> in the filesystem that is not invalid, but breaks the latest RHEL-9.3 XFS
> driver. So the XFS developers might want to investigate the disk image.
> Thanks.

The issue can't be reproduced by the fresh RHEL7.4 guest. It only happens
after installing the vmware tools packages. Maybe that breaks the RHEL-9.3
XFS driver.

Comment 8 Richard W.M. Jones 2023-08-03 16:35:36 UTC

> Can you obtain a coredump when the warning occurs?

Do you know how to do this?  It might be easier for us to provide you
with the actual disk image instead.

> after installing the vmware tools packages

Is that open-vm-tools specifically, or the VMware Tools program?  (I think
the second one includes some non open-source components).  I don't think
either installs a kernel module.

Comment 9 Laszlo Ersek 2023-08-03 16:36:32 UTC

Xiaodai, for getting a vmcore, I recommend the following:

(1) before invoking virt-v2v, set the following environment variable:

  export LIBGUESTFS_APPEND="panic_on_warn=1"

(2) Make sure you use the libvirt backend, not the QEMU backend

(3) Monitor the "-v -x" log while virt-v2v is working.

(4) Once you hit the warning in the guest kernel, it will crash (hang), and I *hope* virt-v2v will not immediately kill the appliance -- virt-v2v should just sit there, waiting

(5) with "virsh list", learn about the transient (temporary) domain that is the libguestfs appliance

(6) Issue the following command:

# virsh dump GUESTFS_DOMAIN vmcore --memory-only --format kdump-snappy

(Customize GUESTFS_DOMAIN per the output of step (5).) This will write a file called "vmcore" that Bill can analyze with "crash".

Comment 10 Eric Sandeen 2023-08-03 18:11:30 UTC

> The issue can't be reproduced by the fresh RHEL7.4 guest. It only happens
> after installing the vmware tools packages. Maybe that breaks the RHEL-9.3
> XFS driver.

Is this because installing the vmware tools package prior to conversion causes the subsequent removal via "rpm -e" of these tools when the guest is converted to KVM?

IOWS it is unlikely to be anything about the package itself; it appears to me that the warning and filesystem shutdown happens during the file unlink which happens during the package removal.

Comment 11 Eric Sandeen 2023-08-03 18:16:31 UTC

(In reply to Richard W.M. Jones from comment #8)
> > Can you obtain a coredump when the warning occurs?
> 
> Do you know how to do this?  It might be easier for us to provide you
> with the actual disk image instead.

If you can provide us with the pre-conversion RHEL7 image, with instructions on how to run the conversion to trigger the problem, that should suffice as well.

Thanks,
-Eric

Comment 12 Richard W.M. Jones 2023-08-03 18:18:43 UTC

I'll grab the image and send you a link to it.

Comment 15 Eric Sandeen 2023-08-04 01:04:47 UTC

Richard sent out a link to the RHEL7 image that is causing the issues, with a reproducer. I would probably start by examining the filesystems within that image to see if there is anything unique or problematic about them.

In the end, the reproducer is simply removing a few rpm packages and unlinking the files they contain, albeit in an interesting guest environment. Is unlinking those files in the image enough to trigger the bug?

Comment 16 Richard W.M. Jones 2023-08-04 08:10:53 UTC

The thing that worries me about this bug is whether the disk image is simply corrupt,
in which case we should be rejecting it (it's NOTABUG).  I believe this only counts
as a bug if there's a difference or incompatibility in the way RHEL 7 (guest) read/writes
the image versus RHEL 9 (new host kernel).

However because the disk image we have is for VMware I cannot test boot it on KVM
in order to run commands like 'rpm -e' or 'xfs_repair'.

Xiaodai, is it possible to snapshot this disk on the source (so as not to
disturb the disk image), then boot it in VMware, and run the 'rpm -e' command,
to see if it behaves the same way when run on VMware with a RHEL 7 kernel?
Then after the test, restore back to the snapshot.

Comment 17 Xiaodai Wang 2023-08-04 09:22:28 UTC

There is no issue when running the 'rpm -e' command directly on the original guest in VMware. 

[root@vm-212-125 ~]# rpm -e --allmatches open-vm-tools-desktop open-vm-tools
[root@vm-212-125 ~]# echo $?
0
[root@vm-212-125 ~]# systemctl status vmtoolsd
Unit vmtoolsd.service could not be found.

Comment 18 Richard W.M. Jones 2023-08-04 10:24:11 UTC

That seems to indicate that there really is a kernel bug here.

Comment 20 Richard W.M. Jones 2023-08-05 11:33:57 UTC

CANNOT reproduce with this RHEL 7 kernel: kernel-3.10.0-1160.el7.x86_64
CAN reproduce with RHEL 9 kernel:         kernel-5.14.0-349.el9.x86_64

Comment 22 Richard W.M. Jones 2023-08-07 08:14:18 UTC

Dave, the driver in this case is VMware PVSCSI (vmw_pvscsi).  The kernel part is
open source but the storage part is VMware's closed source ESXi so who knows what
happens there.

Comment 23 Richard W.M. Jones 2023-08-07 08:15:36 UTC

Xiaodai, is it possible to reproduce this situation from scratch in a new guest?
I think you mentioned something about installing RHEL 7 as a guest on VMware and
then installing VMware Tools.  If that (done fresh) recreates the situation then
we might have something that we can investigate.

Comment 24 Laszlo Ersek 2023-08-07 10:04:35 UTC

Right, two ideas come up:

- Install RHEL7 afresh on VMWare (like now), but take a snapshot of the disk before installing the VMware Tools RPM. Try to see if removing a *different* RPM on the pre-tools-install image, under RHEL9, complains. If not, then the invalid clean-declaration of the log may occur in response to the VMWare tools RPM installation.

- Repeat the original reproducer, but with emulated IDE rather than VMware PVSCSI. (I don't remember, but I assume at least that VMWare supports slow, emulated (=non-paravirt) storage, such as IDE.)

Comment 25 Eric Sandeen 2023-08-07 13:52:59 UTC

So at this point, we seem to have a "regression" but only for corrupted filesystems. That probably lowers the urgency a bit.

Dave, FWIW, we do occasionally see this situation in the wild (clean fs, dirty unlinked list) - no idea how it arises.

Comment 27 Richard W.M. Jones 2023-08-07 14:22:14 UTC

I think VirtualBox is related to qemu, and we're fairly sure that qemu
doesn't corrupt XFS filesystems, so my speculation would be either it's
something to do with VMware or it's something that happens on every RHEL 7.
Let's see if Xiaodai can reproduce this with VMware and see where that leads us.

Comment 28 Eric Sandeen 2023-08-07 17:22:20 UTC

This looks like an upstream report of the same thing, though let's see what they find with xfs_repair:

https://bugzilla.kernel.org/show_bug.cgi?id=217769

Comment 29 Dave Chinner 2023-08-07 22:06:42 UTC

FWIW, this has nothing to do with the vmware tools rpm install or removal. It simply has to do with removing files, so the focus on "installing/removing vmware tools rpm" is misguided. If you remove any package that has inodes in AG 1 (there is about 25,000 active allocated inodes in AG 1) and they hash to a dirty unlinked list (there were several in Ag 1) then that remove operation will also trip over the same unlinked list problem.

Hence the focus needs to be on finding out how the filesystem ended up being cleanly unmounted whilst there were still inodes in the unlinked lists on disk....

-Dave.

Comment 30 Richard W.M. Jones 2023-08-08 07:26:46 UTC

Understood.  However the only "lead" we have for this bug is that it happened when
installing RHEL 7 and then installing VMware tools, so I'd like Xiaodai to try
that to see if it is reproducible.  If that is the case then we will have some
steps that can be done to reproduce the fs inconsistency.

Comment 33 Richard W.M. Jones 2023-08-08 12:17:59 UTC

> Here is a guest(rhel7.4-test) I created last week, IIRC, I didn't do any change and installed nothing manually. 
> it has two snapshots. The first one is good. The second one is bad.
> From the good, you just need to enter the OS, then power off, it will turn into bad. May be this would help you to
> debug the issue.

Have you got the name of the good snapshot?  I will download that as an
image and try to reproduce this behaviour with KVM.

Comment 43 Richard W.M. Jones 2023-08-09 09:23:55 UTC

> It's very strange that v2v reports different error when convert the guest 'esx8.0-rhel7.4-x86_64-mxie-debug' although installing vmware-tools for the guest using same method

You can't really tell from the virt-v2v error, you need to check the full
debug logs to see the actual kernel message to see if that is the same or not.
(Since it fails when deleting a file, I suspect it's going to be the same thing).

Comment 48 Laszlo Ersek 2023-08-09 15:24:15 UTC

In all three of those logs, the underlying issue is the same:

> WARNING: CPU: ... PID: ... at fs/xfs/xfs_inode.c:1840 xfs_iunlink+0x154/0x1e0 [xfs]

- In case 1 ("Structure needs cleaning"), the process affected is "guestfsd".

- In case 2 ("is_file: stat: /etc/init.d/kudzu: Input/output error"), it is "chkconfig".

- In case 3 ("/usr/bin/vmware-uninstall-tools.pl: Input/output error"), it is "systemctl".

The steps performed before the conversion, that is, while the disk is managed by vmware, differ in each case -- but they all result in the same (problematic) filesystem condition.

And the same filesystem condition triggers the same warning during each conversion, only the *manifestation* of the same issue differs every time.

See Dave's comment 29.


Here's some ideas:

- does the problem persist if, in either of the three scenarios, you run "sync" manually, before you power down the VM (on vmware)?

- (repeating from comment 24:) can you try using IDE rather than PVSCSI?

Comment 49 Laszlo Ersek 2023-08-09 15:38:54 UTC

Bill, Eric, Dave,

- would it help if, while the guest existed on vmware, "blktrace" recorded the i/o traffic on the block device underlying the XFS filesystem? (The events should be recorded to file living on a *different* virtual disk, presumably.)

- Also, the problematic filesystem state is produced by the RHEL-7.4 kernel. In comment 21, Dave wrote that the cuplrit may be "a kernel bug that allowed a log to marked clean whilst there were still outstanding unlinked inodes needing to be processed prior to marking the log clean". What if there really was such a bug in the RHEL-7.4 kernel, it just has been fixed since -- and RHEL-7.9 is not affected?

Can you identify (git-blame) the XFS commit (backport) in RHEL-7.9 that deals with the unlinked inodes *before* unmounting / clean-marking? If we know the particular commit, we can check whether it was part of RHEL-7.4 already. If it wasn't, then this is simply a RHEL-7.4 guest kernel bug, and the user should be advised to upgrade RHEL7 to RHEL7.9 specifically *on the source* (vmware) before attempting conversion.

Comment 51 Richard W.M. Jones 2023-08-09 16:04:27 UTC

(In reply to Eric Sandeen from comment #50)
> Has a reproducer been identified where you can start with a known-good XFS
> filesystem (i.e. you can show that xfs_repair finds no issues), you then
> operate on on that same filesystem with (???), and end up with that same
> filesystem with unlinked inode problems as found by a subsequent xfs_repair
> (or manifestation of this bug?)

In theory we have a disk image (comment 31) which would reproduce this with some
simple steps.  Unfortunately because it is a snapshot of a VMware disk and there
is some permission problem I've not been able to download and verify this myself.

I am instead trying to install a RHEL 7.4 guest from scratch on my local KVM instance
and then seeing if I can reproduce the xfs_repair -n output from the filesystem,
on the theory that this is some kind of RHEL 7 kernel bug that leaves a corrupted
filesystem which is only picked up by the RHEL 9 kernel later.

Comment 54 Laszlo Ersek 2023-08-09 16:26:56 UTC

(In reply to Eric Sandeen from comment #50)

> Upgrading the kernel to i.e. 7.9 isn't likely to help. By the time we have a
> clean filesystem with an unexpected unlinked inode list lying around, the
> only things that will fix it are an xfs_repair or a log replay (caused by a
> crash or power loss or filesystem shutdown).

I agree that an upgrade is unlikely to help; the RHEL-7.4 guest kernel could do its damage before the RHEL-7.9 kernel is first booted.

Instead, my proposal is to start the entire procedure with RHEL-7.9 install media, on vmware -- the filesystem should be formatted with mkfs.xfs from RHEL-7.9 (invoked effectively by Anaconda), and only touched (after installation / reboot) by the RHEL-7.9 kernel. I suspect that this way the filesystem inconsistency will never come to be, more or less proving that the problem was fixed between 7.4 and 7.9.

If we can show this reliably, then we should only advise virt-v2v users to (a) upgrade their RHEL-7.4 virtual machines to RHEL-7.9 on the *source* hypervisor (which is something we tend to advise anyway!), and (b) run "xfs_repair" still on the source hypervisor, after upgrading to RHEL-7.9. After both steps, the XFS volume on vmware should be in a pristine state (with the guest shut down), and virt-v2v conversion (using a RHEL-9 guest kernel) should succeed.

Comment 56 Laszlo Ersek 2023-08-09 16:44:37 UTC

I think none of us have such a reproducer (I certainly don't have one).

... Which, of course, implies that not seeing the symptom when starting with RHEL-7.9, could only be "stochastic" evidence. :)

Comment 57 Richard W.M. Jones 2023-08-09 16:51:48 UTC

There's a reproducer described in comment 31, if we were able to access it.

I just tried now to create a fresh RHEL 7.4 VM on KVM, tried removing packages, rebooting
etc, and in all cases (upstream) xfs_repair on the filesystem was clean.  So I do not
have any local reproducer.

Comment 59 Eric Sandeen 2023-08-10 03:41:49 UTC

(In reply to Richard W.M. Jones from comment #57)
> There's a reproducer described in comment 31, if we were able to access it.

This should be a solvable problem I hope?

Comment 60 Richard W.M. Jones 2023-08-10 08:08:27 UTC

> This should be a solvable problem I hope?

I hope so.

Xiaodai, we really need a way to get the "good" snapshot that you described
in comment 31 above.

In comment 45 I suggested a command that can be used to extract this, but
it fails with a permission error.  For some reason the username or password
are wrong, but I don't know what they should be.

Maybe there is some other way using VMware to extract a disk image from
a snapshot?

Comment 61 Laszlo Ersek 2023-08-10 09:38:39 UTC

(In reply to mxie from comment #58)

> The problem is still existed even if executing sync before powering down and
> the disk type of the guest is IDE

Thank you, this is a useful development! It excludes the vmw_pvscsi driver as a potential culprit.

Comment 63 Richard W.M. Jones 2023-08-10 10:55:49 UTC

Thanks for handling this.  Yes, qemu-img convert should be able to decode it, and
if not then nbdkit-vddk-plugin can open local files and will definitely be able
to handle it.

Comment 66 Richard W.M. Jones 2023-08-10 17:17:43 UTC

I've reproduced steps (1) and (2) locally (using Fedora host).
Will try for (3) &c ... after dinner.

Comment 69 Richard W.M. Jones 2023-08-10 19:33:15 UTC

I was able to reproduce step (3), with reservations.

I added the qemu -m 2048 flag.  This is Fedora so I'm not sure if the ancient default
of 128M RAM still applies, or perhaps is changed downstream in RHEL, but better to be
on the safe side.

I also used -serial stdio instead of using -monitor stdio so I could capture the serial output.

Fedora's qemu has Gtk drivers built in so I didn't need vncviewer, -S flag, or gdb.

I checked there are no VMware kernel modules loaded inside the guest.

Most seriously, and quite obvious when capturing serial output, the shutdown
gives worrying errors, which I think may indicate a problem with the "state-1.qcow2"
file (in addition to any ones that Laszlo has already pointed out in his
comment 65 steps (4) & (5))).

[   84.663916] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[   84.665996] systemd-journald[452]: Received SIGTERM from PID 1 (systemd-shutdow).
[   94.668344] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[   94.670187] systemd-shutdown[1]: Sending SIGKILL to PID 475 (lvmetad).
[   94.671369] systemd-shutdown[1]: Unmounting file systems.
[   94.705997] systemd-shutdown[1]: All filesystems unmounted.
[   94.706470] systemd-shutdown[1]: Deactivating swaps.
[   94.706919] systemd-shutdown[1]: All swaps deactivated.
[   94.707369] systemd-shutdown[1]: Detaching loop devices.
[   94.708627] systemd-shutdown[1]: All loop devices detached.
[   94.709120] systemd-shutdown[1]: Detaching DM devices.
[   94.709638] systemd-shutdown[1]: Detaching DM 253:1.
[   94.713002] systemd-shutdown[1]: Not all DM devices detached, 1 left.
[   94.713585] systemd-shutdown[1]: Detaching DM devices.
[   94.714092] systemd-shutdown[1]: Not all DM devices detached, 1 left.
[   94.714637] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing.
[   94.733147] systemd-shutdown[1]: Successfully changed into root pivot.
[   94.733704] systemd-shutdown[1]: Returning to initrd...
[   94.753025] dracut Warning: Killing all remaining processes
dracut Warning: Killing all remaining processes
[   94.771171] XFS (dm-0): Unmounting Filesystem
[   94.773454] dracut Warning: Unmounted /oldroot.
[   94.781661] dracut: Disassembling device-mapper devices
Powering off.
[   94.786302] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[   94.786853] sd 1:0:0:0: [sda] Stopping disk
[   94.833944] ACPI: Preparing to enter system sleep state S5
[   94.834253] Power down.
[   94.835085] acpi_power_off called

Comment 70 Richard W.M. Jones 2023-08-10 19:34:41 UTC

> Which RHEL-7.4 kernel was running when you observed this (uname -r?)

The kernel inside state-1.qcow2 is:

$ virt-get-kernel -a state-1.qcow2 
download: /boot/vmlinuz-3.10.0-693.el7.x86_64 -> ./vmlinuz-3.10.0-693.el7.x86_64
download: /boot/initramfs-3.10.0-693.el7.x86_64.img -> ./initramfs-3.10.0-693.el7.x86_64.img

Comment 73 Richard W.M. Jones 2023-08-11 07:42:46 UTC

This is with Fedora's qemu, I didn't try it on RHEL's qemu ...

To do everything through the guest's serial console:

$ qemu-img create -f qcow2 -F qcow2 -b state-1.qcow2 transition.qcow2
$ qemu-system-x86_64 -nodefaults -nographic -m 2048 -enable-kvm -drive id=sys,file=transition.qcow2,if=none -device ide-hd,drive=sys,bootindex=1 -netdev user,id=usernet -device e1000,netdev=usernet -serial stdio

Interrupt grub and edit the Linux command line:

 - Remove "rhgb quiet"  (the delete key in the serial console is ^H)
 - Add:
     ignore_loglevel console=ttyS0 4

^X to boot
Log in as root/redhat
poweroff

Then the "transition.qcow2" overlay will be corrupted.

Comment 74 Laszlo Ersek 2023-08-11 13:00:38 UTC

(In reply to Eric Sandeen from comment #67)
> That's a lot to digest, but I think the assessment here is that
> state-1.qcow2 is a "pre-corruption" image?

Yes, it's already corrupted (unexpectedly so), in a sense -- it has a dirty log, when it shouldn't.

> If I mount / unmount that filesystem under a RHEL9 kernel, the log replays
> and all unlinked inodes are handled as expected, and the filesystem then is
> clean according to xfs_repair.

Yes, this matches my results.

> I get the same result doing the mount / umount under a RHEL7.4 
> 3.10.0-693.el7.x86_64 kernel

That, on the other hand, I can neither confirm nor refute, from my end.

What I can say is that the following *sequence* results in corruption (= clean log with inodes still present in the unlinked list):

- replay the preexistent dirty log inside "state-1.qcow2" with the RHEL-7.4 kernel
- continue booting to the login prompt
- log in as root
- run "poweroff"
- wait until the VM shuts down

It's unclear to me precisely what step in the above sequence is responsible for the corruption -- I can't tell whether it is the log replay by the RHEL-7.4 kernel, or one of the later steps.

> So I guess I need to look more at the intermediate steps in the reproducer
> above to see what might have gotten us into this state.

From the latest comments, we apparently need to dig into "poweroff"?

Comment 75 Laszlo Ersek 2023-08-11 14:43:16 UTC

To me as well, the RHEL-7.4 log replay itself seems fine.

I created a separate, dedicated RHEL-7.4 virtual machine (using the same kernel), and attached "state-1.qcow2" to it as a *data disk*. I then mounted and immediately unmounted the problematic filesystem, and I *also* deactivated the LVs (vgchange -a n). Then I shut off this dedicated VM. After this, the reproducer didn't run into any errors.

Comment 76 Laszlo Ersek 2023-08-11 14:48:52 UTC

Repeating the test in comment 75, but without the explicit "vgchange -a n" before the "poweroff", the file system (now exposed as a *data disk*) remains *clean*. So the problem seems somehow specific to the filesystem in question being the *root* fs.

Comment 77 Laszlo Ersek 2023-08-11 15:14:16 UTC

- run the reproducer as under bullet (3) in comment 65, but
- replace the "poweroff" command with "systemctl isolate rescue.target".

This will kill all processes keeping the rootfs open with r/w references, so as next step, we can do:

mount -o remount,ro /
xfs_repair -n /dev/mapper/rhel_bootp--73--213--142-root

and this *will* find unlinked inodes!

At this point I can even do

xfs_repair -d /dev/mapper/rhel_bootp--73--213--142-root

("dangerous" mode, to be followed by a reboot immediately), and that command *salvages* the lost inodes under /lost+found. They are small files, approx. 4KB in size, and they look like GVariant objects from glib, serialized to disk.

So to summarize, the RHEL-7.4 kernel is assumed to perform a *valid* log replay on the rootfs (!) during boot -- judged from separate log replays with the same kernel, see above --, but by the time I log in, switch to runlevel 1, and remount the rootfs read-only, there are already lost inodes. *AND* the read-only remount does not clean them up.

I wish I could figure out what those files are that I can save under /lost+found. But based on just their contents, it seems difficult.

Comment 78 Laszlo Ersek 2023-08-11 15:18:11 UTC

Update: repeated cold boots of this VM keep producing *freshly lost* inodes. I repeated the same procedure from comment 77, and while I see the previously saved inodes under /lost+found, "xfs_repair -n" is now reporting *NEW* (and different) lost inodes, once I switch to runlevel 1 again, and remount the rootfs read-only. Something seems to "reliably leak" inodes during every boot.

Comment 79 Richard W.M. Jones 2023-08-11 15:44:55 UTC

I'd really like to make this bug public as fully as possible.  Can you mark any
comments that don't have private links, passwords etc as non-private, then
I'll make the bug public.

Comment 82 Eric Sandeen 2023-08-11 16:43:43 UTC

We might be able to get clues from the lost+found files by looking at their selinux contexts.

In my case I see:

unconfined_u:object_r:config_home_t:s0
system_u:object_r:tuned_tmp_t:s0
system_u:object_r:firewalld_tmp_t:s0
unconfined_u:object_r:user_tmp_t

Oddly, I find no active inodes on the system with either tuned_tmp_t, firewalld_tmp_t, or config_home_t selinux contexts, only the files that got dropped in lost+found when xfs_repair ran.

Comment 83 Eric Sandeen 2023-08-11 17:00:01 UTC

Since we seem to get these new unlinked inodes at every new boot, would it be worth installing newer kernels from later point releases to see if there's a point at which the behavior goes away? (I'm willing to do that but need assistance to know how to get networking going after Richard's help above, sorry for being a virt neophyte.)

Comment 84 Richard W.M. Jones 2023-08-11 17:15:02 UTC

Network out to work because of the "-netdev user,id=usernet -device e1000,netdev=usernet"
parameters already suggested.  However the NetworkManager inside the guest is misconfigured
somehow.  It's easy to get around this though, at the guest console type:

[root@localhost ~]# systemctl stop NetworkManager
[root@localhost ~]# dhclient ens2

It will acquire a 10.0.2.15 address, see:
https://wiki.qemu.org/Documentation/Networking#User_Networking_(SLIRP)

Note this is SLIRP networking so some things like ping don't work.

Comment 85 Eric Sandeen 2023-08-11 22:06:35 UTC

So I don't think I can reproduce this with the RHEL7.5 kernel, 3.10.0-862.el7.x86_64

Not sure if it's relevant or not, but that kernel contains 7e5b2eff655 "[fs] xfs: add O_TMPFILE support" and the unlinked inodes that were problematic in the RHEL7.4 kernel and dropped into lost+found were (mostly) tmp_t context.

Comment 86 Eric Sandeen 2023-08-11 22:25:42 UTC

(without the xfs O_TMPFILE support added later, O_TMPFILE inodes cannot cannot be created under the prior kernel, so that seems unlikely to be directly related.)

Comment 87 Eric Sandeen 2023-08-11 22:46:09 UTC

There are a couple other interesting commits between the 2 kernels (which I wrote ...)

5fdca1ac490 [fs] xfs: toggle readonly state around xfs_log_mount_finish
eab8a4afb3c [fs] xfs: write unmount record for ro mounts

Those seem tantalizingly related, but I can't really connect the dots here. The first makes sure that we process the unlinked inode list after log recovery even for RO mounts, and the 2nd makes sure that when we umount an RO mount, we write the unmount record. But in our "poweroff" we're not doing log replay at all, as far as I can tell.

Comment 88 Laszlo Ersek 2023-08-14 07:10:41 UTC

(In reply to Eric Sandeen from comment #86)
> (without the xfs O_TMPFILE support added later, O_TMPFILE inodes cannot
> cannot be created under the prior kernel, so that seems unlikely to be
> directly related.)

It could be, in my opinion. open(2) explains the two uses of O_TMPFILE, and I suspect glibc's tmpfile(3) might attempt O_TMPFILE, and fall back to manual (less atomic) steps when the fs does not support O_TMPFILE.

... Yes, that's what "stdio-common/tmpfile.c" does; it first calls __gen_tempfd(), and if that one fails, it goes manual. And __gen_tempfd() uses O_TMPFILE in "sysdeps/unix/sysv/linux/gentempfd.c".

Related glibc commit (and bugzilla): e0d2eb5a7986 ("linux: Implement tmpfile with O_TMPFILE (BZ#21530)", 2017-09-01) / <https://sourceware.org/bugzilla/show_bug.cgi?id=21530>.

Comment 89 Laszlo Ersek 2023-08-14 07:19:50 UTC

(In reply to Eric Sandeen from comment #87)
> There are a couple other interesting commits between the 2 kernels (which I
> wrote ...)
> 
> 5fdca1ac490 [fs] xfs: toggle readonly state around xfs_log_mount_finish
> eab8a4afb3c [fs] xfs: write unmount record for ro mounts
> 
> Those seem tantalizingly related, but I can't really connect the dots here.
> The first makes sure that we process the unlinked inode list after log
> recovery even for RO mounts, and the 2nd makes sure that when we umount an
> RO mount, we write the unmount record. But in our "poweroff" we're not doing
> log replay at all, as far as I can tell.

Since my comment 75 and comment 76, I've had this thought: what if the difference lies between *unmounting* and *remounting read-only*?

When the problematic fs is used as a data disk, no corruption ensues upon poweroff, regardless of whether we manually offline the LV containing the fs or not -- in either case, the fs is unmounted.

But when the fs is the rootfs, systemd can never unmount it. We can only remount it read-only. That covers both (a) powering off with systemd, and (b) my manual switching to runlevel 1 (rescue.target) and then manually remounting the filesystem read-only. I thought it was "virtually impossible" for "any" fs driver to have separate code paths for unmount vs. ro-remount, but I could be wrong. If ro-remount in the RHEL-7.4 xfs driver misses a cleanup step that unmount contains, that would explain these symptoms.

Comment 90 Laszlo Ersek 2023-08-14 07:32:13 UTC

BTW, if I understand those two downstream commits correctly (5fdca1ac490 / eab8a4afb3c), something seems off to me about XFS's "read-only-ness".

I would expect a read-only filesystem mount (a) to *succeed* when the underlying block device too were read-only, and (b) *not* to change any data on the underlying block device even if the latter were read-write. A corollary would be that it should be safe to just cut power once remount-ro returned successfully (assuming, anyway, that no caching were in place for the underlying block device).

So, apparently, this isn't the case? The xfs driver (in RHEL-7.5) modifies block device contents even when mounting the filesystem read-only right off the bat?

And I suspect this is somehow connected to our symptoms: if we delay writes/cleanups from a read-only-mount to a final unmount, then that will cause problems whenever that last unmount never happens; for example on a rootfs (which, IIUC, we can never unmount).

... Now, I could be entirely wrong about systemd *not* unmounting the rootfs. Perhaps it does switch root again during poweroff, like it pivots during boot (from the initrd to the actual rootfs).

Comment 91 Eric Sandeen 2023-08-14 18:00:48 UTC

Regarding readonly mounts - the generally accepted behavior is that a "readonly mount' prevents /userspace/ from writing to the filesystem, but does not prevent the filesystem from writing to the block device (i.e. replaying the log).  Filesystems do gracefully handle a readonly block device as needed.

So yes, a "mount -o ro" might still cause the filesystem to do (internal) writes such as log recovery to the block device. ext4 has this same behavior.

Regarding poweroff and systemd, I would really like to understand more about the processes the root fs goes through, specifically whether it powers off while mounted RO, or if it does a pivot.

Comment 92 Eric Sandeen 2023-08-15 20:26:37 UTC

I'm baffled - last week I was able to follow the steps in comment #77 and with fresh poweroff cycles create more problematic inodes, but it's inexplicably not working for me today despite multiple tries.

Laszlo, if you can still generate new unlinked inodes as above, then in the rescue environment where you might run xfs_repair -n or xfs_repair -d , can you:

xfs_repair -n to confirm that "new" inodes showed up, make a note of those inode numbers, and then:

xfs_metadump -o /dev/mapper/rhel_bootp--73--213--142-root /boot/root.meta

to create a metadump image ("root.meta") of the filesystem? It should be small-ish, but you can probably compress it further. Then perhaps we can look into the log in that imge to see what operations happened to create those new unlinked inodes, how they relate to unmount records, etc.

Comment 93 Eric Sandeen 2023-08-16 16:18:54 UTC

So when I'm able to generate new files in lost+found, they always have *_tmp_t contexts - for tuned or firewalld. Those contexts are set when one of those applications creates a file in a tmp_t directory (like /tmp)

tuned, for example, does create an open & unlinked file when it runs:

[root@localhost ~]# ps aux | grep tuned
root       656  0.0  0.0 115256   904 ?        S    00:02   0:00 /bin/bash /usr/sbin/ksmtuned
root      1708  0.7  0.9 562412 18532 ?        Ssl  00:14   0:00 /usr/bin/python -Es /usr/sbin/tuned -l -P
root      1729  0.0  0.0 112660   944 ttyS0    R+   00:14   0:00 grep --color=auto tuned
[root@localhost ~]# ls -l /proc/1708/fd
total 0
lr-x------. 1 root root 64 Aug 17 00:14 0 -> /dev/null
lrwx------. 1 root root 64 Aug 17 00:14 1 -> socket:[22224]
l-wx------. 1 root root 64 Aug 17 00:14 10 -> pipe:[22238]
l-wx------. 1 root root 64 Aug 17 00:14 11 -> /dev/cpu_dma_latency
lr-x------. 1 root root 64 Aug 17 00:14 13 -> /dev/urandom
lrwx------. 1 root root 64 Aug 17 00:14 18 -> anon_inode:[eventpoll]
lrwx------. 1 root root 64 Aug 17 00:14 2 -> socket:[22224]
l-wx------. 1 root root 64 Aug 17 00:14 3 -> /var/log/tuned/tuned.log
lrwx------. 1 root root 64 Aug 17 00:14 4 -> socket:[22231]
lrwx------. 1 root root 64 Aug 17 00:14 5 -> socket:[22232]
lrwx------. 1 root root 64 Aug 17 00:14 6 -> anon_inode:[eventfd]
lrwx------. 1 root root 64 Aug 17 00:14 7 -> /tmp/ffiFLyqWm (deleted)

In this instance the open/unlinked file looks like this:

[root@localhost ~]# hexdump -C /proc/1708/fd/7
00000000  00 00 00 00 00 00 00 00  43 00 00 00 00 00 00 00  |........C.......|
00000010  49 bb ee be 42 60 2e 7f  00 00 49 ba 10 80 8f 6d  |I...B`....I....m|
00000020  2e 7f 00 00 f8 49 ff e3  68 08 7b 02 00 00 00 00  |.....I..h.{.....|
00000030  30 72 97 5f 2e 7f 00 00  50 08 7b 02 00 00 00 00  |0r._....P.{.....|
00000040  10 70 8f 6d 2e 7f 00 00  79 0f 00 00 00 00 00 00  |.p.m....y.......|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fc0  48 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |H...............|
00000fd0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

which is remarkably similar to the files in lost+found with the tuned_tmp_t context, so I assume that this is where it came from:

[root@localhost ~]# hexdump -C /lost+found/9695420
00000000  00 00 00 00 00 00 00 00  43 00 00 00 00 00 00 00  |........C.......|
00000010  49 bb ee 8e e1 26 d0 7f  00 00 49 ba 10 50 2e 34  |I....&....I..P.4|
00000020  d0 7f 00 00 f8 49 ff e3  68 68 dd 00 00 00 00 00  |.....I..hh......|
00000030  30 42 36 26 d0 7f 00 00  50 68 dd 00 00 00 00 00  |0B6&....Ph......|
00000040  10 40 2e 34 d0 7f 00 00  79 0f 00 00 00 00 00 00  |.@.4....y.......|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fc0  48 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |H...............|
00000fd0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

Comment 94 Eric Sandeen 2023-08-22 21:05:35 UTC

I think we have a decent understanding of at least one path that can lead to this problem.

During boot on this RHEL7 release, the root fs is first mounted readonly, and then transitions to read-write.

Because of the (missing) first commit mentioned in comment #87, the readonly mount does not process the unlinked inode list, the filesystem then transitions to read-write, and if it is subsequently cleanly unmounted we are left with a clean filesystem which requires no log replay but still has the unlinked inodes on disk, leading to the problem we're now seeing. Nothing will clean up these inodes other than a log replay post-crash, or an xfs_repair.

I'm not yet certain how the poweroff and/or systemctl isolate scenario is leading to similar problems.

Comment 95 Laszlo Ersek 2023-08-25 15:29:39 UTC

sorry about the late response; I've been on PTO, and am returning to
work today:

(In reply to Eric Sandeen from comment #92)
> I'm baffled - last week I was able to follow the steps in comment #77
> and with fresh poweroff cycles create more problematic inodes, but
> it's inexplicably not working for me today despite multiple tries.
>
> Laszlo, if you can still generate new unlinked inodes as above,

That's the thing -- my own repro attempts have been non-deterministic!
*some* poweroff/boot cycles produced new lost inodes, but some others
wouldn't!

Today I seem unable to produce new lost inodes. :(

(In reply to Eric Sandeen from comment #94)
> During boot on this RHEL7 release, the root fs is first mounted
> readonly, and then transitions to read-write.
>
> Because of the (missing) first commit mentioned in comment #87, the
> readonly mount does not process the unlinked inode list, the
> filesystem then transitions to read-write,

Does the transition to read-write (incorrectly) assume that the initial
read-only mount cleans up the unlinked inode list?

Because, I guess, if the transition to read-write cleaned up that list
itself, then there would be no problem. Of course I cannot say whether
that would match the design of XFS. (I do understand now that "no
writes" in a read-only mount is meant for userspace, not for the kernel
and/or underlying block device.)

Comment 96 Eric Sandeen 2023-08-31 17:14:31 UTC

Darrick has posted a patch to try to mitigate this to the upstream list. We'll need to try to decide if a bleeding-edge patch is a wise choice for RHEL9 at this point though ...

[PATCH v2] xfs: load uncached unlinked inodes into memory on demand
https://lore.kernel.org/linux-xfs/87pm338jyz.fsf@doe.com/T/#m1a17ff371b0e8823141d8a80ba0aafc9991834a3

Comment 97 Eric Sandeen 2023-08-31 17:17:33 UTC

(In reply to Laszlo Ersek from comment #95)

> (In reply to Eric Sandeen from comment #94)
> > During boot on this RHEL7 release, the root fs is first mounted
> > readonly, and then transitions to read-write.
> >
> > Because of the (missing) first commit mentioned in comment #87, the
> > readonly mount does not process the unlinked inode list, the
> > filesystem then transitions to read-write,
> 
> Does the transition to read-write (incorrectly) assume that the initial
> read-only mount cleans up the unlinked inode list?

More or less. I don't know what the assumptions were, but the net effect was that crashing w/ unlinked inodes, mounting that fs RO, and then remounting RW, then unmounting would lead to a clean filesystem with unlinked inodes in the on-disk AGIs which will never be resolved until the next log replay or xfs_repair.
 
> Because, I guess, if the transition to read-write cleaned up that list
> itself, then there would be no problem.

Right; but it didn't, because that processing is part of log replay, which doesn't (need to) happen on a RO->RW transition, as it already happened on the RO mount. At the end of the day this was just a bug ...

> Of course I cannot say whether
> that would match the design of XFS. (I do understand now that "no
> writes" in a read-only mount is meant for userspace, not for the kernel
> and/or underlying block device.)

... a bug possibly stemming from that confusion around what an RO mount should and should not do.

Comment 98 RHEL Program Management 2023-09-23 12:03:37 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 99 RHEL Program Management 2023-09-23 12:08:53 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.