Bug 2227180
| Summary: | [Azure] Encryption at host breaks mkfs.xfs | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Klaas Demter <klaas> | ||||
| Component: | xfsprogs | Assignee: | Eric Sandeen <esandeen> | ||||
| Status: | CLOSED MIGRATED | QA Contact: | Filesystem QE <fs-qe> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 8.8 | CC: | dwysocha, huzhao, klaas, litian, schandle, xuli, xxiong, xzhou, yacao, yuxisun | ||||
| Target Milestone: | rc | Keywords: | MigratedToJIRA | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-09-23 12:03:36 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Actual mount error: [root@hostname~]# mount /mnt/resource/ mount: /mnt/resource: mount(2) system call failed: Structure needs cleaning. Command called by cloud init during boot is: 2023-07-28 08:11:50,351 - subp.py[DEBUG]: Running command ['/usr/sbin/mkfs.xfs', '/dev/sdc1', '-f'] with allowed return codes [0] (shell=False, capture=True) It succeeds in both cases (encryption at host enabled and disabled) I can get it to mount by forcing log zeroing
[root@hostname ~]# xfs_repair -v /dev/sdc1
Phase 1 - find and verify superblock...
- block cache size set to 753200 entries
Phase 2 - using internal log
- zero log...
totally zeroed log
zero_log: head block 0 tail block 0
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 3
- agno = 2
Phase 5 - rebuild AG headers and trees...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
XFS_REPAIR Summary Fri Jul 28 11:36:25 2023
Phase Start End Duration
Phase 1: 07/28 11:36:22 07/28 11:36:22
Phase 2: 07/28 11:36:22 07/28 11:36:25 3 seconds
Phase 3: 07/28 11:36:25 07/28 11:36:25
Phase 4: 07/28 11:36:25 07/28 11:36:25
Phase 5: 07/28 11:36:25 07/28 11:36:25
Phase 6: 07/28 11:36:25 07/28 11:36:25
Phase 7: 07/28 11:36:25 07/28 11:36:25
Total run time: 3 seconds
done
[root@hostname ~]# mount /mnt/resource/
mount: /mnt/resource: mount(2) system call failed: Structure needs cleaning.
[root@hostname ~]# xfs_repair -v -L /dev/sdc1
Phase 1 - find and verify superblock...
- block cache size set to 753200 entries
Phase 2 - using internal log
- zero log...
totally zeroed log
zero_log: head block 0 tail block 0
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
Phase 5 - rebuild AG headers and trees...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
XFS_REPAIR Summary Fri Jul 28 11:42:31 2023
Phase Start End Duration
Phase 1: 07/28 11:42:28 07/28 11:42:28
Phase 2: 07/28 11:42:28 07/28 11:42:31 3 seconds
Phase 3: 07/28 11:42:31 07/28 11:42:31
Phase 4: 07/28 11:42:31 07/28 11:42:31
Phase 5: 07/28 11:42:31 07/28 11:42:31
Phase 6: 07/28 11:42:31 07/28 11:42:31
Phase 7: 07/28 11:42:31 07/28 11:42:31
Total run time: 3 seconds
done
[root@hostname ~]# mount /mnt/resource/
[root@hostname ~]#
Can you please provide the dmesg when mount fails, as well as an xfs_metadump of the problematic device created immediately after mkfs.xfs? (you can compress the metadump file so that it is hopefully small enough to attach.) My first thought was that perhaps the encrypted device is not honoring FALLOC_FL_ZERO_RANGE that we use for efficient zeroing, but we use that same mechanism when zeroing the log from xfs_repair. The "totally zeroed log" message from repair also indicates that the log was in fact already completely zeroed out before xfs_repair tried to do it again. So we need to know what was actually wrong with the filesystem which caused mount to fail; dmesg and metadump will hopefully give us what we need. Created attachment 1980463 [details]
xfs_metadump /dev/sdb1 xfs_metadata.dump
This has to be specific to cloud-init calling it or something happening during boot. If I run mkfs.xfs myself in the running system it works and I can mount it dmesg of failed mount: [Jul28 14:23] XFS (sdb1): Mounting V5 Filesystem [ +0.003285] XFS (sdb1): totally zeroed log [ +6.971766] XFS (sdb1): Internal error head_block >= tail_block || head_cycle != tail_cycle + 1 at line 1656> [ +0.009456] CPU: 2 PID: 8569 Comm: mount Kdump: loaded Not tainted 4.18.0-477.15.1.el8_8.x86_64 #1 [ +0.000004] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release > [ +0.000001] Call Trace: [ +0.000004] dump_stack+0x41/0x60 [ +0.000007] xfs_corruption_error+0x8b/0x90 [xfs] [ +0.000068] ? xlog_clear_stale_blocks+0x177/0x1c0 [xfs] [ +0.000062] ? xlog_verify_head+0xd4/0x190 [xfs] [ +0.000060] xlog_clear_stale_blocks+0x1a1/0x1c0 [xfs] [ +0.000061] ? xlog_clear_stale_blocks+0x177/0x1c0 [xfs] [ +0.000060] xlog_find_tail+0x20f/0x350 [xfs] [ +0.000060] xlog_recover+0x2b/0x160 [xfs] [ +0.000060] xfs_log_mount+0x28c/0x2b0 [xfs] [ +0.000060] xfs_mountfs+0x45e/0x8e0 [xfs] [ +0.000063] xfs_fs_fill_super+0x36c/0x6a0 [xfs] [ +0.000060] ? xfs_mount_free+0x30/0x30 [xfs] [ +0.000060] get_tree_bdev+0x18f/0x270 [ +0.000006] vfs_get_tree+0x25/0xc0 [ +0.000003] do_mount+0x2e9/0x950 [ +0.000005] ? memdup_user+0x4b/0x80 [ +0.000002] ksys_mount+0xbe/0xe0 [ +0.000003] __x64_sys_mount+0x21/0x30 [ +0.000003] do_syscall_64+0x5b/0x1b0 [ +0.000004] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ +0.000003] RIP: 0033:0x7f7985dc435e [ +0.000003] Code: 48 8b 0d 2d 4b 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f > [ +0.000002] RSP: 002b:00007ffc841c8358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [ +0.000003] RAX: ffffffffffffffda RBX: 00005645dd8de5d0 RCX: 00007f7985dc435e [ +0.000001] RDX: 00005645dd8e0970 RSI: 00005645dd8e1510 RDI: 00005645dd8e1530 [ +0.000001] RBP: 00007f7986c35184 R08: 0000000000000000 R09: 00005645dd8d9016 [ +0.000002] R10: 00000000c0ed0000 R11: 0000000000000246 R12: 0000000000000000 [ +0.000001] R13: 00000000c0ed0000 R14: 00005645dd8e1530 R15: 00005645dd8e0970 [ +0.000002] XFS (sdb1): Corruption detected. Unmount and run xfs_repair [ +0.003330] XFS (sdb1): failed to locate log tail [ +0.000001] XFS (sdb1): log mount/recovery failed: error -117 [ +0.000149] XFS (sdb1): log mount failed xfs_metadata dump attached, I also attached a full dd of the complete filesystem to the case 03572027. the metadata dump is xz compressed. I just noticed I cut some lines in the dmesg output: [Jul28 14:23] XFS (sdb1): Mounting V5 Filesystem [ +0.003285] XFS (sdb1): totally zeroed log [ +6.971766] XFS (sdb1): Internal error head_block >= tail_block || head_cycle != tail_cycle + 1 at line 1656 of file fs/xfs/xfs_log_recover.c. Caller xlog_clear_stale_blocks+0x177/0x1c0 [xfs] [ +0.009456] CPU: 2 PID: 8569 Comm: mount Kdump: loaded Not tainted 4.18.0-477.15.1.el8_8.x86_64 #1 [ +0.000004] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022 [ +0.000001] Call Trace: [ +0.000004] dump_stack+0x41/0x60 [ +0.000007] xfs_corruption_error+0x8b/0x90 [xfs] [ +0.000068] ? xlog_clear_stale_blocks+0x177/0x1c0 [xfs] [ +0.000062] ? xlog_verify_head+0xd4/0x190 [xfs] [ +0.000060] xlog_clear_stale_blocks+0x1a1/0x1c0 [xfs] [ +0.000061] ? xlog_clear_stale_blocks+0x177/0x1c0 [xfs] [ +0.000060] xlog_find_tail+0x20f/0x350 [xfs] [ +0.000060] xlog_recover+0x2b/0x160 [xfs] [ +0.000060] xfs_log_mount+0x28c/0x2b0 [xfs] [ +0.000060] xfs_mountfs+0x45e/0x8e0 [xfs] [ +0.000063] xfs_fs_fill_super+0x36c/0x6a0 [xfs] [ +0.000060] ? xfs_mount_free+0x30/0x30 [xfs] [ +0.000060] get_tree_bdev+0x18f/0x270 [ +0.000006] vfs_get_tree+0x25/0xc0 [ +0.000003] do_mount+0x2e9/0x950 [ +0.000005] ? memdup_user+0x4b/0x80 [ +0.000002] ksys_mount+0xbe/0xe0 [ +0.000003] __x64_sys_mount+0x21/0x30 [ +0.000003] do_syscall_64+0x5b/0x1b0 [ +0.000004] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ +0.000003] RIP: 0033:0x7f7985dc435e [ +0.000003] Code: 48 8b 0d 2d 4b 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fa 4a 38 00 f7 d8 64 89 01 48 [ +0.000002] RSP: 002b:00007ffc841c8358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [ +0.000003] RAX: ffffffffffffffda RBX: 00005645dd8de5d0 RCX: 00007f7985dc435e [ +0.000001] RDX: 00005645dd8e0970 RSI: 00005645dd8e1510 RDI: 00005645dd8e1530 [ +0.000001] RBP: 00007f7986c35184 R08: 0000000000000000 R09: 00005645dd8d9016 [ +0.000002] R10: 00000000c0ed0000 R11: 0000000000000246 R12: 0000000000000000 [ +0.000001] R13: 00000000c0ed0000 R14: 00005645dd8e1530 R15: 00005645dd8e0970 [ +0.000002] XFS (sdb1): Corruption detected. Unmount and run xfs_repair [ +0.003330] XFS (sdb1): failed to locate log tail [ +0.000001] XFS (sdb1): log mount/recovery failed: error -117 [ +0.000149] XFS (sdb1): log mount failed and just to make sure I got this across properly, this is not a one off example it, it happens 100% of the time and I have that issue on hundreds of RHEL 8.8 VMs. well more precise 100% of the cloud init initiated mkfs.xfs commands during boot :D but you get what I mean Klaas, I'm going to let support work with you from here on this, they scale better than I do for initial triage of customer problems. My gut feeling is that something is misbehaving with the block device, not mkfs.xfs but I'll let them see if they can work it out. Thanks, -Eric I will note that after restoring the provided metadump to a filesystem image with xfs_mdrestore, it mounts (via loopback) without problem for me on the same RHEL8 kernel version, likely indicating a problem with the block device or environment, not mkfs.xfs or the filesystem itself. Microsoft got back to me with some more information, seems not only red hat is affected: Ubuntu 23: 6.2.0-1009-azure -->fails Ubuntu 22: 5.15.0-1037-azure --> fails Ubuntu 20: 5.15.0-1042-azure ---> ok Ubuntu 18: 5.4.0-1109-azure ----> ok Redhat 9.2: 5.14.0-284.18.1.el9_2.x86_64 -->fails Redhat 8.8 : 4.18.0-477.10.1.el8_8.x86_64 --> fails Redhat 7.7 SAP: 3.10.0-1062.52.2.el7.x86_64 ----> ok Redhat 7.6 raw: 3.10.0-957.72.1.el7.x86_64 --> ok SUSE 15 sp4: 5.14.21-150400.14.46-azure --> fails SUSE 12 sp5: 4.12.14-16.139-azure --> ok I am guessing this needs a very indepth analysis that includes the Microsoft Team that owns the encryption at host feature :) I have convinced Red Hat support to open a collaboration request with Microsoft via TSAnet. Lets see if we can finally get some results. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |
Description of problem: After enabling encryption at host the mkfs.xfs being called by cloud-init is no longer producing a filesystem that is mountable. Version-Release number of selected component (if applicable): xfsprogs-5.0.0-11.el8_8.x86_64 kernel 4.18.0-477.15.1.el8_8.x86_64 How reproducible: After booting a current rhel 8.8 systems that was deallocated cloud init initializes the ephemeral disk. it calls mkfs.xfs on a partition it created according to my cloud-init userdata: #cloud-config disk_setup: ephemeral0: table_type: gpt layout: [66, [33,82]] overwrite: true fs_setup: - device: ephemeral0.1 filesystem: xfs overwrite: true - device: ephemeral0.2 filesystem: swap mounts: - ["ephemeral0.1", "/mnt/resource"] - ["ephemeral0.2", "none", "swap", "sw", "0", "0"] Steps to Reproduce: 1. Create VM with that user-data in Azure 2. 3. Actual results: Depending on the encryption at host setting it either works or not Expected results: Works with encryption at host enabled and disabled Additional info: I use the pay as you go image: "imageReference": { "id": "", "offer": "RHEL", "publisher": "RedHat", "sku": "8-lvm-gen2", "version": "latest" fully updated to all released errata as of today Attached Red Hat Support case that includes sos reports of both states: 03572027 Microsoft Support Case: 2307280050000960