Bug 2155253
Summary: | Can not recover a host since disk layout recreation script fails | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Roman Safronov <rsafrono> | |
Component: | rear | Assignee: | Pavel Cahyna <pcahyna> | |
Status: | CLOSED MIGRATED | QA Contact: | CS System Management SST QE <rhel-cs-system-management-subsystem-qe> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 9.1 | CC: | dchinner, dranck, ekuris, esandeen, pcahyna | |
Target Milestone: | rc | Keywords: | MigratedToJIRA | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2210773 (view as bug list) | Environment: | ||
Last Closed: | 2023-09-22 03:23:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1823324 | |||
Bug Blocks: | 2210773 |
Description
Roman Safronov
2022-12-20 14:15:02 UTC
Hello, can you please provide the ReaR layout (files under /var/lib/rear/layout , esp. disklayout.conf)? Can you please also attach the .cfg file found under /var/lib/rear/layout/lvm/ ? Is the problem reproducible on other systems? If so, can you please attach the full debug log from rear savelayout or rear mkrescue on a system where the problem occurs? (The debug log is produced using the -D flag, so the complete command is "rear -D savelayout", and the log is then found under /var/log/rear .) If you still have the system where the backup was produced, can you please provide the output of the command "lvm lvs --separator=: --noheadings --units b --nosuffix -o origin,lv_name,vg_name,lv_size,lv_layout,pool_lv,chunk_size,stripes,stripe_size,seg_size" ? According to my preliminary investigation, it seems that the problem is here: Volume group "vg" has insufficient free space (16219 extents): 16226 required. and the following errors are not relevant. The problem may have something to do with lv_thinpool being almost as big as the whole VG, but there should still be some space available. Please also try, before running "rear recover", to manually change the size of lv_thinpool to 68027416576b. I.e. change the line lvmvol /dev/vg lv_thinpool 68056776704b thin,pool chunksize:65536b in /var/lib/rear/layout/disklayout.conf to lvmvol /dev/vg lv_thinpool 68027416576b thin,pool chunksize:65536b I don't know whether it is related but I see a one more suspicious printing at the very beginning, one of logical volumes config is not supported by vgcfgrestore, i.e see below: ... Comparing disks Device vda has expected (same) size 68719476736 bytes (will be used for 'recover') Disk configuration looks identical Proceed with 'recover' (yes) otherwise manual disk layout configuration is enforced (default 'yes' timeout 30 seconds) yes User confirmed to proceed with 'recover' Layout 'thin,sparse' of LV 'lv_audit' in VG '/dev/vg' not supported by vgcfgrestore <--- THIS LINE Start system layout restoration. Disk '/dev/vda': creating 'gpt' partition table Disk '/dev/vda': creating partition number 1 with name ''ESP'' Disk '/dev/vda': creating partition number 2 with name ''BSP'' Disk '/dev/vda': creating partition number 3 with name ''boot'' Disk '/dev/vda': creating partition number 4 with name ''root'' Disk '/dev/vda': creating partition number 5 with name ''vda5'' Disk '/dev/vda': creating partition number 6 with name ''growvols'' Creating LVM PV /dev/vda4 Creating LVM PV /dev/vda6 Creating LVM VG 'vg'; Warning: some properties may not be preserved... Creating LVM volume 'vg/lv_thinpool'; Warning: some properties may not be preserved... The disk layout recreation script failed See below some of requested outputs [cloud-admin@controller-0 ~]$ sudo lvm lvs --separator=: --noheadings --units b --nosuffix -o origin,lv_name,vg_name,lv_size,lv_layout,pool_lv,chunk_size,stripes,stripe_size,seg_size :lv_audit:vg:1199570944:thin,sparse:lv_thinpool:0:0:0:1199570944 :lv_home:vg:1249902592:thin,sparse:lv_thinpool:0:0:0:1249902592 :lv_log:vg:3250585600:thin,sparse:lv_thinpool:0:0:0:3250585600 :lv_root:vg:11299454976:thin,sparse:lv_thinpool:0:0:0:11299454976 :lv_srv:vg:10049552384:thin,sparse:lv_thinpool:0:0:0:10049552384 :lv_thinpool:vg:68056776704:thin,pool::65536:1:0:68056776704 :lv_tmp:vg:1249902592:thin,sparse:lv_thinpool:0:0:0:1249902592 :lv_var:vg:39774584832:thin,sparse:lv_thinpool:0:0:0:39774584832 Change the size of lv_thinpool to 68027416576b did not help, the error is similar, just numbers are different: +++ lvm lvcreate -y --chunksize 65536b --type thin-pool -L 68027416576b --thinpool lv_thinpool vg Thin pool volume with chunk size 64.00 KiB can address at most <15.88 TiB of data. Insufficient free space: 16235 extents needed, but only 16219 available I think I know what's wrong. The command lvm lvcreate -y --chunksize 65536b --type thin-pool -L 68056776704b --thinpool lv_thinpool vg is trying to create a thin pool of exactly the same size as was in the original system. The size of the metadata volume is chosen automatically, but in the original system, it had only 8MB and the chosen new size is larger. As there was very little space available in the VG of the old system, the chosen new size of the metadata volume does not leave enough space in the VG for the data volume. Can you please repeat the recovery process, and when it asks you to confirm or edit the /var/lib/rear/layout/diskrestore.sh script, choose to edit it, and change the line lvm lvcreate -y --chunksize 65536b --type thin-pool -L 68056776704b --thinpool lv_thinpool vg to lvm lvcreate -y --chunksize 65536b --poolmetadatasize 8M --type thin-pool -L 68056776704b --thinpool lv_thinpool vg ? (this should be done before it fails the first time, so you will need to do a manual recovery procedure where ReaR asks you for confirmation, currently you seem to be launching it in an unattended way.) (when I said repeat the recovery process, I meant from the start, i.e. the boot of the rescue medium.) I tried specified in comment 19, the command passed but the diskrestore.sh script still fails, see the attached rear-controller-0.log_20221222 This seems to be an unrelated problem. I tried the problematic command mkfs.xfs -f -m uuid=23ce7347-fce3-48b4-9854-60a6db155b16 -i size=512 -d agcount=400 -s size=512 -i attr=2 -i projid32bit=1 -m crc=1 -m finobt=1 -b size=4096 -i maxpct=25 -d sunit=128 -d swidth=128 -l version=2 -l sunit=128 -l lazy-count=1 -n size=4096 -n version=2 -r extsize=4096 /dev/mapper/rhel_kvm--08--guest09-lv_srv and it dumps core for me as well, so it is easily reproducible. According to the assertion, it seems that it does not like "-d agcount=400". Indeed, when I change "-d agcount=400" to "-d agcount=40", the command passes. Now, the question is how agcount=400 got there. Can you please provide the content of /var/lib/rear/layout/xfs/vg-lv_srv.xfs ? I suppose it will also have "agcount=400". Assuming this is the case, the question is, how did it get there? The file seems to be merely the output of "xfs_info /srv". If that's the case, how could /srv have been created with agcount=400 if mkfs.xfs rejects this value? Has the VM in question been upgraded from an earlier version of RHEL? I am thinking that maybe the filesystem was created with an older version of mkfs.xfs that allowed this and the assertion was added to the code later. This could also explain the thin pool problem, because a similar question arises: how could the thin pool have been created with such a small metadata volume, when the default is a larger metadata volume? Maybe it was created when the default was different, and new LVM with the same parameters now creates a different layout? Or have you provided the (small) metadata volume size manually when creating the volume for the first time? contents of /var/lib/rear/layout/xfs/vg-lv_srv.xfs meta-data=/dev/mapper/vg-lv_srv isize=512 agcount=400, agsize=6144 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 data = bsize=4096 blocks=2453504, imaxpct=25 = sunit=16 swidth=16 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=1872, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Regarding the source of agcount=400 I am not sure, I am just using an environment deployed by CI. IIUC, openstack nodes are provisioned using overcloud-hardened-uefi-full.raw image which has a pre-defined disk layout. the manual page states that agcount and agsize are mutually exclusive. I tried to use your value of agsize and let the command deduce agcount: mkfs.xfs -f -m uuid=23ce7347-fce3-48b4-9854-60a6db155b16 -i size=512 -d agsize=6144b -s size=512 -i attr=2 -i projid32bit=1 -m crc=1 -m finobt=1 -b size=4096 -i maxpct=25 -d sunit=128 -d swidth=128 -l version=2 -l sunit=128 -l lazy-count=1 -n size=4096 -n version=2 -r extsize=4096 /dev/mapper/rhel_kvm--08--guest09-lv_srv this passes. The resulting filesystem has these parameters: meta-data=/dev/mapper/rhel_kvm--08--guest09-lv_srv isize=512 agcount=399, agsize=6144 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 data = bsize=4096 blocks=2451456, imaxpct=25 = sunit=16 swidth=16 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Note that it is using agcount=399. I noticed that the size of the log section is different. I tried to match it by adding "-l size=1872b": mkfs.xfs -f -m uuid=23ce7347-fce3-48b4-9854-60a6db155b16 -i size=512 -d agsize=6144b -s size=512 -i attr=2 -i projid32bit=1 -m crc=1 -m finobt=1 -b size=4096 -i maxpct=25 -d sunit=128 -d swidth=128 -l version=2 -l sunit=128 -l lazy-count=1 -l size=1872b -n size=4096 -n version=2 -r extsize=4096 /dev/mapper/rhel_kvm--08--guest09-lv_srv the result has: meta-data=/dev/mapper/rhel_kvm--08--guest09-lv_srv isize=512 agcount=399, agsize=6144 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 data = bsize=4096 blocks=2451456, imaxpct=25 = sunit=16 swidth=16 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=1872, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 so, still a bit different, and agcount=399. My conclusion is that it is not feasible to 100% match all the parameters of the original file system in the recreated file system. Not sure why, maybe your image was created using a different version of mkfs.xfs. And for some reason forcing it to match agcount triggers an assertion, and matching agsize works better. This is in some sense analogous to the LVM problem that we discussed first. The combination of parameters deduced from the original layout does not work 100% when creating the new layout. Are your images going to be used by customers, or are they produced only for internal use? BTW, I found an existing bug for handling very small thin pool metadata volume size https://bugzilla.redhat.com/show_bug.cgi?id=2149586 Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |