Bug 1170803
Summary: | calls e2fsck on all ext volumes, provides no status indicator, and hangs indefinitely if e2fsck doesn't exit | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Leslie Satenstein <lsatenstein> | ||||||||||||||||
Component: | python-blivet | Assignee: | Vratislav Podzimek <vpodzime> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | unspecified | ||||||||||||||||||
Version: | 26 | CC: | amulhern, awilliam, blivet-maint-list, bugzilla, bugzilla, cra, cristian.ciupitu, djuran, dlehman, esandeen, fabrice, g.kaviyarasu, jan.public, jansen, jcapik, jeder, jonathan, josef, j, kparal, kzak, lsatenstein, marmalodak, marmarek, me, m_kretzschmar, mrmazda, oliver, pschindl, robatino, samuel-rhbugs, vanmeeuwen+fedora, vpodzime | ||||||||||||||||
Target Milestone: | --- | Keywords: | CommonBugs, Reopened | ||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | All | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | https://fedoraproject.org/wiki/Common_F25_bugs#anaconda-fsck-slow AcceptedBlocker | ||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2017-09-26 01:41:19 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 1396702 | ||||||||||||||||||
Attachments: |
|
Description
Leslie Satenstein
2014-12-04 22:11:17 UTC
Same Lockup problem with Workspace Anaconda version (Probably the same version). Cannot reset hostname. Cannot set timezone or select installation disk. With command line prompt, system was waiting for anaconda. Was able to kill anaconda and do one restart. -- 2nd attempt with anaconda locked up. Quite prepared to test again with newer RC6. Anaconda RC5 Locks up with attempt to change hostname (wired access). I hope this version is not the one for general release. Shasum checking shows no download error. Self test indicates no error. Was hoping to test a fully non-lockup anywhere version of Anaconda. Reminder Same problem with workspace and net-install versions I am reporting this problem because it occurs using a wired connection. When I retest using a computer with wi-fi, Anaconda does not lock up. With wired connection, one cannot correct the timezone or exit from the network hub after setting a hostname. With the wireless connection, the default hostname is random, with faulty characters, and requires a manual correction. Please test with wired connection. Did not work for me Please test with wireless connection yes, worked for me. tested with network-server version and with live-Workspace version. Tested with two versions of RC5 I tested a wired connection and can't reproduce any of the things you've reported. Logs from /tmp would help, so would journalctl -b -l -o short-monotonic. If you can't get to a shell for these things, then there are problems deeper than just the installer that implicate kernel and/or hardware. Created attachment 965493 [details]
RC5 /tmp as tar file 1170803.tar
This tar file is created post anaconda lockup within the Wired spoke
some info.
CPU = 100%.
the tarfile contains all the contents of the /tmp (all logs).
I was able to return from the shell to gnome, and somehow was able to verify that I had copyied the files to a secondary flash drive.
Anaconda appeared to have restarted. I had changed the keyboard to ca (french canada) That was shown as the default , but on return it showed as US switched back). Is Anaconda in there twice?
One feedback. The help button on the wireless setup page eventually opened up, perhaps after 90 seconds. (Note the 100% cpu)
I will follow up with individual log files in case the tar file does not arrive cleanly
Created attachment 965494 [details]
anconda.log
anaconda.log
Created attachment 965495 [details]
ifcfg.log
Created attachment 965496 [details]
program.log
Created attachment 965497 [details]
sensitive-info.log
Created attachment 965498 [details]
storage.log
other logs = 0 bytes
The last entry in program.log is e2fsck so it seems like anaconda is waiting for it to complete. 1.Please reboot from install media. 2.Use blkid to confirm the device for fs uuid 0594544c-dc98-4ab7-af14-0840131a2ca1, volume name SeagateExt3. Previously it was sdg2 but this isn't guaranteed between reboots. 3. Run e2fsck -f -p -C 0 /dev/sdXY What results do you get? If this comes up clean and completes, then you'll need to reproduce the freeze, find the PID for anaconda, and run 'pstack PID' and attach the results as a file. I'm not sure if pstack comes on all install media, or if you'll need to use live media and yum install it. Hi Chris, Sorry for the delay in responding. I was looking to my emails for bugzilla email showing activity on this bug #. When you indicated SeagateEXT3, that was my 2 terrabyte USB external backup. I unplugged it, and retested. There is no problem any longer. Since I will do the install without the external USB backup plugged in, I know it will succeed. So, I apologize for causing concern at the last minute. BTW, while that e2fsck check was taking place cpu was 101% (on dual core). When I saw that, I assumed wrongly, a tight loop in code. For my own curiosity, I will try the rc5 beta again with drive plugged in and give it a half hour to run through the check. I believe that the release notes should be updated to indicate potential delays as I experienced it with Anaconda. Please redirect to release note person responsible (Pete Travis) Anaconda is running resize2fs -P and then e2fsck -f -p -C 0 on every ext volume, before I've even gotten passed the language selection screen. It's fine for it to collect minimum size info with resize2fs -P for all ext volumes, but I think it's inappropriate to run e2fsck on everything. My case: INFO program: Running... e2fsck -f -p -C 0 /dev/sda1 INFO program: Running... e2fsck -f -p -C 0 /dev/loop3 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/live-base INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/live-osimg-min That's recorded while I'm still at the Welcome+Language menu, so sda isn't even a chosen target for installation yet fsck is running on it? And I have no opt out? And I get no status report if it hangs? The fsck on the other three volumes fails with exit code other than 0 because they're all busy at the time the e2fsck was issued. So that's also pointless. I think this needs to behave better. If there is a way to get the revised anaconda out before tomorrow, or if there is going to be an emergency patch and you would like me to test with my system, I would like to know by email. To do so, update this bug report and as I am on the CC list, I will know to fetch the download to test. Please put the retrieval url for wget download in this response. Leslie This is a sufficiently pernicious bug, I'm proposing as an F22 alpha blocker. - It's completely reasonable for users to have many, or a large, ext volume(s). - It's unreasonable for the installer to issue e2fsck on all volumes when they aren't related to the installation at all. The whole point of journaled file systems is to avoid potentially hours or days long fsck. e2fsck -f -p should only be run on volumes that are explicitly chosen for resize, or reuse (like /home), and then there needs to be a status UI for that during the installation phase so we know what's going on. "When using a dedicated installer image, the installer must be able to complete an installation using the text, graphical and VNC installation interfaces. This criterion covers showstopper bugs in the installer for which there isn't any other specific criterion: obviously, it can't 'complete an installation' if there's a showstopper." I believe this comes from ce92b550a687547683bd36fe9419fb08894207fa , which was added to fix bug #1162215 . Note the explanation there: Run fsck before obtaining minimum filesystem size. (#1162215) resize2fs -P now requires an e2fsck -f first. The lack of a real minimum size left us with no useful minInstanceSize and allowed users to attempt a shrink down to any size, which led to failures. If we have a currentSize, set filesystem. Leslie: there are not going to be any emergency fixes. When go/no-go finishes, that is it, the release is done. Nothing will change in the release images after the Go decision. *** Bug 1169542 has been marked as a duplicate of this bug. *** Adam I thought so. I sent a note to Pete Travis to update the F12 release note. I am not certain that the note too, is blocked from updates. (Example, other distributions have corrected release notes after go-live). If Pete is not the right person, please follow up. It is important to update the wiki before tomorrow. I am not the only one to keep backup drives plugged in 24/7 (In reply to Adam Williamson (Red Hat) from comment #17) > I believe this comes from ce92b550a687547683bd36fe9419fb08894207fa , which > was added to fix bug #1162215. Yep, too bad as bcl predicted this bug in 1162215c21 if the check were applied to all volumes, not just user selected target devices. Draft commonbugs text: If the system has any combination of slow, many, or large, ext[234] volumes, the installer might hang. The hang could begin at any point from the Welcome & Language selection menu onward. This is due to the installer running e2fsck on all available ext volumes. Once the installer is launched, users are advised to wait for e2fsck instances to complete. As a work around prior to launching the installer, users can physically detach devices unrelated to the installation. I think commonbugs is the best place for this info, thanks for the report and wiki entry. *** Bug 1172324 has been marked as a duplicate of this bug. *** I think this actually is a bug e2fsprogs. blivet just tries to get information about existing storage devices and formats (disks, partitions, file systems,...) and anaconda needs this information before user does any disk/device selection. And with the follow-up patch for bug #1162215 (c0cccb8ee "Try to get FS info first before doing an FS check") no fsck should be run unless needed. That means that: 1) if you have damaged or cleanly unmounted file system on your machine before the installation, fsck is run, but it's probably not a good idea to run installation on such machine 2) if your filesystems are clear and *everything works as expected*, there no fsck should be run And the thing that is not working as expected here is resize2fs requiring fsck to tell blivet the minimum size of the filesystem. Or does anybody disagree with this? Please let me know because I plan to reassign this bug to e2fsprogs. Well, in comment 14 I show log entries when UI is not further than the Welcome/Language menu, and it's running e2fsck -f -p on every ext volume available including the ext4 live system images. They can't all need fsck or we have other problems. When I do an e2fsck -f -p on an 8TB ext4 volume that's freshly created and has no files in it takes ~45 seconds. So 30-60 minutes for a large fs with a bunch of files in it, even if clean, doesn't seem outlandish. I'll add Eric Sandee to the bug, see what he thinks. Eric, maybe start at comment 21's draft text for a shorter summary of this bug rather than wading through the whole thing. <-- SNIP -->
> And the thing that is not working as expected here is resize2fs requiring
> fsck to tell blivet the minimum size of the filesystem. Or does anybody
> disagree with this? Please let me know because I plan to reassign this bug
> to e2fsprogs.
I agree that this requirement seems wrong. Also, I know that it has not always been the case, or perhaps not for all situations. For resize2fs version 1.42.8 (the version on my desktop), on the tests that I run on filesystems on loop devices, e2fsck is not required to be run to get resize2fs -P to return an apparently meaningful value.
If you are unconditionally running "e2fsck -f -p" you are forcing a full check, and you get to wait for it. If you're running it on every disk on every machine in the Fedora universe, you get to wait for every single possible attached drive. So, not an e2fsprogs bug. resize2fs does now require a check prior to printing the minimum size, if the filesystem is in error or has a last-check-time set which has expired, because on corrupted filesystems the calculation could hang: commit 7d7a8fe4ea4d9162977a1a6b32c4737d9ca9dd1f Author: Eric Sandeen <sandeen> Date: Mon Jun 9 09:52:19 2014 -0400 resize2fs: don't attempt to calculate minimum size on fs with errors ... + if (!(mount_flags & EXT2_MF_MOUNTED)) { + if (!force && ((fs->super->s_lastcheck < fs->super->s_mtime) || + (fs->super->s_state & EXT2_ERROR_FS) || + ((fs->super->s_state & EXT2_VALID_FS) == 0))) { + fprintf(stderr, + _("Please run 'e2fsck -f %s' first.\n\n"), + device_name); + exit(1); + } + } so there are filesystems out there which will require a check prior to resize2fs -P, but certainly not *all* of them. You could attempt resize2fs -P, and if that fails w/ the above message, run e2fsck if you still really wanted to, alert the user, etc. As always, if Anaconda or associated bits has filesystem questions, we're happy to help - wish I'd known about this earlier. -Eric Oh, damn. Just reread the above check. (fs->super->s_lastcheck < fs->super->s_mtime) does indeed require fsck if it's mounted after the last check. hohum. -Eric Still, if we'd known about this requirement/problem in the installer, I think we probably could have relaxed that check. :( We didn't anticipate a workflow which ran resize2fs -P on many filesystems in a row, I guess. -Eric I've sent a patch upstream to drop the fsck requirement if we're only printing the minimum size. (In reply to Eric Sandeen from comment #30) > I've sent a patch upstream to drop the fsck requirement if we're only > printing the minimum size. Thanks! Are you okay with taking this bug? python-blivet (now) does exactly what you suggested in comment #27 -- attempt resize2fs -P, if that fails, e2fsck. Sure; the patch is merged upstream now, according to Ted (though not pushed yet, apparently). Sadly my suggestion in #27 won't really work too well, it'll almost always fail. So F20 is just kind of doomed for this, I guess. Wish I'd known about the problem earlier, but oh well! Part of this bug seems to be that the user gets no feedback on a long fsck action; I don't know if it should be cloned to deal with that, if it' possible. -Eric (In reply to Eric Sandeen from comment #32) > Sure; the patch is merged upstream now, according to Ted (though not pushed > yet, apparently). Thanks! > > Sadly my suggestion in #27 won't really work too well, it'll almost always > fail. So F20 is just kind of doomed for this, I guess. Wish I'd known > about the problem earlier, but oh well! Yeah, what I meant by my comment is that blivet does what's right here. > > Part of this bug seems to be that the user gets no feedback on a long fsck > action; I don't know if it should be cloned to deal with that, if it' > possible. That's a good point. I'm going to create a separate bug for it, though. Discussed at today's blocker review meeting [1]. Rejected as a blocker. This bug doesn't clearly violate any criteria and looks to be getting worked on either way. Please repropose if it's found to violate another criterion. [1] http://meetbot.fedoraproject.org/fedora-blocker-review/2015-01-07/ When can I test a F22 beta, so I can close this bug(let) e2fsprogs-1.42.12-3.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/e2fsprogs-1.42.12-3.fc21 e2fsprogs-1.42.12-3.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/e2fsprogs-1.42.12-3.fc20 Package e2fsprogs-1.42.12-3.fc21: * should fix your issue, * was pushed to the Fedora 21 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing e2fsprogs-1.42.12-3.fc21' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2015-2511/e2fsprogs-1.42.12-3.fc21 then log in and leave karma (feedback). e2fsprogs-1.42.12-3.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. e2fsprogs-1.42.12-3.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report. Still an issue installing F22. Anaconda hangs while it is running e2fsck on each filesystem, with no indication what its doing. With what version of e2fsprogs? (In reply to Andy Campbell from comment #41) > Still an issue installing F22. Anaconda hangs while it is running e2fsck on > each filesystem, with no indication what its doing. Please attach /tmp/program.log e2fsprogs, as shipped with F22 .... [liveuser@localhost ~]$ rpm -qa | grep e2fsprogs e2fsprogs-1.42.12-4.fc22.x86_64 e2fsprogs-libs-1.42.12-4.fc22.x86_64 Uploading requested logs. All I did was boot F22 Workstation live image from a USB stick and selected install. I waited for 30 mins or so for the fscks to complete, PC was clean shutdown before booting off of the USB stick. [liveuser@localhost ~]$ grep e2fsck /tmp/program.log 13:17:37,177 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/vg_neotrantor-entertainment 13:17:41,118 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/vg_neotrantor-software 13:17:44,939 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/vg_neotrantor-photos 13:17:46,912 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/vg_neotrantor-stuff 18:17:48,752 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/vg_neotrantor-VirtMch2 18:23:50,320 INFO program: Running... e2fsck -f -p -C 0 /dev/sda1 18:23:50,399 INFO program: Running... e2fsck -f -p -C 0 /dev/sda2 18:23:50,746 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/fedora-mnt_VirtMch 18:23:58,089 INFO program: Running... e2fsck -f -p -C 0 /dev/sdc5 18:25:40,583 INFO program: Running... e2fsck -f -p -C 0 /dev/sdd1 18:34:31,307 INFO program: Running... e2fsck -f -p -C 0 /dev/loop3 18:34:31,313 INFO program: e2fsck: Cannot continue, aborting. 18:34:31,379 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/live-rw 18:34:31,387 INFO program: e2fsck: Cannot continue, aborting. 18:34:31,435 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/live-base 18:34:31,441 INFO program: e2fsck: Operation not permitted while trying to open /dev/mapper/live-base 18:34:31,492 INFO program: Running... e2fsck -f -p -C 0 /dev/mapper/live-osimg-min 18:34:31,497 INFO program: e2fsck: Operation not permitted while trying to open /dev/mapper/live-osimg-min Created attachment 1036924 [details]
Request install log with e2fsck commands
(In reply to Andy Campbell from comment #44) > 13:17:37,177 INFO program: Running... e2fsck -f -p -C 0 > /dev/mapper/vg_neotrantor-entertainment Thanks, this is a different problem. It looks like blivet is running e2fsck unconditionally. Please open a new bug against python-blivet with the logs from /tmp/*log attached to it as individual text/plain attachments. *** Bug 1390027 has been marked as a duplicate of this bug. *** So this bug was never really fixed, and this is not really a 'different problem'. In #1162215 we noticed that resize2fs was requiring us to run e2fsck before it would tell us a minimum size for the filesystem, so we started running it on all filesystems to make sure we could get the minimum size info. This bug says 'wait a minute, instead of having anaconda fsck everything, we should just make resize2fs tell us the minimum size without requiring an fsck if the fs has been mounted since last check'. And so Eric changed resize2fs: https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/commit/?id=0462fd6db55de28d7e087d8d06ab20339acd8f67 and then he submitted an update, and marked it as fixing this bug, and this bug got closed. But that was wrong, because a crucial step was missed: we never actually changed anaconda/blivet back to not running fsck on everything again. It no longer *needed to*, but it still actually *was*. So this bug has been the same bug all along, and never has been fixed. We actually need to change this code: https://github.com/rhinstaller/blivet/blob/2.1-devel/blivet/formats/fs.py#L122-L125 https://github.com/rhinstaller/blivet/blob/2.1-devel/blivet/formats/fs.py#L277-L279 basically, `FS.__init__()` calls `self.update_size_info()` (which means that will run for *every* filesystem anaconda sees), and `update_size_info()` unconditionally runs `do_check()`, which if the filesystem in question is ext2/3/4, results in a run of `e2fsck -f -p -C 0` on that filesystem. For the record, cmurf correctly notes that this actually results in anaconda making changes to filesystems not involved in the install, because `e2fsck -p` means: -p Automatically repair ("preen") the file system. This option will cause e2fsck to automatically fix any filesystem problems that can be safely fixed without human intervention. even given that we trust it only makes 'safe' changes, that's pretty unexpected behaviour. Also for the record, we re-considered the dupe (#1390027) as a blocker, but concluded that we'd stand by the original assessment of this bug as rejected blocker, even in light of the 'preen' thing. (In reply to Adam Williamson from comment #49) > Also for the record, we re-considered the dupe (#1390027) as a blocker, but > concluded that we'd stand by the original assessment of this bug as rejected > blocker, even in light of the 'preen' thing. Too bad. So now we have to wait for Fedora 26 to get this bug fixed? What do those of us with large filesystems on the same disk that we want to install to do until then? (Other disks could be unplugged during install, but not the install destination disk.) "Too bad. So now we have to wait for Fedora 26 to get this bug fixed?" I'm afraid so, yes. "Too bad. So now we have to wait for Fedora 26 to get this bug fixed? What do those of us with large filesystems on the same disk that we want to install to do until then?" I'll make an updates.img today and link it here and from common bugs. It's quite easy to hack out the fsck from the code in a slightly dirty way which isn't really appropriate to throw into the official release right now, but will work around the problem for those with giant filesystems. Will respins be able to have the corrected code? In the past, Respins usually are generated a few days after the official Fedora Release. Ditto for Remixes. The latter two types of outputs (respin/remix) can prove the fix. Can't really say. Depends when we get around to fixing it and how complicated the fix is. https://www.happyassassin.net/updates/1170803.0.img should work around this for Fedora 25 users. Boot the installer with `inst.updates=https://www.happyassassin.net/updates/1170803.0.img` as a kernel parameter to use it. (for dlehman - I basically just ripped the `do_check()` out of `update_size_info() and put the finally: block back in line). Thanks Adam, that worked for me. And then after booting fedora I had to boot with fsck.mode=skip. The reason is that the partition it wants to fsck is 1.5TB if git repos, and is only 7200rpm...probably would take days. My take on this https://github.com/rhinstaller/blivet/pull/526 (In reply to Vratislav Podzimek from comment #56) > My take on this https://github.com/rhinstaller/blivet/pull/526 No, I just tested this and -n still takes a very long time to run. I had to wait 7.5 hours for the installer to get past this step, which is unacceptable. The only solution is to remove the e2fsck call completely (or ask the user if they want to run it). e2fsck is no longer required to solve the original reason that it was added since resize2fs was fixed. (In reply to Charles R. Anderson from comment #57) > (In reply to Vratislav Podzimek from comment #56) > > My take on this https://github.com/rhinstaller/blivet/pull/526 > > No, I just tested this and -n still takes a very long time to run. I had to > wait 7.5 hours for the installer to get past this step, which is > unacceptable. The only solution is to remove the e2fsck call completely (or > ask the user if they want to run it). e2fsck is no longer required to solve > the original reason that it was added since resize2fs was fixed. The problem is that blivet/anaconda need to know if the file system is clean in order to decide whether it can be resized or not. The plan is to report that information from blivet to anaconda which could then ask user if they want to do a check or not. However, there's no fast, safe and noninvasive way to tell if an ext2/3/4 file system is clean or (probably) not. Maybe there should be some way to e.g. just check the journal? Without it, blivet/anaconda would consider all ext2/3/4 file systems unresizable and only after user explicitly runs checks on them they would be considered resizable (if the check succeeds, of course). (In reply to Vratislav Podzimek from comment #58) > The problem is that blivet/anaconda need to know if the file system is clean > in order to decide whether it can be resized or not. Please re-read Eric Sandeen's comment #27, especially this part: "resize2fs: don't attempt to calculate minimum size on fs with errors ... so there are filesystems out there which will require a check prior to resize2fs -P, but certainly not *all* of them. You could attempt resize2fs -P, and if that fails w/ the above message, run e2fsck if you still really wanted to, alert the user, etc." > However, there's no fast, safe and noninvasive > way to tell if an ext2/3/4 file system is clean or (probably) not. Maybe > there should be some way to e.g. just check the journal? Without it, > blivet/anaconda would consider all ext2/3/4 file systems unresizable and > only after user explicitly runs checks on them they would be considered > resizable (if the check succeeds, of course). tune2fs shows Filesystem features: needs_recovery if they filesystem is dirty. You could replay the journal by mounting/unmounting the filesystem. You could try resize2fs -P to get the minimum size without running e2fsck as Eric Sandeen suggests. When it finally comes to doing the actual resize, you could try running the resize2fs and only run e2fsck if that fails. I believe the goals for any solution should be: 1. e2fsck should only be run when necessary, and only on disks and filesystems that were specifically selected for resize or installation. e.g. if you don't select a disk, don't check filesystems on that disk. If you select a filesystem for reformat, it shouldn't be checked. If you don't select a filesystem for any operations at all, it shouldn't be checked. [The current method of checking all disks, all filesystems fails to check encrypted partitions, so delaying checks until the last possible moment has the additional benefit that the checks could be applied to encrypted filesystems as well] 2. On filesystems that have been selected for operations to be performed on them, avoid running e2fsck wherever possible. Check for dirty flag with tune2fs, try mounting filesystem to cause the kernel to replay the journal if necessary, try resizing the filesystem without running e2fsck first. Only if all those steps fail, then fall back to running e2fsck. 3. As a last resort when e2fsck is determined to be necessary, ask the user to confirm this, warning them that it may take many hours or days to complete the check. (It might even be possible to estimate the time based on the speed of the disk and the "used" size of the filesystem.) 4. When finally running e2fsck, provide visual feedback of the progress and allow the user to cancel the operation. Thanks. Perhaps one should select the drives to be part of the new installation before the integrity scan takes place. Post installation, one could start a background "re-check" on "first boot" or later. Implementing this change would be an anaconda design change F26/F27. (In reply to Leslie Satenstein from comment #60) > Perhaps one should select the drives to be part of the new installation > before the integrity scan takes place. Post installation, one could start a > background "re-check" on "first boot" or later. Implementing this change > would be an anaconda design change F26/F27. Yes, but partition (not just disk) selection also needs to take place, because large/full partitions on a single disk cause this issue also, and one cannot exactly remove a partition just to do the installation like one can for a separate disk. I have a 2.4TB data partition which is 1.7T full which took 12 hours to fsck when I tested this again a couple days ago, even though the partition was 1) unmounted cleanly, 2) had no journal to replay, and 3) had no fsck auto-checking options set: Maximum mount count: -1 Check interval: 0 (<none>) If Anaconda wants to present a minimum size for any particular extN partition, then as of recent e2fsprogs, that partition will simply need to be without errors for "resize2fs -P" to calculate a minimum size. But being "without errors" does not mean that a full e2fsck must be run. "With errors" is a flag which gets set on the superblock if the filesystem encountered a runtime error which has not yet been fixed. That flag will not be present the vast majority of the time, and resize2fs -P will Just Work, and will present a minimum size without needing fsck on most filesystems. (Honestly, any filesystem which is presenting errors at install time should probably just be excluded (and noted) from the install targets - Anaconda should not be in the business of resolving such issues; if an existing filesystem requires repair that's something for the admin/owner to make a careful decision about before proceeding with a new install.) If you actually want to shrink an extN filesystem, then it almost certainly will need an e2fsfsk first - it will only proceed if there have been no filesystem modifications since the last full fsck. Shrinking is a very metadata intensive operation, and we don't want to run into inconsistencies and errors while performing that brain surgery, so a preceeding e2fsck is required - but only for filesystems which /will/ actually undergo shrink. The code is already written to check the filesystem prior to doing the actual shrink. The additional run of e2fck on all detected ext filesystems was very specifically added to deal with the 'resize2fs -P requires it' issue. I really think it would be pretty fine to just take it back out again. I might try and come up with a PR for this, but I'm not totally sure as I've got a lot of other stuff to do. I'm not sure if this issue applies to me, but I shared a disk with a virtual machine and I tried to install Fedora 25 Beta on it. The disk had several partitions on it, and some were part of the BTRFS filesystem (volume). Since I wasn't planning to install Fedora on that filesystem, I didn't feel the need to unmount it, but somehow it got corrupted pretty badly about that time. Could the design of blivet have anything with it or was it just a coincidence? It seems very unlikely to have anything at all to do with this bug. For a start, this is specific to ext2/3/4. I was under the impression that fsck is run for all possible filesystems, but it's more problematic for ext2/3/4 because it takes longer. No, that's not really the case. ext, FAT, NTFS and HFS+ partitions are checked. All others are not. I don't know how fast or slow FAT, NTFS and HFS+ partition checking is compared to ext checking, but in any case, btrfs partitions are not checked (because the BTRFS class does not override the FS class's definition of self._fsck_class as fsck.UnimplementedFSCK). Updated https://github.com/rhinstaller/blivet/pull/526 to avoid the e2fsck call. However, I still don't think this is right. Blivet needs to know if the file system is in a good shape and can be resized. The "if the tools tell us the minimum size, the file system is okay and resizable" sounds twisted to me. Blivet should be able to get the information about the file system's shape in some quick way not based on any assumptions. Eric, is there a way to run 'e2fsck' somehow for it to just check the 'clean' flag and return 0/1 based on the value of that flag? Would 'e2fsck -n' (without '-f') do that? If not, could such option be added? "Blivet needs to know if the file system is in a good shape and can be resized." But...at least AIUI, it really can't. Knowing whether the error flag has been set is not entirely the same thing. This is a 'the map is the territory' problem. As Eric said, if we're really going to *do* a resize, we have to do a full fsck before we do so. There is no way around that. And I believe there's no practical way to predict whether an fsck run will succeed any faster than simply *doing the fsck run*. To put it another way: * We can quickly tell whether the error flag has been set. * If the error flag has been set, we know we must run an fsck before doing the resize. * If the error flag has not been set, we know we must run an fsck before doing the resize. As for the error flag, I'd say I think we all agree on this: * If the error flag is set, we should just consider the partition fundamentally non-resizable (and maybe provide sufficient info for a front-end like blivet or blivet-gui to display a warning/info box to the user telling them to check the filesystem). The question, I guess, is whether it's OK to rely on resize2fs -M's implicit check of the error flag or not. So the question is really, does e2fsprogs upstream consider the error flag check an implementation detail of the -M feature, or a fundamental part of its job? i.e. if it somehow became the case that -M could be made to print a minimum size without checking the error flag, would they do that? If so, then I agree with you that anaconda should ideally do an independent check of the error flag before bothering with the `resize2fs -M` call, and if the error flag is set, just mark the partition as not resizeable. This also seems unnecessary, it's run on the rootfs.img as part of the compose process, so there's no reason thousands of installations need to repeat that particular check. Running... e2fsck -f -p -C 0 /dev/loop1 Running... e2fsck -f -p -C 0 /dev/mapper/live-rw Running... e2fsck -f -p -C 0 /dev/mapper/live-base The least amount of code change might be to just drop all of the flags being used, which should get a simple and fast pass/fail for whether e2fsck thinks the fs is clean. And then only run resize2fs to get a minimum size on ext234 file systems that are located on the user selected destination device. Once installation begins, either the real e2fsck prior to resize, or the resize operation could fail, so there needs to be error handing there anyway. I'm not really sure what's gained by checking everything in advance, whether the user has any intention to modify those file systems. (In reply to Vratislav Podzimek from comment #68) > Eric, is there a way to run 'e2fsck' somehow for it to just check the > 'clean' flag and return 0/1 based on the value of that flag? Would 'e2fsck > -n' (without '-f') do that? If not, could such option be added? "clean" simply means "no log replay needed" - I don't think there is any tool that allows you to query exactly that with a specific return value for the result. You could also parse dumpe2fs -h output, it will contain one of the following: Filesystem state: clean Filesystem state: not clean Filesystem state: clean, with errors Filesystem state: not clean, with errors You can quickly replay the log in userspace with e2fsck -E journal_only if "not valid" is the problem; if "error" is the problem, you must do a full e2fsck just to get the minimum size. But again, if a filesystem is in this much distress I would /not/ try to have the installer deal with it. Just ignore it and let the admin figure out what to do. This bug appears to have been reported against 'rawhide' during the Fedora 26 development cycle. Changing version to '26'. This bug still exists in the 20160521 nightly of Fedora 26. Can it please be fixed before F26 release? Thanks. Correction, 20170521 nightly of Fedora 26. *** Bug 1189905 has been marked as a duplicate of this bug. *** *** Bug 1375894 has been marked as a duplicate of this bug. *** Proposed as a Blocker for 26-beta by Fedora user cra using the blocker tracking app because: Violates: Other disks not touched [hide] Disks not selected as installation targets must not be affected by the installation process in any way. This bug has already been rejected as a blocker twice before. Do you have any particular rationale for re-considering it at this time? (In reply to Adam Williamson from comment #79) > This bug has already been rejected as a blocker twice before. Do you have > any particular rationale for re-considering it at this time? Because this violates the criterion given (don't touch disks/filesystems not selected for install). Preen /may/ be safe, but are we sure it is safe and always will be and will never have a bug that might accidentally destroy data? The whole point of disk selection is as an extra safeguard to avoid touching stuff you know will not be necessary for the install. Because the purported reasons to keep doing e2fsck before disk/filesystem selection are bogus. For example, there may be other filesystems that become available later that are never subjected to the early e2fsck (encrypted ones for example). Because it is absurd to accept that installing Fedora takes 12+ hours where there are preexisting large filesystems unrelated to the installation due to the refusal to remove the broken workaround that was originally put in place as a /temporary/ way to solve another problem that has long since been fixed the correct way. So, nothing new, then? To be quite honest, I've lost the thread here. Is there more work required in e2fsprogs to resolve this? I /think/ I fixed the underlying problem in 2014, but if there's more to it, please remind me. Thanks, -Eric The relevant criterion is actually alpha: The user must be able to select which of the disks connected to the system will be affected by the installation process. Disks not selected as installation targets must not be affected by the installation process in any way. If the file system is clean, the disk is not affected, so the criterion is not violated. If the file system on the non-selected disk is fixed, then the criterion is violated. That's how the criterion reads. Since the installer still indiscriminately runs 'e2fsck -f -p -C 0' on all ext file systems, even on devices not selected as installation targets, it's very obviously a criterion violation. So the rationale for reconsidering is, the previous explanations for rejecting it flat out ignore the criterion: non-selected disks must not be affected in any way; with just a handwave. So make it a blocker or revise the criterion. Per 62 and 72 I think anaconda needs to be out of the fsck business entirely. If dumpe2fs -h indicates the fs is clean, it can be included as an installation target. If it's anything other than clean, it's excluded. Since an fs volume other than "clean" is rare, it's arguably a local configuration issue that both the case where the file system is modified, and the file system is huge and fsck takes a long time. Since it doesn't always happen, it probably can be argued this is a conditional blocker. But I still think the installer is being ornery, running e2fsck on everything in sight. (In reply to Chris Murphy from comment #83) > So the rationale for reconsidering is, the previous explanations for > rejecting it flat out ignore the criterion: non-selected disks must not be > affected in any way; with just a handwave. So make it a blocker or revise > the criterion. I was just going to say the same thing. I was busy looking into the history: The original criterion for blocking Fedora 21 was: "When using a dedicated installer image, the installer must be able to complete an installation using the text, graphical and VNC installation interfaces. This criterion covers showstopper bugs in the installer for which there isn't any other specific criterion: obviously, it can't 'complete an installation' if there's a showstopper." The decision was: "1170803 - RejectedBlocker - This bug doesn't clearly violate any criteria and looks to be getting worked on either way. Please repropose if it's found to violate another criterion." Then the underlying issue with e2fsprogs was fixed and updated packages released. Everyone had to wait until Fedora 22 for a possible installer fix, but blivet was never changed to remove the now-no-longer-needed call to "e2fsck -f -p" on all EXT filesystems on all attached devices. I re-discovered this issue and filed a dupe (#1390027) which officially asked for this to be blocked on the other critereon: "Disk selection The user must be able to select which of the disks connected to the system will be affected by the installation process. Other disks not touched [show] eferences [show]" The decision was: "The decision to classify this bug as a RejectedBlocker and a RejectedFreezeException was made as this is an issue that has been around since Fedora 21 and has not blocked since. Though we note it causes e2fsck to perform 'safe' fixes on non-selected filesystems, as well as taking a long time if large ext2/3/4 filesystems are present, we don't see sufficient reason to change that decision or change this as an FE at this time." Looking into the meetbot logs, some of the reasons given for the rejections are: - "looks to be getting worked on either way" - "because this issue was been around since Fedora 21" - "because we are too close to the final release" - "so obviously there's no momentum to make it a blocker even though it meets the alteration requirement people were saying needed to be true to make it a blocker" - "it can survive one more release. But yes, let's ask for that to be prioritized" Here we are 5 releases later. I think these lines of reasoning deserve to be reconsidered. So in future, if you're going to do this, could you please re-propose the bug as a blocker, say, *six weeks* before the relevant release? Rather than the day we do the go/no-go meeting? Because now we've got another contentious issue thrown right back in at the death of a milestone that's already enough of a mess thanks to the libdb issue. It's a lot easier to make sensible decisions and ensure things are fixed properly when we're not all rushing around like a bunch of headless chickens trying to do things at the last minute. (In reply to Adam Williamson from comment #86) > So in future, if you're going to do this, could you please re-propose the > bug as a blocker, say, *six weeks* before the relevant release? Rather than > the day we do the go/no-go meeting? Sorry, I didn't re-discover this issue until this week when I tried an install on one of my systems that has large/filled filesystems. I filed it on Monday, but it didn't go into the blocker tracker because of the RejectedBlocker keyword which I discovered/fixed on Friday. > Because now we've got another contentious issue thrown right back in at the > death of a milestone that's already enough of a mess thanks to the libdb > issue. It's a lot easier to make sensible decisions and ensure things are > fixed properly when we're not all rushing around like a bunch of headless > chickens trying to do things at the last minute. I'd be fine with moving this to a Final Blocker rather than Beta. (In reply to Adam Williamson from comment #86) > So in future, if you're going to do this, could you please re-propose the > bug as a blocker, say, *six weeks* before the relevant release? Rather than > the day we do the go/no-go meeting? That's rather blame the messenger. We need a better procedure. I've long argued for a better kick the can down the road process. What we have now is kick the can down the road and hide it under the carpet. What I've suggested in the past, and ho hum rejected, is making certain bugs proposed as blockers as blocking the next release rather than the current release. A consistent procedure in writing sounds nice and ideal, but a subjective process where we just do that is already better than what we have right now. At least stop sweeping certain bugs under the carpet and hope they get fixed on their own (or that no one nominates it as a blocker again). > Because now we've got another contentious issue thrown right back in at the > death of a milestone that's already enough of a mess thanks to the libdb > issue. It's a lot easier to make sensible decisions and ensure things are > fixed properly when we're not all rushing around like a bunch of headless > chickens trying to do things at the last minute. The procedure we have is what enables last minute running around. Maybe at freeze we have a hard cutoff for new blocker bugs that are not regressions? If it's a regression, then it's current release current milestone blocker worthy. And if not then it gets kicked down the road to one of the next two milestones: in this case that would be either Fedora 26 Final, or Fedora 27 Alpha. So my proposal is to kick the can down the road, but do no sweep it under the carpet, make it a Fedora 26 Final blocker subject to the input from the team tasked with fixing it, and if they have a compelling argument for pushing it to Fedora 27 Alpha instead, then do that. Discussed at 2017-05-30 blocker review meeting: [1]. This bug was accepted as F27 Alpha blocker and rejected as F26 Beta blocker: This bug has been in existence for a while now, and it's too sweeping of a change to accept for F26 Beta. However, per the logic in Comment 88, we've accepted this as a violation of the Alpha "No disks touched" criterion for F27. [1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2017-05-30/ Besides performance impact, this bug have also other aspects: - may cause data loss, if one use signed block device (dm-verity, like in Chromium OS), or have hibernated system on that disk - may lead to security issue - for example if you (re-)install system on previously compromised machine - then compromised system may try to leave specially crafted filesystem metadata to exploit fsck or other filesystem parsing code involved (see below) of the new system Both applies also to mount and parsing selected files (/etc/fstab, /etc/os-release) from there. I hope this happens only to disks selected for installation - not all of them. Anyway it would be nice to have an option to opt-out from this feature - at the cost of not having detailed information in partitioning wizard. Cross referencing downstream issue: https://github.com/QubesOS/qubes-issues/issues/2835 I installed F26 beta yesterday. After I hit the "next" button on the first Anaconda screen, the installer freezed for about 5 minutes. In the "ps" output I saw that Anaconda run e2fsck processes on each partition of my 3 disks (2x 1 TB HDD, 1x 500 GB NVMe). The same freeze happened if I clicked the "refresh" button in the Anaconda partitioning tool to re-read the partition layout. It would be nice if the installer could tell the user that it runs file system checks in the background and that it can take several minutes. A frozen installer without any information makes users nervous. :-) An updates.img that should hopefully fix the issues: http://vpodzime.fedorapeople.org/1170803_updates.img Please try! Hey Vratislav, can you please describe the changes you made? Thanks a lot. It's this PR -- https://github.com/rhinstaller/blivet/pull/526 -- which basically does two things: 1. changes how blivet runs e2fsck to not make any changes to the FS (because there's right now no way to ask user for confirmation) 2. removes the FS checks when determining whether and FS instance is resizable and what its minimum size is because we no longer need to do it Is the updates.img against current F26? (In reply to Adam Williamson from comment #95) > Is the updates.img against current F26? "current" at the time it was created. Any problems with it? well, just trying to clarify what people should test with. I usually generate updates images against the tag matching the package build in current nightly composes, so people can properly test against a nightly compose... Seems like this has now been merged to blivet upstream, but not released yet. Vratislav, can we get it built for Rawhide and then we can finally close this one out? :) So this went into python-blivet-2.1.10-1, which is now built for Rawhide, F27 and F26. I'll mark it ON_QA for now, so we can verify the fix with recent Rawhide / F27 composes. Moving to Beta blocker, as we're not doing an Alpha for F27. Just as a note, the fix for this seems to have broken resizing - see https://bugzilla.redhat.com/show_bug.cgi?id=1484575 . Can anyone who has a system where the impact of this is very obvious please test with any recent F27 compose and confirm whether it's fixed? Thanks. See https://www.happyassassin.net/nightlies.html for image download links (that's my personal domain, if you don't trust me, just verify the links are https links to Fedora domains before downloading). It seems no one with affected system is willing to test the fix. Closing the bug, hopefully everything works. It if doesn't, please reopen. Did not receive notice to do a test. If it is part of kogj, I would have tested it. The only request I received was today, Sept 25. My email address has not changed. It works better than before, I have been using Fedora 27 pre-beta for versions dated 02, 11, 18, 19, and 23 (on 5 disks with 6 systems). Before the fix, the scan took 7 minutes elapsed to do the scan. I will post a response today (EST on improvements). I have tested the fixes using Fedora-Workstation-netinst-x86_64-27-20170925.n.0.iso From the clicking of Continue (and responding yes to the beta "risk message", Elapsed time dropped from seven minutes to about 2.¼ minutes. A substantial improvement. A posiitive change. Thank you. Leslie PS. I never received any emails notifications about readiness, except for Comment 103. I do check spam filters. 7 minutes --> 2.25 minutes to scan 7 Fedora distributions file systems. lvm, btrfs, btrfs, ext4. F25, F26 3x(F27 test) Here is what I have that I used for timings 4 disks are 1 terrabyte sata2 units. 1 is an SSD. I typically scratch the DISKC to test nightlies on true hardware. DISKA F27 lvm GNOME test DISKB F26 ext4 XFCE DISKC F27 btrfs Gnome test DISKC F27 lvmthin GNOME test DISKD F25 ext4 GNOME test DISKE F26 BTRFS W /Home on xfs GNOME System Ram 8 gigs. CPU Q9650 (Dual core 64bit Intel, circa 2012) Addendum All spinning disks are 7200rpm, each with 64megs cache, and 2 have NCQ A second test ran in 2 minutes, about 15 seconds shorter time. Seven minutes to 2 minutes is great. (In reply to Leslie Satenstein from comment #106) > Addendum > > All spinning disks are 7200rpm, each with 64megs cache, and 2 have NCQ > > A second test ran in 2 minutes, about 15 seconds shorter time. > > Seven minutes to 2 minutes is great. Cool! Also confirmed that this works for me now. No e2fsck is being run before disk selection or custom partitioning. No more 12+ hour wait at the Language Selection screen from trying to fsck 1.7TB of data. Thanks! I ask here since the Fedora 25 common bugs has a link to this issue: the installer update fro F25 mentioned there and here, seems to be gone or off-line. Is this installer image mirrored somewhere? (I know I could switch to F26 but keeping all systems in a computer classroom at the same version would be preferable) David, we'll try to get the web server back up. Adam, happyassassin.net seems to be down and therefore https://www.happyassassin.net/updates/1170803.0.img can't be downloaded. Can you please fix? The site seems to be back, thanks for fixing! As an additional note: wouldn't it be advisable if there are installer updates like this that are advertised in teh release notes/ common bugs, to host these at Fedora, preferrably on the mirrors like rpm updates? Frankly it's just rather *easier* for me to do it on my personal domain. These are not being provided as Official Fedora Updates, it's just something I'm doing as a courtesy. It wouldn't actually be appropriate to publish this as if it were an Official Fedora Update as no-one at all besides me has verified the contents :) (you can of course expand the image file to check what's in it). The site usually stays up pretty well, but I went on vacation and my cable modem somehow lost its signal a week before I got back, so it was down till I could get back and power cycle it. Sorry about that. |