Bug 1159292
Summary: | Machine automatically shutdown during upgrade in less than 15 minutes | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Robin Lee <robinlee.sysu> |
Component: | fedup-dracut | Assignee: | Will Woods <wwoods> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 20 | CC: | awilliam, bastiaan, error, jreznik, kalevlember, kparal, michael.monreal, milan.kerslager, mruckman, robatino, satellitgo, tflink, tomspur, wwoods, zbyszek |
Target Milestone: | --- | Keywords: | CommonBugs, Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | https://fedoraproject.org/wiki/Common_F21_bugs#no-fedup-beta | ||
Fixed In Version: | systemd-216-8.fc21 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-11-05 17:32:52 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1043129 | ||
Attachments: |
Description
Robin Lee
2014-10-31 12:00:43 UTC
I can confirm this issue, and it is not rare. Yesterday, we have seen it on a bare metal machine when upgrading F20 desktop to F21 workstation. Today, we have seen it in a VM and on a bare metal machine, when upgrading F19 desktop to F21 workstation. Unfortunately we didn't catch this before F21 Beta release, because yesterday we considered it to be a hardware issue. Another information worth noting is that we saw it both with fedup 0.8.1 and fedup 0.9.0. The only thing which was common for all these cases is RC4 upgrade.img (also the OP used this image). We have not seen this issue in the past with older upgrade.img files. Quite interestingly, we haven't hit this issue quite a few times even with RC4, otherwise it would be considered a blocker bug. So maybe this issue is only triggered with RC4 upgrade.img *and* if some particular package is present during the upgrade? Not sure.
Right now I have a VM with F19 (default desktop installation) snapshot and I have hit this issue 2 times out of 2 attempts. I have recorded the VM output and found out what are the last lines before the machine shuts down (btw, it seems to shut down hard, like poweroff -f). See attached screenshot. The log line is:
> upgrade[1004]: [707/1287] (89%) cleaning at-spi2-atk-2.8.1-1.fc19...
> cgroup: option or name mismatch, new: 0x0 "", old: 0x4 "systemd"
In a second after printing this, the machine powers off. So the issue might be related to fedup-dracut, systemd or at-spi2-atk. Or all of them.
After starting and booting into the machine again, the results are non-deterministic. Because lots of packages were not cleaned properly, yum complains a lot (in my case, something about 600 duplicate packages). My F19->F21 machine at least booted, but lbrabec's F20->F21 machine didn't even boot to graphics and network didn't work (maybe his machine turned off during actual update phase, not clean phase). Also, there is no /var/log/upgrade.log in my case, so I can't provide it. I can only provide fedup.log. Maybe the upgrade logging wasn't really solved for F19, just for F20?
Created attachment 952550 [details]
fedup output directly before reboot
Created attachment 952552 [details]
fedup.log from F19->F21
Created attachment 952553 [details]
"yum check" output from F19->F21 after unsuccessful upgrade
Since 4 people hit this already (OP and 3 of us in our office), it seems it isn't a random occurrence. Proposing as a blocker bug. Upgrade criterion. davidshea and I both had the thought that this may possibly be related to https://bugzilla.redhat.com/show_bug.cgi?id=1154768 somehow. zdzichu told me about a new systemd feature: http://cgit.freedesktop.org/systemd/systemd/tree/NEWS#n25 And bingo, the reboot is time-based. It's after exactly 15 minutes after start. See attached video. To confirm this even further, this time it rebooted after a different package (PackageKit), so it's not a problem with at-spi2-atk. Either this is a bug in systemd, or there is some service in the background which uses JobTimeoutAction and JobTimeoutRebootArgument, and causes the reboot. The question is what service, and whether fedup-dracut should prevent it. Created attachment 952567 [details]
video showing poweroff right after 15th minute mark
Kamil: that's pretty much exactly what #1154768 is about, I believe. It's a problem with that 'feature' when TimeoutSec is used in a unit file - that feature shouldn't trigger a shutdown if the unit which hasn't completed or whatever specifies TimeoutSec, but it does. http://cgit.freedesktop.org/systemd/systemd/commit/?id=2928b0a863091f8f291fddb168988711afd389ef is the commit that added the timeout. It rather looks like settings were added to /etc/systemd/system.conf to control this behaviour: +#StartTimeoutSec=15min +#StartTimeoutAction=reboot-force +#StartTimeoutRebootArgument= so might we be able to fix this for Beta by having fedup insert those settings into the /etc/systemd/system.conf in the upgrade environment? So, I wanted to try something to see if it would fix this: diff --git a/fedup/sysprep.py b/fedup/sysprep.py index f7fbe25..bda50e7 100644 --- a/fedup/sysprep.py +++ b/fedup/sysprep.py @@ -150,6 +150,10 @@ def customize_initrd(initrd): for f in wantfiles: log.debug("%s",f) with TemporaryDirectory(prefix="fedup.initramfs.") as tmpdir: curimg.extract(wantfiles, root=tmpdir) + sysdconf = os.path.join(tmpdir, 'etc', 'systemd', 'system.conf') + log.debug("Editing %s", sysdconf) + sysdout = open(sysdconf, 'a') + sysdout.write('\nStartTimeoutAction=none\n') newimg.append(wantfiles, root=tmpdir) # look for updates, and add them to initrd if found (obviously not a clean fix, but enough to test the idea). However, I ran into a bit of a problem: it seems /etc/systemd/system.conf never actually winds up in the fedup initramfs. It *should* do. If you print out the 'list of files to be appended' both in customize_initrd() and in newimg.append(), all the way through, etc/systemd/system.conf shows up. I can even print out the contents of the file after my patch edits it, showing that it gets properly extracted from curimg (and it wouldn't be in newimg.append's 'filelist' unless it was there in the temp directory, but it *is* in newimg.append's 'filelist'). But somehow, it doesn't make it in when newimg.append() actually writes the files into the new initramfs. lsinitrd on fedup-initramfs.img never shows it. This is the case even with an unmodified fedup 0.9 (without my patch above). It occurs to me that I'm not sure whether systemd when running the fedup upgrade step will read from the copy of /etc/systemd/system.conf in the initramfs, or the one in sysroot. Anyway, I *think* what we need to do is get that setting set in the fedup environment one way or another. And we really need this fix ASAP, because it's causing broken systems on upgrade. oh, there's actually a sysdout.close() in there too. I wasn't able to reproduce this with either fedup versions when using fedup on a i386 F19 box. https://bugzilla.redhat.com/show_bug.cgi?id=1154416#c10 claims this systemd feature going to be shut off in F21 anyway, so it's probably not something fedup actually needs to worry about permanently. I also haven't been able to reproduce this in my testing, so it's being triggered by either a very recent change in F21 *or* something special about your system(s). The question is: what's the cause? Can you start an upgrade with 'rd.break' and use the dracut emergency shell to check the contents of: /etc/systemd/system.conf (this is the initramfs one) /sysroot/etc/systemd/system.conf (this is your system's version) (In reply to Adam Williamson (Red Hat) from comment #11) > [...] it doesn't make it in when > newimg.append() actually writes the files into the new initramfs. lsinitrd > on fedup-initramfs.img never shows it. This is the case even with an > unmodified fedup 0.9 (without my patch above). lsinitrd doesn't handle concatenated initramfs images the way the kernel does, so the files that were appended by fedup may not show up in lsinitrd. To confirm whether the files are actually in the initramfs you need to boot with 'rd.break' and check the contents of the live initramfs, as described above. Hum, it looks like the edit actually worked fine, if I reboot and drop to a dracut shell I see it. looks like just lsinitrd can't read the initramfs properly after fedup edits it. Unfortunately roshi can run an upgrade that takes more than 15 mins without hitting the bug, so it looks like reproduction isn't 100% so I'm not totally sure I can test my 'fix'. If anyone wants to try it, though, try patching sysprep.py as above (plus the sysdout.close() after the write) and see if that gets you past the shutdown... Will: it won't be a permanent problem no, but the upgrade.img in the Beta tree is now built with the systemd that contains this timeout, and we can't change it. Anyone running a fedup against that tree may hit the problem. I suspect the difference between cases where the bug happens and cases where it doesn't is whether the boot reached a far enough point to consider the boot 'complete' and turn off the timer, but I can't quite grok the systemd code well enough to understand exactly when it does that. wwood: I believe the bug is triggered by a very recent change in systemd. It landed upstream after 216 was released, but systemd-216-2 in fedora has in its changelog "- Update to latest git, but without the readahead removal patch" , so I'm guessing we got it there. We pulled 216-4 into Beta TC4. So I believe only tests using an instroot from Beta TC4, Beta RC1, Beta RC4, or development/21 since 10-28 would show this bug. roshi I think misunderstood, and was expect the timeout to happen during fedup *package download*, so I think his failure to reproduce is bogus. I currently have a fedup upgrade stage that's been running for 19 minutes and hasn't shut down, with my patch. I actually see the timeout kicking in, in the journal: Oct 31 22:52:11 test1.happyassassin.net systemd[1]: Startup timed out. (but it doesn't *do* anything because my patch sets the action on timeout to 'none'). That seems like pretty good confirmation both of the cause, and that my approach should work around the problem. However, just to be sure, I'll test that the same procedure but without my patch causes the bug, next. There's an f20 scratch build of fedup with my test fix here: http://koji.fedoraproject.org/koji/taskinfo?taskID=7995328 and an f19 one here: http://koji.fedoraproject.org/koji/taskinfo?taskID=7995397 can folks who can reproduce this bug reliably please test with that fedup? You should see that the upgrade installation stage survives past 15 minutes, and around that time you should see the 'Startup timed out' message appear in the journal. For the record, looks like I'm right about when we got this change: [adamw@adam systemd (f21 %)]$ git log 0038-core-add-support-for-a-configurable-system-wide-star.patch commit 708deb95da792cb4515da4ceaf79458bdcc09f7a Author: Zbigniew Jędrzejewski-Szmek <zbyszek.pl> Date: Fri Oct 3 22:43:52 2014 -0400 Update to latest git [adamw@adam systemd (f21 %)]$ That commit is 216-2. The update, https://admin.fedoraproject.org/updates/FEDORA-2014-12688 , was submitted as 216-2 then edited to -3, -4 and finally -5. It was pulled into TC4 as -4, and into RC1 and RC4 as -5. -5 was pushed stable on 10-28. So as I wrote above, only TC4, RC1, RC2 and RC4 trees, and the development/21 tree from 10-29 on, would have the issue. You also only hit the issue if the 'update installation' stage of fedup takes more than 15 minutes, which it doesn't on a fast disk with a standard package set. I'm installing KDE with a bunch of optional sub-groups in order to make the upgrade take longer, for testing. The timeout was only reverted upstream four days ago, the reversion has not yet reached the Fedora package. I can reproduce the problem when running an upgrade of a big installation without my patched fedup. If you boot into the broken system (might need 1 or rescue) and run 'journalctl -b-1' you can see the problem quite clearly, the log ends with: Startup timed out. Forcibly powering off as a result of failure. Shutting down. Journal stopped I snapshotted before rebooting to fedup 'stage2'. I've just run the snapshot, updated to my scratch fedup, and re-run fedup to re-generate the initramfs, and kicked off 'stage2' again to confirm that it completes this time. I've gotta run out for dinner now, will post the results when I return. As expected, running the same upgrade but using fedup with my hack results in a successful upgrade. 'journalctl -b-1' shows the 'Startup timed out.' message in the middle of package installation. Seems pretty clear what the cause is, at this point. It'd be good to decide on a proper fix for this and get it out there on Monday if possible, we want to be ahead of release on Tuesday. Okay so, if I understand right: we never saw this in upgrade testing before now 'cuz the offending/buggy code was: * never in an upstream systemd release * not in F21 updates until Oct 28 (so my test builds didn't have it), and * not in TC1/2/3 or RC2/3 But it *did* get pulled into RC4, which is causing this and bug 1154768. If we had another day or two the obvious fix would just be to add "StartTimeoutAction=none" to /etc/systemd/systemd.conf, in the systemd package. But since it's technically feasible for fedup to fix *anything* during an upgrade, and we don't have a week, we're now fixing this in fedup. Okay, fine. Now: since the 'StartTimeoutAction' configuration item has been removed from systemd upstream, just adding that line to systemd.conf will cause a "Unknown lvalue 'StartTimeoutAction' ..." warning for every version of systemd *other* than 216-{2,3,4,5}. Ugh. On the other hand, this can probably also be fixed by adjusting the systemd units in the initramfs so that systemd properly considers startup to be finished before the upgrade starts.. which would probably be the intended behavior for dealing with this feature anyway. (In reply to Adam Williamson (Red Hat) from comment #11) > So, I wanted to try something to see if it would fix this: > > diff --git a/fedup/sysprep.py b/fedup/sysprep.py > index f7fbe25..bda50e7 100644 > --- a/fedup/sysprep.py > +++ b/fedup/sysprep.py > @@ -150,6 +150,10 @@ def customize_initrd(initrd): > for f in wantfiles: log.debug("%s",f) > with TemporaryDirectory(prefix="fedup.initramfs.") as tmpdir: > curimg.extract(wantfiles, root=tmpdir) > + sysdconf = os.path.join(tmpdir, 'etc', 'systemd', 'system.conf') > + log.debug("Editing %s", sysdconf) > + sysdout = open(sysdconf, 'a') > + sysdout.write('\nStartTimeoutAction=none\n') > newimg.append(wantfiles, root=tmpdir) > > # look for updates, and add them to initrd if found Yes, this seems to be the most reasonable way without changing the systemd package. > Author: Zbigniew Jędrzejewski-Szmek <zbyszek.pl> > Date: Fri Oct 3 22:43:52 2014 -0400 > > Update to latest git > [adamw@adam systemd (f21 %)]$ > > That commit is 216-2. My fault. I'm sorry for that. This update *was* supposed to go to stable before the freeze started, but was delayed by bugs (and the -3, -4, ... updates). (In reply to Will Woods from comment #22) > Now: since the 'StartTimeoutAction' configuration item has been removed from > systemd upstream, just adding that line to systemd.conf will cause a > "Unknown lvalue 'StartTimeoutAction' ..." warning for every version of > systemd *other* than 216-{2,3,4,5}. Ugh. I'll add a compatibility patch to ignore StartTimeoutAction=none in subsequent versions. That should be fine, since StartTimeoutAction=none is equivalent to the new behaviour. > On the other hand, this can probably also be fixed by adjusting the systemd > units in the initramfs so that systemd properly considers startup to be > finished before the upgrade starts.. which would probably be the intended > behavior for dealing with this feature anyway. That would be hard, changes to the the boot order are inherently fragile. Adam's fix seems to be much safer. (In reply to Adam Williamson (Red Hat) from comment #11) > So, I wanted to try something to see if it would fix this: > > diff --git a/fedup/sysprep.py b/fedup/sysprep.py > index f7fbe25..bda50e7 100644 > --- a/fedup/sysprep.py > +++ b/fedup/sysprep.py > @@ -150,6 +150,10 @@ def customize_initrd(initrd): > for f in wantfiles: log.debug("%s",f) > with TemporaryDirectory(prefix="fedup.initramfs.") as tmpdir: > curimg.extract(wantfiles, root=tmpdir) > + sysdconf = os.path.join(tmpdir, 'etc', 'systemd', 'system.conf') > + log.debug("Editing %s", sysdconf) > + sysdout = open(sysdconf, 'a') > + sysdout.write('\nStartTimeoutAction=none\n') > newimg.append(wantfiles, root=tmpdir) This will cause an unnecessary message in the logs. Something like this would be nicer: sysdconf = os.path.join(tmpdir, 'etc/systemd/system.conf') log.debug("Editing %s", sysdconf) with open(sysdconf, 'a') as sysdout: sysdout.write('\nStartTimeoutSec=0\n') Will: sorry, I guess I didn't quite explain the timeline quite right, let me give it one more shot. The change landed in F21 in systemd-216-2.fc21 and is in all subsequent F21 systemd builds so far. That build or a later one: * Was in F21 updates-testing from 2014-10-08 * Was in F21 stable from 2014-10-28 (plus whatever time it took to sync mirrors) * Was in F21 Beta TC4 onwards (RC2 and RC3 were kind of 'DOA' builds which is why I don't always mention them). This means the upgrade.img in TC4, RC1, RC2 and RC4 trees on stage/ (there is no RC3 tree) have the bug. The development/21 tree is built from the F21 'stable' package set, AIUI. That means the upgrade.img in that tree would not have had this bug until, probably, 2014-10-29. Obviously to hit the bug you still need the fedup package installation stage to take more than 15 minutes. If you have an SSD and a typical GNOME install, for instance, it probably won't, which is why we didn't see this bug in RC4 validation (I ran two upgrade tests on RC4 and didn't hit this bug either time, because the package update stage was fast enough to not hit it). In order to reproduce it I had to install a VM with 1900+ packages (I installed KDE plus a bunch of optional package groups). It's not that unusual to have a lot of packages installed, though, and it's much much easier to run into the bug if you have an HDD not an SSD, I'd expect most 'normal sized' installs to an HDD would hit it. Zbsyzek: thanks for the suggestion. I actually did it precisely because I *wanted* the message in the logs for testing purposes (makes it much easier to check/prove the bug is what we think it is). For a production fix I agree it'd make sense to change the timeout rather than the action. I don't know what the current plan is upstream to replace the timeout - I know you're planning to re-implement it in some different way. To me it would make sense for the config parameters to stay valid, not just add a compatibility patch to silently ignore them. I would expect that StartTimeoutSec=0 should just always disable any kind of system-wide 'boot timeout' of this kind. Having thought about it a bit I think the hack I came up with on the fly actually kind of makes sense in the long term: so long as it has any kind of 'system-wide' boot timeout mechanism, systemd should always treat the setting StartTimeoutSec=0 as disabling it, and things like fedup should always set it. That should avoid any kind of recurrence of this problem in future. Of course, for the long term we'd want to bake the setting into fedup's initramfs at generation time, not have the fedup binary stuff it in on the fly like my patch does. If we do go with my patch I'd expect we'd just leave it in until Final, when we could fix it properly in fedup-dracut. "If we had another day or two the obvious fix would just be to add" It's not about time exactly - it's about the fact that we've signed off on Beta, so the Beta upgrade.img is now set in stone. We cannot change anything in https://dl.fedoraproject.org/pub/alt/stage/21_Beta_RC4/Server/x86_64/os/images/pxeboot/upgrade.img , it is done, irrevocably. On Tuesday it will become https://dl.fedoraproject.org/pub/fedora/linux/releases/test/21-Beta/Server/x86_64/os/images/pxeboot/upgrade.img . It has this bug, and we cannot change that. So we're stuck working around it somewhere else, and fedup is about the only viable place. Obviously we can fix it for the upgrade.img that lives at https://dl.fedoraproject.org/pub/fedora/linux/development/21/x86_64/os/images/pxeboot/upgrade.img - again AIUI, as soon as a systemd without this 'feature' goes into 21 stable, that image will no longer have this bug. But we have instructions in various places telling people to use the image from the milestone tree (not the development/21 tree), and I think that's probably what mirrormanager will wind up pointing to, so we can't really leave things in the state where you hit this bug if you run fedup against the upgrade.img in our official frozen Beta tree. The system wide timeout was changed to a unit-specific timeouts (see http://cgit.freedesktop.org/systemd/systemd/commit/?id=3898b80d40). This is more flexible, because different values make sense for different targets, and also it ties more naturally into existing configuration scheme. One or two options (JobTimeoutSec, JobTimeoutAction) are added to specific units for which the timeouts are wanted. In the new scheme, there's a 15 min timeout on basic.target (not default.target like before), and 30 min timeout for reboot/poweroff. The first one shouldn't matter for fedup, but the second one *might* have to be adjusted for jobs which e.g. download lots of packages during shutdown. Hm, now that I think about it, fedora-autorelabel.service might trigger the boot timeout. So it seems that this requires more thinking before being enabled. Will: I realized after posting that last comment that you're suggesting we can change systemd in f19 and f20 so the setting to disable the timeout is in the /etc/systemd/system.conf that will get copied into the initramfs. That's possibly viable, but I see a couple of problems aside from the tight timeline to try and get an update into stable before Beta goes out: i) system.conf really is a *config* file, i.e. people might have modified it. If they have, the change will come in a .rpmnew, and people often don't bother checking for and merging rpmnew files, especially if the system appears to be working. (I've just checked, it definitely is marked as %config(noreplace) in the spec.) ii) people who just don't update the system before running fedup. We say you ought to, but there's nothing to enforce it, and I know people sometimes don't. (In reply to Adam Williamson (Red Hat) from comment #18) > There's an f20 scratch build of fedup with my test fix here: > > http://koji.fedoraproject.org/koji/taskinfo?taskID=7995328 ... > can folks who can reproduce this bug reliably please test with that fedup? > You should see that the upgrade installation stage survives past 15 minutes, > and around that time you should see the 'Startup timed out' message appear > in the journal. I can reproduce this reliably, and I can confirm that this build of fedup allows for installation to complete successfully, though for some reason it seemed to run much slower. Might have just been an overly abused qcow2 snapshot though... After running into this issue on three different systems, and spending most of the day figuring out how to make them usable again, I can say that recovering from it ranges from straightforward to nightmarish, depending on what packages were installed and where exactly the upgrade stopped. Bet on nightmarish, though, especially for a multilib system. I would recommend adding this to Common F21 bugs on the wiki, as this has the potential to leave systems in a semi-usable or even unusable state. (For instance, NetworkManager on one affected system could not connect to any networks. Xorg failed to start on another one.) I tested patched fedup (fc19) from comment 18 and it fixes the problem for me. I see this message in upgrade.log: > Nov 03 09:41:36 localhost.localdomain systemd[1]: Startup timed out. but the upgrade process continued and the machine was correctly upgraded. (In reply to Adam Williamson (Red Hat) from comment #27) > It's not about time exactly - it's about the fact that we've signed off on > Beta, so the Beta upgrade.img is now set in stone. We cannot change anything > in > https://dl.fedoraproject.org/pub/alt/stage/21_Beta_RC4/Server/x86_64/os/ > images/pxeboot/upgrade.img , it is done, irrevocably. On Tuesday it will > become > https://dl.fedoraproject.org/pub/fedora/linux/releases/test/21-Beta/Server/ > x86_64/os/images/pxeboot/upgrade.img . It has this bug, and we cannot change > that. So we're stuck working around it somewhere else, and fedup is about > the only viable place. Well, is it really that unreasonable rebuilding the upgrade.img with a newer systemd a replacing it in test/21-Beta location? Before it is published, of course. Alternatively, we can also create a fixed compose (might not even be a full compose) and redirecting mirrormanager to it, instead of test/21-Beta. IIRC, we never asked end users to use --instrepo, even with Alpha/Beta upgrades. We use mirrormanager for that, the url is http://mirrors.fedoraproject.org/mirrorlist?repo=fedora-install-21&arch=x86_64 . (That doesn't work yet, which I believe is an omission, because I think it used to work even for F20 Alpha. But you can try the link with an older release number). Fedora Infra used to update the link from Alpha location, to Beta location, to Final location. So, if it is easier than to hack fedup, we can create something like test/21-Beta-fedup-fix with just the necessary files included, and point the mirrormanager to it. (In reply to Adam Williamson (Red Hat) from comment #29) > Will: I realized after posting that last comment that you're suggesting we > can change systemd in f19 and f20 so the setting to disable the timeout is > in the /etc/systemd/system.conf that will get copied into the initramfs. No, sorry - to clarify a bit, I'm suggesting a new systemd package for *F21* that has "StartTimeoutAction=none" in systemd.conf. That'd be a one-line change that adds no new code and disables something that's already disabled/removed upstream. Then we'd rebuild upgrade.img and everything's done. So, roughly what Kamil's saying in comment 31. Also, a note: fedup doesn't add files from the host's /etc, it adds files from the /etc in the host's *initramfs*. [This is necessary because dracut cherry-picks files (and modifies a few of them) when it builds an initramfs, and there's no way to reproduce that behavior outside of dracut.] So: updating systemd in F19/F20 has the problems you mentioned, *plus* it still wouldn't reliably fix the problem unless we also forced the system to rebuild its initramfs (to get the updated systemd.conf) before running fedup (which then pulls the updated file into upgrade.img). That'd be viable, yeah, but it'd still worry me that people *could* use the Beta tree and hit the bug. It's a bit hard to remember what instructions we've written down where for doing pre-release upgrades, at this point, we may well have pointed people at the frozen milestone tree as a source of stable, 'known-OK' upgrade images. The only other viable place we have to point mirrormanager, BTW, at least up until Final TC1, is into development/21 , where the image might conceivably be broken or missing some days (missing isn't that likely, but possible, I guess). It's something we'd have to worry about every time it got rebuilt. I knew about the initramfs wrinkle of course (bit hard not to when writing that patch :>) but forgot about the impact it'd have on the F20 solution. I'm still kind of liking something like my approach, tbh, however we want to try and 'bound' it so it's not around forever (only put it in fedup builds till Final TC1, conditionalize it in the code somehow, whatever). (In reply to Zbigniew Jędrzejewski-Szmek from comment #28) > The system wide timeout was changed to a unit-specific timeouts (see > http://cgit.freedesktop.org/systemd/systemd/commit/?id=3898b80d40). This is > more flexible, because different values make sense for different targets, > and also it ties more naturally into existing configuration scheme. One or > two options (JobTimeoutSec, JobTimeoutAction) are added to specific units > for which the timeouts are wanted. > > In the new scheme, there's a 15 min timeout on basic.target (not > default.target like before), and 30 min timeout for reboot/poweroff. The > first one shouldn't matter for fedup Right. We have two sets of targets: Host: * system-upgrade-generator links "system-upgrade.target" to default.target * system-upgrade.target Requires sysinit.target, Wants upgrade-prep.service * upgrade-prep.service (on ExecStopPost) isolates upgrade-switch-root.target [Here we switch-root back to upgrade.img to run the upgrade..] Upgrade.img: * initrd-system-upgrade-generator links "upgrade.target" to default.target * upgrade.target Requires sysinit.target and sockets.target [And now the upgrade actually starts.] So, yes: we don't use basic.target in either case, which means we shouldn't need to worry about this after Beta. (In reply to Zbigniew Jędrzejewski-Szmek from comment #24) > (In reply to Will Woods from comment #22) > > On the other hand, this can probably also be fixed by adjusting the systemd > > units in the initramfs so that systemd properly considers startup to be > > finished before the upgrade starts.. which would probably be the intended > > behavior for dealing with this feature anyway. > That would be hard, changes to the the boot order are inherently fragile. > Adam's fix seems to be much safer. I don't think I need to change the existing boot ordering, I think I just need to tweak the ordering for 'system-upgrade.target' in the "Host" section above. If I understand the StartTimeoutAction implementation correctly, the timeout happens if systemd's initial job queue doesn't empty after 15 minutes; I just need to fix the units that fedup ships so that the job queue correctly empties (i.e. systemd considers the system started) before we do the switch-root. Testing to confirm now, but if I've read the systemd code right, this will solve the problem: diff --git a/systemd/upgrade-prep.service.in b/systemd/upgrade-prep.service.in index af39946..a28837b 100644 --- a/systemd/upgrade-prep.service.in +++ b/systemd/upgrade-prep.service.in @@ -7,7 +7,7 @@ OnFailureIsolate=true DefaultDependencies=no [Service] -Type=oneshot +Type=idle ExecStart=@LIBEXECDIR@/upgrade-prep.sh ExecStopPost=-/usr/bin/systemctl --no-block isolate upgrade-switch-root.target StandardOutput=journal+console Well, my test upgrade is 40% complete after 23 minutes and still going, so I think the above patch works. Further confirmation would be appreciated, of course; to try this workaround, just edit /lib/systemd/system/upgrade-prep.service and change the Type to 'idle' before you reboot. Or just: sudo sed -i 's/Type=oneshot/Type=idle/' /lib/systemd/system/upgrade-prep.service Also, just to be clear: the systemd bug is still in the RC4 upgrade.img. We can push this workaround as an update, but if we don't fix upgrade.img anyone who upgrades to F21 Beta without updating fedup first will have their systems mangled. The problem is that systemd will consider upgrade-prep.service failed after it exits. Might now matter though. (In reply to Zbigniew Jędrzejewski-Szmek from comment #37) > The problem is that systemd will consider upgrade-prep.service failed after > it exits. Might now matter though. Nevermind, it exits with 0 so everything is fine. fedup-0.9.0-2.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/fedup-0.9.0-2.fc20 fedup-0.9.0-2.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/fedup-0.9.0-2.fc19 fedup-0.9.0-2.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/fedup-0.9.0-2.fc21 0.9.0-2 looks good in a test from my reproducer, I got a complete upgrade from my reproducer snapshot. Note 0.9.0-2 uses the fix wwoods sugested in c#35, not the one I suggested, so tests of my scratch build do not validate 0.9.0-2. It'd be great if others who can reproduce can test with 0.9.0-2, confirm the fix, and provide karma ASAP - we want to get this pushed stable by tomorrow. systemd-216-8.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/systemd-216-8.fc21 (In reply to Adam Williamson (Red Hat) from comment #42) > 0.9.0-2 looks good in a test from my reproducer, I got a complete upgrade > from my reproducer snapshot. Note 0.9.0-2 uses the fix wwoods sugested in > c#35, not the one I suggested, so tests of my scratch build do not validate > 0.9.0-2. It'd be great if others who can reproduce can test with 0.9.0-2, > confirm the fix, and provide karma ASAP - we want to get this pushed stable > by tomorrow. I've put in my good karma. 0.9.0-2 not only works, it is not slow for me like the scratch build was. fedup-0.9.0-2.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report. Let's keep this open while we track all the changes and close it manually when we're sure everything's good (expect a few more close/re-open bounces like this). systemd-216-8.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. Reopening per comment #46. Is this fixed now when upgrading a F20 desktop to F21 beta? (In reply to Michael Monreal from comment #49) > Is this fixed now when upgrading a F20 desktop to F21 beta? Yes, provided the updates have reached your local mirror (which they should have by now). It's a slightly complex question. See https://fedoraproject.org/wiki/Common_F21_bugs#no-fedup-beta . To upgrade you need a copy of fedup on your F20 install, and that fedup needs to find a copy of 'upgrade.img', which is the initramfs containing the upgrade environment. As of this moment, we've made it so that when fedup goes and looks for an upgrade.img , it can't find one. Right now you can't do a fedup to F21 at all without explicitly providing an upgrade.img location with the '--instrepo' parameter. We've also implemented a fedup-side workaround and a systemd-side fix for this bug. The fedup-side workaround should mean that if you upgrade with fedup 0.9.0-2, regardless what upgrade.img you use, you should not be able to hit this bug. upgrade.img files that could still be affected by this bug are in the Beta TC4 through Beta RC4 trees on the staging server. It *should* be safe to upgrade using one of those trees with fedup 0.9.0-2. The systemd-side fix requires upgrade.img to be rebuilt. The systemd fix was pushed stable yesterday, so the upgrade.img files in the development/21 tree on the mirrors will contain the systemd-side fix from today onwards - roughly. It depends on mirror sync times. I don't believe anyone's explicitly tested with that upgrade.img yet. The Final TC1 tree should also contain an upgrade.img with the systemd-side fix. I am currently running a test of that. Our plan is to make the pointer fedup uses to find upgrade.img point to the Final TC1 tree, assuming it tests out OK, as a 'stable' source of an upgrade.img which is known not to contain the bug. Right now I'm not willing to say with absolute certainty 'you should upgrade this way', but if I had to upgrade a system RIGHT NOW with a gun to my head I'd say use fedup 0.9.0-2 and pass --instrepo https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/ (or i386 or armhpf, as appropriate). It'd probably be best to wait until we've tested it a bit more and updated the redirects so fedup 'just works', though. Shouldn't be too long (later today). (In reply to Adam Williamson (Red Hat) from comment #51) > It's a slightly complex question. See > https://fedoraproject.org/wiki/Common_F21_bugs#no-fedup-beta . > ... Thanks for explaining in detail. I tested an upgrade using fedup 0.9.0-1 (i.e. intentionally avoiding the fedup workaround) against the upgrade.img from Final TC1 tree and it doesn't hit the bug. If a few other folks can confirm, I think we can use that as the mirrormanager target. I took fedup-0.9.0-1.fc21.noarch from http://arm.koji.fedoraproject.org/koji/buildinfo?buildID=247899 because I didn't know where to take 0.9.0-1 for FC20. Then (after rpm -Uvh --oldpackage fedup*): fedup --product=workstation --network 21 --instrepo https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/ The upgrade went correcly, it took 18 mins 50 secs (2123 packages, I have SSD) on my primary machine (ie. on baremetal, no virtualization). You good guys. I feel happy now :-) Thank you for fixing bugs. BTW: I saw nice ASCII art at the end of the transaction, it told me that the transaction finished succesfully and the machine will be rebooted then. So no prelimitary reboot occurred. Yup, if you see Beefy it means it worked right. =) Thanks for testing! I saw it also f20 > f21 http://wiki.sugarlabs.org/go/Fedora_21#fedup_Updating_f20_desktop_to_f21_workstation yum update yum install fedup fedup --nogpgcheck --network 21 --product workstation --instrepo https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/ Download 1415 files reboot system "System Upgrade Fedup" on boot menu fedup-dracut-0.9.0 Update 1415 files NOTE: Terminal display stops updating at about 68% FIX: switch to {alt-f2} then {alt-f1} Successful Upgrade. (In reply to satellitgo from comment #57) > I saw it also f20 > f21 > http://wiki.sugarlabs.org/go/ > Fedora_21#fedup_Updating_f20_desktop_to_f21_workstation > > yum update > yum install fedup > fedup --nogpgcheck --network 21 --product workstation --instrepo > https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/ > > Download 1415 files > > reboot system > "System Upgrade Fedup" on boot menu > > fedup-dracut-0.9.0 > > Update 1415 files > > NOTE: Terminal display stops updating at about 68% > > FIX: switch to {alt-f2} then {alt-f1} > > Successful Upgrade. fedup of KDE f20 also worked fedup-0.9.0-2.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report. The updates have been pushed and multiple people verified the fix, closing. I was actually tracking multiple 'fixes' in multiple places, but I think the mirrormanager redirects to TC1 are up now which is the last thing we needed, so closed is probably fine. It'd be good to test a straight 'fedup --network 21 --product (foo)' from 19 and 20, with both 0.9.0-1 and 0.9.0-2, to verify everything is in place as expected. (In reply to Adam Williamson (Red Hat) from comment #61) > I was actually tracking multiple 'fixes' in multiple places, but I think the > mirrormanager redirects to TC1 are up now They are still missing. https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-install-21&arch=x86_64 returns > # repo = fedora-install-21 arch = x86_64 error: invalid repo or arch Yeah - sorry, I followed that up later but didn't update this bug. See https://fedoraproject.org/wiki/Common_F21_bugs#no-fedup-beta and bug #1160931. fedup-0.9.0-2.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. |