Bug 1159292

Summary:

Machine automatically shutdown during upgrade in less than 15 minutes

Product:

[Fedora] Fedora

Reporter:

Robin Lee <robinlee.sysu>

Component:

fedup-dracut

Assignee:

Will Woods <wwoods>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

awilliam, bastiaan, error, jreznik, kalevlember, kparal, michael.monreal, milan.kerslager, mruckman, robatino, satellitgo, tflink, tomspur, wwoods, zbyszek

Target Milestone:

---

Keywords:

CommonBugs, Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

https://fedoraproject.org/wiki/Common_F21_bugs#no-fedup-beta

Fixed In Version:

systemd-216-8.fc21

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-11-05 17:32:52 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1043129

Attachments:

Description	Flags
fedup output directly before reboot	none
fedup.log from F19->F21	none
"yum check" output from F19->F21 after unsuccessful upgrade	none
video showing poweroff right after 15th minute mark	none

Description Robin Lee 2014-10-31 12:00:43 UTC

Description of problem:
After fedup was setup, machine rebooted and entered upgrade process. After less than 15 minutes, the machine suddenly halted. The grub entry added by fedup had been removed.

Version-Release number of selected component (if applicable):
fedup-0.8.1-1.fc20

How reproducible:

Steps to Reproduce:
1. fedup --network 21 --instrepo http://dl.fedoraproject.org/pub/alt/stage/21_Beta_RC4/Server/x86_64/os/
2. reboot

Actual results:
Upgrade succeeded.

Comment 1 Kamil Páral 2014-10-31 17:03:51 UTC

I can confirm this issue, and it is not rare. Yesterday, we have seen it on a bare metal machine when upgrading F20 desktop to F21 workstation. Today, we have seen it in a VM and on a bare metal machine, when upgrading F19 desktop to F21 workstation. Unfortunately we didn't catch this before F21 Beta release, because yesterday we considered it to be a hardware issue. Another information worth noting is that we saw it both with fedup 0.8.1 and fedup 0.9.0. The only thing which was common for all these cases is RC4 upgrade.img (also the OP used this image). We have not seen this issue in the past with older upgrade.img files. Quite interestingly, we haven't hit this issue quite a few times even with RC4, otherwise it would be considered a blocker bug. So maybe this issue is only triggered with RC4 upgrade.img *and* if some particular package is present during the upgrade? Not sure.

Right now I have a VM with F19 (default desktop installation) snapshot and I have hit this issue 2 times out of 2 attempts. I have recorded the VM output and found out what are the last lines before the machine shuts down (btw, it seems to shut down hard, like poweroff -f). See attached screenshot. The log line is:

> upgrade[1004]: [707/1287] (89%) cleaning at-spi2-atk-2.8.1-1.fc19...
> cgroup: option or name mismatch, new: 0x0 "", old: 0x4 "systemd"

In a second after printing this, the machine powers off. So the issue might be related to fedup-dracut, systemd or at-spi2-atk. Or all of them.

After starting and booting into the machine again, the results are non-deterministic. Because lots of packages were not cleaned properly, yum complains a lot (in my case, something about 600 duplicate packages). My F19->F21 machine at least booted, but lbrabec's F20->F21 machine didn't even boot to graphics and network didn't work (maybe his machine turned off during actual update phase, not clean phase). Also, there is no /var/log/upgrade.log in my case, so I can't provide it. I can only provide fedup.log. Maybe the upgrade logging wasn't really solved for F19, just for F20?

Comment 2 Kamil Páral 2014-10-31 17:04:45 UTC

Created attachment 952550 [details]
fedup output directly before reboot

Comment 3 Kamil Páral 2014-10-31 17:08:15 UTC

Created attachment 952552 [details]
fedup.log from F19->F21

Comment 4 Kamil Páral 2014-10-31 17:08:46 UTC

Created attachment 952553 [details]
"yum check" output from F19->F21 after unsuccessful upgrade

Comment 5 Kamil Páral 2014-10-31 17:12:05 UTC

Since 4 people hit this already (OP and 3 of us in our office), it seems it isn't a random occurrence. Proposing as a blocker bug. Upgrade criterion.

Comment 6 Adam Williamson 2014-10-31 17:35:33 UTC

davidshea and I both had the thought that this may possibly be related to https://bugzilla.redhat.com/show_bug.cgi?id=1154768 somehow.

Comment 7 Kamil Páral 2014-10-31 18:07:32 UTC

zdzichu told me about a new systemd feature: http://cgit.freedesktop.org/systemd/systemd/tree/NEWS#n25

And bingo, the reboot is time-based. It's after exactly 15 minutes after start. See attached video. To confirm this even further, this time it rebooted after a different package (PackageKit), so it's not a problem with at-spi2-atk.

Either this is a bug in systemd, or there is some service in the background which uses JobTimeoutAction and JobTimeoutRebootArgument, and causes the reboot. The question is what service, and whether fedup-dracut should prevent it.

Comment 8 Kamil Páral 2014-10-31 18:08:43 UTC

Created attachment 952567 [details]
video showing poweroff right after 15th minute mark

Comment 9 Adam Williamson 2014-10-31 19:18:44 UTC

Kamil: that's pretty much exactly what #1154768 is about, I believe. It's a problem with that 'feature' when TimeoutSec is used in a unit file - that feature shouldn't trigger a shutdown if the unit which hasn't completed or whatever specifies TimeoutSec, but it does.

Comment 10 Adam Williamson 2014-10-31 19:33:55 UTC

http://cgit.freedesktop.org/systemd/systemd/commit/?id=2928b0a863091f8f291fddb168988711afd389ef is the commit that added the timeout.

It rather looks like settings were added to /etc/systemd/system.conf to control this behaviour:

+#StartTimeoutSec=15min
+#StartTimeoutAction=reboot-force
+#StartTimeoutRebootArgument=

so might we be able to fix this for Beta by having fedup insert those settings into the /etc/systemd/system.conf in the upgrade environment?

Comment 11 Adam Williamson 2014-10-31 21:16:00 UTC

So, I wanted to try something to see if it would fix this:

diff --git a/fedup/sysprep.py b/fedup/sysprep.py
index f7fbe25..bda50e7 100644
--- a/fedup/sysprep.py
+++ b/fedup/sysprep.py
@@ -150,6 +150,10 @@ def customize_initrd(initrd):
         for f in wantfiles: log.debug("%s",f)
         with TemporaryDirectory(prefix="fedup.initramfs.") as tmpdir:
             curimg.extract(wantfiles, root=tmpdir)
+            sysdconf = os.path.join(tmpdir, 'etc', 'systemd', 'system.conf')
+            log.debug("Editing %s", sysdconf)
+            sysdout = open(sysdconf, 'a')
+            sysdout.write('\nStartTimeoutAction=none\n')
             newimg.append(wantfiles, root=tmpdir)
 
     # look for updates, and add them to initrd if found

(obviously not a clean fix, but enough to test the idea). However, I ran into a bit of a problem: it seems /etc/systemd/system.conf never actually winds up in the fedup initramfs. It *should* do. If you print out the 'list of files to be appended' both in customize_initrd() and in newimg.append(), all the way through, etc/systemd/system.conf shows up. I can even print out the contents of the file after my patch edits it, showing that it gets properly extracted from curimg (and it wouldn't be in newimg.append's 'filelist' unless it was there in the temp directory, but it *is* in newimg.append's 'filelist'). But somehow, it doesn't make it in when newimg.append() actually writes the files into the new initramfs. lsinitrd on fedup-initramfs.img never shows it. This is the case even with an unmodified fedup 0.9 (without my patch above).

It occurs to me that I'm not sure whether systemd when running the fedup upgrade step will read from the copy of /etc/systemd/system.conf in the initramfs, or the one in sysroot.

Anyway, I *think* what we need to do is get that setting set in the fedup environment one way or another. And we really need this fix ASAP, because it's causing broken systems on upgrade.

Comment 12 Adam Williamson 2014-10-31 21:16:36 UTC

oh, there's actually a sysdout.close() in there too.

Comment 13 Mike Ruckman 2014-10-31 21:19:49 UTC

I wasn't able to reproduce this with either fedup versions when using fedup on a i386 F19 box.

Comment 14 Will Woods 2014-10-31 22:28:23 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1154416#c10 claims this systemd feature going to be shut off in F21 anyway, so it's probably not something fedup actually needs to worry about permanently.

I also haven't been able to reproduce this in my testing, so it's being triggered by either a very recent change in F21 *or* something special about your system(s). The question is: what's the cause?

Can you start an upgrade with 'rd.break' and use the dracut emergency shell to check the contents of:

  /etc/systemd/system.conf         (this is the initramfs one)
  /sysroot/etc/systemd/system.conf (this is your system's version)

(In reply to Adam Williamson (Red Hat) from comment #11)
> [...] it doesn't make it in when
> newimg.append() actually writes the files into the new initramfs. lsinitrd
> on fedup-initramfs.img never shows it. This is the case even with an
> unmodified fedup 0.9 (without my patch above).

lsinitrd doesn't handle concatenated initramfs images the way the kernel does, so the files that were appended by fedup may not show up in lsinitrd.

To confirm whether the files are actually in the initramfs you need to boot with 'rd.break' and check the contents of the live initramfs, as described above.

Comment 15 Adam Williamson 2014-10-31 22:37:28 UTC

Hum, it looks like the edit actually worked fine, if I reboot and drop to a dracut shell I see it. looks like just lsinitrd can't read the initramfs properly after fedup edits it.

Unfortunately roshi can run an upgrade that takes more than 15 mins without hitting the bug, so it looks like reproduction isn't 100% so I'm not totally sure I can test my 'fix'. If anyone wants to try it, though, try patching sysprep.py as above (plus the sysdout.close() after the write) and see if that gets you past the shutdown...

Comment 16 Adam Williamson 2014-10-31 22:40:41 UTC

Will: it won't be a permanent problem no, but the upgrade.img in the Beta tree is now built with the systemd that contains this timeout, and we can't change it. Anyone running a fedup against that tree may hit the problem.

I suspect the difference between cases where the bug happens and cases where it doesn't is whether the boot reached a far enough point to consider the boot 'complete' and turn off the timer, but I can't quite grok the systemd code well enough to understand exactly when it does that.

Comment 17 Adam Williamson 2014-10-31 22:57:20 UTC

wwood: I believe the bug is triggered by a very recent change in systemd. It landed upstream after 216 was released, but systemd-216-2 in fedora has in its changelog "- Update to latest git, but without the readahead removal patch" , so I'm guessing we got it there. We pulled 216-4 into Beta TC4. So I believe only tests using an instroot from Beta TC4, Beta RC1, Beta RC4, or development/21 since 10-28 would show this bug.

roshi I think misunderstood, and was expect the timeout to happen during fedup *package download*, so I think his failure to reproduce is bogus.

I currently have a fedup upgrade stage that's been running for 19 minutes and hasn't shut down, with my patch. I actually see the timeout kicking in, in the journal:

Oct 31 22:52:11 test1.happyassassin.net systemd[1]: Startup timed out.

(but it doesn't *do* anything because my patch sets the action on timeout to 'none'). That seems like pretty good confirmation both of the cause, and that my approach should work around the problem.

However, just to be sure, I'll test that the same procedure but without my patch causes the bug, next.

Comment 18 Adam Williamson 2014-10-31 23:33:15 UTC

There's an f20 scratch build of fedup with my test fix here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=7995328

and an f19 one here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=7995397

can folks who can reproduce this bug reliably please test with that fedup? You should see that the upgrade installation stage survives past 15 minutes, and around that time you should see the 'Startup timed out' message appear in the journal.

Comment 19 Adam Williamson 2014-11-01 00:05:20 UTC

For the record, looks like I'm right about when we got this change:

[adamw@adam systemd (f21 %)]$ git log 0038-core-add-support-for-a-configurable-system-wide-star.patch 
commit 708deb95da792cb4515da4ceaf79458bdcc09f7a
Author: Zbigniew Jędrzejewski-Szmek <zbyszek.pl>
Date:   Fri Oct 3 22:43:52 2014 -0400

    Update to latest git
[adamw@adam systemd (f21 %)]$ 

That commit is 216-2. The update, https://admin.fedoraproject.org/updates/FEDORA-2014-12688 , was submitted as 216-2 then edited to -3, -4 and finally -5. It was pulled into TC4 as -4, and into RC1 and RC4 as -5. -5 was pushed stable on 10-28. So as I wrote above, only TC4, RC1, RC2 and RC4 trees, and the development/21 tree from 10-29 on, would have the issue. You also only hit the issue if the 'update installation' stage of fedup takes more than 15 minutes, which it doesn't on a fast disk with a standard package set. I'm installing KDE with a bunch of optional sub-groups in order to make the upgrade take longer, for testing.

The timeout was only reverted upstream four days ago, the reversion has not yet reached the Fedora package.

Comment 20 Adam Williamson 2014-11-01 00:55:19 UTC

I can reproduce the problem when running an upgrade of a big installation without my patched fedup. If you boot into the broken system (might need 1 or rescue) and run 'journalctl -b-1' you can see the problem quite clearly, the log ends with:

Startup timed out.
Forcibly powering off as a result of failure.
Shutting down.
Journal stopped

I snapshotted before rebooting to fedup 'stage2'. I've just run the snapshot, updated to my scratch fedup, and re-run fedup to re-generate the initramfs, and kicked off 'stage2' again to confirm that it completes this time. I've gotta run out for dinner now, will post the results when I return.

Comment 21 Adam Williamson 2014-11-01 03:35:50 UTC

As expected, running the same upgrade but using fedup with my hack results in a successful upgrade. 'journalctl -b-1' shows the 'Startup timed out.' message in the middle of package installation. Seems pretty clear what the cause is, at this point.

It'd be good to decide on a proper fix for this and get it out there on Monday if possible, we want to be ahead of release on Tuesday.

Comment 22 Will Woods 2014-11-01 20:19:34 UTC

Okay so, if I understand right: we never saw this in upgrade testing before now 'cuz the offending/buggy code was:

  * never in an upstream systemd release
  * not in F21 updates until Oct 28 (so my test builds didn't have it), and
  * not in TC1/2/3 or RC2/3

But it *did* get pulled into RC4, which is causing this and bug 1154768.

If we had another day or two the obvious fix would just be to add

  "StartTimeoutAction=none"

to /etc/systemd/systemd.conf, in the systemd package.

But since it's technically feasible for fedup to fix *anything* during an upgrade, and we don't have a week, we're now fixing this in fedup. Okay, fine.

Now: since the 'StartTimeoutAction' configuration item has been removed from systemd upstream, just adding that line to systemd.conf will cause a "Unknown lvalue 'StartTimeoutAction' ..." warning for every version of systemd *other* than 216-{2,3,4,5}. Ugh.

On the other hand, this can probably also be fixed by adjusting the systemd units in the initramfs so that systemd properly considers startup to be finished before the upgrade starts.. which would probably be the intended behavior for dealing with this feature anyway.

Comment 23 Zbigniew Jędrzejewski-Szmek 2014-11-01 20:27:56 UTC

(In reply to Adam Williamson (Red Hat) from comment #11)
> So, I wanted to try something to see if it would fix this:
> 
> diff --git a/fedup/sysprep.py b/fedup/sysprep.py
> index f7fbe25..bda50e7 100644
> --- a/fedup/sysprep.py
> +++ b/fedup/sysprep.py
> @@ -150,6 +150,10 @@ def customize_initrd(initrd):
>          for f in wantfiles: log.debug("%s",f)
>          with TemporaryDirectory(prefix="fedup.initramfs.") as tmpdir:
>              curimg.extract(wantfiles, root=tmpdir)
> +            sysdconf = os.path.join(tmpdir, 'etc', 'systemd', 'system.conf')
> +            log.debug("Editing %s", sysdconf)
> +            sysdout = open(sysdconf, 'a')
> +            sysdout.write('\nStartTimeoutAction=none\n')
>              newimg.append(wantfiles, root=tmpdir)
>  
>      # look for updates, and add them to initrd if found

Yes, this seems to be the most reasonable way without changing the systemd package.

> Author: Zbigniew Jędrzejewski-Szmek <zbyszek.pl>
> Date:   Fri Oct 3 22:43:52 2014 -0400
> 
>     Update to latest git
> [adamw@adam systemd (f21 %)]$ 
> 
> That commit is 216-2.
My fault. I'm sorry for that. This update *was* supposed to go to stable before the
freeze started, but was delayed by bugs (and the -3, -4, ... updates).

Comment 24 Zbigniew Jędrzejewski-Szmek 2014-11-01 20:30:30 UTC

(In reply to Will Woods from comment #22)
> Now: since the 'StartTimeoutAction' configuration item has been removed from
> systemd upstream, just adding that line to systemd.conf will cause a
> "Unknown lvalue 'StartTimeoutAction' ..." warning for every version of
> systemd *other* than 216-{2,3,4,5}. Ugh.
I'll add a compatibility patch to ignore StartTimeoutAction=none in subsequent versions.
That should be fine, since StartTimeoutAction=none is equivalent to the new behaviour.

> On the other hand, this can probably also be fixed by adjusting the systemd
> units in the initramfs so that systemd properly considers startup to be
> finished before the upgrade starts.. which would probably be the intended
> behavior for dealing with this feature anyway.
That would be hard, changes to the the boot order are inherently fragile. Adam's fix seems to be much safer.

Comment 25 Zbigniew Jędrzejewski-Szmek 2014-11-01 20:36:09 UTC

(In reply to Adam Williamson (Red Hat) from comment #11)
> So, I wanted to try something to see if it would fix this:
> 
> diff --git a/fedup/sysprep.py b/fedup/sysprep.py
> index f7fbe25..bda50e7 100644
> --- a/fedup/sysprep.py
> +++ b/fedup/sysprep.py
> @@ -150,6 +150,10 @@ def customize_initrd(initrd):
>          for f in wantfiles: log.debug("%s",f)
>          with TemporaryDirectory(prefix="fedup.initramfs.") as tmpdir:
>              curimg.extract(wantfiles, root=tmpdir)
> +            sysdconf = os.path.join(tmpdir, 'etc', 'systemd', 'system.conf')
> +            log.debug("Editing %s", sysdconf)
> +            sysdout = open(sysdconf, 'a')
> +            sysdout.write('\nStartTimeoutAction=none\n')
>              newimg.append(wantfiles, root=tmpdir)

This will cause an unnecessary message in the logs. Something like this would be nicer:

  sysdconf = os.path.join(tmpdir, 'etc/systemd/system.conf')
  log.debug("Editing %s", sysdconf)
  with open(sysdconf, 'a') as sysdout:
      sysdout.write('\nStartTimeoutSec=0\n')

Comment 26 Adam Williamson 2014-11-01 20:51:19 UTC

Will: sorry, I guess I didn't quite explain the timeline quite right, let me give it one more shot. The change landed in F21 in systemd-216-2.fc21 and is in all subsequent F21 systemd builds so far. That build or a later one:

* Was in F21 updates-testing from 2014-10-08
* Was in F21 stable from 2014-10-28 (plus whatever time it took to sync mirrors)
* Was in F21 Beta TC4 onwards (RC2 and RC3 were kind of 'DOA' builds which is why I don't always mention them). This means the upgrade.img in TC4, RC1, RC2 and RC4 trees on stage/ (there is no RC3 tree) have the bug.

The development/21 tree is built from the F21 'stable' package set, AIUI. That means the upgrade.img in that tree would not have had this bug until, probably, 2014-10-29.

Obviously to hit the bug you still need the fedup package installation stage to take more than 15 minutes. If you have an SSD and a typical GNOME install, for instance, it probably won't, which is why we didn't see this bug in RC4 validation (I ran two upgrade tests on RC4 and didn't hit this bug either time, because the package update stage was fast enough to not hit it). In order to reproduce it I had to install a VM with 1900+ packages (I installed KDE plus a bunch of optional package groups). It's not that unusual to have a lot of packages installed, though, and it's much much easier to run into the bug if you have an HDD not an SSD, I'd expect most 'normal sized' installs to an HDD would hit it.

Zbsyzek: thanks for the suggestion. I actually did it precisely because I *wanted* the message in the logs for testing purposes (makes it much easier to check/prove the bug is what we think it is). For a production fix I agree it'd make sense to change the timeout rather than the action.

I don't know what the current plan is upstream to replace the timeout - I know you're planning to re-implement it in some different way. To me it would make sense for the config parameters to stay valid, not just add a compatibility patch to silently ignore them. I would expect that StartTimeoutSec=0 should just always disable any kind of system-wide 'boot timeout' of this kind. Having thought about it a bit I think the hack I came up with on the fly actually kind of makes sense in the long term: so long as it has any kind of 'system-wide' boot timeout mechanism, systemd should always treat the setting StartTimeoutSec=0 as disabling it, and things like fedup should always set it. That should avoid any kind of recurrence of this problem in future. Of course, for the long term we'd want to bake the setting into fedup's initramfs at generation time, not have the fedup binary stuff it in on the fly like my patch does. If we do go with my patch I'd expect we'd just leave it in until Final, when we could fix it properly in fedup-dracut.

Comment 27 Adam Williamson 2014-11-01 21:01:31 UTC

"If we had another day or two the obvious fix would just be to add"

It's not about time exactly - it's about the fact that we've signed off on Beta, so the Beta upgrade.img is now set in stone. We cannot change anything in https://dl.fedoraproject.org/pub/alt/stage/21_Beta_RC4/Server/x86_64/os/images/pxeboot/upgrade.img , it is done, irrevocably. On Tuesday it will become https://dl.fedoraproject.org/pub/fedora/linux/releases/test/21-Beta/Server/x86_64/os/images/pxeboot/upgrade.img . It has this bug, and we cannot change that. So we're stuck working around it somewhere else, and fedup is about the only viable place.

Obviously we can fix it for the upgrade.img that lives at https://dl.fedoraproject.org/pub/fedora/linux/development/21/x86_64/os/images/pxeboot/upgrade.img - again AIUI, as soon as a systemd without this 'feature' goes into 21 stable, that image will no longer have this bug. But we have instructions in various places telling people to use the image from the milestone tree (not the development/21 tree), and I think that's probably what mirrormanager will wind up pointing to, so we can't really leave things in the state where you hit this bug if you run fedup against the upgrade.img in our official frozen Beta tree.

Comment 28 Zbigniew Jędrzejewski-Szmek 2014-11-01 21:10:39 UTC

The system wide timeout was changed to a unit-specific timeouts (see http://cgit.freedesktop.org/systemd/systemd/commit/?id=3898b80d40). This is more flexible, because different values make sense for different targets, and also it ties more naturally into existing configuration scheme. One or two options (JobTimeoutSec, JobTimeoutAction) are added to specific units for which the timeouts are wanted.

In the new scheme, there's a 15 min timeout on basic.target (not default.target like before), and 30 min timeout for reboot/poweroff. The first one shouldn't matter for fedup, but the second one *might* have to be adjusted for jobs which e.g. download lots of packages during shutdown.

Hm, now that I think about it, fedora-autorelabel.service might trigger the boot timeout. So it seems that this requires more thinking before being enabled.

Comment 29 Adam Williamson 2014-11-01 23:54:47 UTC

Will: I realized after posting that last comment that you're suggesting we can change systemd in f19 and f20 so the setting to disable the timeout is in the /etc/systemd/system.conf that will get copied into the initramfs. That's possibly viable, but I see a couple of problems aside from the tight timeline to try and get an update into stable before Beta goes out:

i) system.conf really is a *config* file, i.e. people might have modified it. If they have, the change will come in a .rpmnew, and people often don't bother checking for and merging rpmnew files, especially if the system appears to be working. (I've just checked, it definitely is marked as %config(noreplace) in the spec.)

ii) people who just don't update the system before running fedup. We say you ought to, but there's nothing to enforce it, and I know people sometimes don't.

Comment 30 Michael Hampton 2014-11-03 05:19:28 UTC

(In reply to Adam Williamson (Red Hat) from comment #18)
> There's an f20 scratch build of fedup with my test fix here:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=7995328
...
> can folks who can reproduce this bug reliably please test with that fedup?
> You should see that the upgrade installation stage survives past 15 minutes,
> and around that time you should see the 'Startup timed out' message appear
> in the journal.

I can reproduce this reliably, and I can confirm that this build of fedup allows for installation to complete successfully, though for some reason it seemed to run much slower. Might have just been an overly abused qcow2 snapshot though...

After running into this issue on three different systems, and spending most of the day figuring out how to make them usable again, I can say that recovering from it ranges from straightforward to nightmarish, depending on what packages were installed and where exactly the upgrade stopped. Bet on nightmarish, though, especially for a multilib system.

I would recommend adding this to Common F21 bugs on the wiki, as this has the potential to leave systems in a semi-usable or even unusable state. (For instance, NetworkManager on one affected system could not connect to any networks. Xorg failed to start on another one.)

Comment 31 Kamil Páral 2014-11-03 11:33:44 UTC

I tested patched fedup (fc19) from comment 18 and it fixes the problem for me. I see this message in upgrade.log:
> Nov 03 09:41:36 localhost.localdomain systemd[1]: Startup timed out.
but the upgrade process continued and the machine was correctly upgraded.

(In reply to Adam Williamson (Red Hat) from comment #27)
> It's not about time exactly - it's about the fact that we've signed off on
> Beta, so the Beta upgrade.img is now set in stone. We cannot change anything
> in
> https://dl.fedoraproject.org/pub/alt/stage/21_Beta_RC4/Server/x86_64/os/
> images/pxeboot/upgrade.img , it is done, irrevocably. On Tuesday it will
> become
> https://dl.fedoraproject.org/pub/fedora/linux/releases/test/21-Beta/Server/
> x86_64/os/images/pxeboot/upgrade.img . It has this bug, and we cannot change
> that. So we're stuck working around it somewhere else, and fedup is about
> the only viable place.

Well, is it really that unreasonable rebuilding the upgrade.img with a newer systemd a replacing it in test/21-Beta location? Before it is published, of course.

Alternatively, we can also create a fixed compose (might not even be a full compose) and redirecting mirrormanager to it, instead of test/21-Beta. IIRC, we never asked end users to use --instrepo, even with Alpha/Beta upgrades. We use mirrormanager for that, the url is http://mirrors.fedoraproject.org/mirrorlist?repo=fedora-install-21&arch=x86_64 . (That doesn't work yet, which I believe is an omission, because I think it used to work even for F20 Alpha. But you can try the link with an older release number). Fedora Infra used to update the link from Alpha location, to Beta location, to Final location.

So, if it is easier than to hack fedup, we can create something like test/21-Beta-fedup-fix with just the necessary files included, and point the mirrormanager to it.

Comment 32 Will Woods 2014-11-03 14:47:33 UTC

(In reply to Adam Williamson (Red Hat) from comment #29)
> Will: I realized after posting that last comment that you're suggesting we
> can change systemd in f19 and f20 so the setting to disable the timeout is
> in the /etc/systemd/system.conf that will get copied into the initramfs.

No, sorry - to clarify a bit, I'm suggesting a new systemd package for *F21* that has "StartTimeoutAction=none" in systemd.conf. That'd be a one-line change that adds no new code and disables something that's already disabled/removed upstream.

Then we'd rebuild upgrade.img and everything's done. So, roughly what Kamil's saying in comment 31.

Also, a note: fedup doesn't add files from the host's /etc, it adds files from the /etc in the host's *initramfs*. 

[This is necessary because dracut cherry-picks files (and modifies a few of them) when it builds an initramfs, and there's no way to reproduce that behavior outside of dracut.]

So: updating systemd in F19/F20 has the problems you mentioned, *plus* it still wouldn't reliably fix the problem unless we also forced the system to rebuild its initramfs (to get the updated systemd.conf) before running fedup (which then pulls the updated file into upgrade.img).

Comment 33 Adam Williamson 2014-11-03 15:29:13 UTC

That'd be viable, yeah, but it'd still worry me that people *could* use the Beta tree and hit the bug. It's a bit hard to remember what instructions we've written down where for doing pre-release upgrades, at this point, we may well have pointed people at the frozen milestone tree as a source of stable, 'known-OK' upgrade images.

The only other viable place we have to point mirrormanager, BTW, at least up until Final TC1, is into development/21 , where the image might conceivably be broken or missing some days (missing isn't that likely, but possible, I guess). It's something we'd have to worry about every time it got rebuilt.

I knew about the initramfs wrinkle of course (bit hard not to when writing that patch :>) but forgot about the impact it'd have on the F20 solution.

I'm still kind of liking something like my approach, tbh, however we want to try and 'bound' it so it's not around forever (only put it in fedup builds till Final TC1, conditionalize it in the code somehow, whatever).

Comment 34 Will Woods 2014-11-03 15:29:37 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #28)
> The system wide timeout was changed to a unit-specific timeouts (see
> http://cgit.freedesktop.org/systemd/systemd/commit/?id=3898b80d40). This is
> more flexible, because different values make sense for different targets,
> and also it ties more naturally into existing configuration scheme. One or
> two options (JobTimeoutSec, JobTimeoutAction) are added to specific units
> for which the timeouts are wanted.
>
> In the new scheme, there's a 15 min timeout on basic.target (not
> default.target like before), and 30 min timeout for reboot/poweroff. The
> first one shouldn't matter for fedup

Right. We have two sets of targets:

Host: 
* system-upgrade-generator links "system-upgrade.target" to default.target
* system-upgrade.target Requires sysinit.target, Wants upgrade-prep.service
* upgrade-prep.service (on ExecStopPost) isolates upgrade-switch-root.target

[Here we switch-root back to upgrade.img to run the upgrade..]

Upgrade.img:
* initrd-system-upgrade-generator links "upgrade.target" to default.target
* upgrade.target Requires sysinit.target and sockets.target

[And now the upgrade actually starts.]

So, yes: we don't use basic.target in either case, which means we shouldn't need to worry about this after Beta.

(In reply to Zbigniew Jędrzejewski-Szmek from comment #24)
> (In reply to Will Woods from comment #22)
> > On the other hand, this can probably also be fixed by adjusting the systemd
> > units in the initramfs so that systemd properly considers startup to be
> > finished before the upgrade starts.. which would probably be the intended
> > behavior for dealing with this feature anyway.
> That would be hard, changes to the the boot order are inherently fragile.
> Adam's fix seems to be much safer.

I don't think I need to change the existing boot ordering, I think I just need to tweak the ordering for 'system-upgrade.target' in the "Host" section above.

If I understand the StartTimeoutAction implementation correctly, the timeout happens if systemd's initial job queue doesn't empty after 15 minutes; I just need to fix the units that fedup ships so that the job queue correctly empties (i.e. systemd considers the system started) before we do the switch-root.

Comment 35 Will Woods 2014-11-03 15:46:48 UTC

Testing to confirm now, but if I've read the systemd code right, this will solve the problem:

diff --git a/systemd/upgrade-prep.service.in b/systemd/upgrade-prep.service.in
index af39946..a28837b 100644
--- a/systemd/upgrade-prep.service.in
+++ b/systemd/upgrade-prep.service.in
@@ -7,7 +7,7 @@ OnFailureIsolate=true
 DefaultDependencies=no
 
 [Service]
-Type=oneshot
+Type=idle
 ExecStart=@LIBEXECDIR@/upgrade-prep.sh
 ExecStopPost=-/usr/bin/systemctl --no-block isolate upgrade-switch-root.target
 StandardOutput=journal+console

Comment 36 Will Woods 2014-11-03 16:23:14 UTC

Well, my test upgrade is 40% complete after 23 minutes and still going, so I think the above patch works.

Further confirmation would be appreciated, of course; to try this workaround, just edit /lib/systemd/system/upgrade-prep.service and change the Type to 'idle' before you reboot. Or just:

 sudo sed -i 's/Type=oneshot/Type=idle/' /lib/systemd/system/upgrade-prep.service

Also, just to be clear: the systemd bug is still in the RC4 upgrade.img. We can push this workaround as an update, but if we don't fix upgrade.img anyone who upgrades to F21 Beta without updating fedup first will have their systems mangled.

Comment 37 Zbigniew Jędrzejewski-Szmek 2014-11-03 16:31:44 UTC

The problem is that systemd will consider upgrade-prep.service failed after it exits. Might now matter though.

Comment 38 Zbigniew Jędrzejewski-Szmek 2014-11-03 16:34:52 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #37)
> The problem is that systemd will consider upgrade-prep.service failed after
> it exits. Might now matter though.
Nevermind, it exits with 0 so everything is fine.

Comment 39 Fedora Update System 2014-11-03 18:01:42 UTC

fedup-0.9.0-2.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/fedup-0.9.0-2.fc20

Comment 40 Fedora Update System 2014-11-03 18:05:30 UTC

fedup-0.9.0-2.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/fedup-0.9.0-2.fc19

Comment 41 Fedora Update System 2014-11-03 18:05:47 UTC

fedup-0.9.0-2.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/fedup-0.9.0-2.fc21

Comment 42 Adam Williamson 2014-11-03 18:43:54 UTC

0.9.0-2 looks good in a test from my reproducer, I got a complete upgrade from my reproducer snapshot. Note 0.9.0-2 uses the fix wwoods sugested in c#35, not the one I suggested, so tests of my scratch build do not validate 0.9.0-2. It'd be great if others who can reproduce can test with 0.9.0-2, confirm the fix, and provide karma ASAP - we want to get this pushed stable by tomorrow.

Comment 43 Fedora Update System 2014-11-03 19:16:33 UTC

systemd-216-8.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/systemd-216-8.fc21

Comment 44 Michael Hampton 2014-11-03 19:27:16 UTC

(In reply to Adam Williamson (Red Hat) from comment #42)
> 0.9.0-2 looks good in a test from my reproducer, I got a complete upgrade
> from my reproducer snapshot. Note 0.9.0-2 uses the fix wwoods sugested in
> c#35, not the one I suggested, so tests of my scratch build do not validate
> 0.9.0-2. It'd be great if others who can reproduce can test with 0.9.0-2,
> confirm the fix, and provide karma ASAP - we want to get this pushed stable
> by tomorrow.

I've put in my good karma. 0.9.0-2 not only works, it is not slow for me like the scratch build was.

Comment 45 Fedora Update System 2014-11-04 02:29:16 UTC

fedup-0.9.0-2.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 46 Adam Williamson 2014-11-04 02:35:13 UTC

Let's keep this open while we track all the changes and close it manually when we're sure everything's good (expect a few more close/re-open bounces like this).

Comment 47 Fedora Update System 2014-11-04 02:48:06 UTC

systemd-216-8.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 48 Mike Ruckman 2014-11-04 05:55:42 UTC

Reopening per comment #46.

Comment 49 Michael Monreal 2014-11-04 17:33:32 UTC

Is this fixed now when upgrading a F20 desktop to F21 beta?

Comment 50 Michael Hampton 2014-11-04 17:36:08 UTC

(In reply to Michael Monreal from comment #49)
> Is this fixed now when upgrading a F20 desktop to F21 beta?

Yes, provided the updates have reached your local mirror (which they should have by now).

Comment 51 Adam Williamson 2014-11-04 17:46:48 UTC

It's a slightly complex question. See https://fedoraproject.org/wiki/Common_F21_bugs#no-fedup-beta .

To upgrade you need a copy of fedup on your F20 install, and that fedup needs to find a copy of 'upgrade.img', which is the initramfs containing the upgrade environment.

As of this moment, we've made it so that when fedup goes and looks for an upgrade.img , it can't find one. Right now you can't do a fedup to F21 at all without explicitly providing an upgrade.img location with the '--instrepo' parameter.

We've also implemented a fedup-side workaround and a systemd-side fix for this bug. The fedup-side workaround should mean that if you upgrade with fedup 0.9.0-2, regardless what upgrade.img you use, you should not be able to hit this bug. upgrade.img files that could still be affected by this bug are in the Beta TC4 through Beta RC4 trees on the staging server. It *should* be safe to upgrade using one of those trees with fedup 0.9.0-2.

The systemd-side fix requires upgrade.img to be rebuilt. The systemd fix was pushed stable yesterday, so the upgrade.img files in the development/21 tree on the mirrors will contain the systemd-side fix from today onwards - roughly. It depends on mirror sync times. I don't believe anyone's explicitly tested with that upgrade.img yet.

The Final TC1 tree should also contain an upgrade.img with the systemd-side fix. I am currently running a test of that. Our plan is to make the pointer fedup uses to find upgrade.img point to the Final TC1 tree, assuming it tests out OK, as a 'stable' source of an upgrade.img which is known not to contain the bug.

Right now I'm not willing to say with absolute certainty 'you should upgrade this way', but if I had to upgrade a system RIGHT NOW with a gun to my head I'd say use fedup 0.9.0-2 and pass --instrepo https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/ (or i386 or armhpf, as appropriate). It'd probably be best to wait until we've tested it a bit more and updated the redirects so fedup 'just works', though. Shouldn't be too long (later today).

Comment 52 Michael Monreal 2014-11-04 18:04:59 UTC

(In reply to Adam Williamson (Red Hat) from comment #51)
> It's a slightly complex question. See
> https://fedoraproject.org/wiki/Common_F21_bugs#no-fedup-beta .
> ...

Thanks for explaining in detail.

Comment 53 Adam Williamson 2014-11-04 18:14:51 UTC

I tested an upgrade using fedup 0.9.0-1 (i.e. intentionally avoiding the fedup workaround) against the upgrade.img from Final TC1 tree and it doesn't hit the bug. If a few other folks can confirm, I think we can use that as the mirrormanager target.

Comment 54 Milan Kerslager 2014-11-04 21:22:54 UTC

I took fedup-0.9.0-1.fc21.noarch from http://arm.koji.fedoraproject.org/koji/buildinfo?buildID=247899 because I didn't know where to take 0.9.0-1 for FC20. Then (after rpm -Uvh --oldpackage fedup*):

fedup --product=workstation --network 21 --instrepo https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/

The upgrade went correcly, it took 18 mins 50 secs (2123 packages, I have SSD) on my primary machine (ie. on baremetal, no virtualization).

You good guys. I feel happy now :-) Thank you for fixing bugs.

Comment 55 Milan Kerslager 2014-11-04 21:31:30 UTC

BTW: I saw nice ASCII art at the end of the transaction, it told me that the transaction finished succesfully and the machine will be rebooted then. So no prelimitary reboot occurred.

Comment 56 Adam Williamson 2014-11-04 21:34:28 UTC

Yup, if you see Beefy it means it worked right. =) Thanks for testing!

Comment 57 satellitgo 2014-11-05 00:58:44 UTC

I saw it also f20 > f21
http://wiki.sugarlabs.org/go/Fedora_21#fedup_Updating_f20_desktop_to_f21_workstation

yum update
yum install fedup
fedup --nogpgcheck --network 21 --product workstation --instrepo https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/

    Download 1415 files 

    reboot system
    "System Upgrade Fedup" on boot menu

    fedup-dracut-0.9.0

        Update 1415 files 

    NOTE: Terminal display stops updating at about 68%

    FIX: switch to {alt-f2} then {alt-f1} 

    Successful Upgrade.

Comment 58 satellitgo 2014-11-05 02:46:38 UTC

(In reply to satellitgo from comment #57)
> I saw it also f20 > f21
> http://wiki.sugarlabs.org/go/
> Fedora_21#fedup_Updating_f20_desktop_to_f21_workstation
> 
> yum update
> yum install fedup
> fedup --nogpgcheck --network 21 --product workstation --instrepo
> https://dl.fedoraproject.org/pub/alt/stage/21_TC1/Server/x86_64/os/
> 
>     Download 1415 files 
> 
>     reboot system
>     "System Upgrade Fedup" on boot menu
> 
>     fedup-dracut-0.9.0
> 
>         Update 1415 files 
> 
>     NOTE: Terminal display stops updating at about 68%
> 
>     FIX: switch to {alt-f2} then {alt-f1} 
> 
>     Successful Upgrade.

fedup of KDE f20 also worked

Comment 59 Fedora Update System 2014-11-05 03:57:23 UTC

fedup-0.9.0-2.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 60 Kamil Páral 2014-11-05 17:32:52 UTC

The updates have been pushed and multiple people verified the fix, closing.

Comment 61 Adam Williamson 2014-11-05 17:44:37 UTC

I was actually tracking multiple 'fixes' in multiple places, but I think the mirrormanager redirects to TC1 are up now which is the last thing we needed, so closed is probably fine. It'd be good to test a straight 'fedup --network 21 --product (foo)' from 19 and 20, with both 0.9.0-1 and 0.9.0-2, to verify everything is in place as expected.

Comment 62 Kamil Páral 2014-11-06 13:25:52 UTC

(In reply to Adam Williamson (Red Hat) from comment #61)
> I was actually tracking multiple 'fixes' in multiple places, but I think the
> mirrormanager redirects to TC1 are up now

They are still missing.
https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-install-21&arch=x86_64
returns
> # repo = fedora-install-21 arch = x86_64 error: invalid repo or arch

Comment 63 DO NOT USE account not monitored (old adamwill) 2014-11-06 16:40:26 UTC

Yeah - sorry, I followed that up later but didn't update this bug. See https://fedoraproject.org/wiki/Common_F21_bugs#no-fedup-beta and bug #1160931.

Comment 64 Fedora Update System 2014-11-10 06:02:45 UTC

fedup-0.9.0-2.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.