Bug 1350120

Summary: fwupd causes crash after sudo dnf system-upgrade reboot
Product: [Fedora] Fedora Reporter: Ricky <zhangchaowang>
Component: fwupdAssignee: Richard Hughes <rhughes>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 24CC: rhughes, wwoods, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-28 07:38:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
boot log
none
/var/lib/dnf/system-upgrade.json none

Description Ricky 2016-06-25 18:22:43 UTC
Description of problem:
Upgrade to fc24 from 4.5.7-200.fc23.x86_64
After download package, execute 'sudo dnf system-upgrade reboot'
I don't see new option in grub as fedup did befoe, which is expected.
System boots but immediately crash and reboot.
How can I grab the log of boot? It is too fast to see error message.
thanks.

Version-Release number of selected component (if applicable):
dnf-plugin-system-upgrade-0.7.1-1.fc23.noarch

Comment 1 Ricky 2016-06-26 04:29:14 UTC
I uses journalctl -b -0. It doesn't show last boot -- the boot after executing 'sudo dnf system-upgrade reboot'

I manually browse boot log in journalctl with the help from time stamp. But I don't find anything fishy. I did see it trigger reboot process in the boot -- umounting /var, /, /home and etc.

Could you suggest me how to dig deeper? It is nice to reference python script in https://github.com/rpm-software-management/dnf-plugin-system-upgrade.

But I have no clues why it reboots. Could you please suggest me any way to debug it?

Comment 2 Ricky 2016-06-26 12:41:13 UTC
Created attachment 1172540 [details]
boot log

This is boot log after executing 'sudo dnf system-upgrade reboot'

Comment 3 Ricky 2016-06-26 12:42:19 UTC
Created attachment 1172541 [details]
/var/lib/dnf/system-upgrade.json

Comment 4 Ricky 2016-06-26 17:04:52 UTC
The issue is resolved. It turns out that I have installed different version of libusb in /usr/local/lib for kinect, which caused fwupd service failed. The failure of fwupd caused system upgrade service terminates. Therefore, dnf-system-upgrade plugin didn't finish.

I spent time to read dnf-system-upgrade plugin. I thought run_upgrade function has started because symlink folder got wiped out. But it is actually wiped out by systmed in dnf-system-upgrade.service post exec.

I'd suggest that add a new internal state in dnf-system-upgrade plugin to indicate that run_upgrade didn't get launch because of interruption by other system update service. 

Also, suggest users to dig into boot log, eg, 'journalctl -b -1'. Somehow, running 'dnf system-upgrade log -1' didn't show last boot log.

Comment 5 Will Woods 2016-06-27 17:39:06 UTC
This problem occurs because of other services that use /system-updates but don't implement the behavior correctly. Especially this part: 

   As the first step, the update script should check if the /system-update
   symlink points to the the location used by that update script. In case it
   does not exists or points to a different location, the script must exit
   without error. It is possible for multiple update services to be
   installed, and for multiple update scripts to be launched in parallel, and
   only the one that corresponds to the tool that created the symlink before
   reboot should perform any actions. It is unsafe to run multiple updates in
   parallel.

(from https://www.freedesktop.org/software/systemd/man/systemd.offline-updates.html, item 5 under "Implementing Offline System Updates")

So either fwupd isn't implementing this correctly, or the libusb problem you mentioned caused it to crash before it could check the /system-update symlink.

I'd recommend that the fwupdmgr binary should (somehow) be called upon to write out a flag file somewhere when fwupd prepares an update, and then fwupd-offline-update.service should have ConditionPathExists=<flagfile>.

This way, the service will not be started unless a) fwupd itself prepared the update, and b) fwupdmgr is at least *minimally* functional - i.e. it's not just gonna crash on startup because of mismatched libraries or similar problems.

Comment 6 Ricky 2016-06-27 19:18:23 UTC
@Will
Thanks for your quick response.

fwupd does item 5 properly:

https://github.com/hughsie/fwupd/blob/0e3aae27f86956843da889d88c8bc1048e010771/src/fu-util.c#L398

But the program just crashed early due to library dependencies issues. It caused reboot without telling me who/what/why caused reboot.

The logic from item 5 should be implemented by systemd.offline-updates service, instead of individual update script. The systemd.offline-updates service should determine who can run and make sure it runs one update script at a time under its control.

I'd prefer the old fedup way to create a new entry in grub and do explicitly upgrade without any interferences.

Comment 7 Will Woods 2016-06-27 21:04:30 UTC
(In reply to Ricky from comment #6)
> @Will
> Thanks for your quick response.
> 
> fwupd does item 5 properly:
> 
> https://github.com/hughsie/fwupd/blob/
> 0e3aae27f86956843da889d88c8bc1048e010771/src/fu-util.c#L398

Except it doesn't protect against the case you encountered, which a flag file would.

> But the program just crashed early due to library dependencies issues. It
> caused reboot without telling me who/what/why caused reboot.

If there was a way for system upgrade tools to notify the user about failures, I'm sure everyone would be happy to use it. But nobody has designed a good facility for that yet. (Note that would need to handle a lot of interesting cases - remote upgrades over ssh, systems without plymouth, systems without graphical displays, etc.)

> The logic from item 5 should be implemented by systemd.offline-updates
> service, instead of individual update script. The systemd.offline-updates
> service should determine who can run and make sure it runs one update script
> at a time under its control.

The systemd team seems to disagree, but feel free to discuss better solutions with them.

> I'd prefer the old fedup way to create a new entry in grub and do explicitly
> upgrade without any interferences.

Editing the grub configuration is much less reliable, and the failure mode is an infinite reboot loop or a totally unbootable system.

With the current method your system is completely unharmed in the event of an unexpected upgrade failure, which is much safer. The downside is (as you note) that it's a lot harder to tell something went wrong when we don't break your system in the process.

Comment 8 Richard Hughes 2016-06-28 07:38:35 UTC
I think not including a new incompatible libusb is the real error here. We;ll use something better for offline updates error reporting if it becomes available.

Comment 9 Zbigniew Jędrzejewski-Szmek 2016-06-30 13:28:05 UTC
I guess we should have a reporting mechanism. I'd could be useful also for other things which need to perform a display a message just before rebooting. I'll try to propose something upstream in systemd.

On the other hand, the fact that a broken fwupd binary causes also other upgrade methods to fail is still an issue. It *would* be nice to implement an additional safety like proposed in second half of comment #5.

Comment 10 Will Woods 2016-06-30 22:06:47 UTC
(In reply to Richard Hughes from comment #8)
> I think not including a new incompatible libusb is the real error here.

In *this* case, yes, but we have seen other instances of fwupd trying to start during a system upgrade, then failing and rebooting the system before the upgrade can start. See Radek's report in bug 1317215:

  https://bugzilla.redhat.com/show_bug.cgi?id=1317215#c9

I'm not certain why fwupd is failing there - there's no clear error messages, other than this line:

Jun 27 12:39:35 dhcp-24-124.brq.redhat.com org.freedesktop.fwupd[836]: (fwupd:846): Fu-WARNING **: FuMain: failed to load AppStream data: AppStream metadata name /var/cache/app-info/xmls/.goutputstream-T2UMHY not valid, expected .xml[.*] or .yml[.*]

But nonetheless, we end up here:

Jun 27 12:39:59 dhcp-24-124.brq.redhat.com dbus[836]: [system] Failed to activate service 'org.freedesktop.fwupd': timed out
Jun 27 12:39:59 dhcp-24-124.brq.redhat.com fwupdmgr[829]: Failed to connect to fwupd: Error calling StartServiceByName for org.freedesktop.fwupd: GDBus.Error:org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.freedesktop.fwupd': timed out
Jun 27 12:39:59 dhcp-24-124.brq.redhat.com systemd[1]: fwupd-offline-update.service: Main process exited, code=exited, status=1/FAILURE
Jun 27 12:39:59 dhcp-24-124.brq.redhat.com systemd[1]: fwupd-offline-update.service: Unit entered failed state.
Jun 27 12:39:59 dhcp-24-124.brq.redhat.com systemd[1]: fwupd-offline-update.service: Triggering OnFailure= dependencies.
Jun 27 12:39:59 dhcp-24-124.brq.redhat.com systemd[1]: fwupd-offline-update.service: Failed with result 'exit-code'.

...and then the system reboots. (His workaround was "uninstall fwupd", which is probably not what you want users to do..)

So, yeah. What to do about this is up to you, of course, but having an extra Condition to keep the service from starting when it's not needed would have prevented the problem in this case as well.