Bug 1518401 - Automatic regeneration of nVidia akmod fails on "offline" kernel update
Summary: Automatic regeneration of nVidia akmod fails on "offline" kernel update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: akmods
Version: 27
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Nicolas Chauvet (kwizart)
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-28 20:26 UTC by Stephen Gallagher
Modified: 2018-02-06 15:29 UTC (History)
11 users (show)

Fixed In Version: akmods-0.5.6-12.fc26 akmods-0.5.6-12.fc27
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-06 10:49:15 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
akmod failure log (17.64 KB, text/plain)
2017-11-29 01:47 UTC, Stephen Gallagher
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1474969 0 unspecified CLOSED VirtualBox keeps breaking after updates because akmods sometimes does not do anything 2021-02-22 00:41:40 UTC

Internal Links: 1474969

Description Stephen Gallagher 2017-11-28 20:26:33 UTC
Description of problem:
First off, I know that problems with the nVidia kernel driver are not Fedora's problem, however I suspect that the issue I'm reporting here has to do with some interaction of the akmod and the offline updates process using PackageKit and systemd's system-update.target.

The issue is that after an offline update involving a new kernel, the `/usr/bin/akmods` call in /etc/kernel/postinst.d/akmodsposttrans apparently executes and fails, because after the system has rebooted at the end of the offline update process, the akmod is marked as failed and I need to manually call `sudo akmods --force` and then reboot again in order to restore the driver. (Due to a bug in the nouveau driver, I have to use the nVidia driver in order to get output on an external monitor from my laptop, so it's obvious when it hasn't properly rebuilt).

Version-Release number of selected component (if applicable):


How reproducible:
Consistently every time a kernel update occurs during offline updates.

Steps to Reproduce:
1. Install Fedora 27 on a system with an nVidia graphics adapter (in my case I have 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M1000M] (rev a2) )
2. Install the nVidia kmod from the negativo17 repository appropriately. Verify that it works properly.
3. Use `pkcon update --only-download` to queue a kernel update for offline updates
4. `pkcon offline-trigger && reboot`

Actual results:
After rebooting, the nVidia driver has failed to build its kmod. The nouveau driver is activated as a fallback, which can be seen with `lsmod`

Expected results:
The nVidia driver should be regenerated from the akmods and be activated on boot.

Additional info:
I can't see any sign that SELinux is getting in the way in audit.log or `audit2why`, but I have not expressly tried setting permissive mode before doing this.

Comment 1 Sergio Basto 2017-11-28 21:53:20 UTC
please check bug #1474969 , and please attach /var/cache/akmods/nvidiasometing/5.1.30-1-for-4.13.12-300.fc27.x86_64.failed.log

Comment 2 Stephen Gallagher 2017-11-29 01:47:05 UTC
Created attachment 1360126 [details]
akmod failure log

Attaching the log as requested.

Comment 3 Stephen Gallagher 2017-11-29 01:58:40 UTC
In particular, the error "Session terminated, killing shell..." seems to suggest to me that my hypothesis might be correct; systemd may be terminating the system-updates.target before the posttrans script has completed.

I notice that the postrans script actually executes with '&', causing it to go into the background and return success immediately. Perhaps the solution is to drop that and force the process to complete before returning control to RPM?

I'm requesting info from Thorsten Leemhuis, as he appears to have been the person who contributed this to the original rpmfusion version of the package and I'm hoping he knows whether there was a reason to use job control here.

My guess is that it was some sort of protection against the build hanging indefinitely and blocking the upgrade process. But if that's the case, maybe it would be best to solve this with a timeout of some kind.

Comment 4 Nicolas Chauvet (kwizart) 2017-11-29 07:31:16 UTC
The problem with not using & is that there would be a circle dependency.
We would not be allowed to install another rpm produced by akmods if we don't leave the previous transaction.

There is probably another way to handle that with the kernel posttrans script.
Such as writing on a socket/pipe that would trigger the akmods.service.
Once a service is triggered, I expect that systemd would wait for it to end before any reboot.

Another means would be for akmods to register as a system-updates.target
But then tasks might be different in this context than when akmods is started at boot.

There is another context where akmods might be ended prematurely. It's when end-users reboot right after the RPM transaction with a kernel(kernel-devel) update.
Another way I've envision was to have a RPM lock early when running akmods.

Comment 5 Thorsten Leemhuis 2017-11-29 08:01:34 UTC
(In reply to Nicolas Chauvet (kwizart) from comment #4)
> The problem with not using & is that there would be a circle dependency.
> We would not be allowed to install another rpm produced by akmods if we
> don't leave the previous transaction.

FWIW & IIRC: Back in the old days akmod circumvented this by actually building the module in the posttrans and let the daemon install it immediately after the RPM transaction that brought the new kernel (and thus triggered building the new module/akmod package). That leaves only a very small time window where things can go wrong (not sure if that works for offline updates, those came way later). And yes, I think there even was a "timeout of some kind" to prevent the "build hanging indefinitely and blocking the upgrade process", like Stephen mentioned (but maybe that was still on the todo list back then; not sure).

Comment 6 Hans de Goede 2017-11-29 13:14:36 UTC
Is there a way to detect that we're doing an offline update? Then we could decide to drop the "&" from the posttrans script only in the offline update case ...

Also aren't we also running some akmods service on boot, should this not catch this and build the module at boot since it was not build before ?

Comment 7 Hans de Goede 2017-11-29 13:16:48 UTC
Ah I see now that part of the problem is the build being marked as failed. We should really not mark builds as failed when the build got interrupted / killed.

Anyone has any suggestions how to not mark builds as failed when they get killed, rather then exit due to compiler / link errors ?

Comment 8 Stephen Gallagher 2017-11-29 13:21:45 UTC
@Hans, the service that runs at boot calls akmods without --force. It detects that the previous attempt to build the module failed and skips it, assuming it to be broken.

As my analysis shows, I suspect it's failing because systemd ends up rebooting the computer mid-build, which marks it broken even though it might have succeeded.

As for detecting the state, we can probably interrogate systemd. Zbigniew might know more...

Comment 9 Stephen Gallagher 2017-11-29 13:27:06 UTC
Also, I'm not comfortable with just dropping the &, I don't think. We need to have a timeout, or else we might end up with a situation where some build is hanging (e.g waiting for user input because a new option was missed in an answerfile or something like that) and thus preventing the update process from ever completing.

If this happens in the offline-updates mode, I'm not sure the system will be recoverable without a rescue disk. (I don't know when the magic file that causes systemd to boot to that mode gets removed.)

Comment 10 nicolas.vieville 2017-11-29 14:31:31 UTC
Hello,

Sorry for this "maybe naive" suggestion, but wouldn't be possible to try something like dnf system-upgrade plugin as a one shot service based on the existence of one file in the root tree. See:

https://github.com/rpm-software-management/dnf-plugin-system-upgrade/blob/master/dnf-system-upgrade.service

The file triggering the akmod build would be removed by the service himself (ExecStopPost directive of the service).

Cordially,


-- 
NVieville

Comment 11 Nicolas Chauvet (kwizart) 2017-12-14 21:17:35 UTC
@stephen
Is there any option for verbosity with pkcon ?

It seems like I cannot use pkcon to trigger updates. There is a reboot in system-updates.target, plymouth display updates in progress, but then it reboots immediately.
This is on f27 with dnf showing updates are available (including kernel updates).

Comment 12 Stephen Gallagher 2017-12-14 21:20:20 UTC
(In reply to Nicolas Chauvet (kwizart) from comment #11)
> @stephen
> Is there any option for verbosity with pkcon ?
> 
> It seems like I cannot use pkcon to trigger updates. There is a reboot in
> system-updates.target, plymouth display updates in progress, but then it
> reboots immediately.
> This is on f27 with dnf showing updates are available (including kernel
> updates).

While in the offline updates mode, pkcon logs everything to the journal. You can increase the verbosity by following the instructions at https://www.freedesktop.org/software/PackageKit/pk-bugs.html

Comment 13 Nicolas Chauvet (kwizart) 2017-12-14 21:45:09 UTC
There is progress from pk-offline-updates up to 6%
Then there is an error from dnf, quoting :
---
dnf[1110] Use "dnf system-upgrade reboot" to start upgrade
systemd[1] Failed to start System Upgrade using DNF.
systemd[1]: dnf-system-upgrade.service: Unit entered failed state.
systemd[1]: dnf-system-upgrade.service: Failed with result 'exit-code'.
systemd[1]: Rebooting: service failed
---
I'm removing the python3-dnf-plugin-system-upgrade...

Comment 14 Nicolas Chauvet (kwizart) 2018-01-08 22:15:24 UTC
So here my current status.
I don't see any way to hook into the offline update target.
Even trying to run akmods from a dedicated service or using systemd-run instead of nohup doesn't prevent the system to reboot after the offline update occurred.

What I need is to hook into the following process:
- User selects updates and reboot into offline updates mode. (along with a kernel/kernel-devel updates).
- systemd offline updates service succeed (rpm transaction ends and starts akmods)
(- akmods build and install any appropriate kmod*)
- systemd offline updates target succeed
- reboot with watchdog timeout

It seems that the whatdog timeout for reboot is triggered even if the akmods service is working.


The current workaround is probably to detect when using offline updates and exit early. (so that akmods can succeed in the next reboot).

Something like (in /etc/kernel/postinst.d/akmodsposttrans):
---
systemctl is-active system-update.target &>/dev/null
RET=$?

 [ $RET == 0 ] && exit 0

Comment 15 Fedora Update System 2018-01-26 18:15:34 UTC
akmods-0.5.6-12.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-97ee67cf09

Comment 16 Fedora Update System 2018-01-26 18:15:50 UTC
akmods-0.5.6-12.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2018-c5b163a57d

Comment 17 Fedora Update System 2018-01-28 22:34:32 UTC
akmods-0.5.6-12.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-c5b163a57d

Comment 18 Fedora Update System 2018-01-28 23:04:17 UTC
akmods-0.5.6-12.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-97ee67cf09

Comment 19 Fedora Update System 2018-02-06 10:49:15 UTC
akmods-0.5.6-12.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Comment 20 Fedora Update System 2018-02-06 15:29:59 UTC
akmods-0.5.6-12.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.