Bug 1518401

Summary:

Automatic regeneration of nVidia akmod fails on "offline" kernel update

Product:

[Fedora] Fedora

Reporter:

Stephen Gallagher <sgallagh>

Component:

akmods

Assignee:

Nicolas Chauvet (kwizart) <kwizart>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

fedora, gholms, hdegoede, hobbes1069, kwizart, leigh123linux, negativo17, nicolas.vieville, sergio, sgallagh, zbyszek

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

akmods-0.5.6-12.fc26 akmods-0.5.6-12.fc27

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-02-06 10:49:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
akmod failure log	none

Description Stephen Gallagher 2017-11-28 20:26:33 UTC

Description of problem:
First off, I know that problems with the nVidia kernel driver are not Fedora's problem, however I suspect that the issue I'm reporting here has to do with some interaction of the akmod and the offline updates process using PackageKit and systemd's system-update.target.

The issue is that after an offline update involving a new kernel, the `/usr/bin/akmods` call in /etc/kernel/postinst.d/akmodsposttrans apparently executes and fails, because after the system has rebooted at the end of the offline update process, the akmod is marked as failed and I need to manually call `sudo akmods --force` and then reboot again in order to restore the driver. (Due to a bug in the nouveau driver, I have to use the nVidia driver in order to get output on an external monitor from my laptop, so it's obvious when it hasn't properly rebuilt).

Version-Release number of selected component (if applicable):

How reproducible:
Consistently every time a kernel update occurs during offline updates.

Steps to Reproduce:
1. Install Fedora 27 on a system with an nVidia graphics adapter (in my case I have 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M1000M] (rev a2) )
2. Install the nVidia kmod from the negativo17 repository appropriately. Verify that it works properly.
3. Use `pkcon update --only-download` to queue a kernel update for offline updates
4. `pkcon offline-trigger && reboot`

Actual results:
After rebooting, the nVidia driver has failed to build its kmod. The nouveau driver is activated as a fallback, which can be seen with `lsmod`

Expected results:
The nVidia driver should be regenerated from the akmods and be activated on boot.

Additional info:
I can't see any sign that SELinux is getting in the way in audit.log or `audit2why`, but I have not expressly tried setting permissive mode before doing this.

Comment 1 Sergio Basto 2017-11-28 21:53:20 UTC

please check bug #1474969 , and please attach /var/cache/akmods/nvidiasometing/5.1.30-1-for-4.13.12-300.fc27.x86_64.failed.log

Comment 2 Stephen Gallagher 2017-11-29 01:47:05 UTC

Created attachment 1360126 [details]
akmod failure log

Attaching the log as requested.

Comment 3 Stephen Gallagher 2017-11-29 01:58:40 UTC

In particular, the error "Session terminated, killing shell..." seems to suggest to me that my hypothesis might be correct; systemd may be terminating the system-updates.target before the posttrans script has completed.

I notice that the postrans script actually executes with '&', causing it to go into the background and return success immediately. Perhaps the solution is to drop that and force the process to complete before returning control to RPM?

I'm requesting info from Thorsten Leemhuis, as he appears to have been the person who contributed this to the original rpmfusion version of the package and I'm hoping he knows whether there was a reason to use job control here.

My guess is that it was some sort of protection against the build hanging indefinitely and blocking the upgrade process. But if that's the case, maybe it would be best to solve this with a timeout of some kind.

Comment 4 Nicolas Chauvet (kwizart) 2017-11-29 07:31:16 UTC

The problem with not using & is that there would be a circle dependency.
We would not be allowed to install another rpm produced by akmods if we don't leave the previous transaction.

There is probably another way to handle that with the kernel posttrans script.
Such as writing on a socket/pipe that would trigger the akmods.service.
Once a service is triggered, I expect that systemd would wait for it to end before any reboot.

Another means would be for akmods to register as a system-updates.target
But then tasks might be different in this context than when akmods is started at boot.

There is another context where akmods might be ended prematurely. It's when end-users reboot right after the RPM transaction with a kernel(kernel-devel) update.
Another way I've envision was to have a RPM lock early when running akmods.

Comment 5 Thorsten Leemhuis 2017-11-29 08:01:34 UTC

(In reply to Nicolas Chauvet (kwizart) from comment #4)
> The problem with not using & is that there would be a circle dependency.
> We would not be allowed to install another rpm produced by akmods if we
> don't leave the previous transaction.

FWIW & IIRC: Back in the old days akmod circumvented this by actually building the module in the posttrans and let the daemon install it immediately after the RPM transaction that brought the new kernel (and thus triggered building the new module/akmod package). That leaves only a very small time window where things can go wrong (not sure if that works for offline updates, those came way later). And yes, I think there even was a "timeout of some kind" to prevent the "build hanging indefinitely and blocking the upgrade process", like Stephen mentioned (but maybe that was still on the todo list back then; not sure).

Comment 6 Hans de Goede 2017-11-29 13:14:36 UTC

Is there a way to detect that we're doing an offline update? Then we could decide to drop the "&" from the posttrans script only in the offline update case ...

Also aren't we also running some akmods service on boot, should this not catch this and build the module at boot since it was not build before ?

Comment 7 Hans de Goede 2017-11-29 13:16:48 UTC

Ah I see now that part of the problem is the build being marked as failed. We should really not mark builds as failed when the build got interrupted / killed.

Anyone has any suggestions how to not mark builds as failed when they get killed, rather then exit due to compiler / link errors ?

Comment 8 Stephen Gallagher 2017-11-29 13:21:45 UTC

@Hans, the service that runs at boot calls akmods without --force. It detects that the previous attempt to build the module failed and skips it, assuming it to be broken.

As my analysis shows, I suspect it's failing because systemd ends up rebooting the computer mid-build, which marks it broken even though it might have succeeded.

As for detecting the state, we can probably interrogate systemd. Zbigniew might know more...

Comment 9 Stephen Gallagher 2017-11-29 13:27:06 UTC

Also, I'm not comfortable with just dropping the &, I don't think. We need to have a timeout, or else we might end up with a situation where some build is hanging (e.g waiting for user input because a new option was missed in an answerfile or something like that) and thus preventing the update process from ever completing.

If this happens in the offline-updates mode, I'm not sure the system will be recoverable without a rescue disk. (I don't know when the magic file that causes systemd to boot to that mode gets removed.)

Comment 10 nicolas.vieville 2017-11-29 14:31:31 UTC

Hello,

Sorry for this "maybe naive" suggestion, but wouldn't be possible to try something like dnf system-upgrade plugin as a one shot service based on the existence of one file in the root tree. See:

https://github.com/rpm-software-management/dnf-plugin-system-upgrade/blob/master/dnf-system-upgrade.service

The file triggering the akmod build would be removed by the service himself (ExecStopPost directive of the service).

Cordially,


-- 
NVieville

Comment 11 Nicolas Chauvet (kwizart) 2017-12-14 21:17:35 UTC

@stephen
Is there any option for verbosity with pkcon ?

It seems like I cannot use pkcon to trigger updates. There is a reboot in system-updates.target, plymouth display updates in progress, but then it reboots immediately.
This is on f27 with dnf showing updates are available (including kernel updates).

Comment 12 Stephen Gallagher 2017-12-14 21:20:20 UTC

(In reply to Nicolas Chauvet (kwizart) from comment #11)
> @stephen
> Is there any option for verbosity with pkcon ?
> 
> It seems like I cannot use pkcon to trigger updates. There is a reboot in
> system-updates.target, plymouth display updates in progress, but then it
> reboots immediately.
> This is on f27 with dnf showing updates are available (including kernel
> updates).

While in the offline updates mode, pkcon logs everything to the journal. You can increase the verbosity by following the instructions at https://www.freedesktop.org/software/PackageKit/pk-bugs.html

Comment 13 Nicolas Chauvet (kwizart) 2017-12-14 21:45:09 UTC

There is progress from pk-offline-updates up to 6%
Then there is an error from dnf, quoting :
---
dnf[1110] Use "dnf system-upgrade reboot" to start upgrade
systemd[1] Failed to start System Upgrade using DNF.
systemd[1]: dnf-system-upgrade.service: Unit entered failed state.
systemd[1]: dnf-system-upgrade.service: Failed with result 'exit-code'.
systemd[1]: Rebooting: service failed
---
I'm removing the python3-dnf-plugin-system-upgrade...

Comment 14 Nicolas Chauvet (kwizart) 2018-01-08 22:15:24 UTC

So here my current status.
I don't see any way to hook into the offline update target.
Even trying to run akmods from a dedicated service or using systemd-run instead of nohup doesn't prevent the system to reboot after the offline update occurred.

What I need is to hook into the following process:
- User selects updates and reboot into offline updates mode. (along with a kernel/kernel-devel updates).
- systemd offline updates service succeed (rpm transaction ends and starts akmods)
(- akmods build and install any appropriate kmod*)
- systemd offline updates target succeed
- reboot with watchdog timeout

It seems that the whatdog timeout for reboot is triggered even if the akmods service is working.


The current workaround is probably to detect when using offline updates and exit early. (so that akmods can succeed in the next reboot).

Something like (in /etc/kernel/postinst.d/akmodsposttrans):
---
systemctl is-active system-update.target &>/dev/null
RET=$?

 [ $RET == 0 ] && exit 0

Comment 15 Fedora Update System 2018-01-26 18:15:34 UTC

akmods-0.5.6-12.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-97ee67cf09

Comment 16 Fedora Update System 2018-01-26 18:15:50 UTC

akmods-0.5.6-12.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2018-c5b163a57d

Comment 17 Fedora Update System 2018-01-28 22:34:32 UTC

akmods-0.5.6-12.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-c5b163a57d

Comment 18 Fedora Update System 2018-01-28 23:04:17 UTC

akmods-0.5.6-12.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-97ee67cf09

Comment 19 Fedora Update System 2018-02-06 10:49:15 UTC

akmods-0.5.6-12.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Comment 20 Fedora Update System 2018-02-06 15:29:59 UTC

akmods-0.5.6-12.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.