Bug 1518401
Summary: | Automatic regeneration of nVidia akmod fails on "offline" kernel update | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Stephen Gallagher <sgallagh> | ||||
Component: | akmods | Assignee: | Nicolas Chauvet (kwizart) <kwizart> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 27 | CC: | fedora, gholms, hdegoede, hobbes1069, kwizart, leigh123linux, negativo17, nicolas.vieville, sergio, sgallagh, zbyszek | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | akmods-0.5.6-12.fc26 akmods-0.5.6-12.fc27 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-02-06 10:49:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Stephen Gallagher
2017-11-28 20:26:33 UTC
please check bug #1474969 , and please attach /var/cache/akmods/nvidiasometing/5.1.30-1-for-4.13.12-300.fc27.x86_64.failed.log Created attachment 1360126 [details]
akmod failure log
Attaching the log as requested.
In particular, the error "Session terminated, killing shell..." seems to suggest to me that my hypothesis might be correct; systemd may be terminating the system-updates.target before the posttrans script has completed. I notice that the postrans script actually executes with '&', causing it to go into the background and return success immediately. Perhaps the solution is to drop that and force the process to complete before returning control to RPM? I'm requesting info from Thorsten Leemhuis, as he appears to have been the person who contributed this to the original rpmfusion version of the package and I'm hoping he knows whether there was a reason to use job control here. My guess is that it was some sort of protection against the build hanging indefinitely and blocking the upgrade process. But if that's the case, maybe it would be best to solve this with a timeout of some kind. The problem with not using & is that there would be a circle dependency. We would not be allowed to install another rpm produced by akmods if we don't leave the previous transaction. There is probably another way to handle that with the kernel posttrans script. Such as writing on a socket/pipe that would trigger the akmods.service. Once a service is triggered, I expect that systemd would wait for it to end before any reboot. Another means would be for akmods to register as a system-updates.target But then tasks might be different in this context than when akmods is started at boot. There is another context where akmods might be ended prematurely. It's when end-users reboot right after the RPM transaction with a kernel(kernel-devel) update. Another way I've envision was to have a RPM lock early when running akmods. (In reply to Nicolas Chauvet (kwizart) from comment #4) > The problem with not using & is that there would be a circle dependency. > We would not be allowed to install another rpm produced by akmods if we > don't leave the previous transaction. FWIW & IIRC: Back in the old days akmod circumvented this by actually building the module in the posttrans and let the daemon install it immediately after the RPM transaction that brought the new kernel (and thus triggered building the new module/akmod package). That leaves only a very small time window where things can go wrong (not sure if that works for offline updates, those came way later). And yes, I think there even was a "timeout of some kind" to prevent the "build hanging indefinitely and blocking the upgrade process", like Stephen mentioned (but maybe that was still on the todo list back then; not sure). Is there a way to detect that we're doing an offline update? Then we could decide to drop the "&" from the posttrans script only in the offline update case ... Also aren't we also running some akmods service on boot, should this not catch this and build the module at boot since it was not build before ? Ah I see now that part of the problem is the build being marked as failed. We should really not mark builds as failed when the build got interrupted / killed. Anyone has any suggestions how to not mark builds as failed when they get killed, rather then exit due to compiler / link errors ? @Hans, the service that runs at boot calls akmods without --force. It detects that the previous attempt to build the module failed and skips it, assuming it to be broken. As my analysis shows, I suspect it's failing because systemd ends up rebooting the computer mid-build, which marks it broken even though it might have succeeded. As for detecting the state, we can probably interrogate systemd. Zbigniew might know more... Also, I'm not comfortable with just dropping the &, I don't think. We need to have a timeout, or else we might end up with a situation where some build is hanging (e.g waiting for user input because a new option was missed in an answerfile or something like that) and thus preventing the update process from ever completing. If this happens in the offline-updates mode, I'm not sure the system will be recoverable without a rescue disk. (I don't know when the magic file that causes systemd to boot to that mode gets removed.) Hello, Sorry for this "maybe naive" suggestion, but wouldn't be possible to try something like dnf system-upgrade plugin as a one shot service based on the existence of one file in the root tree. See: https://github.com/rpm-software-management/dnf-plugin-system-upgrade/blob/master/dnf-system-upgrade.service The file triggering the akmod build would be removed by the service himself (ExecStopPost directive of the service). Cordially, -- NVieville @stephen Is there any option for verbosity with pkcon ? It seems like I cannot use pkcon to trigger updates. There is a reboot in system-updates.target, plymouth display updates in progress, but then it reboots immediately. This is on f27 with dnf showing updates are available (including kernel updates). (In reply to Nicolas Chauvet (kwizart) from comment #11) > @stephen > Is there any option for verbosity with pkcon ? > > It seems like I cannot use pkcon to trigger updates. There is a reboot in > system-updates.target, plymouth display updates in progress, but then it > reboots immediately. > This is on f27 with dnf showing updates are available (including kernel > updates). While in the offline updates mode, pkcon logs everything to the journal. You can increase the verbosity by following the instructions at https://www.freedesktop.org/software/PackageKit/pk-bugs.html There is progress from pk-offline-updates up to 6% Then there is an error from dnf, quoting : --- dnf[1110] Use "dnf system-upgrade reboot" to start upgrade systemd[1] Failed to start System Upgrade using DNF. systemd[1]: dnf-system-upgrade.service: Unit entered failed state. systemd[1]: dnf-system-upgrade.service: Failed with result 'exit-code'. systemd[1]: Rebooting: service failed --- I'm removing the python3-dnf-plugin-system-upgrade... So here my current status. I don't see any way to hook into the offline update target. Even trying to run akmods from a dedicated service or using systemd-run instead of nohup doesn't prevent the system to reboot after the offline update occurred. What I need is to hook into the following process: - User selects updates and reboot into offline updates mode. (along with a kernel/kernel-devel updates). - systemd offline updates service succeed (rpm transaction ends and starts akmods) (- akmods build and install any appropriate kmod*) - systemd offline updates target succeed - reboot with watchdog timeout It seems that the whatdog timeout for reboot is triggered even if the akmods service is working. The current workaround is probably to detect when using offline updates and exit early. (so that akmods can succeed in the next reboot). Something like (in /etc/kernel/postinst.d/akmodsposttrans): --- systemctl is-active system-update.target &>/dev/null RET=$? [ $RET == 0 ] && exit 0 akmods-0.5.6-12.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-97ee67cf09 akmods-0.5.6-12.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2018-c5b163a57d akmods-0.5.6-12.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-c5b163a57d akmods-0.5.6-12.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-97ee67cf09 akmods-0.5.6-12.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report. akmods-0.5.6-12.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report. |