Bug 1244394 - Upgrade stuck on script
Summary: Upgrade stuck on script
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: initial-setup
Version: 23
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Martin Kolman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker AcceptedFreezeException
Depends On:
Blocks: F23BetaFreezeException
TreeView+ depends on / blocked
 
Reported: 2015-07-18 12:17 UTC by Stanislav Graf
Modified: 2015-10-07 16:35 UTC (History)
14 users (show)

Fixed In Version: initial-setup-0.3.37-1.fc23
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-07 16:35:35 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1244558 0 unspecified CLOSED Switch initial-setup to use python3 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1251387 0 unspecified CLOSED DNF hangs when upgrading f22 to rawhide 2021-02-22 00:41:40 UTC

Internal Links: 1244558 1251387

Description Stanislav Graf 2015-07-18 12:17:21 UTC
Description of problem:
=======================
During my upgrade from latest F22 to F23, I hit an issue when upgrade to initial-setup-0.3.31-1.fc22.x86_64 got stuck (for hours) in cleanup phase. The process tree showed that dnf ran script and that ran systemctl. I had to kill that systemctl command to continue in my upgrade.


Version-Release number of selected component (if applicable):
=============================================================
F23, initial-setup-0.3.31-1.fc22.x86_64


How reproducible:
=================
I was upgrading only once.


Steps to Reproduce:
===================
1. From F22
2. dnf update --releasever=23

Actual results:
===============
Update of initial-setup is stucked in cleanup phase.

Expected results:
=================
Update continues.

Comment 1 Vratislav Podzimek 2015-07-20 09:44:56 UTC
This is how the relevant part of the initial-setup's .spec file looks like:

%post
if [ $1 -ne 2 -a ! -f /etc/sysconfig/initial-setup ]; then
  platform="$(arch)"
  if [ "$platform" = "s390" -o "$platform" = "s390x" ]; then
    echo "RUN_INITIAL_SETUP=YES" > /etc/sysconfig/initial-setup
  else
    %systemd_post initial-setup-text.service
  fi
fi

%preun
%systemd_preun initial-setup-text.service

%postun
%systemd_postun_with_restart initial-setup-text.service

%post gui
%systemd_post initial-setup-graphical.service

%preun gui
%systemd_preun initial-setup-graphical.service

%postun gui
%systemd_postun_with_restart initial-setup-graphical.service

I don't see anything that could got stuck other than the %systemd_* macros so I'm reassigning this to systemd.

Comment 2 Zbigniew Jędrzejewski-Szmek 2015-07-20 12:54:27 UTC
Those macros call systemctl to communicate with PID 1. If the upgrade was stuck in one of those macros, then mostly likely the real problem was that PID 1 was stuck itself or had a segmentation fault or similar.

To verify that systemctl is at fault, you could look (if the issue happens again) at pstree output for the dnf process.

If systemd had crashed (or similar), some kind of information should be visible in journalctl. It *may* have happened during the upgrade or earlier, hard to say. Is there anything interesting in the logs?

Comment 3 Stanislav Graf 2015-07-22 21:12:59 UTC
In /var/log/dnf.rpm.log:
~~~
Jul 17 23:37:14 INFO Cleanup: initial-setup-0.3.31-1.fc22.x86_64
Jul 17 23:51:07 INFO --- logging initialized ---
Jul 18 00:51:08 INFO --- logging initialized ---
Jul 18 01:51:22 INFO --- logging initialized ---
Jul 18 02:51:23 INFO --- logging initialized ---
Jul 18 08:28:17 INFO Cleanup: ibus-typing-booster-1.2.11-1.fc22.noarch
~~~

At the same time in journal log:
~~~
Jul 17 23:37:13 localhost.localdomain systemd[1]: Stopped TCG Core Services Daemon.
Jul 17 23:37:13 localhost.localdomain systemd[1]: Reloading.
Jul 17 23:37:13 localhost.localdomain systemd[1]: Configuration file /usr/lib/systemd/system/auditd.service is marked world-inaccessible. This has no effect as configuration data is accessible
 via APIs without restrictions. Proceeding anyway.
Jul 17 23:37:13 localhost.localdomain systemd[1]: [/usr/lib/systemd/system/initial-setup-text.service:21] Support for option SysVStartPriority= has been removed and it is ignored
Jul 17 23:37:13 localhost.localdomain systemd[1]: Reloading.
Jul 17 23:37:13 localhost.localdomain systemd[1]: Configuration file /usr/lib/systemd/system/auditd.service is marked world-inaccessible. This has no effect as configuration data is accessible
 via APIs without restrictions. Proceeding anyway.
Jul 17 23:37:13 localhost.localdomain systemd[1]: [/usr/lib/systemd/system/initial-setup-text.service:21] Support for option SysVStartPriority= has been removed and it is ignored
Jul 17 23:37:14 localhost.localdomain systemd[1]: Reloading.
Jul 17 23:37:14 localhost.localdomain systemd[1]: Configuration file /usr/lib/systemd/system/auditd.service is marked world-inaccessible. This has no effect as configuration data is accessible
 via APIs without restrictions. Proceeding anyway.
Jul 17 23:37:14 localhost.localdomain systemd[1]: [/usr/lib/systemd/system/initial-setup-text.service:21] Support for option SysVStartPriority= has been removed and it is ignored
Jul 17 23:37:14 localhost.localdomain audit[1]: <audit-1130> pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=initial-setup-text comm="systemd" exe="/usr/l
ib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 17 23:37:14 localhost.localdomain audit[1]: <audit-1131> pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=initial-setup-text comm="systemd" exe="/usr/l
ib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 17 23:37:14 localhost.localdomain systemd[1]: Stopping Initial Setup configuration program (text mode)...
Jul 17 23:37:14 localhost.localdomain systemd[1]: Starting Initial Setup configuration program (text mode)...
Jul 17 23:51:06 localhost.localdomain systemd[1]: Starting dnf makecache...
~~~

And during every boot in journal log:
~~~
Jul 18 08:48:37 localhost.localdomain root[25868]: Running initial_setup
Jul 18 08:48:37 localhost.localdomain python[25869]: detected unhandled Python exception in '/usr/lib/python2.7/site-packages/initial_setup/__main__.py'
~~~

Comment 4 Stanislav Graf 2015-08-02 09:24:13 UTC
Happened again with 'initial-setup' upgrade today.

$ pstree --arguments --long  --show-pids --ns-changes --uid-changes -Z 28007
dnf,28007,`unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023' -OO /bin/dnf update
  └─sh,28150,`unconfined_u:unconfined_r:rpm_script_t:s0-s0:c0.c1023' /var/tmp/rpm-tmp.2aTsdl 1
      └─systemctl,28173,`unconfined_u:unconfined_r:rpm_script_t:s0-s0:c0.c1023' try-restart initial-setup-text.service

$ rpm -q initial-setup
initial-setup-0.3.33-2.fc23.x86_64
initial-setup-0.3.35-1.fc23.x86_64

And killing the systemctl process solved the issue.

$ sudo kill 28173

$ rpm -q initial-setup
initial-setup-0.3.35-1.fc23.x86_64

$ ps aux | grep system
root         1  0.0  0.1 194276  9160 ?        Ss   Aug01   0:14 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
root       628  0.0  0.5  93380 45844 ?        Ss   Aug01   0:03 /usr/lib/systemd/systemd-journald
root       669  0.0  0.0  47492  7688 ?        Ss   Aug01   0:00 /usr/lib/systemd/systemd-udevd
dbus       852  0.0  0.0  66852  6476 ?        Ssl  Aug01   0:27 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root       868  0.0  0.0  24480  3156 ?        Ss   Aug01   0:00 /usr/lib/systemd/systemd-logind
gdm       1529  0.0  0.0  45132  4912 ?        Ss   Aug01   0:00 /usr/lib/systemd/systemd --user
root     27856  0.0  0.0  43012  4904 ?        Ss   11:08   0:00 /usr/lib/systemd/systemd --user
sgraf    28324  0.0  0.0  45260  5152 ?        Ss   11:11   0:00 /usr/lib/systemd/systemd --user
sgraf    30196  0.0  0.0 117060  2380 pts/0    S+   11:20   0:00 grep --color=auto system

In the mean time I can for example start/stop/status tuned service (using systemctl) without any issues.

Comment 5 Stanislav Graf 2015-08-02 09:32:21 UTC
(In reply to Stanislav Graf from comment #3)
[...]
> And during every boot in journal log:
> ~~~
> Jul 18 08:48:37 localhost.localdomain root[25868]: Running initial_setup
> Jul 18 08:48:37 localhost.localdomain python[25869]: detected unhandled
> Python exception in
> '/usr/lib/python2.7/site-packages/initial_setup/__main__.py'
> ~~~

This error during every boot got fixed with latest update. Possibly due to Bug 1244558. Maybe this BZ would be fixed also now.

Comment 6 Stanislav Graf 2015-09-08 18:15:01 UTC

*** This bug has been marked as a duplicate of bug 1244558 ***

Comment 7 Pete Walter 2015-09-13 18:40:57 UTC
Reopening as this sounds like a blocker to me.

Proposing as Fedora 23 Beta Blocker because it breaks the following criteria for KDE spin:

"For each one of the release-blocking package sets, it must be possible to successfully complete an upgrade from a fully updated installation of the previous stable Fedora release with that package set installed."

With this bug it is impossible to complete an upgrade.

Comment 8 Adam Williamson 2015-09-13 21:11:29 UTC
Well, this is a complete mess: it was closed as a dupe of a bug that was closed and then re-opened without a comment, and now this one has been re-opened.

I have no idea what any of this is meant to mean.

Comment 9 Dan Mossor [danofsatx] 2015-09-14 01:18:15 UTC
I ran into this today on a simple dnf update on a Fedora 23 Cinnamon. 

Sep 13 17:43:19 INFO Cleanup: initial-setup-gui-0.3.35-1.fc23.x86_64
Sep 13 17:43:19 INFO Cleanup: initial-setup-gui-0.3.35-1.fc23.x86_64
Sep 13 17:43:20 INFO Cleanup: anaconda-gui-23.19.2-1.fc23.x86_64
Sep 13 17:43:20 INFO Cleanup: anaconda-gui-23.19.2-1.fc23.x86_64
Sep 13 17:43:21 INFO Cleanup: initial-setup-0.3.35-1.fc23.x86_64
Sep 13 17:43:21 INFO Cleanup: initial-setup-0.3.35-1.fc23.x86_64
Sep 13 18:03:40 ERROR Non-fatal POSTUN scriptlet failure in rpm package initial-setup
Sep 13 18:03:40 ERROR Non-fatal POSTUN scriptlet failure in rpm package initial-setup



However, on a nearly identical system I updated less than 15 minutes after that there were no issues.

On the system that hung, no further progress was made through the cleanup phase, leaving (on this system) 177 packages in limbo. The only thing the next run of dnf did was uninstall the extra kernels that were still there, as they weren't removed yet. Attempting to resolve any leftover issues to this point has been unseccessful - dnf history redo is hitting too many errors to weed through, and I know of no other method to accomplish this.

Comment 10 Vratislav Podzimek 2015-09-14 06:30:34 UTC
Looks to me like the problem is that the initial-setup-text service is running (started) in the background. systemd restarts it when updating the package and gets stuck because the initial-setup-text prompts user for input somewhere and never terminates. However, the initial-setup-text service should never be running when the system is booted to the state where user logs in and does system update. Martin, could you please have a look at this?

Comment 11 Adam Williamson 2015-09-14 16:23:27 UTC
Can I suggest that the obvious fix here is simply this?

%postun
%systemd_postun initial-setup-text.service

%postun gui
%systemd_postun initial-setup-graphical.service

since, as noted, i-s should never be running as a daemon on an installed system, I can't see any situation in which using systemd_postun_with_restart would ever do anything *good*. No?

Comment 12 Adam Williamson 2015-09-14 16:48:14 UTC
Discussed at 2015-09-14 blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2015-09-14/f23-blocker-review.2015-09-14-16.00.log.txt .

We agreed that this is clearly a bad bug, but determining its blocker/FE status is a bit complex.

1) do we actually need to change the frozen package set to fix it?
2) when does the bug actually happen?

For 2), we're fairly sure the bug happens when initial-setup is running and you update it, but we're not sure under what circumstances initial-setup is running. It's never *supposed* to be running while the system is being used 'normally' (i.e. installing updates or upgrading).

For 1), any fix that involves changing %postun - e.g. the one I suggested in #c11 - benefits from happening in the frozen Beta package set, as %postun is taken from the 'before' version, not the 'after' version. However, it's conceivable that we can fix this elsewhere, e.g. by stopping any running initial-setup in the %pre of the 'after' package.

Taking all that together, we decided to grant this issue a freeze exception for now, and leave blocker status open depending on the answer to 2). We will decide whether to actually pull a fix through the freeze on the basis of exactly how this winds up being fixed (whether we change %postun and, if so, how vital it is that that change be included in Beta).

Comment 13 Adam Williamson 2015-09-17 07:04:12 UTC
vpodzime: mkolman: could we please have your thoughts on the issues raised in #c12 and the best way to fix this? thanks.

Comment 14 Martin Kolman 2015-09-17 13:09:05 UTC
(In reply to awilliam from comment #11)
> Can I suggest that the obvious fix here is simply this?
> 
> %postun
> %systemd_postun initial-setup-text.service
> 
> %postun gui
> %systemd_postun initial-setup-graphical.service
> 
> since, as noted, i-s should never be running as a daemon on an installed
> system, I can't see any situation in which using systemd_postun_with_restart
> would ever do anything *good*. No?

Yeah, that might work - trying it out now. :)

Comment 15 Martin Kolman 2015-09-17 14:31:13 UTC
Seems to work! Patch posted for review. :)

Comment 16 Adam Williamson 2015-09-17 15:04:26 UTC
mkolman: there's an issue with fixing it in %postun, noted in #c12 again: if we do (only) that, it should be fixed in the frozen Beta package set, because %postun is taken from the *old* version when updating, not the *new* version. So if we ship Beta with the 'broken' %postun, even if we ship an update with the 'fixed' %postun, you may hit the issue the first time you update.

This is why someone suggested instead/also having something in %pre, because %pre comes from the new package. The suggestion was to stop any running initial-setup service/process in %pre.

WDYT about that?

Comment 17 Adam Williamson 2015-09-17 16:51:53 UTC
Discussed at 2015-09-17 Fedora 23 Beta Go/No-Go meeting, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/teams/f23_beta_go_no-go_meeting/f23_beta_go_no-go_meeting.2015-09-17-16.00.log.txt . Rejected as a blocker.

The main 'supported' scenario in which this would be encountered is a regular update to an installed F23 system, if somehow i-s is running (the scenarios where i-s is running have not yet been figured out; that's the unknown part of this bug). It follows logically that we don't need to change the frozen Beta tree or media to avoid that happening: we only need to ensure the first post-Beta update to initial-setup avoids the bug.

Therefore this bug is not a release blocker, but we are going to state that we require the next Fedora 23 initial-setup update to somehow avoid this issue happening - i.e. we want it to have a %pre fix/workaround or something else that means this bug will not be encountered *even when upgrading from an initial-setup package with the 'bad' %postun, with initial-setup running*.

The currently-pending i-s update will be un-submitted to ensure it doesn't go stable without a fix for this.

Comment 18 Adam Williamson 2015-09-17 16:52:53 UTC
We should also document this on the 'upgrading with yum' page for people doing (technically unsupported) live upgrades.

Comment 19 Martin Kolman 2015-09-18 13:14:41 UTC
(In reply to awilliam from comment #16)
> mkolman: there's an issue with fixing it in %postun, noted in #c12 again: if
> we do (only) that, it should be fixed in the frozen Beta package set,
> because %postun is taken from the *old* version when updating, not the *new*
> version. So if we ship Beta with the 'broken' %postun, even if we ship an
> update with the 'fixed' %postun, you may hit the issue the first time you
> update.
> 
> This is why someone suggested instead/also having something in %pre, because
> %pre comes from the new package. The suggestion was to stop any running
> initial-setup service/process in %pre.
> 
> WDYT about that?
Yeah, %pre should work according to the wiki[0].

BTW, I think I might have a preliminary theory why some people have IS running without knowing about it:
* unless a user specification has been found in kickstart IS will run on system startup if it is installed and its unit/-s enabled
- it does not actually check if there are user accounts on the system, it only checks the kickstart
* the Initial Setup services (initial-setup-graphical and initial-setup-text) have been until recently in the list of services to automatically enable once installed
- so it was possible to get IS pulled in & automatically activated after the installation, for example by installing a package group containing it
- this has been fixed as part of bug 1215224 about two months ago
* what about the users not noticing Initial Setup has been running on their machine ?
- the Initial Setup TUI starts on a console and tries to block all other consoles from initializing to prevent user meddling from interfering with initial system re-/configuration
- unfortunately this is not very robust and basically requires you to write all console names in the unit file,
  if you miss one the console won't be blocked and users can log-in on it and cause the mischief we are currently seeing
- there is a RFE[1] on systemd to provide this functionality (provide a target IS can use that blocks all consoles) but it does not seem to be getting much traction so far

[0] https://fedoraproject.org/wiki/Packaging:ScriptletSnippets#Scriptlet_Ordering
[1] http://lists.freedesktop.org/archives/systemd-devel/2015-July/033754.html

Comment 20 Adam Williamson 2015-09-28 18:08:29 UTC
mkolman: thanks for the explanation. So just to recap, sending out an update with a fix/workaround in %pre would satisfy the blocker process requirements relating to this bug.

Thanks for the i-s theory, that sounds plausible for sure. I suspect the most likely contingent of affected users would be those who installed with Workstation or without a desktop, then added a desktop which uses i-s after installation.

Comment 21 Martin Kolman 2015-09-29 14:35:06 UTC
Updated patch with the %pre based old-runnning-IS stopping & disabling code has been posted for review.

Comment 22 Fedora Update System 2015-09-30 13:41:43 UTC
initial-setup-0.3.37-1.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-f23124ced6

Comment 23 Fedora Update System 2015-10-01 23:53:37 UTC
initial-setup-0.3.37-1.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update initial-setup'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-f23124ced6

Comment 24 Fedora Update System 2015-10-02 01:49:28 UTC
initial-setup-0.3.37-1.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update initial-setup'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-f23124ced6

Comment 25 Fedora Update System 2015-10-07 16:35:31 UTC
initial-setup-0.3.37-1.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.