Bug 827699

Summary:	Hung script can hang entire installation
Product:	[Fedora] Fedora	Reporter:	Nadav Har'El <nyh>
Component:	anaconda	Assignee:	Anaconda Maintenance Team <anaconda-maint-list>
Status:	CLOSED CANTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	17	CC:	anaconda-maint-list, g.kaviyarasu, jonathan, nsoranzo, paolini, vanmeeuwen+fedora
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-06-07 14:12:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nadav Har'El 2012-06-02 13:00:43 UTC

A typical upgrade (or install) involves installation literally thousands of packages, installed one after one. If one of these installations hangs - e.g., because of a bug in a post-install script - the *entire* installation hangs. An unusually experienced and resourceful user can switch to another terminal and kill the hung process and cause the installation to resume, but a more realistic scenario is that the user gives up, reboots, and either tries to use the half-upgraded or half-installed system (a sure recipe for disaster) or to retry the installation (might not work again - and if it does, the user will have a big mess with duplicate packages). Most likely, a large percentage of these users will end up with an unusable system, and leaving Fedora for good.

Unfortunately, this is NOT a theoretical problem - it just happened in Fedora 17 an upgrade to which hangs if you have the "sbcl" package (see bug 822008) - and it also happened to me in the past (I don't remember the exact details, but I believe it involved the "nethack" package in Fedora 14).

One logical solution is to time-limit scripts - if a single script hasn't finished after, say, 5 minutes, kill it. Of course a resonable limit needs to be decided on.

An implementation of such time-limit is somewhat related to bug 826904: Currently it appears (?) Anaconda just runs the script and waits. Instead, Anaconda needs to continue running, and notice that the timeout period has passed.

Comment 1 Chris Lumens 2012-06-03 22:10:29 UTC


*** This bug has been marked as a duplicate of bug 826683 ***

Comment 2 Nadav Har'El 2012-06-04 03:41:11 UTC

Please don't close this bug as a duplicate of 826683...
By the way, 826683 itself is a duplicate of bug 822008.

This bug is a *generalization* of 822008 (and 826683) - it says that regardless of the specific problems in the "sbcl" package, which these bugs reported, Anaconda shouldn't have hung completely.

Comment 3 Nicola Soranzo 2012-06-07 12:57:06 UTC

(In reply to comment #2)
> Please don't close this bug as a duplicate of 826683...
> By the way, 826683 itself is a duplicate of bug 822008.
> 
> This bug is a *generalization* of 822008 (and 826683) - it says that
> regardless of the specific problems in the "sbcl" package, which these bugs
> reported, Anaconda shouldn't have hung completely.

I agree, reopening.
If the anaconda maintainers think that this is not solvable, don't close it as WONTFIX.

Comment 4 Chris Lumens 2012-06-07 14:12:44 UTC

There is no possible way we can fix this.  The package scriptlets themselves need to be tested and fixed.  We cannot simply wait for a given amount of time and kill the process if it's not working.  First, what process would we kill?  It's all running within anaconda (except for who knows what that the scriptlet spawns).  Second, a package scriptlet can do anything it wants.  It can hit the network.  It can convert databases to different formats.  It can take as long as it wants to.  We are not going to play the game of guessing how long is long enough, as we will never get it right.  File a bug against the package.

Comment 5 Nadav Har'El 2012-06-07 18:12:25 UTC

Chris, I strongly disagree with your decision and the logic leading to it.

The issue is that Fedora installation (or upgrade) is composed of several thousand (!) separate package installations, and it is inexusable that a bug in one of them should be able to break the entire installation - like happened in this case for the second time in recent years.

Moreover, though theoretically a package could spend an hour converting a database or something - in practice, Fedora should NOT ALLOW packages to be such inconsiderate hogs during an upgrade - remember that install (and upgrade) length is an important part of the Fedora user experience, and Fedora should encorage install scripts to be as quick as possible - and definitely some sort of upper limit could be imposed. Remember that even if the upper limit was a ridiculously high 1 hour (!), it would have solved the bug in this case - as people who waited a whole hour would have their installation continue. If you think 1 hour is not enough, how about 6 hours? This is the amount of time that I waited hoping that the issue would clear up by itself...

I also don't understand your example that a scriptlet can "hit the network", implying that it can hang if the network is unavailable or slow. How is this excusable? Imagine that each of the 3,000 packages in my Fedora installation required a different Website to be reachable during upgrade, and if not it would just hang. How could I ever upgrade my machine if one of these 3,000 sites is bound to be unavailable at any time? 

It's not terribly important to find what to kill (although, if you really want, this can be done pretty well with process groups - and in a watertight manner by using cgroups and similar recent kernel mechanisms). All that matters is to continue running the *next* package. If the previous package also continues to run in the background, so be it (it's not perfect, but not as bad as the current situation). In bug 826904 I analyzed another Anaconda bug - its failing to refresh the display while installing packages - and guessed (without looking at the code, I admit) that when Anaconda starts a package installation, it just wait()s for the child process to finish. This is a broken behavior: Anaconda should continue running alongside the child process, so that it can continue to refresh its screen (this is bug 826904) and, eventually (e.g., after a long timeout), proceed to running the next child process without waiting for the previous one.

Make no mistake - I don't think that what I'm describing is ideal or even fully "correct". Failing to properly install one package may have unknown effects on other packages, and I can probably list other problems with my suggested solution. But the current situation - that the upgrade will *NEVER* finish for a large percentage of the Fedora users, is much worse. Almost any bug (except disk corruption ;-)) is better than that. Anybody less invested in Fedora than myself (I've been using Red Hat Linux for 13 years) who had this problem and was forced to re-install his machine from scratch because of a botched upgrade, is almost sure to try Ubuntu next time....

Comment 6 Chris Lumens 2012-06-07 18:21:14 UTC

> The issue is that Fedora installation (or upgrade) is composed of several
> thousand (!) separate package installations, and it is inexusable that a bug
> in one of them should be able to break the entire installation - like
> happened in this case for the second time in recent years.

The package needs to be fixed.  The failure to install a package can break the entire installation, too, and in that case it would need to be fixed as well.

> Moreover, though theoretically a package could spend an hour converting a
> database or something - in practice, Fedora should NOT ALLOW packages to be
> such inconsiderate hogs during an upgrade - remember that install (and
> upgrade) length is an important part of the Fedora user experience, and
> Fedora should encorage install scripts to be as quick as possible - and
> definitely some sort of upper limit could be imposed.

This would require a change to packaging policy, which is not in anaconda's domain.

> Remember that even if
> the upper limit was a ridiculously high 1 hour (!), it would have solved the
> bug in this case - as people who waited a whole hour would have their
> installation continue. If you think 1 hour is not enough, how about 6 hours?
> This is the amount of time that I waited hoping that the issue would clear
> up by itself...

Nothing we can choose will ever be correct, and will result in future bugs.

> I also don't understand your example that a scriptlet can "hit the network",
> implying that it can hang if the network is unavailable or slow. How is this
> excusable? Imagine that each of the 3,000 packages in my Fedora installation
> required a different Website to be reachable during upgrade, and if not it
> would just hang. How could I ever upgrade my machine if one of these 3,000
> sites is bound to be unavailable at any time?

The point is that package scriptlets have free reign.  They can do literally anything.  This could include doing something involving the network.  If you think this is wrong, work on getting packaging policy changed.

> In bug 826904 I analyzed another Anaconda
> bug - its failing to refresh the display while installing packages - and
> guessed (without looking at the code, I admit) that when Anaconda starts a
> package installation, it just wait()s for the child process to finish. This
> is a broken behavior: Anaconda should continue running alongside the child
> process, so that it can continue to refresh its screen (this is bug 826904)
> and, eventually (e.g., after a long timeout), proceed to running the next
> child process without waiting for the previous one.

The refresh behavior is being fixed in the newui branch.

We can't simply start installing a package while a previous one is still working.  That's not how package dependencies work.  One package may require that a previous package's scriptlet do something before it can be properly installed or run its own scriptlet.

> But the current situation - that the upgrade will
> *NEVER* finish for a large percentage of the Fedora users, is much worse.

You are assuming that the number of people who have the broken package installed is large.

> Almost any bug (except disk corruption ;-)) is better than that. Anybody
> less invested in Fedora than myself (I've been using Red Hat Linux for 13
> years) who had this problem and was forced to re-install his machine from
> scratch because of a botched upgrade, is almost sure to try Ubuntu next
> time....

And yet, data corruption is possible here.  Package scriptlets could process user preferences (touching data in a user's home directory), convert a database from one format to another, or who knows what else.  These tasks could involve modifying files and if we simply killed the scriptlets part way through, we're talking data corruption.

If you don't like this, get involved in Fedora QA and make sure packages don't behave badly.  anaconda is not going to start second guessing packages, because that's a game we won't ever win.

Comment 7 Maurizio Paolini 2012-06-08 08:59:02 UTC

(In reply to comment #6)
> The package needs to be fixed.  The failure to install a package can break
> the entire installation, too, and in that case it would need to be fixed as
> well.

Of course the package needs to be fixed, that's clear!  But this is not the
point here.

Anyway, we cannot assume all packages have scriptlets that do not hang.

At least anaconda should provide some information to the user in case there
is something that *seems* bad (like a scriptlets that take too much to
complete) and perhaps suggest some behaviour to the user, or ask the user
if it wants the scriptlet to be killed, leaving the responsibility to the
user.

I was also hit by this hangup problem, and ended up with an unusable system:
package sbcl is required by maxima, which, I think, is quite standard in
scientific environments. So the scenario described by Nadav Har'El
is not theoretical!

Some bounds on scriptlets could also be imposed, and in the first place
they should be used as little as possible... big tasks (like database
conversions) should *not* be done in scriptlets, simply because they are
run in very different situations (rpm -U, preupgrade, anaconda...) with
different user interaction scenarios.
It could be wise, for example, to decide that scriptlets *cannot* undertake
tasks that could lead to loss of data.

> This would require a change to packaging policy, which is not in anaconda's
> domain.

Perhaps some revision of packaging policy should be done, then.

> The point is that package scriptlets have free reign.  They can do literally
> anything.

This is not good, in my opinion.

> We can't simply start installing a package while a previous one is still
> working.  That's not how package dependencies work.  One package may require
> that a previous package's scriptlet do something before it can be properly
> installed or run its own scriptlet.
> 
> > But the current situation - that the upgrade will
> > *NEVER* finish for a large percentage of the Fedora users, is much worse.
> 
> You are assuming that the number of people who have the broken package
> installed is large.

Add me to the count, and all users that have maxima installed on their
fedora 16.