Bug 1255858 (gui_pop)

Summary: kickstart guided install reboots at|before preparing for install
Product: Red Hat Enterprise Linux 7 Reporter: David Bulkow <david.bulkow>
Component: anacondaAssignee: Brian Lane <bcl>
Status: CLOSED ERRATA QA Contact: Release Test Team <release-test-team-automation>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.2CC: bcl, david.bulkow, dshea, jvavra, mbanas, tgummels
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: anaconda-21.48.22.57-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 23:09:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1238410, 1364088    
Attachments:
Description Flags
results of inst.nokill install
none
anaconda trace
none
trace 2 - X.log
none
trace 2 - anaconda.log
none
trace 2 - anaconda-tb-HxM7mp
none
trace 2 - anaconda-tb-all.log
none
trace 2 - anaconda-yum.conf
none
trace 2 - ifcfg.log
none
trace 2 - ks-script-PRFjKQ
none
trace 2 - ks-script-n_VNqK
none
trace 2 - libuser.9wfl83
none
trace 2 - packages
none
trace 2 - packages_required
none
trace 2 - packaging.log
none
trace 2 - partition-map
none
trace 2 - program.log
none
trace 2 - rpm-script.log
none
trace 2 - storage.log
none
trace 2 - storage.state
none
trace 2 - syslog
none
trace 3 - anaconda trace file none

Description David Bulkow 2015-08-21 17:32:46 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 David Shea 2015-08-21 17:34:06 UTC
Can you try running the install with inst.nokill on the boot command line, and attach the logs from /tmp after the install fails?

Comment 3 David Bulkow 2015-08-21 17:42:16 UTC
Welp, so many buttons, so little time...

That first comment was supposed to read as follows:

A GUI based kickstart driven install fails by orderly shutdown just around the time the display switches from the options (filled in by kickstart) and the "preparing for install" progress bar.

I have tried inst.text, serial redirect, inst.vnc and all have been successful - even with the kickstart.

A manual install, no kickstart, also succeeds.

Finally, I tried 'console=ttyS1 console=tty0 inst.graphical', which fails.

NOTE: the failure is not 100%.  I was able to slip past the failure mode 2 out of 35 tries.

There is no dialog box or indication of failure other than a blank screen.  When I manage to get to tty1 and poke around, I am able to see an orderly shutdown/reboot.  When running with console redirected to both serial and tty0 then I also see just an orderly shutdown.  No kernel messages, no error messages.

I have tried shunting our kickstart by removing the storage specification.  The machine sits happily in this state with both GUI (tty6) and tmux (tty1) available.  I'm digging through the environment now for signs of abnormality.

Comment 4 David Bulkow 2015-08-21 17:42:38 UTC
inst.nokill - coming right up.

Comment 5 David Bulkow 2015-08-21 18:43:53 UTC
Created attachment 1065685 [details]
results of inst.nokill install

Using inst.nokill install failed the second try.

I ran what we call a "customer" install for which we provide a kickstart file on separate media - inst.ks=cdrom:ks.cfg - and do the disk swap to get started.  In this instance the GUI popped before getting to the progress bar screen.

Comment 6 David Bulkow 2015-08-29 13:20:14 UTC
I have compared 7.1-GA install logs with 7.2 successful and unsuccessful.  Nothing stands out as obvious.

Things I've tried:
- inst.text is always successful
- inst.nokill makes a graphical install slightly more successful
- inst.debug=1 makes no difference in output, that I could tell

I think my next step is to dig into anaconda sources and try to enable more debugging.  I haven't seen a way to increase the debug level from the kernel command line.



We have also noticed that specifying an install group '@Stratus ftAUL --optional' when using a graphical install fails to install any of the group.  A text install happily installs the group.  Probably another bug...  What d'ya think?  I'll be happy to add another issue.

Comment 7 David Bulkow 2015-09-02 15:55:13 UTC
FYI, I've now seen the graphics install anaconda crash with 7.2-Beta-1.  Still no evidence in the log files that indicates what blew up.

Comment 8 Brian Lane 2015-09-02 22:58:18 UTC
I don't see anything in the logs that indicate any kind of problem. It was in the middle of installing packages it looks like, the last thing in rpm-script.log is: ghostscript-9.07-18.el7.x86_64 (840/1330) and syslog doesn't have any hardware problems logged.

When this fails with inst.nokill are the anaconda or anaconda-yum processes still running? If anaconda is, could you do this:

kill -USR2 `cat /var/run/anaconda.pid`

And attach the /tmp/anaconda-tb-* file as a text/plain attachment.

Comment 9 David Bulkow 2015-09-03 22:57:51 UTC
Created attachment 1070153 [details]
anaconda trace

Another inst.nokill + graphics pop

Anaconda was still running, no trace files in /tmp.
Issued SIGUSR2 to anaconda, generating the attached trace file.

Comment 10 Brian Lane 2015-09-08 17:37:25 UTC
Can you run this with a simplified kickstart? One without any of your custom packages? I can't see any reason for it to be quitting, but it is possible that a problem in one of your packages could cause anaconda-yum to exit unexpectedly.

Also, was just anaconda running or was anaconda-yum also running (that's the process that handles the actual package installation).

Comment 11 David Bulkow 2015-09-08 18:28:02 UTC
The answer to the second question is yes, anaconda-yum is running at the time of the event.

Comment 12 David Bulkow 2015-09-08 18:36:57 UTC
Do you want ALL the logs from a less inclusive kickstart?

Comment 13 David Bulkow 2015-09-08 18:55:14 UTC
This latest repro only adds prerequisites for our custom packages.  Nothing fancy with repos and all Red Hat software.

Comment 14 Brian Lane 2015-09-08 18:57:14 UTC
(In reply to David Bulkow from comment #12)
> Do you want ALL the logs from a less inclusive kickstart?

Yes, if it crashes. Thanks.

Comment 15 David Bulkow 2015-09-08 19:52:09 UTC
Created attachment 1071457 [details]
trace 2 - X.log

Comment 16 David Bulkow 2015-09-08 19:52:39 UTC
Created attachment 1071458 [details]
trace 2 - anaconda.log

Comment 17 David Bulkow 2015-09-08 19:53:13 UTC
Created attachment 1071459 [details]
trace 2 - anaconda-tb-HxM7mp

Comment 18 David Bulkow 2015-09-08 19:53:37 UTC
Created attachment 1071460 [details]
trace 2 - anaconda-tb-all.log

Comment 19 David Bulkow 2015-09-08 19:53:59 UTC
Created attachment 1071461 [details]
trace 2 - anaconda-yum.conf

Comment 20 David Bulkow 2015-09-08 19:54:16 UTC
Created attachment 1071462 [details]
trace 2 - ifcfg.log

Comment 21 David Bulkow 2015-09-08 19:54:52 UTC
Created attachment 1071463 [details]
trace 2 - ks-script-PRFjKQ

Comment 22 David Bulkow 2015-09-08 19:55:49 UTC
Created attachment 1071464 [details]
trace 2 - ks-script-n_VNqK

Comment 23 David Bulkow 2015-09-08 19:56:12 UTC
Created attachment 1071465 [details]
trace 2 - libuser.9wfl83

Comment 24 David Bulkow 2015-09-08 19:56:34 UTC
Created attachment 1071466 [details]
trace 2 - packages

Comment 25 David Bulkow 2015-09-08 19:56:55 UTC
Created attachment 1071467 [details]
trace 2 - packages_required

Comment 26 David Bulkow 2015-09-08 19:57:17 UTC
Created attachment 1071468 [details]
trace 2 - packaging.log

Comment 27 David Bulkow 2015-09-08 19:57:36 UTC
Created attachment 1071469 [details]
trace 2 - partition-map

Comment 28 David Bulkow 2015-09-08 19:57:55 UTC
Created attachment 1071470 [details]
trace 2 - program.log

Comment 29 David Bulkow 2015-09-08 19:58:21 UTC
Created attachment 1071471 [details]
trace 2 - rpm-script.log

Comment 30 David Bulkow 2015-09-08 19:58:42 UTC
Created attachment 1071472 [details]
trace 2 - storage.log

Comment 31 David Bulkow 2015-09-08 19:59:16 UTC
Created attachment 1071473 [details]
trace 2 - storage.state

Comment 32 David Bulkow 2015-09-08 19:59:32 UTC
Created attachment 1071474 [details]
trace 2 - syslog

Comment 33 David Bulkow 2015-09-08 20:00:28 UTC
The following were zero-length:
ks-script-PRFjKQ.log
ks-script-n_VNqK.log
sensitive-info.log

Comment 34 Brian Lane 2015-09-08 20:35:14 UTC
Sorry, I should have remembered to suggest running with inst.loglevel=debug, could you give that a try, with the limited package set. Thanks.

Comment 35 David Bulkow 2015-09-09 19:28:23 UTC
Do you want every file separately again?

There still isn't much in these logs that sticks out :-(

Comment 36 Brian Lane 2015-09-09 20:32:15 UTC
You only need to include the anaconda-tb-* file, it has everything we need in it.

Comment 37 David Bulkow 2015-09-09 23:48:54 UTC
Created attachment 1071973 [details]
trace 3 - anaconda trace file

This is from a simplified kickstart, @core (I removed 'Server with GUI' for this run) plus packages required for our software.

Comment 38 Brian Lane 2015-09-10 16:18:10 UTC
14:22:19,722 INFO anaconda: Installing boot loader
14:22:19,723 INFO anaconda: Performing post-installation setup tasks
14:22:19,747 INFO anaconda: Performing post-installation setup tasks
14:22:19,747 INFO anaconda: Thread Done: AnaInstallThread (139932915861248)

This was actually a successful run. Can you get this exact same kickstart to fail?

Comment 39 David Bulkow 2015-09-10 16:25:04 UTC
It hadn't quite reached the point where one presses the reboot button.  The graphics popped and I was left with the tmux session and anaconda still running.

There hasn't been anything interesting in these logs from the beginning.  I wonder if having your hands on one of these machines could help?

In the mean time I'm giving 7.2-Snap-1 a try this morning.

Comment 40 Brian Lane 2015-09-10 16:36:38 UTC
That's correct behavior when using nokill, anaconda stops in a loop. But the installation was finished in this case. In the others it wasn't done since anaconda-yum was still running and it hadn't logged anything about bootloader installation.

Comment 41 David Bulkow 2015-09-21 17:41:24 UTC
Brian,

7.2-Snap-1 is far less likely to reproduce this issue - 1 in 10?
7.2-Alpha-1 was more like 9 in 10.

I just finished round-tripping other install issues and will try to get another trace.  Unfortunately I don't see anything useful coming from these logs.  I've been comparing successful versus failed runs and nothing jumps right out.  It feels like something graphics related is blowing up, but not leaving any traces in logs.

As for the reboot button - I have been having to first press Reboot, then Ctrl-alt-delete to reboot from an install when things go right.

Is there more I can do in the graphics interface to catch faults - presuming there are ways to fail that don't get logged?

-dbul-

Comment 42 Brian Lane 2015-09-21 18:16:16 UTC
There's inst.loglevel=lock which will log a pile of things about yum locking, but at the point where it fails it is past all of that. When it starts up anaconda-yum it is supposed to be sitting in a loop watching its output and it shouldn't escape with anaconda-yum still installing things.

I may have hit this on Friday while doing a dvd + kickstart install. I only saw it once, but I'll keep looking.

Comment 45 Brian Lane 2016-02-11 00:31:18 UTC
I revisited this today, and still cannot get it to reproduce here. I've used your kickstart (with the slight modification of just doing autopart) in a KVM. I've grabbed it from http and from a swapped in cdrom image, I've inserted failures into anaconda-yum and anaconda's processing of the anaconda-yum output, I've stared at the code and dug into the underlying yum code a bit too.

I cannot find any code paths that would result in it jumping to the exit code without logging something. In anaconda-yum it will always pass a QUIT: back to anaconda. In anaconda it will always log what it gets from anaconda-yum, and if anaconda-yum quits unexpectedly the traceback will be logged in anaconda.log and a dialog displayed.

There must be something very different about how you are using this, or as a last resort some kind of hardware problem that's not showing up in syslog. eg. a out of memory problem would usually show up as a oom-killer entry.

I'm pretty much stuck. About all I can recommend is trying a different method of installation and see if it works better.

As an example of how puzzling this is, in the packaging.log there is this:

14:33:52,738 INFO packaging:  populate transaction set

which comes from anaconda-yum. That should be immediately followed by "check transaction set". The only way it can escape would result in either an error being logged, or a progress updates about retrying the populate transaction set.

In the anaconda code, where it processes the output from anaconda-yum, it logs everything, even unrecognized text. It can't hit either of the sys.exit() calls without something getting logged.

Which isn't to say I don't believe this is happening, it clearly is, but all the code paths should result in more detail than what we are seeing which makes me suspect that there is some other serious unknown issue going on.

Comment 46 David Bulkow 2016-02-12 16:39:58 UTC
This isn't statistically significant, but I have not yet seen this fail when I add inst.text to the command line.

The downside to a text install is that graphics.target (sp?) isn't enabled, leaving a problem we have to clean up if this install is being performed in manufacturing.  Some folks here have had their install automagically downgraded to a text install - related?

The graphics on this platform is provided by the on-board BMC, using some old spec Matrox design.  I have no doubt there could be trouble in this space, but I'm not sure how to dig that out of the data we've been getting.  Could it be possible that instead of an error exit the process just got sigKilled?

Also not statistically significant - I have not had this fail when no kickstart is specified/used.  This would indicate we've done something untoward in the kickstart, and we did believe this for a while, leading us to reduce the kickstart to it's minimum - still had this event occur.

If there would be value in seeing this first hand, I'd be glad to host a field trip - we're down the road about 20 mins from Westford.  Unfortunately we don't have this hardware in a Red Hat lab.

Comment 47 Brian Lane 2016-02-12 17:45:57 UTC
(In reply to David Bulkow from comment #46)
> This isn't statistically significant, but I have not yet seen this fail when
> I add inst.text to the command line.
> 
> The downside to a text install is that graphics.target (sp?) isn't enabled,
> leaving a problem we have to clean up if this install is being performed in
> manufacturing.  Some folks here have had their install automagically
> downgraded to a text install - related?
> 

If you add 'xconfig --startxonboot' to the kickstart you should be able to do a text installation of a GUI system.

> The graphics on this platform is provided by the on-board BMC, using some
> old spec Matrox design.  I have no doubt there could be trouble in this
> space, but I'm not sure how to dig that out of the data we've been getting. 
> Could it be possible that instead of an error exit the process just got
> sigKilled?

I don't think so, I tried killing things from outside (and inside) the processes and it always resulted in a clear reason being logged.

> 
> Also not statistically significant - I have not had this fail when no
> kickstart is specified/used.  This would indicate we've done something
> untoward in the kickstart, and we did believe this for a while, leading us
> to reduce the kickstart to it's minimum - still had this event occur.

Or that there's some timing issue involved.

A funny thing just happened. I saw this in a kvm with the ks pulled from network (no cd swapping). But since I didn't have nokill it just went click and rebooted. So I've probably seen this twice in all my attempts -- but something clearly went wrong.

I also just tried killing metacity (results in an error dialog box from anaconda) and Xorg (results in anaconda exiting but no system reboot).

Comment 48 Brian Lane 2016-02-12 23:23:50 UTC
I may have found it, give this updates.img a try by adding this to the cmdline (or copying it to a local http server):

inst.updates=https://bcl.fedorapeople.org/updates/1255858.img

The problem may be a race at the switch from the summary hub to the progress hub, there are 2 '_autoContinue clicking continue button' which was a problem fixed in the upstream version of Anaconda.

Comment 49 David Bulkow 2016-02-22 23:51:59 UTC
Sorry this took so long.  We've had runs where these install issues go away but then come back weeks later.  It was important to us that we test this thoroughly, particularly on machines where this seems to happen more frequently.

We have managed 18 successful installs plus 9 more on another machine using this patch.

I think this can be considered fixed.

Thank you for your efforts in digging this one out.

Comment 54 errata-xmlrpc 2016-11-03 23:09:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2158.html