Bug 983319 - Install gets stuck returning from Installation Destination spoke
Summary: Install gets stuck returning from Installation Destination spoke
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: rawhide
Hardware: All
OS: All
unspecified
urgent
Target Milestone: ---
Assignee: David Shea
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker
: 998687 (view as bug list)
Depends On:
Blocks: F20AlphaBlocker
TreeView+ depends on / blocked
 
Reported: 2013-07-11 00:26 UTC by Adam Williamson
Modified: 2013-08-29 12:38 UTC (History)
14 users (show)

Fixed In Version: anaconda-20.8-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-08-29 12:38:44 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
the 'dialog on black background' effect (95.64 KB, image/png)
2013-07-11 00:27 UTC, Adam Williamson
no flags Details
the buggy state: note all controls are inactive and the warning for Installation Destination is still displayed (78.85 KB, image/png)
2013-07-11 00:27 UTC, Adam Williamson
no flags Details
strace before hitting 'reclaim space' (5.13 MB, text/plain)
2013-08-23 16:56 UTC, Brian Lane
no flags Details
strace after hitting 'reclaim space' (741.01 KB, text/plain)
2013-08-23 16:58 UTC, Brian Lane
no flags Details
lsof of the anaconda process after it gets stuck. (34.89 KB, text/plain)
2013-08-23 17:00 UTC, Brian Lane
no flags Details
backtrace (8.75 KB, text/x-log)
2013-08-23 18:35 UTC, David Shea
no flags Details
diff between 19.30.13 and 20.6 pyanaconda/ui/gui/utils.py (1.66 KB, text/plain)
2013-08-23 21:48 UTC, Steve Tyler
no flags Details

Description Adam Williamson 2013-07-11 00:26:35 UTC
I built a couple of Rawhide live images for some very early F20 testing, one last week, one today. Today's has anaconda 20.1-1. On both, I found I could run the installer and get to the hub and then Installation Destination, but after completing partitioning with a simple 'delete all disk contents' guided partitioning choice, the installer seems to get 'stuck' returning to the hub: the hub is visible but in a state where no controls are sensitive. It's not possible to begin installation or enter any other spoke.

There are no smoking guns in the anaconda logs.

The VM I'm testing in is the one I've used for most of my testing for a long time, but using VNC/vga graphics instead of SPICE/qxl because of a Rawhide virt bug, and rtl8139 networking instead of virtio because of another Rawhide bug. It has a single 15GB virtio hard disk with an existing Fedora install on it, very simple setup. Reproduction is simply to run the installer, accept U.S. English, click through the 'Timbuktu' warning, click Installation Destination, click Done, click Reclaim Space, choose to delete all partitions, click Reclaim space, and see the bug.

It's worth noting that whenever a dialog comes up - the Timbuktu warning, Installation Options, and Reclaim Space - everything behind it, including anaconda itself, entirely disappears. You just see the dialog floating on a completely black background (with the GNOME panel at the top). I don't recall F18 or F19 behaving that way. Looks like something changed GNOME/GTK+ side, and that could _possibly_ affect this bug?

I'll upload the live image I'm using for testing. I'm also going to attach screenshots of the 'dialog on black background' effect and of the bug state itself. And I'll build a KDE live image later and see if it affects that.

This looks like an Alpha blocker as things stand, it's a showstopper for live installs.

Comment 1 Adam Williamson 2013-07-11 00:27:09 UTC
Created attachment 771920 [details]
the 'dialog on black background' effect

Comment 2 Adam Williamson 2013-07-11 00:27:49 UTC
Created attachment 771921 [details]
the buggy state: note all controls are inactive and the warning for Installation Destination is still displayed

Comment 3 Adam Williamson 2013-07-11 00:34:20 UTC
Actually, doesn't look like I need to upload my live image: the bug is reproducible with the current nightly, which you can find at http://koji.fedoraproject.org/koji/taskinfo?taskID=5592193 .

Comment 4 Adam Williamson 2013-07-11 00:35:01 UTC
CCing mclasen for any possible GNOME/GTK+ angle on this (as mentioned above, I'll check with a KDE live later to see if that reproduces the issue or not).

Comment 5 Adam Williamson 2013-07-16 19:27:40 UTC
What info are you waiting for, exactly? I couldn't check KDE last week as KDE live image composes were failing; I'll check if that's still the case currently.

Comment 6 Adam Williamson 2013-07-30 19:30:07 UTC
Just for the record, KDE is now building again, but I can't investigate this further until https://bugzilla.redhat.com/show_bug.cgi?id=986069 is fixed.

Comment 7 Adam Williamson 2013-08-16 16:46:04 UTC
I believe the anaconda team has reproduced this and is working on it, so un-setting needinfo.

Comment 8 Adam Williamson 2013-08-20 14:51:44 UTC
This is confirmed not to be live-specific and I know the team has reproduced it. Not sure why needinfo seems to keep being set (or if I just forgot to unset it with c#6).

Comment 9 Brian Lane 2013-08-20 17:07:45 UTC

*** This bug has been marked as a duplicate of bug 997149 ***

Comment 10 Adam Williamson 2013-08-21 16:59:48 UTC
Re-opening, as testing indicates that fixing 997149 has not fixed this problem. :(

Comment 11 Kamil Páral 2013-08-21 17:09:25 UTC
Discussed at 2013-08-21 blocker review meeting [1]. This is accepted as an Alpha blocker, because it violates the following F20 alpha release criterion: "The installer must be able to complete an installation using any supported locally connected storage interface." [2]

[1] http://meetbot.fedoraproject.org/fedora-blocker-review/2013-08-21/
[2] https://fedoraproject.org/wiki/Fedora_20_Alpha_Release_Criteria#Storage_interfaces

Comment 12 Adam Williamson 2013-08-21 17:48:17 UTC
Just to confirm, as viking-ice asked in the blocker meeting, this bug affects bare metal too. Just tested on my old laptop and it's stuck at the hub screen right now.

Comment 13 Brian Lane 2013-08-23 16:56:33 UTC
Created attachment 789668 [details]
strace before hitting 'reclaim space'

Comment 14 Brian Lane 2013-08-23 16:58:25 UTC
Created attachment 789669 [details]
strace after hitting 'reclaim space'

Note the end, where it does this forever:

--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
rt_sigreturn()                          = 20651408

Comment 15 Brian Lane 2013-08-23 17:00:27 UTC
Created attachment 789671 [details]
lsof of the anaconda process after it gets stuck.

fd 25 seems to be eventfd, maybe related? Maybe not.

I've got no idea what's going on here -- except that there is a good probability that it isn't anaconda. We typically don't see SIGSEGV.

Comment 16 David Shea 2013-08-23 18:35:31 UTC
Created attachment 789696 [details]
backtrace

The versions of the packages and debuginfos don't necessarily correspond to anything in any install image, so I make no promises about the validity of those line numbers. The crash is happening in anaconda_lb_move_window_to_parent on this line:

    if (!GTK_IS_WIDGET(parent) || !GTK_IS_WINDOW(window))

the parent looks fine, but the window pointer is pointing to garbage

Comment 17 David Shea 2013-08-23 18:35:54 UTC
*** Bug 998687 has been marked as a duplicate of this bug. ***

Comment 18 Steve Tyler 2013-08-23 19:06:41 UTC
You Python programmers are spoiled ... :-)

anaconda_lb_move_window_to_parent() is in anaconda C code:
https://git.fedorahosted.org/cgit/anaconda.git/tree/widgets/src/lightbox.c

Comment 19 Steve Tyler 2013-08-23 19:19:20 UTC
My money is on a tardy callback:

$ less -N anaconda-20.6-1/widgets/src/lightbox.c
...
    125     /* make the shade move with the parent window */
    126     g_signal_connect(window, "configure-event",
    127                      G_CALLBACK (anaconda_lb_move_window_to_parent), lightbox);
...

Comment 20 Steve Tyler 2013-08-23 21:10:44 UTC
[PATCH] Fix a SIGSEGV when returning from storage spoke (#983319)
https://lists.fedorahosted.org/pipermail/anaconda-patches/2013-August/005533.html

" ...
   then the signal handler (that takes the (destroyed) lightbox as a
   parameter) was called.
 - Kaboom."

Thanks for the detailed explanation. That is what I guessed, but I never would have been able to figure out a fix ... :-)

Comment 21 Steve Tyler 2013-08-23 21:38:31 UTC
What puzzles me is why don't we see this in F19? lightbox.c is unchanged and utils.py is changed in places that don't seem to be related:

$ diff -qs anaconda-19.30.13-1/widgets/src/lightbox.c anaconda-20.6-1/widgets/src/lightbox.c
Files anaconda-19.30.13-1/widgets/src/lightbox.c and anaconda-20.6-1/widgets/src/lightbox.c are identical

$ diff -qs anaconda-19.30.13-1/pyanaconda/ui/gui/utils.py anaconda-20.6-1/pyanaconda/ui/gui/utils.py
Files anaconda-19.30.13-1/pyanaconda/ui/gui/utils.py and anaconda-20.6-1/pyanaconda/ui/gui/utils.py differ

Comment 22 Steve Tyler 2013-08-23 21:48:57 UTC
Created attachment 789743 [details]
diff between 19.30.13 and 20.6 pyanaconda/ui/gui/utils.py

$ diff -u anaconda-19.30.13-1/pyanaconda/ui/gui/utils.py anaconda-20.6-1/pyanaconda/ui/gui/utils.py

Comment 23 Adam Williamson 2013-08-24 01:33:18 UTC
20.8 fixes this in my testing. Thanks!

Comment 24 Ales Kozumplik 2013-08-26 12:11:30 UTC
(In reply to Brian C. Lane from comment #14)
> Created attachment 789669 [details]
> strace after hitting 'reclaim space'
> 
> Note the end, where it does this forever:
> 
> --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
> rt_sigreturn()                          = 20651408

this could be caused by Anaconda trying to catch the sigsegv and write out some useful information about it via isys.handleSegv. That's a naive thing to try.

Comment 25 Steve Tyler 2013-08-26 14:15:22 UTC
Thanks for pointing that out, Ales. I was wondering about that SIGSEGV handler too. Do you think it could be removed?

While investigating Bug 998687, I found that:
1. No message was written, because a SIGSEGV storm occurs.
2. Disabling the SIGSEGV handler was required to get a core dump.
   (Bug 998687, Comment 7, Step 3)

Bug 998687 - SIGSEGV storm instead of Manual Partitioning
(Marked as a dupe of this one ...)

$ less -N anaconda-20.8-1/pyanaconda/isys/isys.c
...
    219 static PyObject * doSegvHandler(PyObject *s, PyObject *args) {
    220     void *array[20];
    221     size_t size;
    222     char **strings;
    223     size_t i;
    224 
    225     signal(SIGSEGV, SIG_DFL); /* back to default */
    226     
    227     size = backtrace (array, 20);
    228     strings = backtrace_symbols (array, size);
    229     
    230     printf ("Anaconda received SIGSEGV!.  Backtrace:\n");
    231     for (i = 0; i < size; i++)
    232         printf ("%s\n", strings[i]);
    233      
    234     free (strings);
    235     exit(1);
    236 }
...

Comment 26 Steve Tyler 2013-08-26 14:32:27 UTC
doSegvHandler() calls printf(), free(), and exit(), none of which are on the list of signal-safe functions:

POSIX:
2.4 Signal Concepts
2.4.3 Signal Actions
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03

SIGNAL(7)
http://man7.org/linux/man-pages/man7/signal.7.html

Comment 27 Steve Tyler 2013-08-26 15:12:56 UTC
A SIGSEGV handler was installed in anaconda on 2006-02-23:

- install SIGSEGV handler
author	Peter Jones <pjones>	2006-02-23 20:00:49 (GMT)
https://git.fedorahosted.org/cgit/anaconda.git/commit/?id=a3f4e015863d8cff89844bb28e17fb52546e526e

doSegvHandler() in that commit is not signal-safe, and it appears to have been unchanged since then:
https://git.fedorahosted.org/cgit/anaconda.git/tree/isys/isys.c?id=a3f4e015863d8cff89844bb28e17fb52546e526e#n1439

Search results for SIGSEGV in commit messages:
https://git.fedorahosted.org/cgit/anaconda.git/log/?qt=grep&q=SIGSEGV

Comment 28 Steve Tyler 2013-08-26 16:21:08 UTC
(In reply to Ales Kozumplik from comment #24)
> (In reply to Brian C. Lane from comment #14)
> > Created attachment 789669 [details]
> > strace after hitting 'reclaim space'
> > 
> > Note the end, where it does this forever:
> > 
> > --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
> > rt_sigreturn()                          = 20651408
> 
> this could be caused by Anaconda trying to catch the sigsegv and write out
> some useful information about it via isys.handleSegv. That's a naive thing
> to try.

Bug 1001187 - SIGSEGV signal handler calls functions that are not signal-safe

Comment 29 Ales Kozumplik 2013-08-26 16:29:10 UTC
(In reply to Steve Tyler from comment #25)
> Thanks for pointing that out, Ales. I was wondering about that SIGSEGV
> handler too. Do you think it could be removed?
> 

It should, definitely. It was written for a different Anaconda:)

Comment 30 Steve Tyler 2013-08-26 16:48:26 UTC
(In reply to Ales Kozumplik from comment #29)
> (In reply to Steve Tyler from comment #25)
> > Thanks for pointing that out, Ales. I was wondering about that SIGSEGV
> > handler too. Do you think it could be removed?
> > 
> 
> It should, definitely. It was written for a different Anaconda:)

Thanks, Ales. I've copied both your comments to Bug 1001187.

Comment 31 Petr Schindler 2013-08-29 12:38:44 UTC
I'm closing this bug as it seems to be fixed in TC2 (anaconda-20.9-1)


Note You need to log in before you can comment on or make changes to this bug.