Bug 1289704 - Occasional Python interpreter crashes triggered by anaconda during keyboard layout enumeration(?)
Summary: Occasional Python interpreter crashes triggered by anaconda during keyboard l...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: python3
Version: 24
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Charalampos Stratakis
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-08 18:27 UTC by Adam Williamson
Modified: 2022-07-20 15:22 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-08 12:30:23 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
the traceback (I think) (266.92 KB, text/plain)
2015-12-12 19:50 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2015-12-08 18:27:07 UTC
Sorry for the vagueness of this report, but it's quite difficult to pin down.

It seems like for every night's openQA tests, one or two tests will fail for no immediately apparent reason. The installer seems to be crashing during repository configuration.

There are two variants I've seen, here are cases of each:

1. https://openqa.stg.fedoraproject.org/tests/1521
2. https://openqa.fedoraproject.org/tests/582

In the first case, the Python trace is shown on screen and captured by openQA - https://openqa.stg.fedoraproject.org/tests/1521/modules/_boot_to_anaconda/steps/12 . In the second case, it is not. However, the crash is recorded in the syslog:

17:20:18,253 CRIT anaconda: Anaconda crashed on signal 11

In both cases, anaconda/DNF seem to run into some issues setting up the repositories, though slightly different issues in each case. You can see the logs on the "Logs & Assets" tab for each test - see particular the librepo logs:

1. https://openqa.stg.fedoraproject.org/tests/1521/file/dnf.librepo.log
2. https://openqa.fedoraproject.org/tests/582/file/dnf.librepo.log

in case 1, it seems like it keeps hitting different mirrors and getting a repomd.xml with a different checksum from the one it was expecting; in case 2 it seems to be looking for a particular repodata file and not finding it.

Unfortunately we don't have openQA set to upload the anaconda logs when an install *succeeds* so I can't easily compare the logs to those of a test run at almost the same time to see how different they are, but it certainly seems *plausible* that the crash is somehow related to the repo issues.

Ooh, actually, while I'm writing this, I have a theory. Comparing timestamps, anaconda seems to crash in both cases right around the time it's trying the ftp.linux.cz mirror. For case 1, we have:

16:13:11,162 CRIT anaconda: Anaconda crashed on signal 11
16:13:11 check_transfer_statuses: Transfer finished: repodata/repomd.xml (Effective url: ftp://ftp.linux.cz/pub/linux/fedora/linux/development/rawhide/x86_64/os/repodata/repomd.xml)

for case 2, we have:

17:20:18,253 CRIT anaconda: Anaconda crashed on signal 11
12:20:18 check_transfer_statuses: Error during transfer: Status code: 404 for http://ftp.linux.cz/pub/linux/fedora/linux/development/rawhide/x86_64/os/repodata/c7740d52753079186273e421b8ac9902e120d20c398faa484a0635b4a30e8213-filelists.xml.gz

(there's some kind of timezone / clock adjustment thing going on with the hour there, but it's clearly actually happening at the same time). So there's something screwy with that specific mirror, I think.

Now for case 2 I can definitely come up with a theory: we have a 404 there, and the 404 page for linux.cz is needlessly complicated and contains a bunch of non-ASCII characters:

http://ftp.linux.cz/agjkajpag

so I can certainly see why maybe we're managing to crash Python on some kind of unicode issue there. Case 1 is a bit more mysterious, though, because there's no 404 going on there - we do get the file we requested, it just didn't have the checksum we expected. But perhaps there's still something odd about the server's response?

Comment 1 Adam Williamson 2015-12-08 19:34:39 UTC
I think https://fedorahosted.org/fedora-infrastructure/ticket/5020 is contributing to / complicating this - mirrormanager appears to be providing stale information at present (it provides checksums for the repodata/repomd.xml file which librepo uses to check the mirror is current, except that mirrormanager itself is sending out *stale* data, so librepo is rejecting up-to-date mirrors and accepting stale ones). This would explain why so many mirrors got rejected, and it may be that the crash only happens when ftp.linux.cz fails the checks somehow.

I've tried reproducing by setting linux.cz as the mirror directly, and by using a mirrorlist URL tuned to return only Czech mirrors (to give a higher chance of linux.cz being the top mirror in the list), and neither of those broke. I even tried with a *metalink* URL tuned for CZ mirrors:

https://mirrors.fedoraproject.org/metalink?repo=rawhide&arch=x86_64&country=cz

to make sure the checksum stuff kicked in, but I can't crash it that way either. So, I'm a bit stuck now.

Comment 2 Adam Williamson 2015-12-08 19:42:46 UTC
Hmm, maybe my theory isn't so hot. Here's another case where it crashed with signal 11 and the trace didn't appear on screen:

https://openqa.fedoraproject.org/tests/620

but in that case it's hitting ftp://ftp.fi.muni.cz when it crashes, it looks like. And here's another one:

https://openqa.fedoraproject.org/tests/597

which looks a lot like the other cases - it black screens shortly after reaching the hub - but no 'signal 11' is recorded in the syslog...

to summarize, what the hell.

Comment 3 David Shea 2015-12-09 12:57:51 UTC
Can you get to the core dump? It should be in /tmp/anaconda.core.

Comment 4 Adam Williamson 2015-12-09 16:44:27 UTC
I'm trying - https://phab.qadevel.cloud.fedoraproject.org/D686

Comment 5 Adam Williamson 2015-12-10 17:38:43 UTC
This is continuing to happen daily - latest crop of failures is:

https://openqa.fedoraproject.org/tests/742
https://openqa.fedoraproject.org/tests/716
https://openqa.fedoraproject.org/tests/711
https://openqa.fedoraproject.org/tests/695

but none of them generated a /tmp/anaconda.core , it seems. (If you watch the videos, you'll see the very last thing is an attempt to send /tmp/anaconda.core to the server, which fails because it doesn't exist). I haven't had any success making it happen locally, either (possibly because mirrormanager is geographical and I get a different set of mirrors from the openQA boxes, or just I've been unlucky).

Comment 6 Adam Williamson 2015-12-12 17:16:38 UTC
Woohoo, finally we have a core dump!

https://openqa.stg.fedoraproject.org/tests/2187
https://openqa.stg.fedoraproject.org/tests/2187/file/anaconda.core.tar.gz

that's from the 2015-12-12 Rawhide nightly, it'll be with whatever version of anaconda &c. are in Rawhide today.

Comment 7 Adam Williamson 2015-12-12 19:50:09 UTC
Created attachment 1105158 [details]
the traceback (I think)

I think this is the relevant traceback - if I'm reading it right, we're actually crashing somewhere in the unicode bits while dealing with keyboard layouts?

Comment 8 David Shea 2015-12-14 12:41:48 UTC
Great, so we found a bug python

Comment 9 Adam Williamson 2015-12-14 17:01:28 UTC
do you think I should also file an upstream (Python) bug?

Comment 10 Adam Williamson 2015-12-15 18:55:54 UTC
Adjusting summary, but here's an interesting note: we got a couple of things fixed in mirrormanager today which resulted in (I think, I can only check failed tests for sure - we don't upload the anaconda logs for successful tests) all the tests using the first mirror they hit, which is a fast internal mirror. Previously, the tests were all using slower public mirrors, and they would often have rejected several mirrors due to the 'stale metadata' issue - https://fedorahosted.org/fedora-infrastructure/ticket/5020 (you can see mirrors being rejected for bad checksums in the dnf.librepo.log for several of the tests mentioned here). Notably, today, not one test hit the 'mystery crash'. So even though the traceback shows the crash happening in keyboard layout i18n stuff, not apparently anything to do with repository setup, it seems interesting that the crash stopped happening when the mirror issues were cleared up. Perhaps there's some non-obvious interaction going on here?

Comment 11 Fedora Admin XMLRPC Client 2016-01-29 13:07:22 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 12 Adam Williamson 2016-01-29 18:22:07 UTC
Just to underline #c10, we haven't seen this once since moving to faster mirrors. So there's definitely some kind of odd interaction going on with the repo setup stuff.

Comment 13 Jan Kurik 2016-02-24 15:32:15 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase

Comment 14 Fedora End Of Life 2017-07-25 19:36:58 UTC
This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 15 Charalampos Stratakis 2017-07-26 11:09:28 UTC
Is this bug still observable?

Comment 16 Adam Williamson 2017-07-26 15:11:27 UTC
We still see 'mystery crashes' every so often, I haven't checked the data on one of them lately to see if it still looks like *this* mystery crash. I'll try and find a few minutes to look at one soon.

Comment 17 Fedora End Of Life 2017-08-08 12:30:23 UTC
Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.