Red Hat Bugzilla – Bug 1289704
Occasional Python interpreter crashes triggered by anaconda during keyboard layout enumeration(?)
Last modified: 2017-08-08 08:30:23 EDT
Sorry for the vagueness of this report, but it's quite difficult to pin down.
It seems like for every night's openQA tests, one or two tests will fail for no immediately apparent reason. The installer seems to be crashing during repository configuration.
There are two variants I've seen, here are cases of each:
In the first case, the Python trace is shown on screen and captured by openQA - https://openqa.stg.fedoraproject.org/tests/1521/modules/_boot_to_anaconda/steps/12 . In the second case, it is not. However, the crash is recorded in the syslog:
17:20:18,253 CRIT anaconda: Anaconda crashed on signal 11
In both cases, anaconda/DNF seem to run into some issues setting up the repositories, though slightly different issues in each case. You can see the logs on the "Logs & Assets" tab for each test - see particular the librepo logs:
in case 1, it seems like it keeps hitting different mirrors and getting a repomd.xml with a different checksum from the one it was expecting; in case 2 it seems to be looking for a particular repodata file and not finding it.
Unfortunately we don't have openQA set to upload the anaconda logs when an install *succeeds* so I can't easily compare the logs to those of a test run at almost the same time to see how different they are, but it certainly seems *plausible* that the crash is somehow related to the repo issues.
Ooh, actually, while I'm writing this, I have a theory. Comparing timestamps, anaconda seems to crash in both cases right around the time it's trying the ftp.linux.cz mirror. For case 1, we have:
16:13:11,162 CRIT anaconda: Anaconda crashed on signal 11
16:13:11 check_transfer_statuses: Transfer finished: repodata/repomd.xml (Effective url: ftp://ftp.linux.cz/pub/linux/fedora/linux/development/rawhide/x86_64/os/repodata/repomd.xml)
for case 2, we have:
17:20:18,253 CRIT anaconda: Anaconda crashed on signal 11
12:20:18 check_transfer_statuses: Error during transfer: Status code: 404 for http://ftp.linux.cz/pub/linux/fedora/linux/development/rawhide/x86_64/os/repodata/c7740d52753079186273e421b8ac9902e120d20c398faa484a0635b4a30e8213-filelists.xml.gz
(there's some kind of timezone / clock adjustment thing going on with the hour there, but it's clearly actually happening at the same time). So there's something screwy with that specific mirror, I think.
Now for case 2 I can definitely come up with a theory: we have a 404 there, and the 404 page for linux.cz is needlessly complicated and contains a bunch of non-ASCII characters:
so I can certainly see why maybe we're managing to crash Python on some kind of unicode issue there. Case 1 is a bit more mysterious, though, because there's no 404 going on there - we do get the file we requested, it just didn't have the checksum we expected. But perhaps there's still something odd about the server's response?
I think https://fedorahosted.org/fedora-infrastructure/ticket/5020 is contributing to / complicating this - mirrormanager appears to be providing stale information at present (it provides checksums for the repodata/repomd.xml file which librepo uses to check the mirror is current, except that mirrormanager itself is sending out *stale* data, so librepo is rejecting up-to-date mirrors and accepting stale ones). This would explain why so many mirrors got rejected, and it may be that the crash only happens when ftp.linux.cz fails the checks somehow.
I've tried reproducing by setting linux.cz as the mirror directly, and by using a mirrorlist URL tuned to return only Czech mirrors (to give a higher chance of linux.cz being the top mirror in the list), and neither of those broke. I even tried with a *metalink* URL tuned for CZ mirrors:
to make sure the checksum stuff kicked in, but I can't crash it that way either. So, I'm a bit stuck now.
Hmm, maybe my theory isn't so hot. Here's another case where it crashed with signal 11 and the trace didn't appear on screen:
but in that case it's hitting ftp://ftp.fi.muni.cz when it crashes, it looks like. And here's another one:
which looks a lot like the other cases - it black screens shortly after reaching the hub - but no 'signal 11' is recorded in the syslog...
to summarize, what the hell.
Can you get to the core dump? It should be in /tmp/anaconda.core.
I'm trying - https://phab.qadevel.cloud.fedoraproject.org/D686
This is continuing to happen daily - latest crop of failures is:
but none of them generated a /tmp/anaconda.core , it seems. (If you watch the videos, you'll see the very last thing is an attempt to send /tmp/anaconda.core to the server, which fails because it doesn't exist). I haven't had any success making it happen locally, either (possibly because mirrormanager is geographical and I get a different set of mirrors from the openQA boxes, or just I've been unlucky).
Woohoo, finally we have a core dump!
that's from the 2015-12-12 Rawhide nightly, it'll be with whatever version of anaconda &c. are in Rawhide today.
Created attachment 1105158 [details]
the traceback (I think)
I think this is the relevant traceback - if I'm reading it right, we're actually crashing somewhere in the unicode bits while dealing with keyboard layouts?
Great, so we found a bug python
do you think I should also file an upstream (Python) bug?
Adjusting summary, but here's an interesting note: we got a couple of things fixed in mirrormanager today which resulted in (I think, I can only check failed tests for sure - we don't upload the anaconda logs for successful tests) all the tests using the first mirror they hit, which is a fast internal mirror. Previously, the tests were all using slower public mirrors, and they would often have rejected several mirrors due to the 'stale metadata' issue - https://fedorahosted.org/fedora-infrastructure/ticket/5020 (you can see mirrors being rejected for bad checksums in the dnf.librepo.log for several of the tests mentioned here). Notably, today, not one test hit the 'mystery crash'. So even though the traceback shows the crash happening in keyboard layout i18n stuff, not apparently anything to do with repository setup, it seems interesting that the crash stopped happening when the mirror issues were cleared up. Perhaps there's some non-obvious interaction going on here?
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component.
Just to underline #c10, we haven't seen this once since moving to faster mirrors. So there's definitely some kind of odd interaction going on with the repo setup stuff.
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.
More information and reason for this action is here:
This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora 'version'
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.
Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Is this bug still observable?
We still see 'mystery crashes' every so often, I haven't checked the data on one of them lately to see if it still looks like *this* mystery crash. I'll try and find a few minutes to look at one soon.
Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.