In openQA and local VM testing of current Rawhide and F29 nightly composes, installer images do not boot to the graphical installer. Instead they boot to a black screen with a cursor and nothing more.
So far I've no idea what's causing this. anaconda did not change in the relevant time frame. The most likely suspects are all the things that changed in *both* the Rawhide 20181025.n.0 compose *and* the F29 20181031.n.0 compose, which is these:
alsa-lib (new version 1.1.7-2.fc30 / 1.1.7-2.fc29)
cairo (new version 1.16.0-1.fc30 / 1.16.0-1.fc29)
dracut (new version 049-11.git20181024.fc30 / 049-11.git20181024.fc29)
libarchive (new version 3.3.3-1.fc30 / 3.3.3-1.fc29)
libinput (new version 1.12.2-1.fc30 / 1.12.2-1.fc29)
libtasn1 (new version 4.13-5.fc30 / 4.13-5.fc29)
python3 (new version 3.7.1-1.fc30 / 3.7.1-1.fc29)
rpm (new version 220.127.116.11-2.fc30 / 18.104.22.168-1.fc29)
of those, it's hard to guess which is the culprit, so I just picked one. Will CC all relevant maintainers.
You can download an image that suffers from this bug here, for investigation:
Note the text install test boots OK and completes successfully, so this is specific to the *graphical* installer somehow. So far I haven't spotted any smoking guns in the logs.
Created attachment 1499700 [details]
anaconda.log (from Rawhide 20181030.n.0 openQA test fail)
Created attachment 1499701 [details]
dnf.librepo.log (from Rawhide 20181030.n.0 openQA test fail)
Created attachment 1499702 [details]
packaging.log (from Rawhide 20181030.n.0 openQA test fail)
Created attachment 1499703 [details]
program.log (from Rawhide 20181030.n.0 openQA test fail)
Created attachment 1499704 [details]
storage.log (from Rawhide 20181030.n.0 openQA test fail)
Created attachment 1499705 [details]
syslog (from Rawhide 20181030.n.0 openQA test fail)
Created attachment 1499708 [details]
X.log (from Rawhide 20181030.n.0 openQA test fail)
Created attachment 1499709 [details]
tarball of /var/log (from Rawhide 20181030.n.0 openQA test fail)
Adam is there anything we can do at the moment from the python side to help you with that? Or should we wait for more info/investigation?
Well, I mean, *someone* needs to figure out what's actually going on. Anyone is more than welcome to try. :P It's hard to be more specific until we have some idea of exactly what's getting stuck and where.
I'll probably start swinging a fairly blunt ax at it soon, by doing custom image builds based on 1021.n.0 with one of the 'suspect' packages added at a time until it starts breaking...
BTW The Anaconda installer on a Live iso starts fine?
I don't actually know, as both live images seem to encounter other issues in testing before they try and start anaconda. I'll test locally in a minute.
Lives are really badly broken and have been since 1014.n.0, apparently - they just don't behave like live images at all. Workstation live boots to gnome-initial-setup, KDE live boots to sddm. I will file this separately, of course.
You can get into a new user session with the Workstation live. Running 'sudo liveinst' from that seems to get stuck during installer init, so that might well be the same problem as this bug.
So it seems python3 really *is* the culprit. I built a test image locally with lorax from the 20181021.n.0 Rawhide tree; as expected, that boots fine. Then I ran an identical build except for adding a single side repo with python3-3.7.1-1.fc30 in it; that image has the bug, it does not boot to the installer.
So I guess next up is figuring out why Python 3.7.1 breaks anaconda!
Was that python3-3.7.0-10.fc30 -> python3-3.7.1-1.fc30,
or python3-3.7.0-9.fc30 -> python3-3.7.1-1.fc30?
Answering my own (lazy) question: https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20181021.n.0/compose/Everything/x86_64/os/Packages/p/ has python3-3.7.0-10.fc30.
If you can share the instructions how to build the iso with a custom built package, I can try to bisect the CPython commit that started this.
In the meantime, I've kicked of a 3.7.0-11 scratch build to eliminate "something got changed in the buildroot" problem: https://koji.fedoraproject.org/koji/taskinfo?taskID=30598946
You can also grab simple-koji-ci built RPMs of 3.7.1rc1 and 3.7.1rc2 from:
(In reply to Miro Hrončok from comment #19)
> You can also grab simple-koji-ci built RPMs of 3.7.1rc1 and 3.7.1rc2 from:
Oh, those got garbage collected already, so resubmitted:
Building a custom installer image is quite easy actually. Install lorax, create a scratch directory, and do this:
lorax -p Fedora -v Rawhide -r Rawhide -s https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20181021.n.0/compose/Everything/x86_64/os/ -s https://www.happyassassin.net/temp/repo/x86_64/ ./results/
The image winds up as ./results/images/boot.iso . Of course, that command creates an image based on Fedora-Rawhide-20181021.n.0 , with additional packages from https://www.happyassassin.net/temp/repo/x86_64/ . Create your own side repo and change the base as you desire. (I think you can use a local side repo too, but I've just always used a remote one as I have a pet web server to use anyway). Each time you want to re-do this, you have to delete results/ , lorax will bail if the result directory exists when you run it.
If you don't get to it first, I'll try some more custom images tomorrow.
Successfully built an rc1 iso.
I'll take this.
3.7.0rc1 boots into Anaconda (in GNOME Boxes).
3.7.0rc2 boots into black screen with a blinking _ cursor (shows Anaconda starting text and then this). I'll give it a couple more minutes (maybe it just takes a while to show up), but it seems long enough.
3b699932e5ac3e76031bbb6d700fbea07492641d is the first new commit
Author: Miss Islington (bot) <firstname.lastname@example.org>
Date: Sun Oct 7 00:09:59 2018 -0700
bpo-6721: Hold logging locks across fork() (GH-4071) (#9291)
bpo-6721: When os.fork() was called while another thread holds a logging lock, the child process may deadlock when it tries to log. This fixes that by acquiring all logging locks before fork and releasing them afterwards.
A regression test that fails before this change is included.
Within the new unittest itself: There is a small _potential_ due to mixing of fork and a thread in the child process if the parent's thread happened to hold a non-reentrant library call lock (malloc?) when the os.fork() happens. buildbots and time will tell if this actually manifests itself in this test or not. :/ A functionality test that avoids that would be a challenge.
An alternate test that isn't trying to produce the deadlock itself but just checking that the release and acquire calls are made would be the next best alternative if so.
(cherry picked from commit 19003841e965bbf56fd06824d6093620c1b66f9e)
Co-authored-by: Gregory P. Smith <email@example.com> [Google]
:040000 040000 54bff49aed96e3c886b25711c9313a44f0c209c1 a5268f5ecd2b53aa62dc49a10637483cd629cc87 M Lib
:040000 040000 de53a0eb6d67653599a6b0ce28f782112a257911 b077206b973a7ce6b696185c939d5f2739c42536 M Misc
I have a 3.7.1 built with this commit reverted, however gnome boxes won't work any more.
So here it is again: https://koji.fedoraproject.org/koji/taskinfo?taskID=30608435
I wonder what Anaconda does that this affects it.
Thanks a lot! I'm building a test image with that Python build now.
(I use virt-manager rather than Boxes, FWIW. It's a bit of a better fit for testing work.)
virt-manager works for me!
However my latest built iso is from 3b699932e and I cannot rebuild it ATM with lorax, because I am on train and the internet is too slow (and lorax seems to always download, no cache etc.).
Hope you'll get lucky with that iso. If not, I can try again once I get more stable connection.
Anyway, even if it works with it, I don't feel like reverting that commit is a "solution", more like an immediate workaround. How urgent is this, should we do it (assuming it works) and investigate later, or do we have time to investigate first?
Confirmed a test image built with the Python 3 build linked above (https://koji.fedoraproject.org/koji/taskinfo?taskID=30608435 ) boots fine.
I would really like the revert for now, please - this is basically a showstopper bug for Rawhide testing, and we are *REALLY* trying to keep Rawhide basically functional from day to day. The problem if we leave a bug like this in is that none of the tests will get past the installer and so we won't have any idea what *other* bugs are creeping in while this one is being fixed; by the time this one is fixed, maybe ten other bugs have backed up "behind" it.
Of course, we can still treat investigating this issue as urgent, I'd just really like us not to leave Rawhide broken while we do that.
I'll wait for the CI build, I'll rebuild an iso with the result, retest and push and build if it works.
Let me know about Fedora 29, it seems that it is not needed there (yet? ever?).
We will need it for 29 too, as we compose regularly for Atomic. Those composes are already broken by this, since 3.7.1 update was pushed stable.
Looks like there might be more cases either related or similar:
- also stuck at UI startup
- fails both in GUI *and TUI*
- Initial Setup on F29 IoT edition blocks boot
- UI startup is the last thing in logs but
no UI can be seen
python3-3.7.1-2.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-230c3c19f5
I can no longer work on this any more (I am sick).
Reassigning back to the default.
It seems like the logging antideadlock commit actually might have introduced some deadlock. I suggest we try a reproducer that doesn't involve building isos and we bring this to upstream.
well, I think you can reproduce by just trying to run anaconda from a live image, if the live image doesn't have the python3 with the revert.
I suppose it may be possible to reproduce by just running initial-setup on an installed system with the non-reverted python3, too? I'll try and check on that soon.
FWIW anaconda does it's own logging + locks so I'll look in there:
python3-3.7.1-2.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-230c3c19f5
Here's an odd thing: F29 updates-testing composes are still failing in what appears to be the same way, even though they include the python3-3.7.1-2.fc29 build. I'm not sure why this is, yet.
This log seems to show python3-3.7.1-2.fc29 is in the installer build:
really not sure what's going on there yet.
python3-3.7.1-2.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.
(In reply to Adam Williamson from comment #38)
> Here's an odd thing: F29 updates-testing composes are still failing in what
> appears to be the same way, even though they include the
> python3-3.7.1-2.fc29 build. I'm not sure why this is, yet.
> This log seems to show python3-3.7.1-2.fc29 is in the installer build:
> really not sure what's going on there yet.
openqa test are fine now for AH ISO generated from 20181109 for both Fedora-29-updates (https://openqa.fedoraproject.org/tests/306830) and Fedora-29-updates-testing (https://openqa.fedoraproject.org/tests/306833).
I think reason of openqa test failure in previous fedora-29-updates-testing could be the Python version available in the buildroot.
If we look at root.log from runroot task for AH ISO from fedora-29-updates-tetsing compose 20181108 (https://kojipkgs.fedoraproject.org//work/tasks/7628/30747628/root.log contains python3-3.7.1-1.fc29) and 20181109 (https://kojipkgs.fedoraproject.org//work/tasks/7628/30747628/root.log contains python3-3.7.1-2.fc29) in buildroot becasue python3-3.7.1-2.fc29 was pushed to stable later before 20181109 compose run.
hmm, still, it's odd that the actual installer build log shows the newer python...oh well, if it's working now, that's fine.
Yeah, even I was wondering if we are providing latest updates-testing repo (which should contain latest Python3 from updates-testing) to lorax for composing ISO, then why was the problem. So, from my debugging I can only think of having older Python3 in the buildroot affecting somehow.
OK. I've reset the urgency and removed the blocker status.
This is now "solved" however in a most ugly fashion.
Hence, keeping open.
I am trying to investigate it.
I am working with the latest live Workstation image and I am able to reproduce the problem, which means that:
- anaconda works properly with the fixed version of Python and
- anaconda GUI doesn't start when I use the "broken Python"
It seems that `anaconda` command runs in the background as `/usr/bin/python3 /sbin/anaconda --liveinst --method=livecd:/dev/mapper/live-base`.
Interesting is that this command works with both Python versions and I see only one difference between `anaconda` and the longer command - `anaconda` reboots a computer when quits but the longer command doesn't.
Is there any other difference between commands:
- /usr/bin/python3 /sbin/anaconda --liveinst --method=livecd:/dev/mapper/live-base
I've tested it again and it seems that there is a difference between commands `anaconda` and `/usr/bin/python3 /sbin/anaconda` because the later one works with both Python versions and the former one not.
After some more testing, I can say that there is no difference between executing `anaconda` and `/usr/bin/python3 /sbin/anaconda` but the behavior of anaconda itself isn't deterministic.
I did more than ten tests (all with the newest Python) and in a few of them, GUI started. What happened every time is that the anaconda window was registered into Gnome - it was visible in app switcher but the window itself was blank (transparent).
I don't know anaconda well enough to debug this issue more but I am willing to help anybody.
Anaconda team. Please help a fellow snake team.
We reverted an upstream commit in Python 3.7 to make the installer work. But we don't want to carry that revert patch forever.
What is that anaconda does, that this commit renders it broken?
(In reply to Miro Hrončok from comment #47)
> Anaconda team. Please help a fellow snake team.
> We reverted an upstream commit in Python 3.7 to make the installer work. But
> we don't want to carry that revert patch forever.
> What is that anaconda does, that this commit renders it broken?
Looking at the Python commit I really suspects it deadlocks the Anaconda logging code, eq. a log() somewhere during Anaconda startup never returns due to the Python logging machinery getting stuck on something.
IIRC we already had something similar in Blivet some time ago due to lazy log evaluation:
log.debug("doing %s, self.foo)
Where foo was a property that when evaluated resulted in another attempt to log which deadlocked the whole thing. I'm not sure if it Blivet has a custom logging lock or it was something in Python itself causing the issue. Adding Vojta to CC to elaborate.
Also the Python commit mentions process forking so I'm wondering if the issue might be possibly related to Anaconda/Blivet or another library calling some utility with Subprocess & associated logging. Especially Blivet calls a lot of utilities during the startup phase when it is scanning available storage, which could easily trigger this if it is actually a race condition.
Anaconda freezes during import of modules, while we hold the imp's lock:
I wanted to replace imp with importlib, but it is not so easy. However,
it seems to be enough to just remove the locking for now:
Hi Miro, is the suggested solution acceptable for the Python team, or should I investigate it further?
(In reply to Vendula Poncova from comment #50)
> Hi Miro, is the suggested solution acceptable for the Python team, or should
> I investigate it further?
Is it confirmed that python, without having the commit reverted, in conjunction with the locking removal, fixes the issue?
If so then I'd say it's acceptable.
(In reply to Charalampos Stratakis from comment #51)
> Is it confirmed that python, without having the commit reverted, in
> conjunction with the locking removal, fixes the issue?
I was testing anaconda with python3-3.7.1-1.fc30 and I wasn't able to reproduce the issue with the removed locking.
That anaconda is in the repos or just locally built?
I'll prep a python3 PR to remove the patch.
Here's a build to test it with: https://koji.fedoraproject.org/koji/taskinfo?taskID=31819177
I can do the test, but please answer my question: Does anaconda in rawhide contain the fix?
(In reply to Miro Hrončok from comment #55)
> Here's a build to test it with:
Ok, I will test it. Thanks!
> I can do the test, but please answer my question: Does anaconda in rawhide
> contain the fix?
The fix will be part of anaconda-30.16-1. We will release it this week.
(In reply to Vendula Poncova from comment #56)
> (In reply to Miro Hrončok from comment #55)
> > Here's a build to test it with:
> > https://koji.fedoraproject.org/koji/taskinfo?taskID=31819177
> Ok, I will test it. Thanks!
I can confirm that Anaconda works with the fix and python3-3.7.2-2.fc30 from the build.
Thanks! Please report here once anaconda-30.16-1 is built in rawhide.
Will the fix go to Fedora 29 as well?
Please don't build/tag 30.16 until https://bugzilla.redhat.com/show_bug.cgi?id=1663585 is sorted out, because it will break composes again.
(In reply to Miro Hrončok from comment #58)
> Thanks! Please report here once anaconda-30.16-1 is built in rawhide.
> Will the fix go to Fedora 29 as well?
We normally don't update Anaconda post Fedora GA & there are no install image respins post GA, so no.
(In reply to Adam Williamson from comment #59)
> Please don't build/tag 30.16 until
> https://bugzilla.redhat.com/show_bug.cgi?id=1663585 is sorted out, because
> it will break composes again.
I currently plan to build anaconda-30.16-1 tomorrow with fix for this bug and bug 1663585.
(In reply to Martin Kolman from comment #60)
> (In reply to Miro Hrončok from comment #58)
> > Thanks! Please report here once anaconda-30.16-1 is built in rawhide.
> > Will the fix go to Fedora 29 as well?
> We normally don't update Anaconda post Fedora GA & there are no install
> image respins post GA, so no.
That's not entirely true. Atomic do two weekly releases, and the community respins a bunch of stuff even if it's not official.
"there are no install image respins post GA, so no."
this is no longer strictly true; the two-week Atomic builds include an Atomic installer image. I'm not sure whether they build the image itself from post-release updates, though, or just use a *payload* generated from post-release packages but keep the *installer itself* built from the frozen release tree.
I'd rather drop that patch from Fedora 29 as well, can you make an exception?
(In reply to Peter Robinson from comment #61)
> (In reply to Martin Kolman from comment #60)
> > (In reply to Miro Hrončok from comment #58)
> > > Thanks! Please report here once anaconda-30.16-1 is built in rawhide.
> > >
> > > Will the fix go to Fedora 29 as well?
> > We normally don't update Anaconda post Fedora GA & there are no install
> > image respins post GA, so no.
> That's not entirely true. Atomic do two weekly releases, and the community
> respins a bunch of stuff even if it's not official.
(In reply to Adam Williamson from comment #62)
> "there are no install image respins post GA, so no."
> this is no longer strictly true; the two-week Atomic builds include an
> Atomic installer image. I'm not sure whether they build the image itself
> from post-release updates, though, or just use a *payload* generated from
> post-release packages but keep the *installer itself* built from the frozen
> release tree.
You are correct, I just didn't want to go into unnecessary details. We already did a couple post GA builds in the past for Atomic or recently for the IoT edition. But in all cases this was more or less exception due to the blocking/urgent nature of the fixes contained in the builds. We certainly do not mind fixing urgent issues in this way, especially if the fixes are minor and not very invasive.
If more is expected from the Anaconda team post GA it should documented somewhere so that we can take it into account during planing and possibly others might have to do so as well (Fedora QA ?). We already maintain 3 to 4 (2 RHEL & 1-2 Fedora versions) Anaconda versions in parallel, so possibly adding one more would be something to take into account.
In this concrete case if there is a good reason for the Python team to drop the workaround patch from Python for F29, then we could backport the Anaconda fix to F29 as well. Still I would recommend waiting a bit after the Anaconda fix has been in Rawhide for a bit in case some subtle regressions show up.
I don't think anything "more" is expected, but as this is a potentially installer-breaking bug, obviously if the Python change is backported to F29, we'd want the fix for this backported too :) It certainly seems to count as an "urgent issue".
Why backport the fix to F29 and drop the patch form Python:
1. the reverted commit is a bugfix. somebody might get hit by this bug. if a fix for anaconda exists, I rather have that fix applied than carrying a potentially dangerous workaround by having this bugfix not applied.
2. we advertise Fedora as a platform for using various Python versions to test user's code - if users test on our 3.7 and it works, they might be surprised if it doesn't work somewhere else.
3. as much careful as we try to be, we might accidentally backport the commit that removes the workaround
4. Fedora 29 and master are still in sync git wise and I'd like to keep it that way if possible
I don't think many users will actually be affected by the bug the reverted commit fixes, so the severity of carrying that patch is not huge. The other two points are trivial.
I would not argue for this if the fix for Fedora 29 was particularly nasty or complicated. However this shall be simple backport of one very small commit.
I think that backporting a bugfix in one application (even an important one like anaconda) beats reverting a bugfix in a language interpreter. However that's juts my view and I'm awfully biased here :D
Sounds like good reasons to backport the Anaconda fix. :)
Still, would suggest to do it after it is in Rawhide for a while, so maybe sometime next week ?
anaconda-30.16-1 has been built with the Python fix and fix for bug 1663585:
Now is the time I'd really love to fast forward the f29 branch. Can we have this bug fixed in anaconda and do a joined bodhi update please?
(In reply to Miro Hrončok from comment #70)
> Now is the time I'd really love to fast forward the f29 branch. Can we have
> this bug fixed in anaconda and do a joined bodhi update please?
Sure, this is the F29 Anaconda build with the patch from Rawhide backported:
As the Bodhi update will have to contain packages from 2 different groups of maintainers I think a proven package will have to create it.
Thank you. I'm in the provenpackagers group.
anaconda-29.24.7-2.fc29 python3-3.7.2-4.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-00870e8bfc
anaconda-29.24.7-2.fc29, python3-3.7.2-4.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-00870e8bfc
Adam, Martin, Vendula, Lumír, Sinny - Thank You all!
anaconda-29.24.7-2.fc29, python3-3.7.2-4.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.
This might be relevant https://bugzilla.redhat.com/show_bug.cgi?id=1691434
Some things deadlock in python3*-3.7.2-4.fc29.x86_64
that worked in python3*-3.7.1-4.fc29.x86_64
These don't involve anaconda.
What is "the python fix" mentioned in comment #69?
Is it https://github.com/python/cpython/commit/3b699932e5ac3e76031bbb6d700fbea07492641d ? (That seems to cause problems.)
Or is it reverting that commit? (That seems to fix problems.)
Or something else?
The "the python fix" mentioned in comment #69 is reverting the reverting of that commit.
The "fix" removed a lock:
- Don't acquire the imp's lock (#1644936) (vponcova)
+diff --git a/pyanaconda/core/util.py b/pyanaconda/core/util.py
+index 6106671c8..084c32da9 100644
+@@ -1354,7 +1354,6 @@ def collect(module_pattern, path, pred):
+ module_path = None
+ (fo, module_path, module_flags) = imp.find_module(mod_name, [path])
+ module = sys.modules.get(module_pattern % mod_name)
+@@ -1427,8 +1426,6 @@ def collect(module_pattern, path, pred):
+ log.error("Failed to import module %s from path %s in collect: %s", mod_name, module_path, imperr)
+ if mod_info and mod_info: # pylint: disable=unsubscriptable-object
+ mod_info.close() # pylint: disable=unsubscriptable-object