Red Hat Bugzilla – Bug 869061
No output to journal after double-switch-root
Last modified: 2015-10-21 03:49:40 EDT
Created attachment 631759 [details]
So as part of the 'fedup' upgrade tool, we're doing a switch-root back into a specially prepared upgrade filesystem after the system hits 'sysinit.target' or thereabouts.
After the second switch-root, there is no output to the journal from any process except systemd.
I've got a service (upgrade.service) which runs a script (/bin/upgrade) which starts:
echo "starting upgrade hook"
upgrade.service has 'StandardOutput=journal' (I've also tried 'journal+console', to no avail).
The service definitely runs - I can see it in the logs if I run with systemd.log_level=debug systemd.log_target=journal, etc.
If I examine the fds for /bin/upgrade in /proc, fd1 and fd2 are both unix sockets (although I can't tell what they're connected to).
journald is definitely running; if I run 'logger' to send data to syslog, it shows up in the journal. As do all the messages from systemd.
A full log (journalctl -a -m > upgrade.log) is attached.
Ah, this is probably important. dmesg has several lines reading:
systemd-journald: Failed to parse log priority line.
which (ISTR) means some kind of version mismatch somewhere?
Created attachment 632452 [details]
dmesg (from different run) showing journal error messages
Proposing as a blocker for Fedora 18 beta due to conditional violation of the following F18 beta release criterion :
For each one of the release-blocking package sets ('minimal', and the package sets for each one of the release-blocking desktops), it must be possible to successfully complete an upgrade from a fully updated installation of the previous stable Fedora release with that package set installed, using any officially recommended upgrade mechanisms. The upgraded system must meet all release criteria.
While this bug doesn't directly prevent the upgrade process, it interferes with the log output of fedup. I'm not certain that this qualifies as a blocker but I want to start the conversation about whether this is enough to take it as a blocker.
Since this is a conditional violation of the beta upgrade criterion, I'm also proposing as NTH for F18 beta.
If and when the journal becomes the default logging solution for the distribution we can look into criteria to match that but until then this does not affect any criteria thus should just be treated as a normal bug...
-1 Beta -1 NTH
Johann: AIUI this prevents us getting any logs from the upgrade process, as the upgrade process - running in dracut - can only use the journal for logging. dracut env is different from regular env.
Yes and I'm perfectly very well aware of that I just dont think it's a release blocking issue and I dont think it's wise to pull in something that messes with logging in *general* just before we release hence - to the NTH
Perhaps a better grounds for blocker status here is the little used:
"A bug is considered a Beta blocker bug if any of the following criteria are met:
Bug hinders execution of required Beta test plans or dramatically reduces test coverage"
on the basis that if we can't get logs from the upgrader it rather hinders testing of the upgrader (which is part of the required Beta test plan).
I'm +1 blocker. Without logs for failures the efficacy of testing fedup is significantly reduced. We know there will be upgrade failures. But if we don't know why they fail, then how can they be fixed other than precisely reproducing the precedent system? I think that's unworkable. We'll have all kinds of hearsay anecdotes and no real data from beta testers.
As I mentioned in my blocker status mail, a key question here is whether this actually needs to be fixed in the frozen Beta package set or not. I'm not sufficiently familiar with precisely how the upgrader is going to work to know this, but Will should be able to tell us.
Hmm, so let's see if I understood this properly:
a) the initrd runs, journald #1 inside it
b) we kill everything (except plymouth, but including journald), and systemd/PID1 switches root to the main system
c) the host system runs, journald #2 inside of it, until all storage is setup
d) we kill everything (excepting plymouth, but including journald), and systemd/PID1 swizcjes root to the fedup system
e) fedup runs, journald #3 inside of it
Now, if somebody invokes "/usr/bin/logger" in a), in c) it will work correctly, but it won't in e) anymore?
Has it been verified that this is not an SELinux issue?
Discussed at 2012-10-29 QA meeting acting as a blocker review meeting: http://meetbot.fedoraproject.org/fedora-meeting/2012-10-29/fedora-qa.2012-10-29-15.00.html . Agreed we cannot decide on blocker status of this bug without an answer to the questions in comment #9 - is there a reason this fix needs to be in the frozen 18 Beta package set? Or can it be fixed in F17 packages or F18 update repo?
(In reply to comment #10)
> Hmm, so let's see if I understood this properly:
> a) the initrd runs, journald #1 inside it
> b) we kill everything (except plymouth, but including journald), and
> systemd/PID1 switches root to the main system
> c) the host system runs, journald #2 inside of it, until all storage is setup
> d) we kill everything (excepting plymouth, but including journald), and
> systemd/PID1 swizcjes root to the fedup system
> e) fedup runs, journald #3 inside of it
This is all basically correct. But there's some evidence that journald #3 might actually be starting in the middle of d).
Part d) is really a few steps:
1. kill all processes
2. mess with mounts
3. serialize state
4. execute new systemd
and it seems like journald might be getting *restarted* by systemd between #1 and #4. This would mean we still have an old journald running, which could explain why it fails to parse messages from the newer systemd.
> Now, if somebody invokes "/usr/bin/logger" in a), in c) it will work
> correctly, but it won't in e) anymore?
Correct. BUT: if you restart journald in e), then run logger, it works.
I'll try to gather logs to show this behavior.
> Has it been verified that this is not an SELinux issue?
Would that cause the "Failed to parse log priority line" messages?
(In reply to comment #12)
> (In reply to comment #10)
> > Hmm, so let's see if I understood this properly:
> > a) the initrd runs, journald #1 inside it
> > b) we kill everything (except plymouth, but including journald), and
> > systemd/PID1 switches root to the main system
> > c) the host system runs, journald #2 inside of it, until all storage is setup
> > d) we kill everything (excepting plymouth, but including journald), and
> > systemd/PID1 swizcjes root to the fedup system
> > e) fedup runs, journald #3 inside of it
> This is all basically correct. But there's some evidence that journald #3
> might actually be starting in the middle of d).
As the killing and switch root is actually done from PID1, synchronously its unlikely we activate journald again here at the same time, was PID1 is the one process which would have to activate here...
> Part d) is really a few steps:
> 1. kill all processes
> 2. mess with mounts
> 3. serialize state
> 4. execute new systemd
> and it seems like journald might be getting *restarted* by systemd between
> #1 and #4.
WE don't run the main event loop of PID here, so this is not really possible
> This would mean we still have an old journald running, which
> could explain why it fails to parse messages from the newer systemd.
> > Now, if somebody invokes "/usr/bin/logger" in a), in c) it will work
> > correctly, but it won't in e) anymore?
> Correct. BUT: if you restart journald in e), then run logger, it works.
Could you check whether journald for some reason managed to survie the killing spree? Maybe check "starttime" in /proc/$PID/stat or whether it is from before or after the transition?
> > Has it been verified that this is not an SELinux issue?
> Would that cause the "Failed to parse log priority line" messages?
That's usually an indicaiton that an old systemd tried to talk to a new journald or vice versa. This kinda sounds as if the killing spree that takes place during the transition for some reason spares journald...
If you can isolate a test-case I can easily reproduce (best would be without needing anaconda sources ;-), that would be truly useful.
Discussed at 2012-10-31 blocker review meeting: http://meetbot.fedoraproject.org/fedora-qa/2012-10-31/f18beta-blocker-review-6.2012-10-31-16.00.log.txt . Agreed we still cannot decide on the blocker status of this issue as it is still not clear what the root cause is and whether the fix needs to be in the frozen Beta package set. It appears no clear decision has been made (still) on exactly how the upgrade process is going to work, whether we are going to be serving out an upgrade.img from our Fedora mirrors, if so how it's going to be built and from what sources, etc. This makes it rather hard to evaluate issues like this.
BTW, an OK work-around could be to simply manually restart systemd-journald.service (but not systemd-journald.socket!) after the final transition? That should definitely get things going again. In order not to delay F18 due to this issue this might be a good enough workaround, no? We can then look into fixing this properly with more time?
Yeah, restarting journald seems to fix things. Oddly, when it restarts, it dumps a bunch of queued messages that were sitting in dmesg into the journal - for example the "Failed to parse log priority line" messages actually show up in journal after the restart.
I've added a hack to restart journald before running the upgrade, which should make this not a blocker, at least.
Looks fine update/package/ship and let's take this one off the proposed blocker list
Discussed at 2012-11-05 QA meeting acting as a blocker review meeting. Agreed to take this off the blocker list as it should be fixed enough now, and with a clearer understanding of the bug, it probably wasn't a blocker anyway as the log messages owuld've been available through a different mechanism.
I've tested with the latest from git and while the fedup output is indeed routed to journald now, there is still no output on tty0 behind plymouth.
upgrade.service specifies that output should be going to console and journald:
However, it doesn't appear as if this is actually happening. While fedup is running, there is no output behind the plymouth splash. I believe that Will has spoken with plymouth devs and they don't think that it's an issue with plymouth.
I'm not sure if this needs a new bug or if this is another symptom of the switching. Can someone more familiar with journald/systemd give their opinion on whether or not this is related and if a new bug is needed? I can supply logs or debug information if needed.
Starting to suspect that both the journal and the console going dark are the same bug - something in systemd just not sending output to the right place(s) anymore, or the sockets it was using have gone away, or something.
Restarting journald is an awful hack that partly works around this, but I'd really like to fix the root cause.
systemd guys: can you reproduce this problem? A simple (ish) reproducer is:
* Install fedup on F17 system
* Run `fedup-cli --network 18 --instrepo 'FEDORAMIRROR/releases/test/18-Beta/Fedora/$basearch/os/'
* Reboot and add to boot args:
'rd.upgrade.test rd.upgrade.debugshell rd.upgrade.break=pre'
There will be a debug shell on tty2 (exit the shell to cause it to re-exec and be in the right root), and the upgrade will drop to a dracut shell before doing anything.
It removes the boot entry when it starts, but you can re-run `fedup-cli` and add '--skippkgs' to make it skip downloading/verifying packages and just set up the boot images and configure the bootloader.
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '18'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 18's end of life.
Thank you for reporting this issue and we are sorry that we may not be
able to fix it before Fedora 18 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior to Fedora 18's end of life.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.
(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)
More information and reason for this action is here:
I tried an update from F21 to F22 based on comment 22 and journal messages were written just fine.