Bug 974383 - Reboot after install often delays for ~1 minute
Summary: Reboot after install often delays for ~1 minute
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: 19
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
Assignee: Brian Lane
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://fedoraproject.org/wiki/Common...
Depends On:
Blocks: 994188
TreeView+ depends on / blocked
 
Reported: 2013-06-14 06:34 UTC by Adam Williamson
Modified: 2013-08-17 05:55 UTC (History)
22 users (show)

Fixed In Version: anaconda-20.5-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 994188 (view as bug list)
Environment:
Last Closed: 2013-08-16 17:35:16 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
journal when waiting for reboot (DVD or netinst) (407.45 KB, text/plain)
2013-06-14 08:35 UTC, Kamil Páral
no flags Details
journal with systemd debug (301.03 KB, text/plain)
2013-06-14 12:00 UTC, Kamil Páral
no flags Details
journal after reboot is triggered (6.71 KB, text/plain)
2013-06-21 12:35 UTC, Kamil Páral
no flags Details
anaconda-693.strace (244.97 KB, text/plain)
2013-06-24 12:21 UTC, Kamil Páral
no flags Details
anaconda-737.strace (726 bytes, text/plain)
2013-06-24 12:22 UTC, Kamil Páral
no flags Details
xorg-738.strace (195.43 KB, text/plain)
2013-06-24 12:22 UTC, Kamil Páral
no flags Details
[x5.strace] strace capture during installer reboot (described in Comment 24) (316.52 KB, text/plain)
2013-07-13 04:03 UTC, Steve Tyler
no flags Details
[x6.strace] strace log during reboot with strace attached to anaconda and metacity (515.90 KB, text/plain)
2013-07-13 05:41 UTC, Steve Tyler
no flags Details
[x7.strace] strace log showing network port 8080 being opened (1.64 MB, text/plain)
2013-07-15 11:01 UTC, Steve Tyler
no flags Details
[20.6-1] screenshot showing "4m[terminated]" message that is displayed for about 7 seconds (4.68 KB, image/png)
2013-08-16 19:24 UTC, Steve Tyler
no flags Details

Description Adam Williamson 2013-06-14 06:34:04 UTC
At least Kamil and I have been seeing this throughout Fedora 19 testing, but never got around to filing a bug on it.

When you finish installation there's a button to reboot. Often, when you click it, the system delays, sitting at a console with some irrelevant info on it, for about a minute (for me) before restarting. (For a long time I just thought it was broken, and forced shutdown on my test VMs at that point). I don't know if it's 1 minute or longer for Kamil. I think other testers have mentioned seeing this too.

bcl thinks this is probably a systemd thing, at least as a first guess. We'll try and provide some more detailed info later, I just wanted to get the bug filed before I forgot about it again.

Comment 1 Jóhann B. Guðmundsson 2013-06-14 07:18:21 UTC
There are only three systemd related shutdown problems... 

The network related ones, the disk setup related ones and when users replace /dev/console with a FIFO  :)

You need to follow http://www.freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 and provide us with shutdown-log.txt ( probably need to save it to a usb stick )

Comment 2 Kamil Páral 2013-06-14 08:35:33 UTC
Created attachment 761139 [details]
journal when waiting for reboot (DVD or netinst)

Before I gather any better information, here's some basic one:

I tested LiveCD >5 times and it doesn't happen for them. Neither if you reboot right after startup, nor if you reboot after installation is finished. Everything is quick. It seems to affect just Anaconda-only environment (DVD, netinst, PXE), both bare metal and VM.

I managed to quickly save journal during waiting for reboot once, it is attached. I'll try to provide better logs. From a quick look, it seems like NetworkManager could cause the problem:

Jun 12 10:21:05 localhost.localdomain NetworkManager[677]: <error> [1371032465.581113] [nm-dbus-manager.c:278] nm_dbus_manager_init_bus(): Could not get the system bus.  Make sure the message bus daemon is running!  Message: Failed to connect to socket /var/run/dbus/system_bus_socket: Connection refused

Comment 3 Harald Hoyer 2013-06-14 09:11:45 UTC
systemd[1]: Walked on cycle path to run-install-source.mount/stop
systemd[1]: Walked on cycle path to network.target/stop
systemd[1]: Walked on cycle path to NetworkManager.service/stop
systemd[1]: Walked on cycle path to basic.target/stop
systemd[1]: Walked on cycle path to sockets.target/stop
systemd[1]: Walked on cycle path to dbus.socket/stop
systemd[1]: Walked on cycle path to sysinit.target/stop
systemd[1]: Walked on cycle path to local-fs.target/stop
systemd[1]: Breaking ordering cycle by deleting job run-install-source.mount/stop
systemd[1]: Job run-install-source.mount/stop deleted to break ordering cycle starting with local-fs.target/stop

Seems like we have a problem with dependencies.

Comment 4 Kamil Páral 2013-06-14 10:28:58 UTC
I booted F19 TC3 DVD 5 times, and hit Quit right after displaying anaconda. Once it rebooted immediately, 4 times it took about 1:30 min.

Comment 5 Kamil Páral 2013-06-14 12:00:03 UTC
(In reply to Kamil Páral from comment #4)
> 4 times it took about 1:30 min.

Ctrl+Alt+Del doesn't speed up the process. However, shell on tty2 still works during this time.

I tried to follow http://www.freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 (I created a secondary disk for that in VM), however for some reason either debug.sh was not executed, or something else failed and the log was not written.

So I at least booted with systemd debug and saved the journal the same way as before, during the reboot timeout period. Attached.

Comment 6 Kamil Páral 2013-06-14 12:00:37 UTC
Created attachment 761228 [details]
journal with systemd debug

Comment 7 Harald Hoyer 2013-06-14 14:24:05 UTC
systemd[1]: Failed to load configuration for reboot.service: No such file or directory
systemd[1]: SELinux access check scon=system_u:system_r:initrc_t:s0 tcon=unconfined_u:object_r:power_unit_file_t:s0 tclass=service perm=start path=/usr/lib/systemd/system/reboot.target cmdline=(null): 0
systemd[1]: Trying to enqueue job reboot.target/start/replace-irreversibly
systemd[1]: Fixing conflicting jobs by deleting job reboot.target/stop
systemd[1]: Deleting job systemd-reboot.service/stop as dependency of job reboot.target/stop
systemd[1]: Deleting job umount.target/stop as dependency of job systemd-reboot.service/stop
systemd[1]: Deleting job -.mount/start as dependency of job umount.target/stop
systemd[1]: Deleting job systemd-update-utmp-shutdown.service/start as dependency of job -.mount/start
systemd[1]: Deleting job systemd-random-seed-save.service/start as dependency of job -.mount/start

Comment 8 Lennart Poettering 2013-06-21 11:12:07 UTC
So my guess here is that the boot-up never finished. You end up in a weird mix of starting things and stopping things there I guess. There's probably a service still in the process of starting up, which we need to wait for to fail before we can then immediately shut it down again to bring the system down again.

Can you enable systemd-debug-shell.service (which will give you an unconditional debug shell on tty9. Then, when this is stuck again, please type "systemctl list-jobs" and paste the output here, thanks.

Comment 9 Kamil Páral 2013-06-21 12:20:24 UTC
Please note that although the output I provided was usually gathered by booting the DVD and then restarting it right after anaconda started, the same reboot delay occurs even after the full installation is complete, easily 30 minutes or more. Can some service be in the process of starting up for that long? Wouldn't it be killed by systemd long ago?

I tried to start systemd-debug-shell.service, but there is no such service on F19 TC6 DVD.

This is the output of "systemctl list-jobs" during the reboot delay:
472 reboot.target                      start waiting
473 systemd-reboot.service             start waiting
474 shutdown.target                    start waiting
563 -.mount                            stop  waiting
560 umount.target                      start waiting
486 sockets.target                     stop  waiting
487 dbus.socket                        stop  waiting
490 NetworkManager.service             stop  waiting
494 iscsid.socket                      stop  waiting
498 iscsiuio.socket                    stop  waiting
499 avahi-daemon.socket                stop  waiting
500 lvm2-lvmetad.service               stop  waiting
501 local-fs-pre.target                stop  waiting
502 systemd-vconsole-setup.service     stop  waiting
507 anaconda.target                    stop  waiting
508 anaconda-shell        stop  waiting
512 anaconda.service                   stop  running
516 rsyslog.service                    stop  waiting
517 syslog.socket                      stop  waiting
519 systemd-ask-password-plymouth.path stop  waiting
520 sysinit.target                     stop  waiting
521 basic.target                       stop  waiting
532 systemd-tmpfiles-clean.timer       stop  waiting
533 cryptsetup.target                  stop  waiting
537 paths.target                       stop  waiting
539 systemd-sysctl.service             stop  waiting
540 lvm2-monitor.service               stop  waiting
544 swap.target                        stop  waiting
545 local-fs.target                    stop  waiting
547 fedora-readonly.service            stop  waiting
556 systemd-remount-fs.service         stop  waiting
557 timers.target                      stop  waiting
561 tmp.mount                          stop  waiting
562 run-install-repo.mount             stop  waiting
564 systemd-initctl.socket             stop  waiting
565 systemd-shutdownd.socket           stop  waiting
566 lvm2-lvmetad.socket                stop  waiting
567 systemd-udevd-control.socket       stop  waiting
568 systemd-journald.socket            stop  waiting
569 systemd-journald.service           stop  waiting
575 proc-sys-fs-binfmt_misc.automount  stop  waiting
576 var-lib-nfs-rpc_pipefs.mount       stop  waiting
577 final.target                       start waiting
578 plymouth-reboot.service            start waiting

Comment 10 Kamil Páral 2013-06-21 12:35:16 UTC
If I kill (systemctl kill) anaconda.service during the reboot delay, the machine reboots _immediately_. And I mean immediately, I wonder whether it at least unmounts the drives. The reboot is instant.

In the journal I see following messages concerning anaconda after reboot is triggered:

Jun 21 12:24:42 localhost.localdomain systemd-journal[559]: Suppressed 287 messages from /system/anaconda.service
Jun 21 12:24:42 localhost.localdomain anaconda[746]: Problems running the window manager: 1

Have the journal missed 287 messages? Why?

Maybe anaconda is erroneously trying to start another window manager, during reboot? Brian, can you check?

Comment 11 Kamil Páral 2013-06-21 12:35:55 UTC
Created attachment 763819 [details]
journal after reboot is triggered

Comment 12 Lennart Poettering 2013-06-21 13:47:26 UTC
(In reply to Kamil Páral from comment #9)

> This is the output of "systemctl list-jobs" during the reboot delay:

[...]

> 512 anaconda.service                   stop  running

[...]

So this is the problem. anaconda is slow at shutting down, and everything else waits for that. Reassigning to anaconda.

Comment 13 Lennart Poettering 2013-06-21 13:53:38 UTC
(In reply to Kamil Páral from comment #10)
> If I kill (systemctl kill) anaconda.service during the reboot delay, the
> machine reboots _immediately_. And I mean immediately, I wonder whether it
> at least unmounts the drives. The reboot is instant.

Hmm, if you have the suspicion that things are too fast here, and something reboots forcibly too early, then you could drop in a script in /usr/lib/systemd/system-shutdown/. Make it execute "sleep 10"  or so, mark it executable. All executables from that directory are executed immediately before issuing the kernel reboot() syscall. Hence, if this change will cause your reboot to wait for 10 more seconds, then everything should be OK. Note that we in fact are pretty fast these days in shutting down. What is slow is mostly the disk syncs. But this might be a non-issue with install media?

> In the journal I see following messages concerning anaconda after reboot is
> triggered:
> 
> Jun 21 12:24:42 localhost.localdomain systemd-journal[559]: Suppressed 287
> messages from /system/anaconda.service
> Jun 21 12:24:42 localhost.localdomain anaconda[746]: Problems running the
> window manager: 1
> 
> Have the journal missed 287 messages? Why?

Seems anaconda is writing a ton of messages in a very short time. journald will rate limit per-service, so that one service cannot cause everybody else's logs to be flushed out. You can increase this limit in /etc/systemd/journald.conf using 

RateLimitInterval=10s
RateLimitBurst=200

(this is the default of a per-service limit of 200 msgs per 10s, feel free to bump this to any values you like. You can also turn off the limiting by setting it to 0. See journald.conf(5) for more info).

Comment 14 Lennart Poettering 2013-06-21 13:55:51 UTC
Regarding the ratelimiting issue also see bug 965803.

Comment 15 Kamil Páral 2013-06-21 14:46:48 UTC
(In reply to Lennart Poettering from comment #13)
> Hence, if this change will cause
> your reboot to wait for 10 more seconds, then everything should be OK.

I tried it and it really waits 10 seconds before rebooting. So it should be fine. I'm quite impressed with the shutdown speed.

> Seems anaconda is writing a ton of messages in a very short time. journald
> will rate limit per-service, so that one service cannot cause everybody
> else's logs to be flushed out. You can increase this limit in
> /etc/systemd/journald.conf using 
> 
> RateLimitInterval=10s
> RateLimitBurst=200

I relaxed those to 0, and the "Suppressed XXX messages" line is no longer in journal. But there are no extra messages from anaconda either, the journal looks almost the same as in comment 11.

Comment 16 Brian Lane 2013-06-21 17:47:59 UTC
I can't reproduce this often enough to debug it.

I did see (via pstree -p) that anaconda and Xorg were still running and lsof -p <anaconda pid> showed a bunch of files still open.

So, if you can reproduce this I'd like to see:

1. What happens if you kill -9 <xorg>
   Is Xorg what's holding onto things?

2. strace -p <anaconda>

3. strace -p <xorg>


At the point where anaconda calls systemctl --no-wall reboot there is nothing else for it to do, it should just exit.

Also, if you can reproduce this reliably, does it also happen in text?

What I did to get information off the system was to run netcat on another system like so:

nc -kl <local ip> 8000

And on the install you can do:

lsof -p XXX > /dev/tcp/<other ip>/8000

to send things to the other system.

Comment 17 Adam Williamson 2013-06-21 17:51:15 UTC
I can't reproduce entirely *reliably*, but it happens to me quite often, so I'll try and remember to check back in this tab next time it happens and provided the requested info. I'm sure Kamil will do the same.

Comment 18 Kamil Páral 2013-06-24 12:20:29 UTC
This is the pstree after DVD is booted:
systemd,1 --switched-root --system --deserialize 26
  |-NetworkManager,672 --no-daemon
  |   |-dhclient,759 -d -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-ens3.pid -lf...
  |   |-{NetworkManager},678
  |   |-{NetworkManager},679
  |   `-{NetworkManager},690
  |-at-spi-bus-laun,750
  |   |-dbus-daemon,753 --config-file=/etc/at-spi2/accessibility.conf --nofork --print-address 3
  |   |   `-{dbus-daemon},755
  |   |-{at-spi-bus-laun},752
  |   `-{at-spi-bus-laun},754
  |-at-spi2-registr,757 --use-gnome-session
  |   `-{at-spi2-registr},758
  |-bash,652 --login
  |   `-pstree,807 -pa
  |-chronyd,785 -u chrony
  |-dbus-daemon,747 --fork --print-pid 5 --print-address 7 --session
  |   `-{dbus-daemon},748
  |-dbus-daemon,670 --system --address=systemd: --nofork --nopidfile --systemd-activation
  |   `-{dbus-daemon},671
  |-dbus-launch,746 --autolaunch 0f2b464e718eba5e75bbddefd8972f04 --binary-syntax --close-stderr
  |-lvmetad,591
  |-multipathd,590
  |   |-{multipathd},592
  |   |-{multipathd},594
  |   |-{multipathd},595
  |   |-{multipathd},596
  |   `-{multipathd},597
  |-polkitd,680 --no-debug
  |   |-{polkitd},681
  |   |-{polkitd},682
  |   |-{polkitd},683
  |   |-{polkitd},684
  |   `-{polkitd},685
  |-rsyslogd,644 -n
  |   |-{rsyslogd},666
  |   |-{rsyslogd},667
  |   |-{rsyslogd},668
  |   `-{rsyslogd},669
  |-spice-vdagentd,656
  |-systemd-journal,562
  |-systemd-logind,687
  |-systemd-udevd,586
  |-tmux,692 -u -f /usr/share/anaconda/tmux.conf start
  |   |-anaconda,693 /sbin/anaconda
  |   |   |-Xorg,738 -br -logfile /tmp/X.log :1 vt7 -s 1440 -ac -nolisten tcp -dpi 96 -noreset
  |   |   |-anaconda,741 /sbin/anaconda
  |   |   |   `-metacity,742 --display :1 --sm-disable
  |   |   `-{anaconda},737
  |   |-bash,694 --login
  |   |-tail,695 -F /tmp/anaconda.log
  |   |-tail,696 -F /tmp/storage.log
  |   `-tail,697 -F /tmp/program.log
  `-tmux,704 -u attach -t anaconda


When I try to reboot, the pstree looks like this:

systemd,1 --switched-root --system --deserialize 26
  |-NetworkManager,672 --no-daemon
  |   |-dhclient,759 -d -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-ens3.pid -lf...
  |   |-{NetworkManager},678
  |   |-{NetworkManager},679
  |   `-{NetworkManager},690
  |-anaconda,693 /sbin/anaconda
  |   |-(Xorg,738)
  |   |-(anaconda,741)
  |   `-{anaconda},737
  |-bash,652 --login
  |   |-pstree,829 -pa
  |   |-strace,811 -p 693
  |   `-strace,818 -p 737
  |-lvmetad,591
  |-rsyslogd,644 -n
  |   |-{rsyslogd},666
  |   |-{rsyslogd},667
  |   |-{rsyslogd},668
  |   `-{rsyslogd},669
  |-systemd-journal,562
  `-systemd-udevd,586


I have attached strace to anaconda 693, Xorg 738 and anaconda 737 before triggering reboot, logs attached. Xorg is exited normally, but both anaconda 693 and 737 and hung and killed after the timeout is reached.

Anaconda 693 is waiting like this:

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 26
setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(26, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
getsockname(26, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
listen(26, 5)                           = 0
accept(26, 0x7fff8079a760, [16])        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=738, si_status=0, si_utime=166, si_stime=12} ---
accept(26,

Anaconda 737 is waiting like this:

restart_syscall(<... resuming interrupted call ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=1, si_uid=0} ---
rt_sigreturn()                          = -1 EINTR (Interrupted system call)
read(27, 0x7fcd45260dd0, 16)            = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=27, events=POLLIN}], 1, 4294967295) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
restart_syscall(<... resuming interrupted call ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
restart_syscall(<... resuming interrupted call ...>

I have verified that killing just anaconda 737 thread is enough for allowing the system to reboot correctly. X can't be killed at that moment (it's a zombie process or whatever it is called, I think).

Comment 19 Kamil Páral 2013-06-24 12:21:58 UTC
Created attachment 764562 [details]
anaconda-693.strace

Comment 20 Kamil Páral 2013-06-24 12:22:07 UTC
Created attachment 764563 [details]
anaconda-737.strace

Comment 21 Kamil Páral 2013-06-24 12:22:18 UTC
Created attachment 764564 [details]
xorg-738.strace

Comment 22 Kamil Páral 2013-06-24 12:26:49 UTC
(In reply to Brian C. Lane from comment #16)
> Also, if you can reproduce this reliably, does it also happen in text?

Yes it does. Again, just killing the anaconda thread (i.e. not the whole process, just the thread) is enough to make the computer reboot.

Comment 23 Jeff Bastian 2013-07-01 16:46:06 UTC
(In reply to Kamil Páral from comment #9)
> I tried to start systemd-debug-shell.service, but there is no such service
> on F19 TC6 DVD.


The service has changed names: it's just debug-shell.service now.

~]$ rpm -ql systemd | grep debug
/usr/lib/systemd/system/debug-shell.service
...

Comment 24 Steve Tyler 2013-07-13 04:01:02 UTC
The attached strace log was captured during reboot with strace attached to two anaconda processes and two tmux processes. Pids are:
anaconda:   677, 727
tmux:       676, 684

Rebooting did not stall, but the strace log has a number of interesting records:

1. The "Problems running the window manager: 1" message from anaconda occurs four times.

2. The "terminated" message that sometimes appears while rebooting is stalled comes from tmux:

$ grep -n terminated x5.strace 
2095:[pid   684] 1373684860.159943 write(1, "[terminated]\n", 13) = 13

3. Anaconda appears to be trying to log a traceback:

$ grep -n 'Traceback' x5.strace 
2195:[pid   727] 1373684860.206649 write(2, "Traceback (most recent call last):\n", 35 <unfinished ...>

4. There are a number of writes to stderr that fail:

$ grep -n 'Exception ' x5.strace 
2387:[pid   727] 1373684860.287284 write(2, "Exception ", 10) = -1 EIO (Input/output error)

$ grep -n 'SystemExit' x5.strace 
2389:[pid   727] 1373684860.288092 write(2, "SystemExit", 10) = -1 EIO (Input/output error)

$ grep -n 'Popen' x5.strace 
2397:[pid   727] 1373684860.296731 write(2, "<bound method Popen.__del__ of <subprocess.Popen object at 0x7fa4f776c610>>", 75) = -1 EIO (Input/output error)

==
Procedure:

Start the installer:
$ qemu-kvm -m 4096 -hda f19-test-1.img -cdrom ~/xfr/fedora/F19/Fedora-19-x86_64-DVD.iso -vga std -boot menu=on

Start a minimal install with default disk partitioning.
Set a root password.

When the Reboot button is sensitive, switch to the installer console and run:

# pgrep anaconda
677
727

# pgrep tmux
676
684

# strace -f -ttt -s 512 -p 677 -p 727 -p 676 -p 684 2>&1 | ssh stephent 'cat > tmp/x5.strace'

Comment 25 Steve Tyler 2013-07-13 04:03:55 UTC
Created attachment 772935 [details]
[x5.strace] strace capture during installer reboot (described in Comment 24)

Comment 26 Steve Tyler 2013-07-13 05:41:02 UTC
Created attachment 772938 [details]
[x6.strace] strace log during reboot with strace attached to anaconda and metacity

For this log, strace was attached to two anaconda processes and the metacity process:
anaconda:                   678, 728
metacity:                   729
systemctl --no-wall reboot: 13543 (determined from the strace log)

This snippet appears to show metacity reporting a fatal error during shutdown that anaconda attempts to log. Then systemctl gets a SIGTERM.

$ less -N x6.strace
...
   1673 [pid   729] 1373691542.246442 write(2, "Window manager warning: ", 24) = 24
   1674 [pid   729] 1373691542.246867 write(2, "Fatal IO error 11 (Resource temporarily unavailable) on display ':1'.\n", 70) = 70
   1675 [pid   728] 1373691542.248712 <... open resumed> ) = -1 ENOENT (No such file or directory)
   1676 [pid   728] 1373691542.249217 write(3, "04:59:02,239 ERR anaconda: Problems running the window manager: 1\n", 66) = 66
   1677 [pid   728] 1373691542.250482 sendto(5, "<140>anaconda: Problems running the window manager: 1\0", 54, 0, NULL, 0 <unfinished ...>
   1678 [pid   729] 1373691542.257112 exit_group(1) = ?
   1679 [pid 13543] 1373691542.257697 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
   1680 [pid   728] 1373691542.260231 <... sendto resumed> ) = 54
   1681 [pid   728] 1373691542.267414 write(6, "ERR anaconda: Problems running the window manager: 1\n", 53 <unfinished ...>
   1682 [pid 13543] 1373691542.268505 +++ killed by SIGTERM +++
   1683 [pid   678] 1373691542.268876 <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 13543
   1684 [pid   678] 1373691542.269223 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7fe624627890}, NULL, 8) = 0
   1685 [pid   678] 1373691542.269897 rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7fe624627890}, NULL, 8) = 0
   1686 [pid   678] 1373691542.270861 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
   1687 [pid   678] 1373691542.271232 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=13543, si_status=SIGTERM, si_utime=0, si_stime=0} ---
...

==
strace capture commands:

# pgrep anaconda
678
728

# pgrep metacity
729

# strace -f -ttt -s 512 -p 678 -p 728 -p 729 2>&1 | ssh stephent 'cat > tmp/x6.strace'

Comment 27 Steve Tyler 2013-07-13 06:18:36 UTC
"systemctl --no-wall reboot" execs "systemd-tty-ask-password-agent".
Could that hang?

$ egrep -n 'pid [13543|13544]' x6.strace | egrep 'clone|exec'
1181:[pid 13543] 1373691542.126780 execve("/bin/sh", ["sh", "-c", "systemctl --no-wall reboot"], [/* 22 vars */]) = 0
1323:[pid 13543] 1373691542.143002 execve("/usr/bin/systemctl", ["systemctl", "--no-wall", "reboot"], [/* 22 vars */]) = 0
1491:[pid 13543] 1373691542.150489 clone(Process 13544 attached
1524:[pid 13544] 1373691542.151277 execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-password-agent", "--watch"], [/* 22 vars */]) = 0
1613:[pid 13543] 1373691542.156330 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb942101b50) = 13544

Comment 28 Steve Tyler 2013-07-15 08:09:39 UTC
Attached log x6.strace shows that anaconda (728) never completes a wait4() for metacity (729).

In lines 1704,1707, there is a wait4(729, ..., WNOHANG ), but it returns 0, which means 729 exists, but has not yet changed state (man wait4, man waitpid).

In line 1719, metacity exits, but there is no subsequent wait4(729, ...) before anaconda exits in line 2239.

$ egrep -n 'pid   (728|729)' x6.strace | egrep 'wait|exit|kill'
1678:[pid   729] 1373691542.257112 exit_group(1) = ?
1704:[pid   728] 1373691542.280907 wait4(729,  <unfinished ...>
1707:[pid   728] 1373691542.282485 <... wait4 resumed> 0x7fff8c0a2aec, WNOHANG, NULL) = 0
1719:[pid   729] 1373691542.285973 +++ exited with 1 +++
1730:[pid   728] 1373691542.289687 wait4(725,  <unfinished ...>
1732:[pid   728] 1373691542.290684 <... wait4 resumed> 0x7fff8c0a328c, WNOHANG, NULL) = -1 ECHILD (No child processes)
2220:[pid   728] 1373691542.557420 exit_group(1) = ?
2239:[pid   728] 1373691542.564746 +++ exited with 1 +++

Comment 29 Steve Tyler 2013-07-15 09:08:33 UTC
Lines 1655,1656 of x6.strace show metacity (729) and anaconda (728) receiving a SIGTERM essentially simultaneously from si_pid 1. Could that cause a race?

Line 1749 shows anaconda receiving a second SIGTERM from si_pid 1.

$ egrep -n 'pid   (728|729)' x6.strace | grep SIGTERM
1655:[pid   729] 1373691542.231323 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
1656:[pid   728] 1373691542.231826 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
1749:[pid   728] 1373691542.299697 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
1825:[pid   728] 1373691542.323044 rt_sigaction(SIGTERM, {SIG_DFL, [], SA_RESTORER, 0x7fe6250c9fa0}, {0x7fe6253e4080, [], SA_RESTORER, 0x7fe6250c9fa0}, 8) = 0

Comment 30 Steve Tyler 2013-07-15 09:33:20 UTC
(In reply to Kamil Páral from comment #10)
...
> Maybe anaconda is erroneously trying to start another window manager, during
> reboot? Brian, can you check?
...

I believe the apparently repeated "Problems ..." messages are actually very similar messages being written to different log files or consoles (based on the consecutive file descriptors (3, 5, 6, 7) and the slight differences in the strings):

$ grep -n Problems x6.strace
1676:[pid   728] 1373691542.249217 write(3, "04:59:02,239 ERR anaconda: Problems running the window manager: 1\n", 66) = 66
1677:[pid   728] 1373691542.250482 sendto(5, "<140>anaconda: Problems running the window manager: 1\0", 54, 0, NULL, 0 <unfinished ...>
1681:[pid   728] 1373691542.267414 write(6, "ERR anaconda: Problems running the window manager: 1\n", 53 <unfinished ...>
1700:[pid   728] 1373691542.278728 write(7, "ERR: Problems running the window manager: 1\n", 44 <unfinished ...>

Comment 31 Steve Tyler 2013-07-15 11:01:15 UTC
Created attachment 773686 [details]
[x7.strace] strace log showing network port 8080 being opened

(In reply to Kamil Páral from comment #18)
...
> Anaconda 693 is waiting like this:
> 
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 26
> setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(26, {sa_family=AF_INET, sin_port=htons(8080),
...

The attached log file (x7.strace) shows a very similar sequence.
The code is opening network port 8080. Why is it doing that?
(Unfortunately, I didn't record what processes correspond to the process ids.)

$ less -N x7.strace:
...
  11264 [pid   682] 1373698241.012535 open("/tmp/.pdbrc", O_RDONLY) = -1 ENOENT (No such file or directory)
  11265 [pid   682] 1373698241.012743 open(".pdbrc", O_RDONLY) = -1 ENOENT (No such file or directory)
  11266 [pid   682] 1373698241.012936 write(1, "Serving on port 8080\n", 21) = -1 EIO (Input/output error)
  11267 [pid   682] 1373698241.013091 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 32
  11268 [pid   682] 1373698241.014071 setsockopt(32, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
  11269 [pid   682] 1373698241.014596 bind(32, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  11270 [pid   682] 1373698241.015057 getsockname(32, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
  11271 [pid   682] 1373698241.015340 listen(32, 5) = 0
  11272 [pid   682] 1373698241.015611 accept(32,  <unfinished ...>
...

Comment 32 Steve Tyler 2013-07-15 11:18:15 UTC
Is anaconda running the Extended Python debugger?

$ grep Serving /usr/lib/python2.7/site-packages/epdb/*.py
/usr/lib/python2.7/site-packages/epdb/epdb.py:                print 'Serving on port %s' % port
/usr/lib/python2.7/site-packages/epdb/epdb.py:                print 'Serving on port %s' % port

$ rpm -q python-epdb
python-epdb-0.11-9.fc19.noarch

Comment 33 Steve Tyler 2013-07-15 11:22:42 UTC
(In reply to Steve Tyler from comment #32)
> Is anaconda running the Extended Python debugger?

$ grep -R epdb anaconda-19.30.13-1
anaconda-19.30.13-1/anaconda:    import epdb
anaconda-19.30.13-1/anaconda:    epdb.serve(skip=1)

Comment 34 Steve Tyler 2013-07-15 17:29:56 UTC
(In reply to Kamil Páral from comment #19)
> Created attachment 764562 [details]
> anaconda-693.strace

This log file also shows epdb (Extended Python debugger) being accessed.
This string is being written to stdout, although the write fails:
"Serving on port 8080\n".

Kamil: The strace "-ttt" option will put timestamps on strace records:
man strace.

Process 693 attached
...
open("/usr/lib/python2.7/site-packages/epdb/telnetclient.pyc", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0100644) = 30
...
open("/tmp/.pdbrc", O_RDONLY)           = -1 ENOENT (No such file or directory)
open(".pdbrc", O_RDONLY)                = -1 ENOENT (No such file or directory)
write(1, "Serving on port 8080\n", 21)  = -1 EIO (Input/output error)
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 26
setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(26, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
...
accept(26,  <unfinished ...>
...

Comment 35 Steve Tyler 2013-07-15 19:02:22 UTC
Now I remember that x7.strace captured pid 1 as well as the anaconda processes, and that it proceeded through the hang until reboot.

You can see anaconda (682) being sent SIGTERM twice and then SIGKILL.
Note that in line 11272, anaconda calls accept() *after* receiving SIGTERM. 
In line 14762, anaconda is finally terminated by SIGKILL.

$ egrep -n 'pid   (682|  1)' x7.strace | egrep 'accept|kill|SIGKILL|SIGTERM|exit'
...
8217:[pid     1] 1373698240.404294 kill(682, SIGTERM <unfinished ...>
8219:[pid   682] 1373698240.404314 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
8222:[pid     1] 1373698240.404347 <... kill resumed> ) = 0
8223:[pid     1] 1373698240.404448 kill(682, SIGCONT <unfinished ...>
8231:[pid     1] 1373698240.404651 <... kill resumed> ) = 0
...
8589:[pid     1] 1373698240.444575 kill(682, SIGTERM) = 0
...
8672:[pid   682] 1373698240.449700 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
...
11272:[pid   682] 1373698241.015611 accept(32,  <unfinished ...>
11309:[pid   682] 1373698241.345584 <... accept resumed> 0x7fff7a27a0c0, [16]) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
11311:[pid   682] 1373698241.346185 accept(32,  <unfinished ...>
14727:[pid     1] 1373698330.456232 kill(682, SIGKILL <unfinished ...>
14730:[pid     1] 1373698330.462081 <... kill resumed> ) = 0
14731:[pid     1] 1373698330.462104 kill(682, SIGCONT) = 0
14762:[pid   682] 1373698330.471266 +++ killed by SIGKILL +++
14771:[pid     1] 1373698330.472480 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=682, si_status=SIGKILL, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
14785:[pid     1] 1373698330.473539 waitid(P_PID, 682, {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=682, si_status=SIGKILL, si_utime=0, si_stime=0}, WEXITED, NULL) = 0
14819:[pid     1] 1373698330.478360 accept4(9, {sa_family=AF_LOCAL, NULL}, [2], SOCK_CLOEXEC) = 14
15399:[pid     1] 1373698330.508387 kill(665, SIGTERM <unfinished ...>
15574:[pid     1] 1373698330.519215 <... kill resumed> ) = 0
15575:[pid     1] 1373698330.519224 kill(665, SIGCONT) = 0

Comment 36 Steve Tyler 2013-07-15 21:43:21 UTC
(In reply to Kamil Páral from comment #4)
> I booted F19 TC3 DVD 5 times, and hit Quit right after displaying anaconda.
> Once it rebooted immediately, 4 times it took about 1:30 min.

x7.strace confirms your time.

The time between anaconda calling accept() and pid 1 sending anaconda SIGKILL is about 1.5 minutes:
1373698330.456232 - 1373698241.346185 = 89.110047 seconds

Snippet from x7.strace:
$ egrep -n 'pid   (682|  1)' x7.strace | egrep 'accept|kill|SIGKILL|SIGTERM|exit'
...
11311:[pid   682] 1373698241.346185 accept(32,  <unfinished ...>
...
14727:[pid     1] 1373698330.456232 kill(682, SIGKILL <unfinished ...>
...

Comment 37 Steve Tyler 2013-08-06 17:16:31 UTC
[Copied from Bug 994188, Comment 1]

Brian C. Lane 2013-08-06 17:03:11 UTC

Found the problem, patch posted to anaconda-patches. Amount to not using os.system to call systemctl for shutdown since it blocks.

Comment 38 Steve Tyler 2013-08-06 17:19:39 UTC
Thanks, Brian:

[PATCH] Don't wait for systemctl shutdown command to exit (#974383)
https://lists.fedorahosted.org/pipermail/anaconda-patches/2013-August/005112.html

Comment 39 Harald Hoyer 2013-08-07 10:35:30 UTC
or use the "--no-block" option of systemctl

Comment 40 Steve Tyler 2013-08-16 19:24:37 UTC
Created attachment 787447 [details]
[20.6-1] screenshot showing "4m[terminated]" message that is displayed for about 7 seconds

With anaconda 20.6-1, I'm still seeing the "4m[terminated]" message as shown in the attached screenshot. Reboot delay is about 7 seconds.

Procedure:
1. Start installer from rawhide boot.iso:
$ qemu-kvm -m 4096 -hda f20-test-1.img -cdrom ~/xfr/fedora/rawhide/boot.iso -vga std -boot menu=on
2. After the Welcome dialog is displayed, click Quit.
3. Click "Yes" and start counting seconds.

The "4m[terminated]" message is displayed for about 7 seconds.

The rawhide boot.iso doesn't have a version number:
$ isoinfo -lR -i boot.iso | egrep 'vmlinuz|squashfs'
-rwxr-xr-x   2    0    0         5653880 Aug 14 2013 [  31326 00]  vmlinuz 
-rw-r--r--   1    0    0       222838784 Aug 15 2013 [  49515 00]  squashfs.img 
-rwxr-xr-x   2    0    0         5653880 Aug 14 2013 [  31326 00]  vmlinuz

Comment 41 Steve Tyler 2013-08-16 19:31:39 UTC
(In reply to Steve Tyler from comment #40)
...
> The "4m[terminated]" message is displayed for about 7 seconds.
...

Just to be clear, that message looks unpolished at best and corrupt at worst.
Users shouldn't be seeing it all ...

As reported in Comment 24, the message appears to be coming from tmux.

Comment 42 Adam Williamson 2013-08-17 05:55:23 UTC
That has nothing at all to do with this bug, though.


Note You need to log in before you can comment on or make changes to this bug.