Bug 974383
Summary: | Reboot after install often delays for ~1 minute | |||
---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> | |
Component: | anaconda | Assignee: | Brian Lane <bcl> | |
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 19 | CC: | awilliam, bcl, dshea, fedora, g.kaviyarasu, harald, jbastian, johannbg, jonathan, kparal, lnykryn, mkolman, mrmazda, msekleta, notting, plautrba, sbueno, stephent98, systemd-maint, vanmeeuwen+fedora, vpavlin, zbyszek | |
Target Milestone: | --- | Keywords: | CommonBugs | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | https://fedoraproject.org/wiki/Common_F19_bugs#reboot-delay | |||
Fixed In Version: | anaconda-20.5-1 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 994188 (view as bug list) | Environment: | ||
Last Closed: | 2013-08-16 17:35:16 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 994188 | |||
Attachments: |
Description
Adam Williamson
2013-06-14 06:34:04 UTC
There are only three systemd related shutdown problems... The network related ones, the disk setup related ones and when users replace /dev/console with a FIFO :) You need to follow http://www.freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 and provide us with shutdown-log.txt ( probably need to save it to a usb stick ) Created attachment 761139 [details]
journal when waiting for reboot (DVD or netinst)
Before I gather any better information, here's some basic one:
I tested LiveCD >5 times and it doesn't happen for them. Neither if you reboot right after startup, nor if you reboot after installation is finished. Everything is quick. It seems to affect just Anaconda-only environment (DVD, netinst, PXE), both bare metal and VM.
I managed to quickly save journal during waiting for reboot once, it is attached. I'll try to provide better logs. From a quick look, it seems like NetworkManager could cause the problem:
Jun 12 10:21:05 localhost.localdomain NetworkManager[677]: <error> [1371032465.581113] [nm-dbus-manager.c:278] nm_dbus_manager_init_bus(): Could not get the system bus. Make sure the message bus daemon is running! Message: Failed to connect to socket /var/run/dbus/system_bus_socket: Connection refused
systemd[1]: Walked on cycle path to run-install-source.mount/stop systemd[1]: Walked on cycle path to network.target/stop systemd[1]: Walked on cycle path to NetworkManager.service/stop systemd[1]: Walked on cycle path to basic.target/stop systemd[1]: Walked on cycle path to sockets.target/stop systemd[1]: Walked on cycle path to dbus.socket/stop systemd[1]: Walked on cycle path to sysinit.target/stop systemd[1]: Walked on cycle path to local-fs.target/stop systemd[1]: Breaking ordering cycle by deleting job run-install-source.mount/stop systemd[1]: Job run-install-source.mount/stop deleted to break ordering cycle starting with local-fs.target/stop Seems like we have a problem with dependencies. I booted F19 TC3 DVD 5 times, and hit Quit right after displaying anaconda. Once it rebooted immediately, 4 times it took about 1:30 min. (In reply to Kamil Páral from comment #4) > 4 times it took about 1:30 min. Ctrl+Alt+Del doesn't speed up the process. However, shell on tty2 still works during this time. I tried to follow http://www.freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 (I created a secondary disk for that in VM), however for some reason either debug.sh was not executed, or something else failed and the log was not written. So I at least booted with systemd debug and saved the journal the same way as before, during the reboot timeout period. Attached. Created attachment 761228 [details]
journal with systemd debug
systemd[1]: Failed to load configuration for reboot.service: No such file or directory systemd[1]: SELinux access check scon=system_u:system_r:initrc_t:s0 tcon=unconfined_u:object_r:power_unit_file_t:s0 tclass=service perm=start path=/usr/lib/systemd/system/reboot.target cmdline=(null): 0 systemd[1]: Trying to enqueue job reboot.target/start/replace-irreversibly systemd[1]: Fixing conflicting jobs by deleting job reboot.target/stop systemd[1]: Deleting job systemd-reboot.service/stop as dependency of job reboot.target/stop systemd[1]: Deleting job umount.target/stop as dependency of job systemd-reboot.service/stop systemd[1]: Deleting job -.mount/start as dependency of job umount.target/stop systemd[1]: Deleting job systemd-update-utmp-shutdown.service/start as dependency of job -.mount/start systemd[1]: Deleting job systemd-random-seed-save.service/start as dependency of job -.mount/start So my guess here is that the boot-up never finished. You end up in a weird mix of starting things and stopping things there I guess. There's probably a service still in the process of starting up, which we need to wait for to fail before we can then immediately shut it down again to bring the system down again. Can you enable systemd-debug-shell.service (which will give you an unconditional debug shell on tty9. Then, when this is stuck again, please type "systemctl list-jobs" and paste the output here, thanks. Please note that although the output I provided was usually gathered by booting the DVD and then restarting it right after anaconda started, the same reboot delay occurs even after the full installation is complete, easily 30 minutes or more. Can some service be in the process of starting up for that long? Wouldn't it be killed by systemd long ago? I tried to start systemd-debug-shell.service, but there is no such service on F19 TC6 DVD. This is the output of "systemctl list-jobs" during the reboot delay: 472 reboot.target start waiting 473 systemd-reboot.service start waiting 474 shutdown.target start waiting 563 -.mount stop waiting 560 umount.target start waiting 486 sockets.target stop waiting 487 dbus.socket stop waiting 490 NetworkManager.service stop waiting 494 iscsid.socket stop waiting 498 iscsiuio.socket stop waiting 499 avahi-daemon.socket stop waiting 500 lvm2-lvmetad.service stop waiting 501 local-fs-pre.target stop waiting 502 systemd-vconsole-setup.service stop waiting 507 anaconda.target stop waiting 508 anaconda-shell stop waiting 512 anaconda.service stop running 516 rsyslog.service stop waiting 517 syslog.socket stop waiting 519 systemd-ask-password-plymouth.path stop waiting 520 sysinit.target stop waiting 521 basic.target stop waiting 532 systemd-tmpfiles-clean.timer stop waiting 533 cryptsetup.target stop waiting 537 paths.target stop waiting 539 systemd-sysctl.service stop waiting 540 lvm2-monitor.service stop waiting 544 swap.target stop waiting 545 local-fs.target stop waiting 547 fedora-readonly.service stop waiting 556 systemd-remount-fs.service stop waiting 557 timers.target stop waiting 561 tmp.mount stop waiting 562 run-install-repo.mount stop waiting 564 systemd-initctl.socket stop waiting 565 systemd-shutdownd.socket stop waiting 566 lvm2-lvmetad.socket stop waiting 567 systemd-udevd-control.socket stop waiting 568 systemd-journald.socket stop waiting 569 systemd-journald.service stop waiting 575 proc-sys-fs-binfmt_misc.automount stop waiting 576 var-lib-nfs-rpc_pipefs.mount stop waiting 577 final.target start waiting 578 plymouth-reboot.service start waiting If I kill (systemctl kill) anaconda.service during the reboot delay, the machine reboots _immediately_. And I mean immediately, I wonder whether it at least unmounts the drives. The reboot is instant. In the journal I see following messages concerning anaconda after reboot is triggered: Jun 21 12:24:42 localhost.localdomain systemd-journal[559]: Suppressed 287 messages from /system/anaconda.service Jun 21 12:24:42 localhost.localdomain anaconda[746]: Problems running the window manager: 1 Have the journal missed 287 messages? Why? Maybe anaconda is erroneously trying to start another window manager, during reboot? Brian, can you check? Created attachment 763819 [details]
journal after reboot is triggered
(In reply to Kamil Páral from comment #9) > This is the output of "systemctl list-jobs" during the reboot delay: [...] > 512 anaconda.service stop running [...] So this is the problem. anaconda is slow at shutting down, and everything else waits for that. Reassigning to anaconda. (In reply to Kamil Páral from comment #10) > If I kill (systemctl kill) anaconda.service during the reboot delay, the > machine reboots _immediately_. And I mean immediately, I wonder whether it > at least unmounts the drives. The reboot is instant. Hmm, if you have the suspicion that things are too fast here, and something reboots forcibly too early, then you could drop in a script in /usr/lib/systemd/system-shutdown/. Make it execute "sleep 10" or so, mark it executable. All executables from that directory are executed immediately before issuing the kernel reboot() syscall. Hence, if this change will cause your reboot to wait for 10 more seconds, then everything should be OK. Note that we in fact are pretty fast these days in shutting down. What is slow is mostly the disk syncs. But this might be a non-issue with install media? > In the journal I see following messages concerning anaconda after reboot is > triggered: > > Jun 21 12:24:42 localhost.localdomain systemd-journal[559]: Suppressed 287 > messages from /system/anaconda.service > Jun 21 12:24:42 localhost.localdomain anaconda[746]: Problems running the > window manager: 1 > > Have the journal missed 287 messages? Why? Seems anaconda is writing a ton of messages in a very short time. journald will rate limit per-service, so that one service cannot cause everybody else's logs to be flushed out. You can increase this limit in /etc/systemd/journald.conf using RateLimitInterval=10s RateLimitBurst=200 (this is the default of a per-service limit of 200 msgs per 10s, feel free to bump this to any values you like. You can also turn off the limiting by setting it to 0. See journald.conf(5) for more info). Regarding the ratelimiting issue also see bug 965803. (In reply to Lennart Poettering from comment #13) > Hence, if this change will cause > your reboot to wait for 10 more seconds, then everything should be OK. I tried it and it really waits 10 seconds before rebooting. So it should be fine. I'm quite impressed with the shutdown speed. > Seems anaconda is writing a ton of messages in a very short time. journald > will rate limit per-service, so that one service cannot cause everybody > else's logs to be flushed out. You can increase this limit in > /etc/systemd/journald.conf using > > RateLimitInterval=10s > RateLimitBurst=200 I relaxed those to 0, and the "Suppressed XXX messages" line is no longer in journal. But there are no extra messages from anaconda either, the journal looks almost the same as in comment 11. I can't reproduce this often enough to debug it. I did see (via pstree -p) that anaconda and Xorg were still running and lsof -p <anaconda pid> showed a bunch of files still open. So, if you can reproduce this I'd like to see: 1. What happens if you kill -9 <xorg> Is Xorg what's holding onto things? 2. strace -p <anaconda> 3. strace -p <xorg> At the point where anaconda calls systemctl --no-wall reboot there is nothing else for it to do, it should just exit. Also, if you can reproduce this reliably, does it also happen in text? What I did to get information off the system was to run netcat on another system like so: nc -kl <local ip> 8000 And on the install you can do: lsof -p XXX > /dev/tcp/<other ip>/8000 to send things to the other system. I can't reproduce entirely *reliably*, but it happens to me quite often, so I'll try and remember to check back in this tab next time it happens and provided the requested info. I'm sure Kamil will do the same. This is the pstree after DVD is booted: systemd,1 --switched-root --system --deserialize 26 |-NetworkManager,672 --no-daemon | |-dhclient,759 -d -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-ens3.pid -lf... | |-{NetworkManager},678 | |-{NetworkManager},679 | `-{NetworkManager},690 |-at-spi-bus-laun,750 | |-dbus-daemon,753 --config-file=/etc/at-spi2/accessibility.conf --nofork --print-address 3 | | `-{dbus-daemon},755 | |-{at-spi-bus-laun},752 | `-{at-spi-bus-laun},754 |-at-spi2-registr,757 --use-gnome-session | `-{at-spi2-registr},758 |-bash,652 --login | `-pstree,807 -pa |-chronyd,785 -u chrony |-dbus-daemon,747 --fork --print-pid 5 --print-address 7 --session | `-{dbus-daemon},748 |-dbus-daemon,670 --system --address=systemd: --nofork --nopidfile --systemd-activation | `-{dbus-daemon},671 |-dbus-launch,746 --autolaunch 0f2b464e718eba5e75bbddefd8972f04 --binary-syntax --close-stderr |-lvmetad,591 |-multipathd,590 | |-{multipathd},592 | |-{multipathd},594 | |-{multipathd},595 | |-{multipathd},596 | `-{multipathd},597 |-polkitd,680 --no-debug | |-{polkitd},681 | |-{polkitd},682 | |-{polkitd},683 | |-{polkitd},684 | `-{polkitd},685 |-rsyslogd,644 -n | |-{rsyslogd},666 | |-{rsyslogd},667 | |-{rsyslogd},668 | `-{rsyslogd},669 |-spice-vdagentd,656 |-systemd-journal,562 |-systemd-logind,687 |-systemd-udevd,586 |-tmux,692 -u -f /usr/share/anaconda/tmux.conf start | |-anaconda,693 /sbin/anaconda | | |-Xorg,738 -br -logfile /tmp/X.log :1 vt7 -s 1440 -ac -nolisten tcp -dpi 96 -noreset | | |-anaconda,741 /sbin/anaconda | | | `-metacity,742 --display :1 --sm-disable | | `-{anaconda},737 | |-bash,694 --login | |-tail,695 -F /tmp/anaconda.log | |-tail,696 -F /tmp/storage.log | `-tail,697 -F /tmp/program.log `-tmux,704 -u attach -t anaconda When I try to reboot, the pstree looks like this: systemd,1 --switched-root --system --deserialize 26 |-NetworkManager,672 --no-daemon | |-dhclient,759 -d -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-ens3.pid -lf... | |-{NetworkManager},678 | |-{NetworkManager},679 | `-{NetworkManager},690 |-anaconda,693 /sbin/anaconda | |-(Xorg,738) | |-(anaconda,741) | `-{anaconda},737 |-bash,652 --login | |-pstree,829 -pa | |-strace,811 -p 693 | `-strace,818 -p 737 |-lvmetad,591 |-rsyslogd,644 -n | |-{rsyslogd},666 | |-{rsyslogd},667 | |-{rsyslogd},668 | `-{rsyslogd},669 |-systemd-journal,562 `-systemd-udevd,586 I have attached strace to anaconda 693, Xorg 738 and anaconda 737 before triggering reboot, logs attached. Xorg is exited normally, but both anaconda 693 and 737 and hung and killed after the timeout is reached. Anaconda 693 is waiting like this: socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 26 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(26, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 getsockname(26, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0 listen(26, 5) = 0 accept(26, 0x7fff8079a760, [16]) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=738, si_status=0, si_utime=166, si_stime=12} --- accept(26, Anaconda 737 is waiting like this: restart_syscall(<... resuming interrupted call ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} --- --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=1, si_uid=0} --- rt_sigreturn() = -1 EINTR (Interrupted system call) read(27, 0x7fcd45260dd0, 16) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=27, events=POLLIN}], 1, 4294967295) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) restart_syscall(<... resuming interrupted call ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) restart_syscall(<... resuming interrupted call ...> I have verified that killing just anaconda 737 thread is enough for allowing the system to reboot correctly. X can't be killed at that moment (it's a zombie process or whatever it is called, I think). Created attachment 764562 [details]
anaconda-693.strace
Created attachment 764563 [details]
anaconda-737.strace
Created attachment 764564 [details]
xorg-738.strace
(In reply to Brian C. Lane from comment #16) > Also, if you can reproduce this reliably, does it also happen in text? Yes it does. Again, just killing the anaconda thread (i.e. not the whole process, just the thread) is enough to make the computer reboot. (In reply to Kamil Páral from comment #9) > I tried to start systemd-debug-shell.service, but there is no such service > on F19 TC6 DVD. The service has changed names: it's just debug-shell.service now. ~]$ rpm -ql systemd | grep debug /usr/lib/systemd/system/debug-shell.service ... The attached strace log was captured during reboot with strace attached to two anaconda processes and two tmux processes. Pids are: anaconda: 677, 727 tmux: 676, 684 Rebooting did not stall, but the strace log has a number of interesting records: 1. The "Problems running the window manager: 1" message from anaconda occurs four times. 2. The "terminated" message that sometimes appears while rebooting is stalled comes from tmux: $ grep -n terminated x5.strace 2095:[pid 684] 1373684860.159943 write(1, "[terminated]\n", 13) = 13 3. Anaconda appears to be trying to log a traceback: $ grep -n 'Traceback' x5.strace 2195:[pid 727] 1373684860.206649 write(2, "Traceback (most recent call last):\n", 35 <unfinished ...> 4. There are a number of writes to stderr that fail: $ grep -n 'Exception ' x5.strace 2387:[pid 727] 1373684860.287284 write(2, "Exception ", 10) = -1 EIO (Input/output error) $ grep -n 'SystemExit' x5.strace 2389:[pid 727] 1373684860.288092 write(2, "SystemExit", 10) = -1 EIO (Input/output error) $ grep -n 'Popen' x5.strace 2397:[pid 727] 1373684860.296731 write(2, "<bound method Popen.__del__ of <subprocess.Popen object at 0x7fa4f776c610>>", 75) = -1 EIO (Input/output error) == Procedure: Start the installer: $ qemu-kvm -m 4096 -hda f19-test-1.img -cdrom ~/xfr/fedora/F19/Fedora-19-x86_64-DVD.iso -vga std -boot menu=on Start a minimal install with default disk partitioning. Set a root password. When the Reboot button is sensitive, switch to the installer console and run: # pgrep anaconda 677 727 # pgrep tmux 676 684 # strace -f -ttt -s 512 -p 677 -p 727 -p 676 -p 684 2>&1 | ssh stephent 'cat > tmp/x5.strace' Created attachment 772935 [details] [x5.strace] strace capture during installer reboot (described in Comment 24) Created attachment 772938 [details]
[x6.strace] strace log during reboot with strace attached to anaconda and metacity
For this log, strace was attached to two anaconda processes and the metacity process:
anaconda: 678, 728
metacity: 729
systemctl --no-wall reboot: 13543 (determined from the strace log)
This snippet appears to show metacity reporting a fatal error during shutdown that anaconda attempts to log. Then systemctl gets a SIGTERM.
$ less -N x6.strace
...
1673 [pid 729] 1373691542.246442 write(2, "Window manager warning: ", 24) = 24
1674 [pid 729] 1373691542.246867 write(2, "Fatal IO error 11 (Resource temporarily unavailable) on display ':1'.\n", 70) = 70
1675 [pid 728] 1373691542.248712 <... open resumed> ) = -1 ENOENT (No such file or directory)
1676 [pid 728] 1373691542.249217 write(3, "04:59:02,239 ERR anaconda: Problems running the window manager: 1\n", 66) = 66
1677 [pid 728] 1373691542.250482 sendto(5, "<140>anaconda: Problems running the window manager: 1\0", 54, 0, NULL, 0 <unfinished ...>
1678 [pid 729] 1373691542.257112 exit_group(1) = ?
1679 [pid 13543] 1373691542.257697 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
1680 [pid 728] 1373691542.260231 <... sendto resumed> ) = 54
1681 [pid 728] 1373691542.267414 write(6, "ERR anaconda: Problems running the window manager: 1\n", 53 <unfinished ...>
1682 [pid 13543] 1373691542.268505 +++ killed by SIGTERM +++
1683 [pid 678] 1373691542.268876 <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 13543
1684 [pid 678] 1373691542.269223 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7fe624627890}, NULL, 8) = 0
1685 [pid 678] 1373691542.269897 rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7fe624627890}, NULL, 8) = 0
1686 [pid 678] 1373691542.270861 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
1687 [pid 678] 1373691542.271232 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=13543, si_status=SIGTERM, si_utime=0, si_stime=0} ---
...
==
strace capture commands:
# pgrep anaconda
678
728
# pgrep metacity
729
# strace -f -ttt -s 512 -p 678 -p 728 -p 729 2>&1 | ssh stephent 'cat > tmp/x6.strace'
"systemctl --no-wall reboot" execs "systemd-tty-ask-password-agent". Could that hang? $ egrep -n 'pid [13543|13544]' x6.strace | egrep 'clone|exec' 1181:[pid 13543] 1373691542.126780 execve("/bin/sh", ["sh", "-c", "systemctl --no-wall reboot"], [/* 22 vars */]) = 0 1323:[pid 13543] 1373691542.143002 execve("/usr/bin/systemctl", ["systemctl", "--no-wall", "reboot"], [/* 22 vars */]) = 0 1491:[pid 13543] 1373691542.150489 clone(Process 13544 attached 1524:[pid 13544] 1373691542.151277 execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-password-agent", "--watch"], [/* 22 vars */]) = 0 1613:[pid 13543] 1373691542.156330 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb942101b50) = 13544 Attached log x6.strace shows that anaconda (728) never completes a wait4() for metacity (729). In lines 1704,1707, there is a wait4(729, ..., WNOHANG ), but it returns 0, which means 729 exists, but has not yet changed state (man wait4, man waitpid). In line 1719, metacity exits, but there is no subsequent wait4(729, ...) before anaconda exits in line 2239. $ egrep -n 'pid (728|729)' x6.strace | egrep 'wait|exit|kill' 1678:[pid 729] 1373691542.257112 exit_group(1) = ? 1704:[pid 728] 1373691542.280907 wait4(729, <unfinished ...> 1707:[pid 728] 1373691542.282485 <... wait4 resumed> 0x7fff8c0a2aec, WNOHANG, NULL) = 0 1719:[pid 729] 1373691542.285973 +++ exited with 1 +++ 1730:[pid 728] 1373691542.289687 wait4(725, <unfinished ...> 1732:[pid 728] 1373691542.290684 <... wait4 resumed> 0x7fff8c0a328c, WNOHANG, NULL) = -1 ECHILD (No child processes) 2220:[pid 728] 1373691542.557420 exit_group(1) = ? 2239:[pid 728] 1373691542.564746 +++ exited with 1 +++ Lines 1655,1656 of x6.strace show metacity (729) and anaconda (728) receiving a SIGTERM essentially simultaneously from si_pid 1. Could that cause a race? Line 1749 shows anaconda receiving a second SIGTERM from si_pid 1. $ egrep -n 'pid (728|729)' x6.strace | grep SIGTERM 1655:[pid 729] 1373691542.231323 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} --- 1656:[pid 728] 1373691542.231826 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} --- 1749:[pid 728] 1373691542.299697 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} --- 1825:[pid 728] 1373691542.323044 rt_sigaction(SIGTERM, {SIG_DFL, [], SA_RESTORER, 0x7fe6250c9fa0}, {0x7fe6253e4080, [], SA_RESTORER, 0x7fe6250c9fa0}, 8) = 0 (In reply to Kamil Páral from comment #10) ... > Maybe anaconda is erroneously trying to start another window manager, during > reboot? Brian, can you check? ... I believe the apparently repeated "Problems ..." messages are actually very similar messages being written to different log files or consoles (based on the consecutive file descriptors (3, 5, 6, 7) and the slight differences in the strings): $ grep -n Problems x6.strace 1676:[pid 728] 1373691542.249217 write(3, "04:59:02,239 ERR anaconda: Problems running the window manager: 1\n", 66) = 66 1677:[pid 728] 1373691542.250482 sendto(5, "<140>anaconda: Problems running the window manager: 1\0", 54, 0, NULL, 0 <unfinished ...> 1681:[pid 728] 1373691542.267414 write(6, "ERR anaconda: Problems running the window manager: 1\n", 53 <unfinished ...> 1700:[pid 728] 1373691542.278728 write(7, "ERR: Problems running the window manager: 1\n", 44 <unfinished ...> Created attachment 773686 [details] [x7.strace] strace log showing network port 8080 being opened (In reply to Kamil Páral from comment #18) ... > Anaconda 693 is waiting like this: > > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 26 > setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > bind(26, {sa_family=AF_INET, sin_port=htons(8080), ... The attached log file (x7.strace) shows a very similar sequence. The code is opening network port 8080. Why is it doing that? (Unfortunately, I didn't record what processes correspond to the process ids.) $ less -N x7.strace: ... 11264 [pid 682] 1373698241.012535 open("/tmp/.pdbrc", O_RDONLY) = -1 ENOENT (No such file or directory) 11265 [pid 682] 1373698241.012743 open(".pdbrc", O_RDONLY) = -1 ENOENT (No such file or directory) 11266 [pid 682] 1373698241.012936 write(1, "Serving on port 8080\n", 21) = -1 EIO (Input/output error) 11267 [pid 682] 1373698241.013091 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 32 11268 [pid 682] 1373698241.014071 setsockopt(32, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 11269 [pid 682] 1373698241.014596 bind(32, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 11270 [pid 682] 1373698241.015057 getsockname(32, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0 11271 [pid 682] 1373698241.015340 listen(32, 5) = 0 11272 [pid 682] 1373698241.015611 accept(32, <unfinished ...> ... Is anaconda running the Extended Python debugger? $ grep Serving /usr/lib/python2.7/site-packages/epdb/*.py /usr/lib/python2.7/site-packages/epdb/epdb.py: print 'Serving on port %s' % port /usr/lib/python2.7/site-packages/epdb/epdb.py: print 'Serving on port %s' % port $ rpm -q python-epdb python-epdb-0.11-9.fc19.noarch (In reply to Steve Tyler from comment #32) > Is anaconda running the Extended Python debugger? $ grep -R epdb anaconda-19.30.13-1 anaconda-19.30.13-1/anaconda: import epdb anaconda-19.30.13-1/anaconda: epdb.serve(skip=1) (In reply to Kamil Páral from comment #19) > Created attachment 764562 [details] > anaconda-693.strace This log file also shows epdb (Extended Python debugger) being accessed. This string is being written to stdout, although the write fails: "Serving on port 8080\n". Kamil: The strace "-ttt" option will put timestamps on strace records: man strace. Process 693 attached ... open("/usr/lib/python2.7/site-packages/epdb/telnetclient.pyc", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0100644) = 30 ... open("/tmp/.pdbrc", O_RDONLY) = -1 ENOENT (No such file or directory) open(".pdbrc", O_RDONLY) = -1 ENOENT (No such file or directory) write(1, "Serving on port 8080\n", 21) = -1 EIO (Input/output error) socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 26 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(26, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 ... accept(26, <unfinished ...> ... Now I remember that x7.strace captured pid 1 as well as the anaconda processes, and that it proceeded through the hang until reboot. You can see anaconda (682) being sent SIGTERM twice and then SIGKILL. Note that in line 11272, anaconda calls accept() *after* receiving SIGTERM. In line 14762, anaconda is finally terminated by SIGKILL. $ egrep -n 'pid (682| 1)' x7.strace | egrep 'accept|kill|SIGKILL|SIGTERM|exit' ... 8217:[pid 1] 1373698240.404294 kill(682, SIGTERM <unfinished ...> 8219:[pid 682] 1373698240.404314 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} --- 8222:[pid 1] 1373698240.404347 <... kill resumed> ) = 0 8223:[pid 1] 1373698240.404448 kill(682, SIGCONT <unfinished ...> 8231:[pid 1] 1373698240.404651 <... kill resumed> ) = 0 ... 8589:[pid 1] 1373698240.444575 kill(682, SIGTERM) = 0 ... 8672:[pid 682] 1373698240.449700 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} --- ... 11272:[pid 682] 1373698241.015611 accept(32, <unfinished ...> 11309:[pid 682] 1373698241.345584 <... accept resumed> 0x7fff7a27a0c0, [16]) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 11311:[pid 682] 1373698241.346185 accept(32, <unfinished ...> 14727:[pid 1] 1373698330.456232 kill(682, SIGKILL <unfinished ...> 14730:[pid 1] 1373698330.462081 <... kill resumed> ) = 0 14731:[pid 1] 1373698330.462104 kill(682, SIGCONT) = 0 14762:[pid 682] 1373698330.471266 +++ killed by SIGKILL +++ 14771:[pid 1] 1373698330.472480 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=682, si_status=SIGKILL, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0 14785:[pid 1] 1373698330.473539 waitid(P_PID, 682, {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=682, si_status=SIGKILL, si_utime=0, si_stime=0}, WEXITED, NULL) = 0 14819:[pid 1] 1373698330.478360 accept4(9, {sa_family=AF_LOCAL, NULL}, [2], SOCK_CLOEXEC) = 14 15399:[pid 1] 1373698330.508387 kill(665, SIGTERM <unfinished ...> 15574:[pid 1] 1373698330.519215 <... kill resumed> ) = 0 15575:[pid 1] 1373698330.519224 kill(665, SIGCONT) = 0 (In reply to Kamil Páral from comment #4) > I booted F19 TC3 DVD 5 times, and hit Quit right after displaying anaconda. > Once it rebooted immediately, 4 times it took about 1:30 min. x7.strace confirms your time. The time between anaconda calling accept() and pid 1 sending anaconda SIGKILL is about 1.5 minutes: 1373698330.456232 - 1373698241.346185 = 89.110047 seconds Snippet from x7.strace: $ egrep -n 'pid (682| 1)' x7.strace | egrep 'accept|kill|SIGKILL|SIGTERM|exit' ... 11311:[pid 682] 1373698241.346185 accept(32, <unfinished ...> ... 14727:[pid 1] 1373698330.456232 kill(682, SIGKILL <unfinished ...> ... [Copied from Bug 994188, Comment 1] Brian C. Lane 2013-08-06 17:03:11 UTC Found the problem, patch posted to anaconda-patches. Amount to not using os.system to call systemctl for shutdown since it blocks. Thanks, Brian: [PATCH] Don't wait for systemctl shutdown command to exit (#974383) https://lists.fedorahosted.org/pipermail/anaconda-patches/2013-August/005112.html or use the "--no-block" option of systemctl Created attachment 787447 [details]
[20.6-1] screenshot showing "4m[terminated]" message that is displayed for about 7 seconds
With anaconda 20.6-1, I'm still seeing the "4m[terminated]" message as shown in the attached screenshot. Reboot delay is about 7 seconds.
Procedure:
1. Start installer from rawhide boot.iso:
$ qemu-kvm -m 4096 -hda f20-test-1.img -cdrom ~/xfr/fedora/rawhide/boot.iso -vga std -boot menu=on
2. After the Welcome dialog is displayed, click Quit.
3. Click "Yes" and start counting seconds.
The "4m[terminated]" message is displayed for about 7 seconds.
The rawhide boot.iso doesn't have a version number:
$ isoinfo -lR -i boot.iso | egrep 'vmlinuz|squashfs'
-rwxr-xr-x 2 0 0 5653880 Aug 14 2013 [ 31326 00] vmlinuz
-rw-r--r-- 1 0 0 222838784 Aug 15 2013 [ 49515 00] squashfs.img
-rwxr-xr-x 2 0 0 5653880 Aug 14 2013 [ 31326 00] vmlinuz
(In reply to Steve Tyler from comment #40) ... > The "4m[terminated]" message is displayed for about 7 seconds. ... Just to be clear, that message looks unpolished at best and corrupt at worst. Users shouldn't be seeing it all ... As reported in Comment 24, the message appears to be coming from tmux. That has nothing at all to do with this bug, though. |