+++ This bug was initially created as a clone of Bug #822745 +++ Description of problem: gkrellm -s <remotesys> freezes up and can't be restarted Version-Release number of selected component (if applicable): gkrellm-2.3.5-5.fc16.x86_64 How reproducible: Intermittent Steps to Reproduce: 1. Run a display of a remote machine 2. After a while, it freezes. 3. It cannot be restarted. Only a reboot of the remote machine fixes it. Actual results: Unreliable performance Expected results: Reliable performance Additional info: I monitor remote machines with gkrellm, ie, gkrellm -s octopus & Normally, this works perfectly, but every now and then, the display either freezes up with the title bar blinking red, or disappears. And then it cannot be restarted. If I try to restart the display (omitting the "&"), nothing happens; the command will wait forever without starting a new display for octopus. If I then ssh to octopus and issue this command systemctl restart gkrellmd.service it takes about 1:20 m:s before systemd is able to restart this service. When it finally does, the hanging request for a new display abruptly terminates. Reissuing that request still hangs. In some mysterious way, the gkrellm network link gets into a zombie-like state that can't be resuscitated by restarting the remote gkrellmd daemon. The only sure way to get the display working again is to reboot the remote machine. That's too drastic a step, usually. I have no idea how this can happen, but here are some possible clues: The failure seems related to intermittancy of the network link. Two local machines that are on wired ethernet connections rarely, if ever, fail. Two other local machines with wireless connections fail, say, every few days. I've recently upgraded the wireless devices from DLink 802.11g to a TP-Link 802.11n router and Rosewill RNX-N180UBE adapters with bigger directional antennae. These use the r8712u driver from the staging area. This setup seems to provide a stronger and more stable wireless link. The gkrellm failure rate has diminished, but is still too frequent. Octopus is hundreds of miles away, connected by an ipsec tunnel that has been quite reliable until recently. Now the display is down and I daren't reboot the machine. BZ 795141 may be related to my problem, but seems focused on a particular wireless adapter and kernel support that are different from mine. I am currently running Fedora 16 with all the latest updates. The problem has existed for many previous kernels and Fedora releases. --- Additional comment from Hans de Goede on 2012-05-18 03:08:49 EDT --- If only a reboot fixes things this sounds like it is likely a kernel issue, or an issue with some stateful firewall in between. I'm afraid I think there is very little we can do without further debugging by you. I assume you've had this setup for a longer period of time? And that it used to work fine before? In that case can you think of any changes on either the gkrellm client or server side which happened around the time you first started seeing this problem? Is the server (so the one which needs a reboot to fix) running a firewall? Have you tried restarting the firewall? (so clear all tables, then reload). Can you try using an older kernel on the server? --- Additional comment from David A. De Graaf on 2012-05-18 13:56:56 EDT --- Hans, I'm rather bewildered by this buggy behaviour. I've delayed reporting it precisely because I haven't any useful data to help debug it. I can't correlate the onset of these hangups with any specific update; I can't recall when remote gkrellm just worked without hangups. I simply guessed that the hangups were due to brief network dropouts, which seems reasonable to me, although I am disappointed that the communication protocol is not more robust. What seems unreasonable to me is that some state info is preserved even when gkrellmd.service is stopped, and restarted. One would expect that all memory of the defective session would be expunged. Where can this memory be hiding? What's perplexing is the excessive time it takes for systemctl stop gkrellmd.service to bring gkrellmd to a stop - about 90 seconds. I'm tempted to blame systemd for yet another breakdown, but that would be unfair. Firewalls are not to blame; within my LAN there are none, and the distant machine talks to me through an ipsec encrypted tunnel. The hangups occur between machines currently running kernel-3.3.5-2.fc16.x86_64, but the frequency seems not dependent on which kernel was in use. I have no older kernels available to test, unfortunately. I wonder, more than anything, if anybody else has this problem. Doesn't anyone else use the -s option to monitor remote machines? Do they experience hangups, too? What info can you think of that I can supply? Here are the last few lines of strace gkrellm -s octopus: ... close(5) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 connect(5, {sa_family=AF_INET, sin_port=htons(19150), sin_addr=inet_addr("192.168.1.2")}, 16) = 0 sendto(5, "gkrellm 2.3.5\n", 14, 0, NULL, 0) = 14 select(6, [5], NULL, NULL, {15, 0}) = 0 (Timeout) select(6, [5], NULL, NULL, {15, 0}) = 0 (Timeout) select(6, [5], NULL, NULL, {15, 0}) = 0 (Timeout) select(6, [5], NULL, NULL, {15, 0} The timeout repeats about every 15 sec. --- Additional comment from Hans de Goede on 2012-05-19 04:59:26 EDT --- Given that a daemon restart does not fix this, where as a reboot does, it is likely that the cause of this lies outside of gkrellm. About the long time it takes systemctl to stop gkrellmd, that is because systemd actually tracks running process, so it does not just send a signal to stop gkrellmd, it then actually waits for the process to exit. There is a 90 sec timeout on this after which systemd kills the process the hard way. So it seems that gkrellmd is stuck in such a state that it won't exit properly either. Have you checked after waiting for the systemd timeout that the gkrellmd process really is gone? Maybe it sticks around as a zombie (and somehow is still holding open the listening port). Other ideas I have is: -stop gkrellmd, try restarting the IPsec connection (if possible), start gkrellmd -stop gkrellmd, start gkrellmd on a different port --- Additional comment from David A. De Graaf on 2012-05-19 13:17:07 EDT --- About the long time it takes systemctl to stop gkrellmd, that is because systemd actually tracks running process, so it does not just send a signal to stop gkrellmd, it then actually waits for the process to exit. There is a 90 sec timeout on this after which systemd kills the process the hard way. So it seems that gkrellmd is stuck in such a state that it won't exit properly either. Is that not a bug? Surely there's a more correct way for systemctl to stop this process. Perhaps this is one of the contributing factors why shutting down a system has become such a long-winded and frustrating procedure. But that's beside the point here. I did confirm that systemctl stop gkrellmd.service does kill the gkrellmd process. Both ps f -ely and pgrep gkrellmd show no such process. I also tested your other very good ideas, sadly with negative results. I stopped gkrellmd on the distant machine, stopped the ipsec tunnel, restarted ipsec tunnel, restarted gkrellmd. Then I could still not locally start a new display, eg, gkrellm -s octopus Then I tried a different port. On the distant machine I edited /etc/gkrellmd.conf to say #port 19150 port 19151 and restarted gkrellmd.service On the local machine I ran gkrellm -s octopus --port 19151 with the same result - the command just hung without starting the display. Would you agree that this points to a kernel problem? --- Additional comment from Hans de Goede on 2012-05-19 13:20:30 EDT --- (In reply to comment #4) > Would you agree that this points to a kernel problem? It seems it does, yes. --- Additional comment from David A. De Graaf on 2012-05-22 12:39:20 EDT --- Here's an interesting, but probably irrelevant, discovery. Although I cannot restart gkrellm -s octopus because, apparently the kernel on octopus is in some sort of zombie state with respect to gkrellmd, I can display octopus activity by this devious method: $ nohup ssh -Y octopus gkrellm & This runs gkrellm on octopus instead of my local datium machine, but displays it locally via the ssh -Y trusted X11 forwarding. gkrellmd is not involved at all. It is interesting to note that the bandwidth used is a steady 20K (I think this means 20 Kilobits/sec), while the other ethernet-connected -s displays run in spurts between 700. and 2.6K, while the wireless-connected -s display consume 1.5KB to 3.0KB. Evidently the X11 forwarding is much less efficient. --- Additional comment from Fedora End Of Life on 2013-01-16 08:39:15 EST --- This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping --- Additional comment from Fedora End Of Life on 2013-02-13 09:31:10 EST --- Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.
Sadly, this bug is alive and well even though F16 has expired. I've learned a new command, lsof -n -i -P , and can report a new factoid that may be helpful. With the display of the remote machine hung, eg, gkrellm -s octopus, when I log in to that remote machine and run that lsof command I get two relevant lines: gkrellmd 11478 gkrellmd 3u IPv4 5684634 0t0 TCP *:19150 (LISTEN) gkrellmd 11478 gkrellmd 4u IPv6 5684635 0t0 TCP *:19150 (LISTEN) showing that gkrellmd is still listening on both the IPV4 and IPV6 19150 ports. THen I stop gkrellmd on the remote machine, systemctl stop gkrellmd.service, and verify that gkrellmd is no longer listening on either port. Then I edit /etc/gkrellmd.conf to comment out this line: ## allow-host ::ffff:127.0.0.1 Then I restart gkrellmd and observe, to my surprise, that TWO lines are still present; listening on both ports, IPV4 and IPV6. This suggests to me that the very long shutdown orchestrated by systemd was improper or inadequate in some way, and that the restart was not a full and complete restart. The restart evidently failed to reread /etc/gkrellmd.conf because it failed to notice my instruction to not listen on IPV6. My suspicions that systemd is culpable are renewed. At the least, it is not acceptable that systemd should take about 90 sec. to stop this or any process. That's just absurd.
Sorry, I'm wrong. I just edited /etc/gkrellmd.conf differently - overriding the default port - #port 19150 port 19151 and found that after stopping and restarting gkrellmd, the two listening lines were listening on the revised port 19151. So my theory that /etc/gkrellmd.conf is not being reread is totally wrong. However, an attempt to use that new listening port still fails - gkrellm --port 19151 -s octopus although telnet octopus 19151 does confirm that something is listening. Puzzling!
This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.