Bug 822745
Summary: | Remote gkrellm network link hangs and can't be restarted | |||
---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | David A. De Graaf <dad> | |
Component: | gkrellm | Assignee: | Hans de Goede <hdegoede> | |
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 16 | CC: | hdegoede, ville.skytta | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 917732 (view as bug list) | Environment: | ||
Last Closed: | 2013-02-13 14:31:08 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
David A. De Graaf
2012-05-18 03:30:32 UTC
If only a reboot fixes things this sounds like it is likely a kernel issue, or an issue with some stateful firewall in between. I'm afraid I think there is very little we can do without further debugging by you. I assume you've had this setup for a longer period of time? And that it used to work fine before? In that case can you think of any changes on either the gkrellm client or server side which happened around the time you first started seeing this problem? Is the server (so the one which needs a reboot to fix) running a firewall? Have you tried restarting the firewall? (so clear all tables, then reload). Can you try using an older kernel on the server? Hans, I'm rather bewildered by this buggy behaviour. I've delayed reporting it precisely because I haven't any useful data to help debug it. I can't correlate the onset of these hangups with any specific update; I can't recall when remote gkrellm just worked without hangups. I simply guessed that the hangups were due to brief network dropouts, which seems reasonable to me, although I am disappointed that the communication protocol is not more robust. What seems unreasonable to me is that some state info is preserved even when gkrellmd.service is stopped, and restarted. One would expect that all memory of the defective session would be expunged. Where can this memory be hiding? What's perplexing is the excessive time it takes for systemctl stop gkrellmd.service to bring gkrellmd to a stop - about 90 seconds. I'm tempted to blame systemd for yet another breakdown, but that would be unfair. Firewalls are not to blame; within my LAN there are none, and the distant machine talks to me through an ipsec encrypted tunnel. The hangups occur between machines currently running kernel-3.3.5-2.fc16.x86_64, but the frequency seems not dependent on which kernel was in use. I have no older kernels available to test, unfortunately. I wonder, more than anything, if anybody else has this problem. Doesn't anyone else use the -s option to monitor remote machines? Do they experience hangups, too? What info can you think of that I can supply? Here are the last few lines of strace gkrellm -s octopus: ... close(5) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 connect(5, {sa_family=AF_INET, sin_port=htons(19150), sin_addr=inet_addr("192.168.1.2")}, 16) = 0 sendto(5, "gkrellm 2.3.5\n", 14, 0, NULL, 0) = 14 select(6, [5], NULL, NULL, {15, 0}) = 0 (Timeout) select(6, [5], NULL, NULL, {15, 0}) = 0 (Timeout) select(6, [5], NULL, NULL, {15, 0}) = 0 (Timeout) select(6, [5], NULL, NULL, {15, 0} The timeout repeats about every 15 sec. Given that a daemon restart does not fix this, where as a reboot does, it is likely that the cause of this lies outside of gkrellm. About the long time it takes systemctl to stop gkrellmd, that is because systemd actually tracks running process, so it does not just send a signal to stop gkrellmd, it then actually waits for the process to exit. There is a 90 sec timeout on this after which systemd kills the process the hard way. So it seems that gkrellmd is stuck in such a state that it won't exit properly either. Have you checked after waiting for the systemd timeout that the gkrellmd process really is gone? Maybe it sticks around as a zombie (and somehow is still holding open the listening port). Other ideas I have is: -stop gkrellmd, try restarting the IPsec connection (if possible), start gkrellmd -stop gkrellmd, start gkrellmd on a different port About the long time it takes systemctl to stop gkrellmd, that is because systemd actually tracks running process, so it does not just send a signal to stop gkrellmd, it then actually waits for the process to exit. There is a 90 sec timeout on this after which systemd kills the process the hard way. So it seems that gkrellmd is stuck in such a state that it won't exit properly either. Is that not a bug? Surely there's a more correct way for systemctl to stop this process. Perhaps this is one of the contributing factors why shutting down a system has become such a long-winded and frustrating procedure. But that's beside the point here. I did confirm that systemctl stop gkrellmd.service does kill the gkrellmd process. Both ps f -ely and pgrep gkrellmd show no such process. I also tested your other very good ideas, sadly with negative results. I stopped gkrellmd on the distant machine, stopped the ipsec tunnel, restarted ipsec tunnel, restarted gkrellmd. Then I could still not locally start a new display, eg, gkrellm -s octopus Then I tried a different port. On the distant machine I edited /etc/gkrellmd.conf to say #port 19150 port 19151 and restarted gkrellmd.service On the local machine I ran gkrellm -s octopus --port 19151 with the same result - the command just hung without starting the display. Would you agree that this points to a kernel problem? (In reply to comment #4) > Would you agree that this points to a kernel problem? It seems it does, yes. Here's an interesting, but probably irrelevant, discovery. Although I cannot restart gkrellm -s octopus because, apparently the kernel on octopus is in some sort of zombie state with respect to gkrellmd, I can display octopus activity by this devious method: $ nohup ssh -Y octopus gkrellm & This runs gkrellm on octopus instead of my local datium machine, but displays it locally via the ssh -Y trusted X11 forwarding. gkrellmd is not involved at all. It is interesting to note that the bandwidth used is a steady 20K (I think this means 20 Kilobits/sec), while the other ethernet-connected -s displays run in spurts between 700. and 2.6K, while the wireless-connected -s display consume 1.5KB to 3.0KB. Evidently the X11 forwarding is much less efficient. This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |