Bug 822745

Summary:	Remote gkrellm network link hangs and can't be restarted
Product:	[Fedora] Fedora	Reporter:	David A. De Graaf <dad>
Component:	gkrellm	Assignee:	Hans de Goede <hdegoede>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	16	CC:	hdegoede, ville.skytta
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	917732 (view as bug list)		Environment:
Last Closed:	2013-02-13 14:31:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David A. De Graaf 2012-05-18 03:30:32 UTC

Description of problem:
gkrellm -s <remotesys> freezes up and can't be restarted

Version-Release number of selected component (if applicable):
gkrellm-2.3.5-5.fc16.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1.  Run a display of a remote machine
2.  After a while, it freezes.
3.  It cannot be restarted.  Only a reboot of the remote machine fixes it.
  
Actual results:
Unreliable performance

Expected results:
Reliable performance

Additional info:
I monitor remote machines with gkrellm, ie,
    gkrellm -s octopus &

Normally, this works perfectly, but every now and then, the display
either freezes up with the title bar blinking red, or disappears.
And then it cannot be restarted.

If I try to restart the display (omitting the "&"), nothing happens;
the command will wait forever without starting a new display for octopus.

If I then ssh to octopus and issue this command
    systemctl restart gkrellmd.service
it takes about 1:20 m:s before systemd is able to restart this
service.  When it finally does, the hanging request for a new display
abruptly terminates.  Reissuing that request still hangs.

In some mysterious way, the gkrellm network link gets into a
zombie-like state that can't be resuscitated by restarting the remote
gkrellmd daemon.  The only sure way to get the display working again is
to reboot the remote machine.  That's too drastic a step, usually.

I have no idea how this can happen, but here are some possible clues:

The failure seems related to intermittancy of the network link.
Two local machines that are on wired ethernet connections rarely, if
ever, fail.  Two other local machines with wireless connections fail,
say, every few days.  I've recently upgraded the wireless devices from
DLink 802.11g to a TP-Link 802.11n router and Rosewill RNX-N180UBE
adapters with bigger directional antennae.  These use the r8712u driver
from the staging area.  This setup seems to provide a stronger and
more stable wireless link.  The gkrellm failure rate has diminished,
but is still too frequent.

Octopus is hundreds of miles away, connected by
an ipsec tunnel that has been quite reliable until recently.  Now the
display is down and I daren't reboot the machine.

BZ 795141 may be related to my problem, but seems focused on a
particular wireless adapter and kernel support that are different from
mine.

I am currently running Fedora 16 with all the latest updates.  The
problem has existed for many previous kernels and Fedora releases.

Comment 1 Hans de Goede 2012-05-18 07:08:49 UTC

If only a reboot fixes things this sounds like it is likely a kernel issue, or an issue with some stateful firewall in between. I'm afraid I think there is very little we can do without further debugging by you.

I assume you've had this setup for a longer period of time? And that it used to work fine before? In that case
can you think of any changes on either the gkrellm client or server side which happened around the time you first started seeing this problem?

Is the server (so the one which needs a reboot to fix) running a firewall? Have you tried restarting the firewall? (so clear all tables, then reload).

Can you try using an older kernel on the server?

Comment 2 David A. De Graaf 2012-05-18 17:56:56 UTC

Hans, I'm rather bewildered by this buggy behaviour.  I've delayed
reporting it precisely because I haven't any useful data to help debug
it.  I can't correlate the onset of these hangups with any specific
update;  I can't recall when remote gkrellm just worked without
hangups.  I simply guessed that the hangups were due to brief network
dropouts, which seems reasonable to me, although I am disappointed 
that the communication protocol is not more robust.

What seems unreasonable to me is that some state info is preserved
even when gkrellmd.service is stopped, and restarted.  One would expect
that all memory of the defective session would be expunged.
Where can this memory be hiding?

What's perplexing is the excessive time it takes for
   systemctl stop gkrellmd.service
to bring gkrellmd to a stop - about 90 seconds.  I'm tempted to blame
systemd for yet another breakdown, but that would be unfair.

Firewalls are not to blame;  within my LAN there are none, and the
distant machine talks to me through an ipsec encrypted tunnel.
The hangups occur between machines currently running
kernel-3.3.5-2.fc16.x86_64, but the frequency seems not dependent on
which kernel was in use.

I have no older kernels available to test, unfortunately.

I wonder, more than anything, if anybody else has this problem.
Doesn't anyone else use the -s option to monitor remote machines?
Do they experience hangups, too?

What info can you think of that I can supply?
Here are the last few lines of   strace gkrellm -s octopus:
  ...
  close(5)                                = 0
  socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
  connect(5, {sa_family=AF_INET, sin_port=htons(19150),
  sin_addr=inet_addr("192.168.1.2")}, 16) = 0
  sendto(5, "gkrellm 2.3.5\n", 14, 0, NULL, 0) = 14
  select(6, [5], NULL, NULL, {15, 0})     = 0 (Timeout)
  select(6, [5], NULL, NULL, {15, 0})     = 0 (Timeout)
  select(6, [5], NULL, NULL, {15, 0})     = 0 (Timeout)
  select(6, [5], NULL, NULL, {15, 0}

The timeout repeats about every 15 sec.

Comment 3 Hans de Goede 2012-05-19 08:59:26 UTC

Given that a daemon restart does not fix this, where as a reboot does, it is likely that the cause of this lies outside of gkrellm.

About the long time it takes systemctl to stop gkrellmd, that is because systemd actually tracks running process, so it does not just send a signal to stop gkrellmd, it then actually waits for the process to exit. There is a 90 sec timeout on this after which systemd kills the process the hard way. So it seems that gkrellmd is stuck in such a state that it won't exit properly either.

Have you checked after waiting for the systemd timeout that the gkrellmd process really is gone? Maybe it sticks around as a zombie (and somehow is still holding open the listening port).

Other ideas I have is:
-stop gkrellmd, try restarting the IPsec connection (if possible), start gkrellmd
-stop gkrellmd, start gkrellmd on a different port

Comment 4 David A. De Graaf 2012-05-19 17:17:07 UTC

About the long time it takes systemctl to stop gkrellmd, that is because
systemd actually tracks running process, so it does not just send a signal to
stop gkrellmd, it then actually waits for the process to exit. There is a 90
sec timeout on this after which systemd kills the process the hard way. So it
seems that gkrellmd is stuck in such a state that it won't exit properly
either.

Is that not a bug?  Surely there's a more correct way for systemctl to stop this process.  Perhaps this is one of the contributing factors why shutting down a system has become such a long-winded and frustrating procedure.
But that's beside the point here.

I did confirm that  systemctl stop gkrellmd.service   does kill the gkrellmd process.  Both  ps f -ely   and   pgrep gkrellmd   show no such process.

I also tested your other very good ideas, sadly with negative results.
I stopped gkrellmd on the distant machine, stopped the ipsec tunnel,
restarted ipsec tunnel, restarted gkrellmd.  Then I could still not locally start a new display, eg,    gkrellm -s octopus

Then I tried a different port.  On the distant machine I edited /etc/gkrellmd.conf
to say 
   #port 19150
   port 19151
and restarted gkrellmd.service

On the local machine I ran
    gkrellm -s octopus --port 19151
with the same result - the command just hung without starting the display.

Would you agree that this points to a kernel problem?

Comment 5 Hans de Goede 2012-05-19 17:20:30 UTC

(In reply to comment #4)
> Would you agree that this points to a kernel problem?

It seems it does, yes.

Comment 6 David A. De Graaf 2012-05-22 16:39:20 UTC

Here's an interesting, but probably irrelevant, discovery.

Although I cannot restart    gkrellm -s octopus   because, apparently
the kernel on octopus is in some sort of zombie state with respect to
gkrellmd,  I can display octopus activity by this devious method:

  $ nohup ssh -Y octopus gkrellm &
  
This runs gkrellm on octopus instead of my local datium machine, but 
displays it locally via the    ssh -Y    trusted X11 forwarding.  
gkrellmd is not involved at all.

It is interesting to note that the bandwidth used is a steady 20K (I
think this means 20 Kilobits/sec), while the other ethernet-connected 
-s  displays run in spurts between 700. and 2.6K,  while the 
wireless-connected  -s  display consume 1.5KB to 3.0KB.

Evidently the X11 forwarding is much less efficient.

Comment 7 Fedora End Of Life 2013-01-16 13:39:15 UTC

This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 8 Fedora End Of Life 2013-02-13 14:31:10 UTC

Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.