917732 – Remote gkrellm network link hangs and can't be restarted

Bug 917732 - Remote gkrellm network link hangs and can't be restarted

Summary: Remote gkrellm network link hangs and can't be restarted

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gkrellm
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Hans de Goede
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-03-04 16:26 UTC by David A. De Graaf
Modified:	2013-08-01 00:35 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:	822745
Environment:
Last Closed:	2013-08-01 00:35:35 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description David A. De Graaf 2013-03-04 16:26:29 UTC

+++ This bug was initially created as a clone of Bug #822745 +++

Description of problem:
gkrellm -s <remotesys> freezes up and can't be restarted

Version-Release number of selected component (if applicable):
gkrellm-2.3.5-5.fc16.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1.  Run a display of a remote machine
2.  After a while, it freezes.
3.  It cannot be restarted.  Only a reboot of the remote machine fixes it.
  
Actual results:
Unreliable performance

Expected results:
Reliable performance

Additional info:
I monitor remote machines with gkrellm, ie,
    gkrellm -s octopus &

Normally, this works perfectly, but every now and then, the display
either freezes up with the title bar blinking red, or disappears.
And then it cannot be restarted.

If I try to restart the display (omitting the "&"), nothing happens;
the command will wait forever without starting a new display for octopus.

If I then ssh to octopus and issue this command
    systemctl restart gkrellmd.service
it takes about 1:20 m:s before systemd is able to restart this
service.  When it finally does, the hanging request for a new display
abruptly terminates.  Reissuing that request still hangs.

In some mysterious way, the gkrellm network link gets into a
zombie-like state that can't be resuscitated by restarting the remote
gkrellmd daemon.  The only sure way to get the display working again is
to reboot the remote machine.  That's too drastic a step, usually.

I have no idea how this can happen, but here are some possible clues:

The failure seems related to intermittancy of the network link.
Two local machines that are on wired ethernet connections rarely, if
ever, fail.  Two other local machines with wireless connections fail,
say, every few days.  I've recently upgraded the wireless devices from
DLink 802.11g to a TP-Link 802.11n router and Rosewill RNX-N180UBE
adapters with bigger directional antennae.  These use the r8712u driver
from the staging area.  This setup seems to provide a stronger and
more stable wireless link.  The gkrellm failure rate has diminished,
but is still too frequent.

Octopus is hundreds of miles away, connected by
an ipsec tunnel that has been quite reliable until recently.  Now the
display is down and I daren't reboot the machine.

BZ 795141 may be related to my problem, but seems focused on a
particular wireless adapter and kernel support that are different from
mine.

I am currently running Fedora 16 with all the latest updates.  The
problem has existed for many previous kernels and Fedora releases.

--- Additional comment from Hans de Goede on 2012-05-18 03:08:49 EDT ---

If only a reboot fixes things this sounds like it is likely a kernel issue, or an issue with some stateful firewall in between. I'm afraid I think there is very little we can do without further debugging by you.

I assume you've had this setup for a longer period of time? And that it used to work fine before? In that case
can you think of any changes on either the gkrellm client or server side which happened around the time you first started seeing this problem?

Is the server (so the one which needs a reboot to fix) running a firewall? Have you tried restarting the firewall? (so clear all tables, then reload).

Can you try using an older kernel on the server?

--- Additional comment from David A. De Graaf on 2012-05-18 13:56:56 EDT ---

Hans, I'm rather bewildered by this buggy behaviour.  I've delayed
reporting it precisely because I haven't any useful data to help debug
it.  I can't correlate the onset of these hangups with any specific
update;  I can't recall when remote gkrellm just worked without
hangups.  I simply guessed that the hangups were due to brief network
dropouts, which seems reasonable to me, although I am disappointed 
that the communication protocol is not more robust.

What seems unreasonable to me is that some state info is preserved
even when gkrellmd.service is stopped, and restarted.  One would expect
that all memory of the defective session would be expunged.
Where can this memory be hiding?

What's perplexing is the excessive time it takes for
   systemctl stop gkrellmd.service
to bring gkrellmd to a stop - about 90 seconds.  I'm tempted to blame
systemd for yet another breakdown, but that would be unfair.

Firewalls are not to blame;  within my LAN there are none, and the
distant machine talks to me through an ipsec encrypted tunnel.
The hangups occur between machines currently running
kernel-3.3.5-2.fc16.x86_64, but the frequency seems not dependent on
which kernel was in use.

I have no older kernels available to test, unfortunately.

I wonder, more than anything, if anybody else has this problem.
Doesn't anyone else use the -s option to monitor remote machines?
Do they experience hangups, too?

What info can you think of that I can supply?
Here are the last few lines of   strace gkrellm -s octopus:
  ...
  close(5)                                = 0
  socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
  connect(5, {sa_family=AF_INET, sin_port=htons(19150),
  sin_addr=inet_addr("192.168.1.2")}, 16) = 0
  sendto(5, "gkrellm 2.3.5\n", 14, 0, NULL, 0) = 14
  select(6, [5], NULL, NULL, {15, 0})     = 0 (Timeout)
  select(6, [5], NULL, NULL, {15, 0})     = 0 (Timeout)
  select(6, [5], NULL, NULL, {15, 0})     = 0 (Timeout)
  select(6, [5], NULL, NULL, {15, 0}

The timeout repeats about every 15 sec.

--- Additional comment from Hans de Goede on 2012-05-19 04:59:26 EDT ---

Given that a daemon restart does not fix this, where as a reboot does, it is likely that the cause of this lies outside of gkrellm.

About the long time it takes systemctl to stop gkrellmd, that is because systemd actually tracks running process, so it does not just send a signal to stop gkrellmd, it then actually waits for the process to exit. There is a 90 sec timeout on this after which systemd kills the process the hard way. So it seems that gkrellmd is stuck in such a state that it won't exit properly either.

Have you checked after waiting for the systemd timeout that the gkrellmd process really is gone? Maybe it sticks around as a zombie (and somehow is still holding open the listening port).

Other ideas I have is:
-stop gkrellmd, try restarting the IPsec connection (if possible), start gkrellmd
-stop gkrellmd, start gkrellmd on a different port

--- Additional comment from David A. De Graaf on 2012-05-19 13:17:07 EDT ---

About the long time it takes systemctl to stop gkrellmd, that is because
systemd actually tracks running process, so it does not just send a signal to
stop gkrellmd, it then actually waits for the process to exit. There is a 90
sec timeout on this after which systemd kills the process the hard way. So it
seems that gkrellmd is stuck in such a state that it won't exit properly
either.

Is that not a bug?  Surely there's a more correct way for systemctl to stop this process.  Perhaps this is one of the contributing factors why shutting down a system has become such a long-winded and frustrating procedure.
But that's beside the point here.

I did confirm that  systemctl stop gkrellmd.service   does kill the gkrellmd process.  Both  ps f -ely   and   pgrep gkrellmd   show no such process.

I also tested your other very good ideas, sadly with negative results.
I stopped gkrellmd on the distant machine, stopped the ipsec tunnel,
restarted ipsec tunnel, restarted gkrellmd.  Then I could still not locally start a new display, eg,    gkrellm -s octopus

Then I tried a different port.  On the distant machine I edited /etc/gkrellmd.conf
to say 
   #port 19150
   port 19151
and restarted gkrellmd.service

On the local machine I ran
    gkrellm -s octopus --port 19151
with the same result - the command just hung without starting the display.

Would you agree that this points to a kernel problem?

--- Additional comment from Hans de Goede on 2012-05-19 13:20:30 EDT ---

(In reply to comment #4)
> Would you agree that this points to a kernel problem?

It seems it does, yes.

--- Additional comment from David A. De Graaf on 2012-05-22 12:39:20 EDT ---

Here's an interesting, but probably irrelevant, discovery.

Although I cannot restart    gkrellm -s octopus   because, apparently
the kernel on octopus is in some sort of zombie state with respect to
gkrellmd,  I can display octopus activity by this devious method:

  $ nohup ssh -Y octopus gkrellm &
  
This runs gkrellm on octopus instead of my local datium machine, but 
displays it locally via the    ssh -Y    trusted X11 forwarding.  
gkrellmd is not involved at all.

It is interesting to note that the bandwidth used is a steady 20K (I
think this means 20 Kilobits/sec), while the other ethernet-connected 
-s  displays run in spurts between 700. and 2.6K,  while the 
wireless-connected  -s  display consume 1.5KB to 3.0KB.

Evidently the X11 forwarding is much less efficient.

--- Additional comment from Fedora End Of Life on 2013-01-16 08:39:15 EST ---

This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

--- Additional comment from Fedora End Of Life on 2013-02-13 09:31:10 EST ---

Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 1 David A. De Graaf 2013-03-04 16:53:40 UTC

Sadly, this bug is alive and well even though F16 has expired.

I've learned a new command,   lsof -n -i -P , and can report a new factoid that may be helpful.  With the display of the remote machine hung, eg, gkrellm -s octopus, when I log in to that remote machine and run that lsof command I get two relevant lines:
  gkrellmd  11478 gkrellmd    3u  IPv4 5684634      0t0  TCP *:19150 (LISTEN)
  gkrellmd  11478 gkrellmd    4u  IPv6 5684635      0t0  TCP *:19150 (LISTEN)
showing that gkrellmd is still listening on both the IPV4 and IPV6 19150 ports.

THen I stop gkrellmd on the remote machine,  systemctl stop gkrellmd.service,  and verify that gkrellmd is no longer listening on either port.

Then I edit /etc/gkrellmd.conf to comment out this line:
  ## allow-host   ::ffff:127.0.0.1

Then I restart gkrellmd and observe, to my surprise, that TWO lines are still present;  listening on both ports, IPV4 and IPV6.

This suggests to me that the very long shutdown orchestrated by systemd was improper or inadequate in some way, and that the restart was not a full and complete restart.  The restart evidently failed to reread /etc/gkrellmd.conf because it failed to notice my instruction to not listen on IPV6.

My suspicions that systemd is culpable are renewed.  At the least, it is not acceptable that systemd should take about 90 sec. to stop this or any process.
That's just absurd.

Comment 2 David A. De Graaf 2013-03-04 20:51:10 UTC

Sorry, I'm wrong.
I just edited /etc/gkrellmd.conf differently - overriding the default port -
  #port 19150
  port 19151
and found that after stopping and restarting gkrellmd, the two listening lines were listening on the revised port 19151.

So my theory that /etc/gkrellmd.conf is not being reread is totally wrong.

However, an attempt to use that new listening port still fails -
  gkrellm --port 19151 -s octopus
although    telnet octopus 19151   does confirm that something is listening.

Puzzling!

Comment 3 Fedora End Of Life 2013-07-03 22:36:10 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 4 Fedora End Of Life 2013-08-01 00:35:39 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.