Bug 168665 - Broken network loopdevice
Broken network loopdevice
Status: CLOSED DUPLICATE of bug 155892
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
All Linux
medium Severity high
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-09-19 05:36 EDT by Alfred Mikota
Modified: 2007-11-30 17:07 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-01-04 14:57:29 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch (2.00 KB, patch)
2005-09-29 15:42 EDT, Lon Hohberger
no flags Details | Diff
command output: ifconfig -a (4.29 KB, text/plain)
2005-10-03 03:55 EDT, Alfred Mikota
no flags Details
command output: lsof (262.95 KB, text/plain)
2005-10-03 03:56 EDT, Alfred Mikota
no flags Details
command output: netstat -a (61.98 KB, text/plain)
2005-10-03 03:56 EDT, Alfred Mikota
no flags Details
command output: netstat -r (953 bytes, text/plain)
2005-10-03 03:57 EDT, Alfred Mikota
no flags Details
command output: ps -ef (9.48 KB, text/plain)
2005-10-03 03:58 EDT, Alfred Mikota
no flags Details
command output: socklist (33.16 KB, text/plain)
2005-10-03 03:58 EDT, Alfred Mikota
no flags Details

  None (edit)
Description Alfred Mikota 2005-09-19 05:36:06 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

Description of problem:
I have 2 cluster nodes (lunx10,lunx11) running under clumanager-1.2.24-0.1.

[root@lunx11 usr]# clustat
Cluster Status - icoserve_cluster                                      11:25:57
Cluster Quorum Incarnation #13
Shared State: Shared Raw Device Driver v1.2

  Member             Status
  ------------------ ----------
  lunx11             Active     <-- You are here
  lunx10             Active

  Service        Status   Owner (Last)     Last Transition Chk Restarts
  -------------- -------- ---------------- --------------- --- --------
  aim-oracle     started  lunx11           15:01:12 Sep 14  60        0
  aim-corba      started  lunx11           15:00:39 Sep 14  60        0
  aim-core01     started  lunx11           15:02:49 Sep 14  60        0
  aim-node01     started  lunx10           09:48:30 Sep 19  60        0
  aim-node02     started  lunx10           09:49:46 Sep 19  60        0
  aim-node03     started  lunx10           09:51:00 Sep 19  60        0
  aim-nodelta    started  lunx11           15:08:47 Sep 14  60        0
  aim-icoweb     started  lunx11           15:09:59 Sep 14  60        0

Each service binds a virtual IP address (output truncated):

[root@lunx11 usr]# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:09:6B:F5:BC:F8
          inet addr:160.160.2.211  Bcast:160.160.255.255  Mask:255.255.0.0
eth0:0    Link encap:Ethernet  HWaddr 00:09:6B:F5:BC:F8
          inet addr:160.160.210.3  Bcast:160.160.255.255  Mask:255.255.0.0
eth0:1    Link encap:Ethernet  HWaddr 00:09:6B:F5:BC:F8
          inet addr:160.160.210.1  Bcast:160.160.255.255  Mask:255.255.0.0
eth0:2    Link encap:Ethernet  HWaddr 00:09:6B:F5:BC:F8
          inet addr:160.160.210.2  Bcast:160.160.255.255  Mask:255.255.0.0
eth0:6    Link encap:Ethernet  HWaddr 00:09:6B:F5:BC:F8
          inet addr:160.160.210.8  Bcast:160.160.255.255  Mask:255.255.0.0
eth0:7    Link encap:Ethernet  HWaddr 00:09:6B:F5:BC:F8
          inet addr:160.160.210.4  Bcast:160.160.255.255  Mask:255.255.0.0
eth1      Link encap:Ethernet  HWaddr 00:09:6B:F5:BC:F9
          inet addr:192.168.200.11  Bcast:192.168.200.255  Mask:255.255.255.0
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0

[root@lunx10 log]# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:09:6B:F5:BD:EC
          inet addr:160.160.2.210  Bcast:160.160.255.255  Mask:255.255.0.0
eth0:0    Link encap:Ethernet  HWaddr 00:09:6B:F5:BD:EC
          inet addr:160.160.210.5  Bcast:160.160.255.255  Mask:255.255.0.0
eth0:1    Link encap:Ethernet  HWaddr 00:09:6B:F5:BD:EC
          inet addr:160.160.210.6  Bcast:160.160.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
eth0:2    Link encap:Ethernet  HWaddr 00:09:6B:F5:BD:EC
          inet addr:160.160.210.7  Bcast:160.160.255.255  Mask:255.255.0.0
eth1      Link encap:Ethernet  HWaddr 00:09:6B:F5:BD:ED
          inet addr:192.168.200.10  Bcast:192.168.200.255  Mask:255.255.255.0
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0


After some time of running the cluster following messages are written to /var/log/messages:



lunx11:
Sep 18 20:07:36 lunx11 clusvcmgrd[493]: <crit> Couldn't connect to member #0: Connection timed out
Sep 18 20:07:36 lunx11 clusvcmgrd[605]: <crit> Couldn't connect to member #0: Connection timed out
Sep 18 20:07:36 lunx11 clusvcmgrd[605]: <crit> _clu_process_unlock: cluster lock not held
Sep 18 20:07:36 lunx11 clusvcmgrd[605]: <alert> Unhandled Exception at clulock.c:408 in _clu_process_unlock
Sep 18 20:07:37 lunx11 clulockd[7373]: <warning> Denied 160.160.2.210: Broken pipe
Sep 18 20:07:37 lunx11 clulockd[7373]: <err> select error: Broken pipe
Sep 18 20:07:37 lunx11 clulockd[7373]: <warning> Denied 160.160.2.211: Broken pipe
Sep 18 20:07:37 lunx11 clulockd[7373]: <err> select error: Broken pipe
Sep 18 20:07:37 lunx11 clulockd[7373]: <warning> Denied 160.160.2.211: Broken pipe
Sep 18 20:07:37 lunx11 clulockd[7373]: <err> select error: Broken pipe
Sep 18 20:07:58 lunx11 clulockd[7373]: <warning> Potential recursive lock #5 grant to member #1, PID7401

lunx10:
Sep 18 20:07:36 lunx10 clusvcmgrd[7401]: <crit> Couldn't connect to member #0: Connection timed out



Starting from this time, network connection to the loopback device is disturbed e.g. it is not possible to make an rcp to root@localhost.

After relocating some services to the other host the operation will continue for sometime....

Any comments?

Thanks very much for your comments


Version-Release number of selected component (if applicable):
clumanager-1.2.24-0.1

How reproducible:
Couldn't Reproduce

Steps to Reproduce:
I am not able to reproduce but this behavior occurs every 2 - 3 days

Additional info:
Comment 1 Lon Hohberger 2005-09-19 11:00:57 EDT
It looks like a network-layer problem causing the cluster to go bonkers, not the
other way around.

Max aliases per device (e.g. eth0:0, eth0:1...) should be 256 (so, eth0:255 is
the max), so that shouldn't really matter.

The cluster software doesn't really use the loopback device for much - certainly
nothing fancy.  It certainly doesn't mess with firewall rules, routing, etc.  It
uses lo to connect to localhost to find the master server, but not for locks
themselves (it talks to the cluster member's IP -- even if it's local).

So, in your logs, you're seeing both member #0 and member #1 failing to connect
to member #0's IP address (160.160.2.210, I think)...  At least it's consistent,
which should make it easier to track down.
Comment 2 Lon Hohberger 2005-09-29 15:42:38 EDT
Created attachment 119435 [details]
Patch

This patch adds retry code in the unlock path.	Since I don't know what the
problem might be in the network stack causing the issue, this is a workaround. 
The patch does two things:

(1) Don't do _clu_process_unlock if we fail to release the cluster lock (this
fixes the "Alert: Unhandled Exception" at line 408 in clulock.c.

(2) Retry if we time out connecting to the lock keeper.  This will fix retries
in the event of the lock keepers's death, as well, as we re-query the local
master for the information if this fails.

This won't prevent the messages from occurring, but should prevent the
unhandled exception problem.

Let me know if you would like test packages.
Comment 3 Alfred Mikota 2005-10-03 02:44:34 EDT
Thanks a lot!
Test packages would be very helpful!
Comment 4 Alfred Mikota 2005-10-03 03:55:10 EDT
Created attachment 119527 [details]
command output: ifconfig -a
Comment 5 Alfred Mikota 2005-10-03 03:56:06 EDT
Created attachment 119528 [details]
command output: lsof
Comment 6 Alfred Mikota 2005-10-03 03:56:54 EDT
Created attachment 119529 [details]
command output: netstat -a
Comment 7 Alfred Mikota 2005-10-03 03:57:38 EDT
Created attachment 119530 [details]
command output: netstat -r
Comment 8 Alfred Mikota 2005-10-03 03:58:03 EDT
Created attachment 119531 [details]
command output: ps -ef
Comment 9 Alfred Mikota 2005-10-03 03:58:40 EDT
Created attachment 119532 [details]
command output: socklist
Comment 10 Alfred Mikota 2005-10-03 04:03:04 EDT
Today I had the same errornous behaviour on the cluster.
For tracing purposes I issued some commands. The output of the commands I have
attached to this bug report.

As you can see clulockd has a lot of open connections, which destroys the ip
stack (as i belive). Please could you check this?

Thanks
Alfred
Comment 11 Alfred Mikota 2005-10-03 04:48:22 EDT
One other issue:

I set the debug level to DEBUG for the clulockd.

The client prints out following output:


LUNX10:
=======
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> LOCK_MASTER_QUERY
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> replying LOCK_ACK 0
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Verified connect from member #1 
(160.160.2.210)
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Processing message on 11
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Received 20 bytes from peer
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> LOCK_MASTER_QUERY
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> replying LOCK_ACK 0
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Verified connect from member #1 
(160.160.2.210)
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Processing message on 11
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Received 20 bytes from peer
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> LOCK_MASTER_QUERY
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> replying LOCK_ACK 0
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Verified connect from member #1 
(160.160.2.210)
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Processing message on 11
Oct  3 10:48:22 lunx10 clulockd[7354]: <debug> Received 20 bytes from peer



LUNX11:
=======

Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Verified connect from member #0 
(160.160.2.211)
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Processing message on 10
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Received 20 bytes from peer
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> LOCK_MASTER_QUERY
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> replying LOCK_ACK 0
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Verified connect from member #0 
(160.160.2.211)
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Processing message on 10
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Received 188 bytes from peer
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> LOCK_UNLOCK
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> lockd_unlock: member #0 lock 7
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> ACK: lock unlocked
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> ioctl(fd,SIOCGARP,ar [eth0]): No 
such device or address
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Connect: Member #1 
(160.160.2.210) [IPv4]
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Processing message on 10
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Received 188 bytes from peer
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> LOCK_LOCK | LOCK_TRYLOCK
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> lockd_trylock: member #1 lock 0
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Replying ACK
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> ioctl(fd,SIOCGARP,ar [eth0]): No 
such device or address
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Connect: Member #1 
(160.160.2.210) [IPv4]
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Processing message on 10
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> Received 188 bytes from peer
Oct  3 10:48:54 lunx11 clulockd[7479]: <debug> LOCK_UNLOCK


Is this a normal condition?

Thanks in advance!

Alfred
Comment 12 Lon Hohberger 2005-10-03 11:55:58 EDT
The ioctl() message is odd, but it seems to be working just fine otherwise (it
already has a MAC address, so it looks up the source IP).  The ARP check ensures
that the connecting node is in the same subnet as the node accepting the
connection.  If we can't get the node's MAC address initially, then we deny the
connection.

In some cases, it doesn't always work correctly.  The recently-released 1.2.28
package has a workaround for this:

(a) stop cluster on both nodes
(b) run "cludb -p cluster%msgsvc_noarp 1" on one node
(c) copy /etc/cluster.xml to other node
(d) start cluster

This disables the ARP checking (which isn't really needed anyway).
Comment 13 Lon Hohberger 2005-10-03 12:26:21 EDT
I don't think that a bunch of connections in TIME_WAIT should have much of an
effect, actually.  You get a lot of sockets left in TIME_WAIT from a
high-traffic web server as well.
Comment 14 Lon Hohberger 2005-10-04 17:38:43 EDT
I'll put up a test package tomorrow -- currently, what it does is basically
retry if this happens.  It's a hard thing to detect when it happens (after all,
it could be anything preventing the connection).  However, since it's typically
in the unlock path, it makes it easier to deal with.

The fault ("Unhandled Exception") is easy to get rid of. ;)
Comment 15 Alfred Mikota 2005-10-05 02:57:02 EDT
I will make a upgrade of the clumanager from  clumanager-1.2.24-0.1 to
clumanager-1.2.28-1 and hope this will workaround the ARP checks.

clulockd hast a lot of connections in TIME_WAIT condition (300 - 500 connections).
Since a "normal" tcp connection doesn't reside in TIME_WAIT I can hardly belive
that tis is a normal condition.

furthermore I made some changes in /etc/sysctl.conf to keep the ip stack more
stable:

net.ipv4.tcp_fin_timeout = 25
net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_max_tw_buckets = 1440000


Comment 16 Lon Hohberger 2005-10-05 13:17:21 EDT
Please have a look at the following link, which explains how TCP works:

http://www.faqs.org/rfcs/rfc793.html

* Clulockd and other cluster daemons open and close many "normal" TCP
connections quickly, which is why many connections appear in the TIME-WAIT
state.  (Like a web server with several clients at once).  This is normal and
expected behavior.

* Your cluster configuration isn't exceedingly large, which begs the question: 
What is different about your particular environment that makes this break, when
it has worked for years in similar (and, of course, more complex) environments?
 [This is not something I can answer; you will have to go through Red Hat
Support: http://www.redhat.com/apps/support/ ]

* There are essentially two problems here:

(1)  The "Unhandled Exception" (which is a bug in clumanager) is easy to fix. 
It should retry where it's giving up, but it doesn't.  I can fix that, and I'll
provide packages later today.  However, that doesn't solve the other problem...

(2)  The reason (1) appeared in the first place -- connection timeout during
unlock (at that point, all connections over any IP fail) -- is not as simple. 
Since you seem to think it is related to the TCP sockets in TIME-WAIT, you can
try tweaking things like TCP time-wait reusing/recycling:

echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

This should reduce the number of sockets in TIME-WAIT.  If this fixes (2), great!

If you were curious about how many ports can be opened/in use at once, try:

cat /proc/sys/net/ipv4/ip_local_port_range

On my system, it's about 28,000 (the delta between the two is how many ports you
have available for connect() operations...).  If you run out of ports during a
connect, the caller will get EAGAIN, which is an immediate error (not a timeout
like you have observed).
Comment 17 Anu Matthew 2005-10-12 11:43:34 EDT
Adding related info, from a 4 node cluster with bonded i/f (clumanager-1.2.26.1-1). 

member id="0" is logging messages: 

Oct 10 11:19:56 dvllinux12 clulockd[3897]: <debug> ioctl(fd,SIOCGARP,ar
[bond0]): No such device or address
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <debug> Connect: Member #1
(165.89.127.29) [ETHER]
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <warning> Denied 165.89.127.29:
Broken pipe
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <err> select error: Broken pipe
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <debug> ioctl(fd,SIOCGARP,ar
[bond0]): No such device or address
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <debug> Connect: Member #1
(165.89.127.30) [ETHER]
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <warning> Denied 165.89.127.30:
Broken pipe
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <err> select error: Broken pipe
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <debug> ioctl(fd,SIOCGARP,ar
[bond0]): No such device or address
Oct 10 11:19:56 dvllinux12 clulockd[3897]: <debug> Connect: Member #1
(165.89.127.7) [IPv4]


Rest of the nodes log the " ioctl(fd,SIOCGARP,ar [bond0]): No such device or
address" message, but not the "select error: Broken Pipe" 

Oct  9 04:03:25 prdlinux11 clulockd[3811]: <debug> replying LOCK_ACK 0
Oct  9 04:03:27 prdlinux11 cluquorumd[3771]: <debug> ioctl(fd,SIOCGARP,ar
[bond0]): No such device or address
Oct  9 04:03:27 prdlinux11 cluquorumd[3771]: <debug> Connect: Member #3
(165.89.127.30) [IPv4]
Oct  9 04:03:27 prdlinux11 cluquorumd[3771]: <debug> Q: Received
QUORUM_PSWITCH_X, fd9


So that you know, and also I'm just curious, why is it only the node 0 getting
the Broken Pipe mesgs. 

--Ev
Comment 18 Lon Hohberger 2005-10-12 14:04:00 EDT
Sorry for the delay, I posted packages which should remove the lock problem here:

http://people.redhat.com/lhh/packages.html
Comment 19 Alfred Mikota 2005-10-18 01:50:47 EDT
As far as I noticed, the network problem comes up as soon as oracle's RMAN is
doing a full backup of the database.
Your last package seems to let disappear the exception. 

The timeouts on the cluster mamager persists:

lunx10 cluster.log
Oct 18 03:00:15 lunx11 clulockd[7427]: <warning> Denied 160.160.2.210: Broken pipe
Oct 18 03:00:15 lunx11 clulockd[7427]: <err> select error: Broken pipe
Oct 18 03:00:49 lunx11 clulockd[7427]: <warning> Denied 160.160.2.210: Broken pipe
Oct 18 03:00:49 lunx11 clulockd[7427]: <err> select error: Broken pipe
Oct 18 03:01:17 lunx11 clulockd[7427]: <warning> Denied 160.160.2.210: Broken pipe
Oct 18 03:01:17 lunx11 clulockd[7427]: <err> select error: Broken pipe
Oct 18 03:02:36 lunx11 clusvcmgrd[7540]: <err> Unable to obtain cluster lock:
Connection timed out
Oct 18 03:02:37 lunx11 clulockd[7427]: <warning> Denied 160.160.2.211: Broken pipe
Oct 18 03:02:37 lunx11 clulockd[7427]: <err> select error: Broken pipe


lunx11 cluster.log:
Oct 18 03:00:15 lunx11 clulockd[7427]: <warning> Denied 160.160.2.210: Broken pipe
Oct 18 03:00:15 lunx11 clulockd[7427]: <err> select error: Broken pipe
Oct 18 03:00:49 lunx11 clulockd[7427]: <warning> Denied 160.160.2.210: Broken pipe
Oct 18 03:00:49 lunx11 clulockd[7427]: <err> select error: Broken pipe
Oct 18 03:01:17 lunx11 clulockd[7427]: <warning> Denied 160.160.2.210: Broken pipe
Oct 18 03:01:17 lunx11 clulockd[7427]: <err> select error: Broken pipe
Oct 18 03:02:36 lunx11 clusvcmgrd[7540]: <err> Unable to obtain cluster lock:
Connection timed out
Oct 18 03:02:37 lunx11 clulockd[7427]: <warning> Denied 160.160.2.211: Broken pipe
Oct 18 03:02:37 lunx11 clulockd[7427]: <err> select error: Broken pipe

Comment 20 Alfred Mikota 2005-10-18 01:52:59 EDT
Sorry, lunx10 cluster.log was the wrong one:

Oct 18 03:00:15 lunx10 clusvcmgrd[7458]: <crit> Couldn't connect to member #0:
Connection timed out
Oct 18 03:00:49 lunx10 clusvcmgrd[7458]: <crit> Couldn't connect to member #0:
Connection timed out

Comment 21 Lon Hohberger 2005-10-18 13:20:42 EDT
* The clumanager package fixes the unhandled exception - but doesn't make the
timeouts go away (it fixes the actual bug with clumanager).

* Did the /proc tweaks help?  Given that there are many thousands of available
TCP ports (something like 28,000), I didn't think this would help, but it was
worth a try.
Comment 22 Anu Matthew 2005-10-18 15:48:11 EDT
/proc tweaks did not help on my above said 4 node cluster, running
clumanager-1.2.26.1-1. 

I've the following tcp params enabled: 

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1

This timeout thingy went so that bad that the cluster assumed that one of the
services have  failed, and ended up restarting (as usual) the package, which was
actually healthy.
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <err> select error: Broken pipe 
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <debug> ioctl(fd,SIOCGARP,ar
[bond0]): No such device or address 
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <debug> Connect: Member #1
(165.89.127.30) [ETHER] 
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <warning> Denied 165.89.127.30:
Broken pipe
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <err> select error: Broken pipe 
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <debug> Verified connect from member
#0 (165.89.127.6) 
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <warning> Denied 165.89.127.6: Broken
pipe
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <err> select error: Broken pipe 
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <debug> ioctl(fd,SIOCGARP,ar
[bond0]): No such device or address 
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <debug> Connect: Member #1
(165.89.127.29) [ETHER] 
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <warning> Denied 165.89.127.29:
Broken pipe
Oct 17 11:11:48 dvllinux12 clulockd[3897]: <err> select error: Broken pipe 

Oct 17 11:13:06 prdlinux11 clusvcmgrd[3904]: <err> Unable to obtain cluster
lock: No locks available
Oct 17 11:13:08 prdlinux11 cluquorumd[3771]: <notice> IPv4 TB @ 165.89.127.254
Online
Oct 17 11:13:23 prdlinux11 clusvcmgrd[3904]: <crit> Couldn't connect to member
#0: Connection timed out
Oct 17 11:13:23 prdlinux11 clusvcmgrd[3904]: <err> Unable to obtain cluster
lock: No locks available
Oct 17 11:13:46 prdlinux11 clusvcmgrd[2805]: <crit> Couldn't connect to member
#0: Connection timed out
Oct 17 11:13:46 prdlinux11 clusvcmgrd[2805]: <err> Unable to obtain cluster
lock: No locks available
*Oct 17 11:13:46 prdlinux11 clusvcmgrd[2805]: <warning> Restarting locally
failed service ploracm1 *
Oct 17 11:13:46 prdlinux11 clusvcmgrd: [2806]: <notice> service notice: Stopping
service ploracm1 ...
Oct 17 11:13:46 prdlinux11 clusvcmgrd: [2806]: <notice> service notice: Running
user script '/etc/cluster/scripts/ploracm1 stop'
Oct 17 11:13:46 prdlinux11 su(pam_unix)[2848]: session opened for user oracle by
(uid=0)
Oct 17 11:14:09 prdlinux11 cluquorumd[3771]: <notice> IPv4 TB @ 165.89.127.254
Offline
Oct 17 11:14:25 prdlinux11 cluquorumd[3771]: <notice> IPv4 TB @ 165.89.127.254
Online
Oct 17 11:15:10 prdlinux11 clusvcmgrd[3904]: <crit> Couldn't connect to member
#0: Connection timed out
Oct 17 11:15:10 prdlinux11 clusvcmgrd[3904]: <err> Unable to obtain cluster
lock: No locks available
Oct 17 11:15:17 prdlinux11 su(pam_unix)[3088]: session opened for user root by
root(uid=0)
Oct 17 11:20:00 prdlinux11 sshd(pam_unix)[3489]: session opened for user root by
(uid=0)
Oct 17 11:20:00 prdlinux11 sshd(pam_unix)[3489]: session closed for user root
Oct 17 11:20:00 prdlinux11 sshd(pam_unix)[3503]: session opened for user root by
(uid=0) 


All these are ofcourse on clumanager-1.2.26.1-1, have not gone for
clumanager-1.2.28 as it would need all nodes to be shutdown, if we were to
change the "cluster%msgsvc_noarp" to YES. 

All nodes have bonded over nics using e1000 module that came along with
2.4.21-32.ELsmp 

Just don't want to face the music, if, even after patching to 1.2.28 this
happens. Hence all this..!! 

--Ev 

Comment 23 Lon Hohberger 2005-10-18 18:22:12 EDT
You can upgrade to 1.2.28 if you want to, but at this point, I don't think this
is going to help much.  This *really* looks like a kernel problem.  The
msgsvc_noarp will quiet *some* things down, but... that's not the root of it.

Ok, so, when this happens:

(a) TCP no longer works to 127.0.0.1 or to other nodes from the affected node
(b) ICMP no longer works (the IP tiebreaker uses ICMP packets (ping!), and even
*IT* is bouncing).

It's very likely this is a kernel problem and very unlikely that it is a
clumanager problem (though clumanager may cause it to occur somehow).  It would
definitely speed things up to file a support request if you have not already
done so.
Comment 31 Lon Hohberger 2005-10-25 11:25:35 EDT
Can you install this errata if you have not already?

http://rhn.redhat.com/errata/RHBA-2005-624.html

In the mean time, I will investigate a way to alleviate some of the socket
inefficiencies in the lock code.
Comment 32 Lon Hohberger 2005-10-25 22:29:58 EDT
http://people.redhat.com/packages.html

* I fixed a race condition in clulockd when signals were received which caused
deadlock (and thus, clulockd would not respond - and timeouts) - I saw this
deadlock behavior while reloading the configuration after adding an additional
19 services.  Consequently, I tested with:

   while : ; do killall -HUP clulockd; done

...while starting/stopping the other member (which causes lots of lock requests).

* I crafted a patch to use local socket (UNIX domain sockets) for certain lock
communications/operations.  This may not be in an errata because it might be
considered destabilizing.  However, it improves performance, significantly
reduces the number of TCP sockets left in TIME_WAIT, and it should not break
rolling upgrade.

* The patch for retrying in the event of a failed/timed-out unlock is also
included as part of the local socket comms patch.

Notes:

* Running clulockd in debug mode will destroy any hope of reasonable lock
performance, but is good for testing.  You will probably see "Broken pipe"
errors when running clulockd at debug log level (and at other levels under high
load), fortunately, lock requesters will retry - so unless you see starvation
messages, these can usually be ignored.


Let me know if this alleviates your problem(s).
Comment 35 Alfred Mikota 2005-10-27 01:50:32 EDT
Hi!
I will try the cluster manager this evening. Thanks a lot in advance!
Comment 38 Lon Hohberger 2005-10-28 11:58:02 EDT
x86_64 packages (contains an incremental build fix for that platform):

http://people.redhat.com/lhh/clumanager-1.2.28-3locksig.src.rpm
http://people.redhat.com/lhh/clumanager-1.2.28-3locksig.x86_64.rpm
Comment 39 Anu Matthew 2005-10-29 11:08:13 EDT
2.28-2locksig rpm has a dependency on libgulm.so.6

# rpm -Uvh ./clumanager-1.2.28-2locksig.i386.rpm  

error: Failed dependencies:
       libgulm.so.6 is needed by clumanager-1.2.28-2locksig 

I had to force a --dep to have it installed. 
Comment 40 Lon Hohberger 2005-10-31 10:39:19 EST
That should not cause any problems; I will look in to it though.  The gulm
bridge uses it -- it is a special STONITH module).
Comment 41 Alfred Mikota 2005-10-31 15:08:36 EST
clumanager-1.2.28-2locksig.i386.rpm works quite good. I haven't  seen the
corrupt loopback device for 2 days but as you already mentioned I also believe
that clumanager is not the trigger for the loopback network device problem.

On heavy network load clumanger writes following lines in /var/log/cluster.log:

Oct 31 21:00:52 lunx11 clulockd[21979]: <warning> Denied 160.160.2.210: Broken pipe
Oct 31 21:00:52 lunx11 clulockd[21979]: <err> select error: Broken pipe
Oct 31 21:01:00 lunx11 clulockd[21979]: <warning> Denied 160.160.2.210: Broken pipe
Oct 31 21:01:00 lunx11 clulockd[21979]: <err> select error: Broken pipe
Oct 31 21:01:16 lunx11 clulockd[21979]: <warning> Denied 160.160.2.210: Broken pipe
Oct 31 21:01:16 lunx11 clulockd[21979]: <err> select error: Broken pipe
Oct 31 21:01:34 lunx11 clulockd[21979]: <warning> Denied 160.160.2.210: Broken pipe
Oct 31 21:01:34 lunx11 clulockd[21979]: <err> select error: Broken pipe
Oct 31 21:01:42 lunx11 clulockd[21979]: <warning> Denied 160.160.2.210: Broken pipe
Oct 31 21:01:42 lunx11 clulockd[21979]: <err> select error: Broken pipe


Is this a normal condition?

Cheers
Alfred
Comment 42 Lon Hohberger 2005-10-31 15:40:09 EST
Yes, that's a possible condition during heavy load of any sort.  In this case,
the requestor retries.  Most people do not run cluster nodes / cluster networks
at max. utilization, as this increases the chance for failures.

We should *probably* make a more intelligent error handler for this case so that
these messages don't appear in the logs as often.

The point here is that it should not be ending up in endless "Timeout" loops
which the services to become unavailable.  It's even possible to see "timeout"
errors periodically, but things should continue normally.

Also, there should be an order of magnitude fewer sockets in TIME_WAIT.  While
this probably wasn't an actual problem, it certainly doesn't hurt to have less.
 If you'd like, I can build a package without this new functionality and see if
it causes the loopback problem again...
Comment 43 Alfred Mikota 2005-11-07 04:31:42 EST
Joern from Redhat Technical Support gave me following first answer:

"I got a first reply from our engineers.
They say that you should install at least 2.4.21-32.0.1.EL because the
2.4.21-27.0.2.ELsmp kernel is ancient, and known broken (it contains 2 known
data corruptors)."


I did an update on one cluster node to kernel version  2.4.21-37.ELsmp.
Since that time the loopback interface does not break under heavy load! I will
observe the two cluster nodes an inform you as soon as I know if the poor
condition will occur again or not.

Yes, it seems to be not a clumanager problem - it's a kernel problem.

Take care!

Alfred
Comment 44 Alfred Mikota 2005-11-08 02:29:34 EST
After 2 days the error occured again. so I will ask Redhat Support for further help.
Comment 45 Lon Hohberger 2005-11-08 09:43:58 EST
Did it cause any problems, or did it just show up once?
Comment 46 Alfred Mikota 2005-11-09 02:58:41 EST
It seems to be no problem for the cluster manager to handle this situation. But
interprocess communication is discontinued.
Comment 47 Lon Hohberger 2005-11-10 16:20:23 EST
Ok, well, we *know* it's below the cluster at this point.
Comment 48 Lon Hohberger 2005-11-10 16:29:27 EST
I'm going to file separate bugzillas for the two clulockd issues.
Comment 49 Lon Hohberger 2005-11-10 16:48:37 EST
Bugzillas: 172893 and 172894 address the clumanager issues contained in this
bugzilla.  Beyond that, I think it is a kernel problem.
Comment 50 Alfred Mikota 2005-11-11 06:59:16 EST
I noticed following following line in cluster.log:

Nov 10 21:19:52 lunx11 clumembd[6748]: <debug> Verified connect from member #0
(127.0.0.1) 

Has the break of the loopback device an impact of the cluster since it uses
127.0.0.1?



Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> LOCK_UNLOCK 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> lockd_unlock: member #0 lock 5 
Nov 10 21:19:52 lunx11 clumembd[6748]: <debug> Verified connect from member #0
(127.0.0.1) 
Nov 10 21:19:52 lunx11 clumembd[6748]: <debug> MB: New connect: fd11 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> ACK: lock unlocked 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> accepted local connection 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> Processing message on 10 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> Received 20 bytes from peer 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> LOCK_MASTER_QUERY 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> replying LOCK_ACK 0 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> Processing message on 10 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> Received 188 bytes from peer 
Nov 10 21:19:52 lunx11 clulockd[6752]: <debug> LOCK_UNLOCK 

Alfred
Comment 51 Lon Hohberger 2005-11-11 10:40:22 EST
Possibly.  However, attempting to work around this behavior is a bad idea -- we
need the real bug fixed.  Can you summarize what problems are actually still
occurring with the current test packages?  As I understand it:

    Periodic loss of ability to communicate to the loopback device (lo).

* At the time "lo" communication is disrupted, how long is it for, and does the
cluster take recovery actions (e.g. STONITH, failover, etc.)?

* Can you connect to other hosts (e.g. via other devices besides lo) at the time
lo communication is disrupted?

* Existence of periodic "timeout" errors from clumanager does not necessarily
mean that there is a communication problem.  The timeouts used for clumanager
during a connect operation is usually 5-10 seconds.  The server taking too long
to respond, perhaps because of load, can cause periodic "timeout" errors.  The
problem I fixed was a deadlock which caused an infinite timeout loop.

* You should probably go back to NOTICE mode for locking (instead of DEBUG). 
DEBUG mode significantly impacts locking performance, and locking is clearly not
the problem.  Also, setting the log level to DEBUG can increase the risk of
timeout errors because of the amount of logging: if syslog backs up, clulockd
will block until syslog is ready to accept the log info, and thus, not respond
to connect attempts.

   cludb -p clulockd%loglevel 5
Comment 52 Lon Hohberger 2005-11-11 10:50:12 EST
Also, if syslog is backed up, anything trying to *use* syslog will also wedge
temporarily.
Comment 53 Alfred Mikota 2005-11-15 01:09:38 EST
Hi Lon!

Here is a short summary of the actual situation.

I am running the latest package auf clumanager you provided:
clumanager-1.2.28.6-0.1nogfs

As far as I see the network communication on the loopback device (and only this
device) is broken when heavy load (e.g. oracle backup RMAN) is running on one
machine.

Network connection to the other machine is possible. 

At the time when heavy load occures, following lines occur in cluster.log

Nov 15 01:02:57 lunx11 clulockd[6718]: <warning> Denied 160.160.2.210: Broken pipe
Nov 15 01:02:57 lunx11 clulockd[6718]: <err> select error: Broken pipe
Nov 15 01:02:57 lunx11 clulockd[6718]: <warning> Denied 160.160.2.210: Broken pipe
Nov 15 01:02:57 lunx11 clulockd[6718]: <err> select error: Broken pipe
Nov 15 01:02:57 lunx11 clulockd[6718]: <warning> Denied 160.160.2.210: Broken pipe
Nov 15 01:02:57 lunx11 clulockd[6718]: <err> select error: Broken pipe
Nov 15 01:07:23 lunx11 clulockd[6718]: <warning> Denied 160.160.2.210: Broken pipe
Nov 15 01:07:23 lunx11 clulockd[6718]: <err> select error: Broken pipe
Nov 15 01:07:30 lunx11 clulockd[6718]: <warning> Denied 160.160.2.210: Broken pipe
Nov 15 01:07:30 lunx11 clulockd[6718]: <err> select error: Broken pipe
Nov 15 01:21:59 lunx11 clumembd[6714]: <info> Membership View #3:0x00000001
Nov 15 01:22:00 lunx11 cluquorumd[6710]: <warning> --> Commencing STONITH <--
Nov 15 01:22:03 lunx11 cluquorumd[6710]: Power to NPS outlet(s) lunx10 turned /Off.
Nov 15 01:22:03 lunx11 cluquorumd[6710]: <notice> STONITH: lunx10 has been fenced!
Nov 15 01:22:07 lunx11 cluquorumd[6710]: Power to NPS outlet(s) lunx10 turned /On.
Nov 15 01:22:07 lunx11 cluquorumd[6710]: <notice> STONITH: lunx10 is no longer
fenced off.
Nov 15 01:22:07 lunx11 clusvcmgrd[6721]: <info> Quorum Event: View #4 0x00000001


This causes the other host to be fenced off.
The other host, which is beeing fenced off, has no entries in cluster.log.

Is there a way to suppress clumanager to fence off the other host during this
heavy load (e.g. clusvcadm -S; clusvcadm -u)? Should I move the database which
causes the heavy load to a 3rd host outside the cluster?


I am using RHEL3 U6 with the lastest update of all packages.
Comment 54 Lon Hohberger 2005-11-15 10:17:46 EST
Thank you!  That was enormously helpful.  Local communication disruption under
load is certainly not a clumanager issue.

In the meantime, you can do this to hopefully help with STONITH problems:

    cludb -p clumembd%rtp 90
    cludb -p cluquorumd%rtp 90

...and increase the failover time a bit.  It should help.
Comment 55 Lon Hohberger 2005-11-15 13:54:59 EST
I think this remaining issue is *strongly* related to # 155892

- It only happens under load, and after awhile
- It only happens to the lo interface (other interfaces are unaffected)

Adding 155892 as a dependency.
Comment 56 Alfred Mikota 2005-11-23 10:01:56 EST
Hi Lon!

One question: Since installing the new rpm package I have the problem that once
per week the other member is fenced off without any warning. All I can see are
the following entries in cluster.log:


Nov 23 01:08:42 lunx11 clulockd[6718]: <err> select error: Broken pipe
Nov 23 13:06:27 lunx11 clumembd[6714]: <info> Membership View #5:0x00000001
Nov 23 13:06:28 lunx11 cluquorumd[6710]: <warning> --> Commencing STONITH <--
Nov 23 13:06:31 lunx11 cluquorumd[6710]: Power to NPS outlet(s) lunx10 turned /Off.
Nov 23 13:06:32 lunx11 cluquorumd[6710]: <notice> STONITH: lunx10 has been fenced!
Nov 23 13:06:35 lunx11 cluquorumd[6710]: Power to NPS outlet(s) lunx10 turned /On.
Nov 23 13:06:35 lunx11 cluquorumd[6710]: <notice> STONITH: lunx10 is no longer
fenced off.
Nov 23 13:06:35 lunx11 clusvcmgrd[6721]: <info> Quorum Event: View #7 0x00000001
Nov 23 13:06:37 lunx11 clusvcmgrd[6721]: <info> State change: lunx11 DOWN
Nov 23 13:11:41 lunx11 clumembd[6714]: <notice> Member lunx10 UP
Nov 23 13:11:42 lunx11 clumembd[6714]: <info> Membership View #6:0x00000003
Nov 23 13:11:42 lunx11 clusvcmgrd[6721]: <info> Quorum Event: View #8 0x00000003
Nov 23 13:11:42 lunx11 clusvcmgrd[6721]: <info> State change: Local UP
Nov 23 13:11:43 lunx11 clusvcmgrd[6721]: <info> State change: lunx10 UP


Do you have any ideas?


Thanks in advance!

Alfred
Comment 57 Lon Hohberger 2005-11-28 12:17:10 EST
Increase your failover time in the GUI to 30 seconds and/or see comment #54.

Nov 23 13:06:27 lunx11 clumembd[6714]: <info> Membership View #5:0x00000001

...means clumembd on lnux10 stopped sending heartbeats, or lnux11 stopped
receiving them for the failover time.  There is a wide range of reasons this
could happen; the most likely is system load.
Comment 58 Lon Hohberger 2006-01-04 14:57:29 EST

*** This bug has been marked as a duplicate of 155892 ***

Note You need to log in before you can comment on or make changes to this bug.