Bug 211777

Summary:	dlm/cman sctp errors in xen guests with fc6t3
Product:	[Fedora] Fedora	Reporter:	Thorsten Scherf <tscherf>
Component:	cman	Assignee:	Christine Caulfield <ccaulfie>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6	CC:	mattdm, nhorman
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	FC6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-04-07 12:04:06 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	213965

Description Thorsten Scherf 2006-10-22 16:41:08 UTC

Description of problem:
service did no relocate to another host (2 node setup)

Version-Release number of selected component (if applicable):
rgmanager-2.0.8-1.fc6

How reproducible:
2 node setup
apache ha
run clustat on node 1
run clustat on node 2


Steps to Reproduce:
1.2 node ha setup
2.run clustat on node 1
3.run clustat on node 2
  
Actual results:
node:1
[root@server1 ~]# clustat -f
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                               1 Online, Local
  server2                               2 Online

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:www          server1                        started

/var/log/messages:
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: recover 1
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: add member 1
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: total members 1 error 0
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: dlm_recover_directory
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: dlm_recover_directory 0 entries
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: recover 1 done: 0 ms
Oct 22 18:30:40 server1 clurgmgrd[1811]: <notice> Resource Group Manager Starting 
Oct 22 18:30:40 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd stop 
Oct 22 18:30:41 server1 clurgmgrd: [1811]: <info> /dev/xvdb1 is not mounted 
Oct 22 18:30:44 server1 clurgmgrd[1811]: <notice> Starting stopped service
service:www 
Oct 22 18:30:44 server1 clurgmgrd: [1811]: <info> mounting /dev/xvdb1 on /webdata 
Oct 22 18:30:44 server1 kernel: kjournald starting.  Commit interval 5 seconds
Oct 22 18:30:44 server1 kernel: EXT3 FS on xvdb1, internal journal
Oct 22 18:30:44 server1 kernel: EXT3-fs: recovery complete.
Oct 22 18:30:44 server1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Oct 22 18:30:44 server1 clurgmgrd: [1811]: <info> Adding IPv4 address
192.168.0.130 to eth0 
Oct 22 18:30:44 server1 avahi-daemon[1494]: Registering new address record for
192.168.0.130 on eth0.
Oct 22 18:30:46 server1 in.rdiscd[2110]: setsockopt (IP_ADD_MEMBERSHIP): Address
already in use
Oct 22 18:30:46 server1 in.rdiscd[2110]: Failed joining addresses 
Oct 22 18:30:46 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd start 
Oct 22 18:30:46 server1 clurgmgrd[1811]: <notice> Service service:www started 
Oct 22 18:30:54 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd
status 
Oct 22 18:31:21 server1 kernel: dlm: rgmanager: recover 3
Oct 22 18:31:21 server1 kernel: dlm: rgmanager: add member 2
Oct 22 18:31:21 server1 kernel: dlm: Initiating association with node 2
Oct 22 18:36:54 server1 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server1 kernel: dlm: Initiating association with node 2

node:2
[root@server2 ~]# clustat -f
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                               1 Online
  server2                               2 Online, Local

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:www          none                           uninitialized

/var/log/messages:
Oct 22 18:29:25 server2 ccsd[1678]: Initial status:: Quorate 
Oct 22 18:31:21 server2 kernel: SCTP: Hash tables configured (established 1365
bind 1638)
Oct 22 18:31:21 server2 kernel: Module sctp cannot be unloaded due to unsafe
usage in net/sctp/protocol.c:1189
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: recover 1
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: add member 1
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: add member 2
Oct 22 18:31:21 server2 kernel: dlm: Initiating association with node 1
Oct 22 18:31:30 server2 clurgmgrd[1758]: <notice> Resource Group Manager Starting 
Oct 22 18:31:31 server2 clurgmgrd: [1758]: <info> Executing /etc/init.d/httpd stop 
Oct 22 18:31:31 server2 clurgmgrd: [1758]: <info> /dev/xvdb1 is not mounted 
Oct 22 18:36:54 server2 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server2 kernel: dlm: Initiating association with node 1


Expected results:
clustat should list server1 as he owner of the ha www-service and should also
list the service as started, not as uninitialized.

Additional info:

cluster.conf:
<?xml version="1.0"?>
<cluster alias="www" config_version="3" name="alpha_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="server1" votes="1" nodeid="1">
                        <fence/>
                </clusternode>
                <clusternode name="server2" votes="1" nodeid="2">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="192.168.0.130" monitor_link="1"/>
                        <fs device="/dev/xvdb1" force_fsck="0" force_unmount="0"
fsid="4677" fstype="ext3" mountpoint="/webdata" name="documentroot" options=""
self_fence="0"/>
                        <script file="/etc/init.d/httpd" name="httpd"/>
                </resources>
                <service autostart="1" name="www">
                        <ip ref="192.168.0.130"/>
                        <fs ref="documentroot"/>
                        <script ref="httpd"/>
                </service>
        </rm>
</cluster>


This is a very simple cluster setup in a xen enviroment.

Comment 1 Lon Hohberger 2006-10-23 17:10:14 UTC

should be the same problem as 211701

Comment 2 Thorsten Scherf 2006-10-23 18:40:36 UTC

additional info: when calling s-c-cluster, it always states that this node is
not part of a cluster and the management tab disappers. 

system-config-cluster-1.0.29-1.0

Comment 3 Lon Hohberger 2006-10-23 22:44:33 UTC


*** This bug has been marked as a duplicate of 211701 ***

Comment 4 Thorsten Scherf 2006-10-24 21:49:10 UTC

problem remains with rgmanager-2.0.15-1.i386.rpm.

Comment 5 Lon Hohberger 2006-10-26 20:48:54 UTC

Ok, I thought it was the same, but it's obviously different

Oct 22 18:36:54 server2 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server2 kernel: dlm: Initiating association with node 1

Comment 6 Lon Hohberger 2006-10-26 20:55:29 UTC

This looks like a DLM bug which should have been fixed on the current cman package.

Comment 7 Lon Hohberger 2006-10-26 20:57:17 UTC

Does this happen with a current cman package?

Comment 8 Lon Hohberger 2006-10-26 21:23:45 UTC

Apparently this happens with current cman/dlm, but only in Xen guests.

Comment 10 Thorsten Scherf 2006-10-26 21:35:07 UTC

[root@server2 ~]# rpm -q cman
cman-2.0.18-2.fc6
[root@server2 ~]# rpm -q rgmanager
rgmanager-2.0.15-1 (from the rhel5 beta channel)
[root@server2 ~]# uname -r
2.6.18-1.2798.fc6xen

Comment 11 Christine Caulfield 2006-10-27 14:51:29 UTC

This is the same as #212472 which is open on the Xen kernel as #212550.

I'll leave this open as we may need to close is separately for FC6 & RHEL5

Comment 12 Neil Horman 2007-01-24 18:27:39 UTC

Have you tried setting the sctp sndbuf_policy to 1 in /proc/sys/net/sctp?  The
error you are seeing looks like it could perhaps come from a limitation of
sendbuffer space.  Setting the policy to 1 will relax some of the memory
allocation limits placed on sctp sockets and allow you to more easily create
associations.

Comment 13 Matthew Miller 2007-04-06 19:13:44 UTC

Fedora Core 5 and Fedora Core 6 are, as we're sure you've noticed, no longer
test releases. We're cleaning up the bug database and making sure important bug
reports filed against these test releases don't get lost. It would be helpful if
you could test this issue with a released version of Fedora or with the latest
development / test release. Thanks for your help and for your patience.

[This is a bulk message for all open FC5/FC6 test release bugs. I'm adding
myself to the CC list for each bug, so I'll see any comments you make after this
and do my best to make sure every issue gets proper attention.]

Comment 14 Thorsten Scherf 2007-04-07 08:55:55 UTC

the bug seems to be fixed in fedora core 6.

Comment 15 Matthew Miller 2007-04-07 12:04:06 UTC

thanks!