Bug 211777 - dlm/cman sctp errors in xen guests with fc6t3
dlm/cman sctp errors in xen guests with fc6t3
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: cman (Show other bugs)
6
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
: Reopened
Depends On:
Blocks: 213965
  Show dependency treegraph
 
Reported: 2006-10-22 12:41 EDT by Thorsten Scherf
Modified: 2007-11-30 17:11 EST (History)
2 users (show)

See Also:
Fixed In Version: FC6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-04-07 08:04:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Thorsten Scherf 2006-10-22 12:41:08 EDT
Description of problem:
service did no relocate to another host (2 node setup)

Version-Release number of selected component (if applicable):
rgmanager-2.0.8-1.fc6

How reproducible:
2 node setup
apache ha
run clustat on node 1
run clustat on node 2


Steps to Reproduce:
1.2 node ha setup
2.run clustat on node 1
3.run clustat on node 2
  
Actual results:
node:1
[root@server1 ~]# clustat -f
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                               1 Online, Local
  server2                               2 Online

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:www          server1                        started

/var/log/messages:
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: recover 1
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: add member 1
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: total members 1 error 0
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: dlm_recover_directory
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: dlm_recover_directory 0 entries
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: recover 1 done: 0 ms
Oct 22 18:30:40 server1 clurgmgrd[1811]: <notice> Resource Group Manager Starting 
Oct 22 18:30:40 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd stop 
Oct 22 18:30:41 server1 clurgmgrd: [1811]: <info> /dev/xvdb1 is not mounted 
Oct 22 18:30:44 server1 clurgmgrd[1811]: <notice> Starting stopped service
service:www 
Oct 22 18:30:44 server1 clurgmgrd: [1811]: <info> mounting /dev/xvdb1 on /webdata 
Oct 22 18:30:44 server1 kernel: kjournald starting.  Commit interval 5 seconds
Oct 22 18:30:44 server1 kernel: EXT3 FS on xvdb1, internal journal
Oct 22 18:30:44 server1 kernel: EXT3-fs: recovery complete.
Oct 22 18:30:44 server1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Oct 22 18:30:44 server1 clurgmgrd: [1811]: <info> Adding IPv4 address
192.168.0.130 to eth0 
Oct 22 18:30:44 server1 avahi-daemon[1494]: Registering new address record for
192.168.0.130 on eth0.
Oct 22 18:30:46 server1 in.rdiscd[2110]: setsockopt (IP_ADD_MEMBERSHIP): Address
already in use
Oct 22 18:30:46 server1 in.rdiscd[2110]: Failed joining addresses 
Oct 22 18:30:46 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd start 
Oct 22 18:30:46 server1 clurgmgrd[1811]: <notice> Service service:www started 
Oct 22 18:30:54 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd
status 
Oct 22 18:31:21 server1 kernel: dlm: rgmanager: recover 3
Oct 22 18:31:21 server1 kernel: dlm: rgmanager: add member 2
Oct 22 18:31:21 server1 kernel: dlm: Initiating association with node 2
Oct 22 18:36:54 server1 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server1 kernel: dlm: Initiating association with node 2

node:2
[root@server2 ~]# clustat -f
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                               1 Online
  server2                               2 Online, Local

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:www          none                           uninitialized

/var/log/messages:
Oct 22 18:29:25 server2 ccsd[1678]: Initial status:: Quorate 
Oct 22 18:31:21 server2 kernel: SCTP: Hash tables configured (established 1365
bind 1638)
Oct 22 18:31:21 server2 kernel: Module sctp cannot be unloaded due to unsafe
usage in net/sctp/protocol.c:1189
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: recover 1
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: add member 1
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: add member 2
Oct 22 18:31:21 server2 kernel: dlm: Initiating association with node 1
Oct 22 18:31:30 server2 clurgmgrd[1758]: <notice> Resource Group Manager Starting 
Oct 22 18:31:31 server2 clurgmgrd: [1758]: <info> Executing /etc/init.d/httpd stop 
Oct 22 18:31:31 server2 clurgmgrd: [1758]: <info> /dev/xvdb1 is not mounted 
Oct 22 18:36:54 server2 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server2 kernel: dlm: Initiating association with node 1


Expected results:
clustat should list server1 as he owner of the ha www-service and should also
list the service as started, not as uninitialized.

Additional info:

cluster.conf:
<?xml version="1.0"?>
<cluster alias="www" config_version="3" name="alpha_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="server1" votes="1" nodeid="1">
                        <fence/>
                </clusternode>
                <clusternode name="server2" votes="1" nodeid="2">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="192.168.0.130" monitor_link="1"/>
                        <fs device="/dev/xvdb1" force_fsck="0" force_unmount="0"
fsid="4677" fstype="ext3" mountpoint="/webdata" name="documentroot" options=""
self_fence="0"/>
                        <script file="/etc/init.d/httpd" name="httpd"/>
                </resources>
                <service autostart="1" name="www">
                        <ip ref="192.168.0.130"/>
                        <fs ref="documentroot"/>
                        <script ref="httpd"/>
                </service>
        </rm>
</cluster>


This is a very simple cluster setup in a xen enviroment.
Comment 1 Lon Hohberger 2006-10-23 13:10:14 EDT
should be the same problem as 211701
Comment 2 Thorsten Scherf 2006-10-23 14:40:36 EDT
additional info: when calling s-c-cluster, it always states that this node is
not part of a cluster and the management tab disappers. 

system-config-cluster-1.0.29-1.0
Comment 3 Lon Hohberger 2006-10-23 18:44:33 EDT

*** This bug has been marked as a duplicate of 211701 ***
Comment 4 Thorsten Scherf 2006-10-24 17:49:10 EDT
problem remains with rgmanager-2.0.15-1.i386.rpm.
Comment 5 Lon Hohberger 2006-10-26 16:48:54 EDT
Ok, I thought it was the same, but it's obviously different

Oct 22 18:36:54 server2 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server2 kernel: dlm: Initiating association with node 1
Comment 6 Lon Hohberger 2006-10-26 16:55:29 EDT
This looks like a DLM bug which should have been fixed on the current cman package.
Comment 7 Lon Hohberger 2006-10-26 16:57:17 EDT
Does this happen with a current cman package?
Comment 8 Lon Hohberger 2006-10-26 17:23:45 EDT
Apparently this happens with current cman/dlm, but only in Xen guests.
Comment 10 Thorsten Scherf 2006-10-26 17:35:07 EDT
[root@server2 ~]# rpm -q cman
cman-2.0.18-2.fc6
[root@server2 ~]# rpm -q rgmanager
rgmanager-2.0.15-1 (from the rhel5 beta channel)
[root@server2 ~]# uname -r
2.6.18-1.2798.fc6xen
Comment 11 Christine Caulfield 2006-10-27 10:51:29 EDT
This is the same as #212472 which is open on the Xen kernel as #212550.

I'll leave this open as we may need to close is separately for FC6 & RHEL5
Comment 12 Neil Horman 2007-01-24 13:27:39 EST
Have you tried setting the sctp sndbuf_policy to 1 in /proc/sys/net/sctp?  The
error you are seeing looks like it could perhaps come from a limitation of
sendbuffer space.  Setting the policy to 1 will relax some of the memory
allocation limits placed on sctp sockets and allow you to more easily create
associations.
Comment 13 Matthew Miller 2007-04-06 15:13:44 EDT
Fedora Core 5 and Fedora Core 6 are, as we're sure you've noticed, no longer
test releases. We're cleaning up the bug database and making sure important bug
reports filed against these test releases don't get lost. It would be helpful if
you could test this issue with a released version of Fedora or with the latest
development / test release. Thanks for your help and for your patience.

[This is a bulk message for all open FC5/FC6 test release bugs. I'm adding
myself to the CC list for each bug, so I'll see any comments you make after this
and do my best to make sure every issue gets proper attention.]
Comment 14 Thorsten Scherf 2007-04-07 04:55:55 EDT
the bug seems to be fixed in fedora core 6. 
Comment 15 Matthew Miller 2007-04-07 08:04:06 EDT
thanks!

Note You need to log in before you can comment on or make changes to this bug.