Bug 212472 - dlm sctp errors with current rhel5 cman/dlm in xen guests
dlm sctp errors with current rhel5 cman/dlm in xen guests
Status: CLOSED DUPLICATE of bug 212550
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On: 212550
Blocks:
  Show dependency treegraph
 
Reported: 2006-10-26 17:25 EDT by Lon Hohberger
Modified: 2009-04-16 18:29 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-10-30 15:44:46 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lon Hohberger 2006-10-26 17:25:20 EDT
+++ This bug was initially created as a clone of Bug #211777 +++

Description of problem:
service did no relocate to another host (2 node setup)

Version-Release number of selected component (if applicable):
rgmanager-2.0.8-1.fc6

How reproducible:
2 node setup
apache ha
run clustat on node 1
run clustat on node 2


Steps to Reproduce:
1.2 node ha setup
2.run clustat on node 1
3.run clustat on node 2
  
Actual results:
node:1
[root@server1 ~]# clustat -f
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                               1 Online, Local
  server2                               2 Online

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:www          server1                        started

/var/log/messages:
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: recover 1
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: add member 1
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: total members 1 error 0
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: dlm_recover_directory
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: dlm_recover_directory 0 entries
Oct 22 18:30:31 server1 kernel: dlm: rgmanager: recover 1 done: 0 ms
Oct 22 18:30:40 server1 clurgmgrd[1811]: <notice> Resource Group Manager Starting 
Oct 22 18:30:40 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd stop 
Oct 22 18:30:41 server1 clurgmgrd: [1811]: <info> /dev/xvdb1 is not mounted 
Oct 22 18:30:44 server1 clurgmgrd[1811]: <notice> Starting stopped service
service:www 
Oct 22 18:30:44 server1 clurgmgrd: [1811]: <info> mounting /dev/xvdb1 on /webdata 
Oct 22 18:30:44 server1 kernel: kjournald starting.  Commit interval 5 seconds
Oct 22 18:30:44 server1 kernel: EXT3 FS on xvdb1, internal journal
Oct 22 18:30:44 server1 kernel: EXT3-fs: recovery complete.
Oct 22 18:30:44 server1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Oct 22 18:30:44 server1 clurgmgrd: [1811]: <info> Adding IPv4 address
192.168.0.130 to eth0 
Oct 22 18:30:44 server1 avahi-daemon[1494]: Registering new address record for
192.168.0.130 on eth0.
Oct 22 18:30:46 server1 in.rdiscd[2110]: setsockopt (IP_ADD_MEMBERSHIP): Address
already in use
Oct 22 18:30:46 server1 in.rdiscd[2110]: Failed joining addresses 
Oct 22 18:30:46 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd start 
Oct 22 18:30:46 server1 clurgmgrd[1811]: <notice> Service service:www started 
Oct 22 18:30:54 server1 clurgmgrd: [1811]: <info> Executing /etc/init.d/httpd
status 
Oct 22 18:31:21 server1 kernel: dlm: rgmanager: recover 3
Oct 22 18:31:21 server1 kernel: dlm: rgmanager: add member 2
Oct 22 18:31:21 server1 kernel: dlm: Initiating association with node 2
Oct 22 18:36:54 server1 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server1 kernel: dlm: Initiating association with node 2

node:2
[root@server2 ~]# clustat -f
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  server1                               1 Online
  server2                               2 Online, Local

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:www          none                           uninitialized

/var/log/messages:
Oct 22 18:29:25 server2 ccsd[1678]: Initial status:: Quorate 
Oct 22 18:31:21 server2 kernel: SCTP: Hash tables configured (established 1365
bind 1638)
Oct 22 18:31:21 server2 kernel: Module sctp cannot be unloaded due to unsafe
usage in net/sctp/protocol.c:1189
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: recover 1
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: add member 1
Oct 22 18:31:21 server2 kernel: dlm: rgmanager: add member 2
Oct 22 18:31:21 server2 kernel: dlm: Initiating association with node 1
Oct 22 18:31:30 server2 clurgmgrd[1758]: <notice> Resource Group Manager Starting 
Oct 22 18:31:31 server2 clurgmgrd: [1758]: <info> Executing /etc/init.d/httpd stop 
Oct 22 18:31:31 server2 clurgmgrd: [1758]: <info> /dev/xvdb1 is not mounted 
Oct 22 18:36:54 server2 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server2 kernel: dlm: Initiating association with node 1


Expected results:
clustat should list server1 as he owner of the ha www-service and should also
list the service as started, not as uninitialized.

Additional info:

cluster.conf:
<?xml version="1.0"?>
<cluster alias="www" config_version="3" name="alpha_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="server1" votes="1" nodeid="1">
                        <fence/>
                </clusternode>
                <clusternode name="server2" votes="1" nodeid="2">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="192.168.0.130" monitor_link="1"/>
                        <fs device="/dev/xvdb1" force_fsck="0" force_unmount="0"
fsid="4677" fstype="ext3" mountpoint="/webdata" name="documentroot" options=""
self_fence="0"/>
                        <script file="/etc/init.d/httpd" name="httpd"/>
                </resources>
                <service autostart="1" name="www">
                        <ip ref="192.168.0.130"/>
                        <fs ref="documentroot"/>
                        <script ref="httpd"/>
                </service>
        </rm>
</cluster>


This is a very simple cluster setup in a xen enviroment.

-- Additional comment from lhh@redhat.com on 2006-10-23 13:10 EST --
should be the same problem as 211701

-- Additional comment from tscherf@redhat.com on 2006-10-23 14:40 EST --
additional info: when calling s-c-cluster, it always states that this node is
not part of a cluster and the management tab disappers. 

system-config-cluster-1.0.29-1.0


-- Additional comment from lhh@redhat.com on 2006-10-23 18:44 EST --


*** This bug has been marked as a duplicate of 211701 ***

-- Additional comment from tscherf@redhat.com on 2006-10-24 17:49 EST --
problem remains with rgmanager-2.0.15-1.i386.rpm.


-- Additional comment from lhh@redhat.com on 2006-10-26 16:48 EST --
Ok, I thought it was the same, but it's obviously different

Oct 22 18:36:54 server2 kernel: dlm: Can't start SCTP association - retrying
Oct 22 18:36:54 server2 kernel: dlm: Initiating association with node 1

-- Additional comment from lhh@redhat.com on 2006-10-26 16:55 EST --
This looks like a DLM bug which should have been fixed on the current cman package.

-- Additional comment from lhh@redhat.com on 2006-10-26 16:57 EST --
Does this happen with a current cman package?

-- Additional comment from lhh@redhat.com on 2006-10-26 17:23 EST --
Apparently this happens with current cman/dlm, but only in Xen guests.

-- Additional comment from lhh@redhat.com on 2006-10-26 17:24 EST --
Note: This happens with rgmanager and gfs; it's not specific to one or the
other.  This has been reproduced on rhel5 with gfs and rgmanager.
Comment 1 Lon Hohberger 2006-10-26 17:27:10 EDT
This causes rgmanager to hang as well as gfs.
Comment 2 Kiersten (Kerri) Anderson 2006-10-26 17:34:42 EDT
Setting blocker-beta request flags and Devel ACK.

Also seeing the same errors when trying to mount a filesystem on the second
node.  First node mounts fine, but the dlm seems to have problem initiating the
connection to the first node from the second node.  Xen nodes kanderso-xen-03
and kanderso-xen-04 are currently in this state, with kanderso-xen-03 being the
first mounter, and kanderso-xen-04 hanging on its mount.gfs.

Getting the following messages on the console for kanderso-xen-03:

[root@kanderso-xen-03 ~]# dlm: clusterfs-0: recover 3
dlm: clusterfs-0: add member 4
dlm: Initiating association with node 4

[root@kanderso-xen-03 ~]# cman_tool services
type             level name         id       state
fence            0     default      00010003 none
[1 2 3 4 5 6 7 8 9]
dlm              1     clusterfs-0  00030003 none
[3 4]
gfs              2     clusterfs-0  00020003 none
[3 4]
[root@kanderso-xen-03 ~]# dlm: Can't start SCTP association - retrying
dlm: Initiating association with node 4
dlm: Can't start SCTP association - retrying
dlm: Initiating association with node 4

[root@kanderso-xen-03 ~]# dlm: Can't start SCTP association - retrying
dlm: Initiating association with node 4
dlm: Can't start SCTP association - retrying
dlm: Initiating association with node 4

I will leave the xen cluster up overnight if you want to play with them.  They
are hosted off of the heater-01 and heater-02 machines.  Also, running the
latest RPMs as of yesterdays build.


Comment 3 RHEL Product and Program Management 2006-10-26 17:45:39 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.
Comment 5 Christine Caulfield 2006-10-27 08:30:27 EDT
It seems that the lksctp-tools also exhibit this problem, so I've raised
bz#212550  against the kernel
Comment 6 Paul Gampe 2006-10-27 18:45:00 EDT
Approved as beta blocker during today's meeting.
Comment 7 Christine Caulfield 2006-10-30 05:06:30 EST
Herbert Xu has posted a patch for this in the kernel bz entry:

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=139699
Comment 8 Kiersten (Kerri) Anderson 2006-10-30 15:44:46 EST
Closing this one as a duplicate of bz 212550.  Once the referenced patch was
applied, was able to get my xen cluster running and gfs filesystems mounted.

*** This bug has been marked as a duplicate of 212550 ***
Comment 9 Nate Straz 2007-12-13 12:22:12 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.