Bug 722469 - [RFE] Add support for redundant ring for standalone Corosync (not RHEL HA/Cluster stack)
[RFE] Add support for redundant ring for standalone Corosync (not RHEL HA/Clu...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.1
Unspecified Unspecified
urgent Severity low
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
: FutureFeature, TechPreview
: 504022 (view as bug list)
Depends On:
Blocks: 732635 733298 743047 758821 758823
  Show dependency treegraph
 
Reported: 2011-07-15 08:58 EDT by Jan Friesse
Modified: 2016-04-26 09:37 EDT (History)
16 users (show)

See Also:
Fixed In Version: corosync-1.4.1-3.el6
Doc Type: Technology Preview
Doc Text:
TechPreview known issues: - Double ring failure will result in spinning of the corosync process - corosync redundant ring appears to meet our technical preview quality requirements, bit because DLM relies on SCTP which is nonfunctional. As a result many features of the cluster software that rely on DLM do not work appropriately
Story Points: ---
Clone Of:
: 732635 (view as bug list)
Environment:
Last Closed: 2011-12-06 06:51:18 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
handle rollover in active rrp properly (2.21 KB, patch)
2011-07-19 11:12 EDT, Jan Friesse
no flags Details | Diff
Handle rollower in passive rrp properly (6.96 KB, patch)
2011-07-19 11:13 EDT, Jan Friesse
no flags Details | Diff
redundant ring automatic recovery (18.55 KB, patch)
2011-07-19 11:13 EDT, Jan Friesse
no flags Details | Diff
totemconfig: Check interfaces address integrity (1.68 KB, patch)
2011-08-19 06:07 EDT, Jan Friesse
no flags Details | Diff
notes taken during redundant ring testing (12.39 KB, text/plain)
2011-08-25 11:49 EDT, Jaroslav Kortus
no flags Details
cpgbench statistics snip from RR recovery testing (4.08 KB, text/plain)
2011-08-25 11:49 EDT, Jaroslav Kortus
no flags Details
Handle endless loop if all ifaces are faulty (3.04 KB, patch)
2011-08-29 09:28 EDT, Jan Friesse
no flags Details | Diff
rrp: Higher threshold in passive mode for mcast (5.46 KB, patch)
2011-08-29 09:28 EDT, Jan Friesse
no flags Details | Diff
Patch which allowes threshold setting < 5 (842 bytes, patch)
2011-09-08 04:12 EDT, Jan Friesse
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 61832 None None None Never

  None (edit)
Description Jan Friesse 2011-07-15 08:58:27 EDT
Description of problem:
Add techprview support for redundant ring with autorecovery feature.
Comment 8 Jan Friesse 2011-07-19 11:12:40 EDT
Created attachment 513826 [details]
handle rollover in active rrp properly
Comment 9 Jan Friesse 2011-07-19 11:13:03 EDT
Created attachment 513827 [details]
Handle rollower in passive rrp properly
Comment 10 Jan Friesse 2011-07-19 11:13:23 EDT
Created attachment 513828 [details]
redundant ring automatic recovery
Comment 12 Jan Friesse 2011-07-29 09:27:21 EDT
How to setup RRP:
Using only corosync:
1.) Copy 

interface {
          ringnumber: 0
          bindnetaddr: 1.2.3.x
          mcastaddr: 226.1.2.3
          mcastport: 5401
}

and change ringnumber to higher value (like 1), bindnetaddr, ... so finally, there will be two interface sections (one for each NIC).

2.) Add 

rrp_mode: [passive|active]

into totem.

Using cman:
https://fedorahosted.org/cluster/wiki/MultiHome

<clusternode name="node1" votes="1" nodeid="1"> 
    <altname name="node1a"/>
    <fence> 
        <method name="single"> 
            <device name="apc" port="1"/> 
        </method> 
    </fence> 
</clusternode> 

or

<clusternode name="node1" votes="1" nodeid="1"> 
    <altname name="node1a" port="6899" mcast="239.192.99.27"/>
    <fence> 
        <method name="single"> 
            <device name="apc" port="1"/> 
        </method> 
    </fence> 
</clusternode> 

This will setup active rrp mode automatically.

After cman is started,

corosync-cfgtool -s

displays information about two/more rings.

Autorecovery tests:
-------------------
Ether on passive or active rrp_mode, corosync should be able to survive:
- removal/iptables blocking/... of one nic and send all traffic thru second one
- removal/failure/... of switch for nic and send all traffic thru second one

Status of rings is displayed in log and also 

corosync-cfgtool -s

Autorecovery must be able to recover ring right after nic is again in working state (add/iptables unblock/... of one of nic and/or add/fix/... of switch)

Rollover test (almost impossible to test):
------------------------------------------
There are constants in totemrrp.c but this needs change in source.

Without change, only one option how to test that rollover works correctly is to send 2^32 packets and on non faulty network/computer/switch/... there should be no strange behavior like lost membership, recovery algorithm running (totem recovery, not rrp autorecovery!!!). In other words, everything should work like there would be never ever happened any rollover.
Comment 18 Jaroslav Kortus 2011-08-18 08:56:43 EDT
There are some errors at startup:

Aug 18 07:42:54 marathon-02 corosync[2480]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 18 07:42:57 marathon-02 fenced[2551]: fenced 3.0.12.1 started
Aug 18 07:42:57 marathon-02 dlm_controld[2577]: dlm_controld 3.0.12.1 started
Aug 18 07:42:57 marathon-02 dlm_controld[2577]: /sys/kernel/config/dlm/cluster/comms/1: mkdir failed: 17
Aug 18 07:42:57 marathon-02 dlm_controld[2577]: /sys/kernel/config/dlm/cluster/comms/2: mkdir failed: 17
Aug 18 07:42:57 marathon-02 dlm_controld[2577]: /sys/kernel/config/dlm/cluster/comms/3: mkdir failed: 17
Aug 18 07:42:57 marathon-02 dlm_controld[2577]: /sys/kernel/config/dlm/cluster/comms/4: mkdir failed: 17
Aug 18 07:42:57 marathon-02 dlm_controld[2577]: /sys/kernel/config/dlm/cluster/comms/5: mkdir failed: 17
Aug 18 07:42:58 marathon-02 gfs_controld[2625]: gfs_controld 3.0.12.1 started

This happens on each node of the cluster.

Later when I shut down one interface on one node that belongs to RR (ring1) the following starts appearing in logs:
Aug 18 07:45:53 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem counter for seqid 2002 iface 192.168.1.4 to [1 of 10]
Aug 18 07:45:55 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no faults
Aug 18 07:45:57 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem counter for seqid 2007 iface 192.168.1.4 to [1 of 10]
Aug 18 07:45:59 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no faults
Aug 18 07:46:02 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem counter for seqid 2012 iface 192.168.1.4 to [1 of 10]
Aug 18 07:46:04 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no faults
[... repeats forever ...]

This situation results in very degraded performance:
from:
54064 messages received     1 bytes per write  10.008 Seconds runtime  5402.308 TP/s   0.005 MB/s
to:
  222 messages received     1 bytes per write  12.755 Seconds runtime    17.406 TP/s   0.000 MB/s
(cpgbench from sts-rhel6.2)

If in this state one other node is shut down (echo b> /proc/sysrq-trigger) the cluster freezes after 'corosync[2480]:   [TOTEM ] A processor failed, forming new configuration.' and does not recover. My expectation here would be to form original configuration via ring0, drop ring1 and fence the failing node.


$ corosync-fplay 
failed to open /var/lib/corosync/fdata: No such file or directory

Do I need anything special to trigger the data collection? Would be quite handy to have it on by default :).

Due to all the facts above I'm switching this back to ASSIGNED.
Comment 19 Jan Friesse 2011-08-18 09:12:59 EDT
(In reply to comment #18)
> There are some errors at startup:
> 
> Aug 18 07:42:54 marathon-02 corosync[2480]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Aug 18 07:42:57 marathon-02 fenced[2551]: fenced 3.0.12.1 started
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]: dlm_controld 3.0.12.1 started
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/1: mkdir failed: 17
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/2: mkdir failed: 17
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/3: mkdir failed: 17
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/4: mkdir failed: 17
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/5: mkdir failed: 17

This seems to be problem of DLM, not corosync one. Can you please try pure corosync?

> Aug 18 07:42:58 marathon-02 gfs_controld[2625]: gfs_controld 3.0.12.1 started
> 
> This happens on each node of the cluster.
> 
> Later when I shut down one interface on one node that belongs to RR (ring1) the
> following starts appearing in logs:
> Aug 18 07:45:53 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem
> counter for seqid 2002 iface 192.168.1.4 to [1 of 10]
> Aug 18 07:45:55 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no
> faults
> Aug 18 07:45:57 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem
> counter for seqid 2007 iface 192.168.1.4 to [1 of 10]
> Aug 18 07:45:59 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no
> faults
> Aug 18 07:46:02 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem
> counter for seqid 2012 iface 192.168.1.4 to [1 of 10]
> Aug 18 07:46:04 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no
> faults
> [... repeats forever ...]
> 

This is not normal. It looks like automatic recovery message is received... this shouldn't happen if interface is really down. How did you shut down that interface?

> This situation results in very degraded performance:
> from:
> 54064 messages received     1 bytes per write  10.008 Seconds runtime  5402.308
> TP/s   0.005 MB/s
> to:
>   222 messages received     1 bytes per write  12.755 Seconds runtime    17.406
> TP/s   0.000 MB/s
> (cpgbench from sts-rhel6.2)
> 

This is expected

> If in this state one other node is shut down (echo b> /proc/sysrq-trigger) the
> cluster freezes after 'corosync[2480]:   [TOTEM ] A processor failed, forming
> new configuration.' and does not recover. My expectation here would be to form
> original configuration via ring0, drop ring1 and fence the failing node.
> 

This is how it should work.

> 
> $ corosync-fplay 
> failed to open /var/lib/corosync/fdata: No such file or directory
> 
> Do I need anything special to trigger the data collection? Would be quite handy
> to have it on by default :).

corosync-fplay is for playing fdata file (maybe this is why it call fplay, and fcreate), not creating. Creating is done by corosync-blackbox.

> 
> Due to all the facts above I'm switching this back to ASSIGNED.
Comment 20 Steven Dake 2011-08-18 11:34:53 EDT
(In reply to comment #18)
> There are some errors at startup:
> 
> Aug 18 07:42:54 marathon-02 corosync[2480]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Aug 18 07:42:57 marathon-02 fenced[2551]: fenced 3.0.12.1 started
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]: dlm_controld 3.0.12.1 started
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/1: mkdir failed: 17
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/2: mkdir failed: 17
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/3: mkdir failed: 17
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/4: mkdir failed: 17
> Aug 18 07:42:57 marathon-02 dlm_controld[2577]:
> /sys/kernel/config/dlm/cluster/comms/5: mkdir failed: 17
> Aug 18 07:42:58 marathon-02 gfs_controld[2625]: gfs_controld 3.0.12.1 started
> 
> This happens on each node of the cluster.
> 

This appears to be a dlm problem.  Does this occur on non-RR setups?

> Later when I shut down one interface on one node that belongs to RR (ring1) the
> following starts appearing in logs:
> Aug 18 07:45:53 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem
> counter for seqid 2002 iface 192.168.1.4 to [1 of 10]
> Aug 18 07:45:55 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no
> faults
> Aug 18 07:45:57 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem
> counter for seqid 2007 iface 192.168.1.4 to [1 of 10]
> Aug 18 07:45:59 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no
> faults
> Aug 18 07:46:02 marathon-04 corosync[2488]:   [TOTEM ] Incrementing problem
> counter for seqid 2012 iface 192.168.1.4 to [1 of 10]
> Aug 18 07:46:04 marathon-04 corosync[2488]:   [TOTEM ] ring 1 active with no
> faults
> [... repeats forever ...]
> 

ifconfig downing the interface won't work as a simulation method.  Both input and output packets on the unicast socket must be blocked.  I don't understand how you got this result - please post your scripts as there are 3 tested-by signoffs in the patch and I tested prior to merge.

> This situation results in very degraded performance:
> from:
> 54064 messages received     1 bytes per write  10.008 Seconds runtime  5402.308
> TP/s   0.005 MB/s
> to:
>   222 messages received     1 bytes per write  12.755 Seconds runtime    17.406
> TP/s   0.000 MB/s
> (cpgbench from sts-rhel6.2)
> 
> If in this state one other node is shut down (echo b> /proc/sysrq-trigger) the
> cluster freezes after 'corosync[2480]:   [TOTEM ] A processor failed, forming
> new configuration.' and does not recover. My expectation here would be to form
> original configuration via ring0, drop ring1 and fence the failing node.
> 

Need more info how you got into this state - the cluster should always recover if a node is terminated.

As for reforming on the secondary ring, this wouldn't work because the secondary ring also goes through the terminated node.  If the RFE is to rebuild a ring to eliminate a processor that has a failed network interface, this also wouldn't work because the entire network switch could fail, essentially eliminating all processors.
 
> 
> $ corosync-fplay 
> failed to open /var/lib/corosync/fdata: No such file or directory
> 
> Do I need anything special to trigger the data collection? Would be quite handy
> to have it on by default :).
> 

agree I think having the blackbox always record data to disk is helpful and something we could consider for RHEL7.  I had looked into this at one point relying on madvise MADV_DONTNEED to provide good throughput to the backing store but I think I couldn't get it to work reliably.  I'll CC Angus to comment on feasibility of this method with libqb in RHEL7.
> Due to all the facts above I'm switching this back to ASSIGNED.
Comment 21 Steven Dake 2011-08-18 11:36:18 EDT
Angus,

Please comment on feasibility of last comment in Comment #20 for a libqb RFE.
Comment 22 Jan Friesse 2011-08-18 11:38:46 EDT
Ok so after not so brief look to problem, here are few problems:
- cman sets active mode. At least (if I remember it correctly) I was told that passive mode is what we are interested in.
- cman sets token to 10000, but our rrp_problem_count_timeout is 2000. In other words, sooner then next token is lost, problem counter is decremented -> interface is never ever marked as faulty -> froze of cluster.
- passive mode rrp_problem_count_threshold is too small for current networks. Running cpgbench can mark as faulty non faulty device
- running active mode on intactive device causes problem because of bind to 127.0.0.1
- After all interfaces are marked dead, there is no auto recovery (why?)

So proposed solution:
- Make passive default
- Set rrp_problem_count_threshold to some higher value lake 30
- This makes cluster little slower before interface is marked dead, but things are still going forward

Or another solution:
- Increase rrp_problem_count_timeout to 2*token. This means, that recovery will take +- 100sec (really almost 2 minutes), and whole cluster will look like totally frozen for that 2 minutes.
- we can make rrp_problem_count_threshold lower (like 5 or so) to make things better. But question is, is really 50 sec win?
- disallow binding to 127.0.0.1 on active mode.

In both of them, solve somehow (if possible) last point.

I hope that it's more then clear which solution is preferred.
Comment 23 Angus Salkeld 2011-08-18 19:06:17 EDT
(In reply to comment #21)
> Angus,
> 
> Please comment on feasibility of last comment in Comment #20 for a libqb RFE.

Currently "corosyn-blackbox" triggers corosync to write the flight
data to file. We could:
1) add an option to corosync-fplay (say -l for "live") to get live
   output.
2) as you suggest write inline every log to file - yikes I'de need to
   experiment to see what the impact will be.
Comment 24 Fabio Massimo Di Nitto 2011-08-19 02:35:55 EDT
(In reply to comment #22)
> Ok so after not so brief look to problem, here are few problems:
> - cman sets active mode. At least (if I remember it correctly) I was told that
> passive mode is what we are interested in.
> - cman sets token to 10000, but our rrp_problem_count_timeout is 2000. In other
> words, sooner then next token is lost, problem counter is decremented ->
> interface is never ever marked as faulty -> froze of cluster.
> - passive mode rrp_problem_count_threshold is too small for current networks.
> Running cpgbench can mark as faulty non faulty device
> - running active mode on intactive device causes problem because of bind to
> 127.0.0.1
> - After all interfaces are marked dead, there is no auto recovery (why?)
> 
> So proposed solution:
> - Make passive default
> - Set rrp_problem_count_threshold to some higher value lake 30
> - This makes cluster little slower before interface is marked dead, but things
> are still going forward
> 
> Or another solution:
> - Increase rrp_problem_count_timeout to 2*token. This means, that recovery will
> take +- 100sec (really almost 2 minutes), and whole cluster will look like
> totally frozen for that 2 minutes.
> - we can make rrp_problem_count_threshold lower (like 5 or so) to make things
> better. But question is, is really 50 sec win?
> - disallow binding to 127.0.0.1 on active mode.
> 
> In both of them, solve somehow (if possible) last point.
> 
> I hope that it's more then clear which solution is preferred.

I am ok with whatever change is required in cman as long as:

1) i get a BZ with the final solution requirement ASAP
2) to make my life a bit simpler you tell me which objdb keys I have to set and how :)
Comment 25 Jan Friesse 2011-08-19 06:07:31 EDT
Created attachment 518999 [details]
totemconfig: Check interfaces address integrity

Two interfaces (in RRP mode) shouldn't have equal unicast or
multicast addresses.

This handles few problems and makes:
- almost no false positives with cpgbench running
- recovery/failure is detected faster and much more reliably

So in my testing, there is one huge problem, and it's auto binding to 127.0.0.1. This feature maybe works with non rrp, but it's pretty problematic with rrp one. So few more questions:
- what is that autobinding good for
- do we really need ability to autobind to 127.0.0.1 on start (simply because after start it looks like there is no problem at all with this)
Comment 26 Steven Dake 2011-08-19 13:59:50 EDT
autobinding is a problem and I am hopeful we can rid ourselves of its need in the future.  The reason it is needed today is that if a user uses corosync without an active interface, corosync messaging wont work at all.  Ideally it should work, which is why it binds to 127.0.0.1.  What should really happen instead is messaging should automatically transport through if no active interface is available as if it was a single 127.0.0.1 nnde.

We do not allow ifdown of the active interface in cluster suite nor corosync in typical environments.  If you have a proposal for addressing these needs, please file a separate rfe.

Regards
-stee
Comment 29 Steven Dake 2011-08-22 15:22:04 EDT
Honza,

Please file a new RFE for the integrity verification patch in comment #25.  We will address the new rfe for 6.3.  A proper environment should have network isolation since that is the entire motive of this feature.

Regards
-steve
Comment 32 Jan Friesse 2011-08-23 03:24:17 EDT
(In reply to comment #29)
> Honza,
> 
> Please file a new RFE for the integrity verification patch in comment #25.  We
> will address the new rfe for 6.3.  A proper environment should have network
> isolation since that is the entire motive of this feature.
> 
> Regards
> -steve

Steve,
I agree with make ver. patch to 6.3 but you are not right with network isolation feature. Main problem is following.

- node with rrp has 2 nics, two listening multicast sockets - two fds
- if this two mcast sockets has same mcast addr, every message send to one multicast address is transported back also to second fd.
- no physical separation will prevent this issue
- reason is mcast loopback

On the other hand, it's not a big deal, it just slow down failure detection.
Comment 33 Steven Dake 2011-08-23 11:20:33 EDT
Honza,

Your correct - my apologies for the error.  I believe we can push to 6.3 or fix in cman component in short term.
Comment 34 Jaroslav Kortus 2011-08-25 11:49:00 EDT
Created attachment 519912 [details]
notes taken during redundant ring testing
Comment 35 Jaroslav Kortus 2011-08-25 11:49:54 EDT
Created attachment 519915 [details]
cpgbench statistics snip from RR recovery testing
Comment 36 Jaroslav Kortus 2011-08-25 11:52:31 EDT
1. Start
Starting is OK, but sometimes it results in delays that were not present before.
Starting cman also makes all rings non-Faulty on most (not all) cluster nodes.

2. Normal run
During normal run there are messages that should not appear (recovering
non-faulty ring).

3. Recovery 
Corosync is able to recover from failures while running cpgbench.
However, there are cases that need extra care (see below).

3.1 Recovery of ring 1
This one was running as expected, successfully handling failures of ring 1
during both cpgbench and QUICKHIT tests (d_io, dd_io)

3.2 Recovery of ring 0
This one was handled well only with cpgbench test. Mounting GFS2 filesystem
with failed ring 0 was not possible (looks like the traffic is still directed
to the failing ring).

Another interesting effect was VERY big increase of throughput while
ring 0 was down (5-10 times). This also needs extra investigation, whether
there are some artifical blocks that can be avoided or something is not done
properly if ring 0 is down.

3.3 Recovery after last ring is broken
This behaved as expected, fencing the first node to break the last functional ring.

3.4 Both rings failure
This was not handled correctly and so far is the most serious failure.
The cluster finally decides for one ring and forms new membership without the
node that had it failed. At this time fencing event was expected, but it did
not happen. It was fenced much later when it rejoined the cluster after I rebooted it.

I've seen similar behaviour when corosync forms new memberships and the missing
nodes are not fenced, they were usually connected with packet loss.

Please see the attached notes with more detail.

For the reasons described, I'm switching this back to ASSIGNED.
Comment 37 Jaroslav Kortus 2011-08-25 11:55:09 EDT
tested with corosync-1.4.1-1.el6.x86_64, default settings according to fedorahosted multihome manual (i.e. no different mcast addresses). All network blocks/failures were by iptables (so no ifdown this time).
Comment 38 Steven Dake 2011-08-25 12:43:54 EDT
Jaroslav,

Thanks for the notes.
Comment 40 Jan Friesse 2011-08-26 02:52:06 EDT
Jardo,
this is scratch build of cluster with patch from Fabbio:
https://brewweb.devel.redhat.com/taskinfo?taskID=3585102

Can you please retest as soon as possible? (especially because I + Florian Haas and Jiaju Zhang ware testing corosync rrp heavily, so problem simply MUST be with same mcast addresses for NIC).

Thanks for this extra work.
Comment 41 Fabio Massimo Di Nitto 2011-08-26 03:19:49 EDT
(In reply to comment #40)
> Jardo,
> this is scratch build of cluster with patch from Fabbio:
> https://brewweb.devel.redhat.com/taskinfo?taskID=3585102
> 
> Can you please retest as soon as possible? (especially because I + Florian Haas
> and Jiaju Zhang ware testing corosync rrp heavily, so problem simply MUST be
> with same mcast addresses for NIC).
> 
> Thanks for this extra work.

Is this the "one mcast address per ring" patch?
Comment 42 Jan Friesse 2011-08-26 03:36:36 EDT
(In reply to comment #41)
> (In reply to comment #40)
> > Jardo,
> > this is scratch build of cluster with patch from Fabbio:
> > https://brewweb.devel.redhat.com/taskinfo?taskID=3585102
> > 
> > Can you please retest as soon as possible? (especially because I + Florian Haas
> > and Jiaju Zhang ware testing corosync rrp heavily, so problem simply MUST be
> > with same mcast addresses for NIC).
> > 
> > Thanks for this extra work.
> 
> Is this the "one mcast address per ring" patch?

Yes it is. Simply because "same mcast for all rings" is last thing I know can cause problems.
Comment 43 Jaroslav Kortus 2011-08-26 08:40:57 EDT
well, I have just tested it with different mcast, C&P from the mutlihome howto:
<altname port="6899" mcast="239.192.99.27" name="marathon-01a"/>

and unfortunately it was the worst case seen so far. With the old build it starts complaining about ring 1 failing. This is fixed probably in new build as this one assembles the cluster cleanly.

After running cpgbench from one node the cluster melted down, with 3 cluster nodes out of 5 running corosync on 100% CPU (probably endless loop?, strace produced no output). With new build only one node was affected and fenced, then another one, until there was no quorum.

The reason for these meltdowns is that very shortly after cpgbench is run, both rings fail.

I it worth retesting without the specified mcast address/port?

cman-3.0.12.1-14.el6.jf.1.x86_64
corosync-1.4.1-1.el6.x86_64
Comment 44 Jan Friesse 2011-08-29 09:28:34 EDT
Created attachment 520395 [details]
Handle endless loop if all ifaces are faulty

If all interfaces were faulty, passive_mcast_flush_send and related
functions ended in endless loop. This is now handled and if there is no
live interface, message is dropped.
Comment 45 Jan Friesse 2011-08-29 09:28:52 EDT
Created attachment 520397 [details]
rrp: Higher threshold in passive mode for mcast

Patch adds new configurable variable rrp_problem_count_mcast_threshold
which is by default 10 times rrp_problem_count_threshold and this is
used as threshold for multicast packets in passive mode. Variable is
unused in active mode.
Comment 46 Jan Friesse 2011-08-29 09:31:20 EDT
Test package containing "Handle endless loop if all ifaces are faulty" and "Higher threshold in passive mode for mcast" is https://brewweb.devel.redhat.com/taskinfo?taskID=3589167

Please use with cman-3.0.12.1-14.el6.jf.1.x86_64 and without extra configuration (so NO <altname port="6899" ...>)
Comment 48 Perry Myers 2011-08-30 08:13:36 EDT
*** Bug 504022 has been marked as a duplicate of this bug. ***
Comment 57 Jan Friesse 2011-09-06 02:58:55 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
TechPreview known issues:
- Double ring failure will result in spinning of the
corosync process
- corosync redundant ring appears to meet our technical
preview quality requirements, bit because DLM relies on SCTP which is nonfunctional.  As a result many features of the cluster software that rely on DLM do not work appropriately
Comment 59 Jan Friesse 2011-09-08 04:12:00 EDT
Created attachment 522059 [details]
Patch which allowes threshold setting < 5
Comment 63 errata-xmlrpc 2011-12-06 06:51:18 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1515.html
Comment 64 Red Hat Bugzilla 2013-10-03 20:26:22 EDT
Removing external tracker bug with the id 'DOC-61832' as it is not valid for this tracker

Note You need to log in before you can comment on or make changes to this bug.