Bug 623176 - token loss during recovery can trigger abort
token loss during recovery can trigger abort
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.5
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
: ZStream
: 556804 631835 (view as bug list)
Depends On:
Blocks: 675783 694182
  Show dependency treegraph
 
Reported: 2010-08-11 10:23 EDT by Jaroslav Kortus
Modified: 2011-07-21 03:47 EDT (History)
8 users (show)

See Also:
Fixed In Version: openais-0.80.6-29.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 675783 (view as bug list)
Environment:
Last Closed: 2011-07-21 03:47:20 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
core of aisexec from crashed node. (1.52 MB, application/x-bzip2)
2010-08-11 10:25 EDT, Jaroslav Kortus
no flags Details
Backport from corosync master branch (5.12 KB, patch)
2011-03-14 05:00 EDT, Jan Friesse
no flags Details | Diff
Backport of d99fba72e65545d8a3573b754525bd2ec8dcc540 (3.21 KB, patch)
2011-03-29 05:57 EDT, Jan Friesse
no flags Details | Diff
Backport of 336741ee96caf3ae435b609ee8a76788902c9edf (8.35 KB, patch)
2011-03-29 05:57 EDT, Jan Friesse
no flags Details | Diff
Backport of Corosync 0eabeee63eca7a4cc1d907607057ac668fafbcae (2.57 KB, patch)
2011-03-29 06:22 EDT, Jan Friesse
no flags Details | Diff
Backport of 0fcf760977a73ef78bacff47e61c30ba21929faa (2.68 KB, patch)
2011-03-29 06:22 EDT, Jan Friesse
no flags Details | Diff
Backport of Corosync 0eabeee63eca7a4cc1d907607057ac668fafbcae (2.61 KB, patch)
2011-03-30 05:13 EDT, Jan Friesse
no flags Details | Diff

  None (edit)
Description Jaroslav Kortus 2010-08-11 10:23:37 EDT
Description of problem:
During recovery phase, aisexec crashed on 2 of 3 nodes.
The setup was 3-node ia64 cluster (a1, a2, a3) and a3 has been set up to have 80% network loss.

This, over time, resulted in lost token and need to form new configuration. When this happened, 2 of 3 nodes have crashed: Program terminated with signal 6, Aborted.

Version-Release number of selected component (if applicable):
openais-0.80.6-16.el5_5.7

How reproducible:
happened just once, perhaps exact timing of token drop to recovery phase is needed?.

Steps to Reproduce:
1. form cluster, start cpgverify (7 per node)
2. setup 80% network loss (tc qdisc add dev eth0 parent 1:20 handle 2:0 netem delay 100ms loss 80% reorder 10% gap 5 delay 100ms)
3. wait for the lossy node to get fenced
  
Actual results:
core may happen

Expected results:
fence, new configuration, live happy ever after

Additional info:


==== Logs from a2 which as only one did not crash =====
==== note the line with: kernel: dlm: invalid h_nodeid 0 from 3 lockspace 20001 ====
Aug 10 10:13:26 a2 openais[22450]: [TOTEM] Retransmit List: cd7b cd95 cdaf cde6 ce01 ce02 ce1b ce1c ce1d ce84 ce85 ce86 ce87  
Aug 10 10:13:26 a2 openais[22450]: [TOTEM] Retransmit List: ce86 ce87 cd7b cd95 cde6 ce02 ce1d ce84 ce85  
Aug 10 10:13:36 a2 openais[22450]: [TOTEM] The token was lost in the OPERATIONAL state. 
Aug 10 10:13:36 a2 openais[22450]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). 
Aug 10 10:13:36 a2 openais[22450]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Aug 10 10:13:36 a2 openais[22450]: [TOTEM] entering GATHER state from 2. 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] Saving state aru cd7a high seq received cebb 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] Storing new sequence id for ring 80 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] entering COMMIT state. 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] entering RECOVERY state. 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] position [0] member 10.15.89.179: 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] previous ring seq 124 rep 10.15.89.179 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] aru cd7a high delivered cd7a received flag 0 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] position [1] member 10.15.89.180: 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] previous ring seq 124 rep 10.15.89.179 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] aru cd7a high delivered cd7a received flag 0 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] position [2] member 10.15.89.181: 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] previous ring seq 124 rep 10.15.89.179 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] aru cebe high delivered cebe received flag 1 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] copying all old ring messages from cd7b-cebb. 
Aug 10 10:13:37 a2 openais[22450]: [TOTEM] Originated 312 messages in RECOVERY. 
Aug 10 10:13:47 a2 openais[22450]: [TOTEM] The token was lost in the RECOVERY state. 
Aug 10 10:13:47 a2 openais[22450]: [TOTEM] Restoring instance->my_aru cd7a my high seq received cebb 
Aug 10 10:13:47 a2 openais[22450]: [TOTEM] entering GATHER state from 5. 
Aug 10 10:13:48 a2 kernel: dlm: invalid h_nodeid 0 from 3 lockspace 20001
Aug 10 10:13:59 a2 kernel: dlm: closing connection to node 1
Aug 10 10:13:59 a2 kernel: dlm: closing connection to node 3
Aug 10 10:13:59 a2 openais[22450]: [TOTEM] entering GATHER state from 0. 
Aug 10 10:13:59 a2 openais[22450]: [TOTEM] Creating commit token because I am the rep. 



==== on the crashed node it looks nearly the same, just different ending ====
Aug 10 10:13:47 a1 openais[22748]: [TOTEM] Restoring instance->my_aru cd7a my high seq received cebb 
Aug 10 10:13:47 a1 openais[22748]: [TOTEM] entering GATHER state from 5. 
Aug 10 10:13:47 a1 gfs_controld[22777]: cluster is down, exiting
Aug 10 10:13:47 a1 dlm_controld[22772]: cluster is down, exiting
Aug 10 10:13:47 a1 kernel: dlm: closing connection to node 3
Aug 10 10:13:47 a1 kernel: dlm: closing connection to node 2
Aug 10 10:13:47 a1 kernel: dlm: closing connection to node 1
Aug 10 10:13:47 a1 fenced[22767]: cluster is down, exiting
Comment 1 Jaroslav Kortus 2010-08-11 10:25:17 EDT
Created attachment 438199 [details]
core of aisexec from crashed node.
Comment 4 Steven Dake 2011-02-07 12:21:02 EST
Core was generated by `aisexec'.
Program terminated with signal 6, Aborted.
#0  0xa000000000010620 in __kernel_syscall_via_break ()
(gdb) where
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000000000f8b00 in raise () from /lib/libc.so.6.1
#2  0x20000000000fbe80 in abort () from /lib/libc.so.6.1
#3  0x20000000000e87b0 in __assert_fail () from /lib/libc.so.6.1
#4  0x400000000001a970 in memb_state_commit_token_update (
    instance=0x20000000017d0010, commit_token=0x60000000001437a8)
    at totemsrp.c:2681
#5  memb_state_commit_enter (instance=0x20000000017d0010, 
    commit_token=0x60000000001437a8) at totemsrp.c:1754
#6  0x4000000000022030 in message_handler_memb_commit_token (
    instance=0x20000000017d0010, msg=0x60000000001437a8, 
    msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:4062
#7  0x4000000000017a20 in main_deliver_fn (context=0x20000000017d0010, 
    msg=0x60000000001437a8, msg_len=1325018) at totemsrp.c:4135
#8  0x4000000000011a10 in none_mcast_recv (rrp_instance=<value optimized out>, 
    iface_no=0, context=0x20000000017d0010, msg=0x60000000001437a8, 
    msg_len=312) at totemrrp.c:476
#9  0x4000000000011d10 in rrp_deliver_fn (context=0x60000000001181b0, 
    msg=0x60000000001437a8, msg_len=312) at totemrrp.c:1319
#10 0x400000000000dcc0 in net_deliver_fn (handle=0, fd=<value optimized out>, 
    revents=1, data=0xc000000000000712) at totemnet.c:695
#11 0x4000000000007290 in poll_run (handle=0) at aispoll.c:402
---Type <return> to continue, or q <return> to quit---
#12 0x4000000000039380 in main (argc=<value optimized out>, 
    argv=<value optimized out>) at main.c:628
(gdb) 
(gdb) up
#1  0x20000000000f8b00 in raise () from /lib/libc.so.6.1
(gdb) up
#2  0x20000000000fbe80 in abort () from /lib/libc.so.6.1
(gdb) up
#3  0x20000000000e87b0 in __assert_fail () from /lib/libc.so.6.1
(gdb) up
#4  0x400000000001a970 in memb_state_commit_token_update (
    instance=0x20000000017d0010, commit_token=0x60000000001437a8)
    at totemsrp.c:2681
2681		assert (commit_token->memb_index <= commit_token->addr_entries);
(gdb) prnt commit_token
Undefined command: "prnt".  Try "help".
(gdb) print *commit_token
Cannot access memory at address 0x0
(gdb) print commit_token
$1 = <value optimized out>
(gdb) up
#5  memb_state_commit_enter (instance=0x20000000017d0010, 
    commit_token=0x60000000001437a8) at totemsrp.c:1754
1754		memb_state_commit_token_update (instance, commit_token);
(gdb) print *commit_token
$2 = {header = {type = 4 '\004', encapsulated = 0 '\000', 
    endian_detector = 65314, nodeid = 1}, token_seq = 6, ring_id = {rep = {
      nodeid = 1, family = 2, 
      addr = "\n\017Y\263\b\000\002\000\n\017Y\263\b\000\004"}, seq = 128}, 
  retrans_flg = 0, memb_index = 4, addr_entries = 3, 
  end_of_commit_token = 0x60000000001437de "\001"}
(gdb) print instance->memb_state
$4 = MEMB_STATE_GATHER
Comment 5 Steven Dake 2011-02-07 12:30:01 EST
(gdb) print instance->my_proc_list_entries
$6 = 3
(gdb) print instance->my_failed_list_entries
$7 = 0
(gdb) up
#6  0x4000000000022030 in message_handler_memb_commit_token (
    instance=0x20000000017d0010, msg=0x60000000001437a8, 
    msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:4062
4062					memb_state_commit_enter (instance, memb_commit_token);
(gdb) print sub_entries
$8 = 3
(gdb) print sub[0]
$9 = {addr = {{nodeid = 1, family = 2, 
      addr = "\n\017Y\263\b\000\002\000\n\017Y\263\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print sub[1]
$10 = {addr = {{nodeid = 2, family = 2, 
      addr = "\n\017Y\264\b\000\002\000\n\017Y\264\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print sub[2]
$11 = {addr = {{nodeid = 3, family = 2, 
      addr = "\n\017Y\265\b\000\002\000\n\017Y\265\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print my_proc_list[0]
No symbol "my_proc_list" in current context.
(gdb) print instance->my_proc_list[0]
$12 = {addr = {{nodeid = 1, family = 2, 
      addr = "\n\017Y\263\b\000\002\000\n\017Y\263\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_proc_list[1]
$13 = {addr = {{nodeid = 2, family = 2, 
      addr = "\n\017Y\264\b\000\002\000\n\017Y\264\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_proc_list[2]
$14 = {addr = {{nodeid = 3, family = 2, 
      addr = "\n\017Y\265\b\000\002\000\n\017Y\265\b\000\004"}, {nodeid = 0, 
      family = 0, addr = '\000' <repeats 15 times>}}}
(gdb) print instance->my_ring_id
$15 = {rep = {nodeid = 1, family = 2, 
    addr = "\n\017Y\263\b\000\002\000\n\017Y\263\b\000\004"}, seq = 124}
Comment 6 Steven Dake 2011-02-07 13:21:59 EST
From the logs, we see the processor lost a token in the recovery state, then entered the gather state.  From the core dump, we can see the processor attempted to enter the commit state from the gather state, but then segfaulted.

It would be nice to have a reproducer, but I believe the issue is that the full ring sequence number is restored on entering recovery state (ring_state_restore()) and again when a token timeout occurs during recovery.  This is corrupting the ring sequence numbers.

What happens then is as follows:
nodes 1,2,3 with seq = 4
commit token seq = 8
commit token sent from 1->2 with seq = 8
commit token sent from 2->3 with seq = 8
commit token sent from 3->1 (1 enters recover state)
commit token sent from 1->2 (2 drops token for long period of time)
failure to send commit token from 3->1 because of token loss
node 1 has token timeout in recovery state, which restores ring sequence number to 4 (!!), and enters gather
finally that commit token makes it to node 1, and the commit token is accepted as a new commit token (rather then being rejected)
When that commit token makes it back around to 1 in recovery, it should start the recovery process.  But because it is in gather, it thinks its a valid commit token (even though its from a now discarded ring configuration)

the code is as follows:
                case MEMB_STATE_GATHER:
                        memb_set_subtract (sub, &sub_entries,
                                instance->my_proc_list, instance->my_proc_list_entries,
                                instance->my_failed_list, instance->my_failed_list_entries);

                        if (memb_set_equal (addr,
                                memb_commit_token->addr_entries,
                                sub,
                                sub_entries) &&

                                memb_commit_token->ring_id.seq > instance->my_ring_id.seq) {

                                memb_state_commit_enter (instance, memb_commit_token);
                        }
                        break;

in this case, the commit token seq is 8, node 1's seq is 4 (because of ring_state_restore).  Then node 1 thinks this is a valid commit token to form a new ring.  Instead node 1 should reject the commit token, which will trigger a token loss in the recovery state on all nodes, resulting in a new round of consensus.

The totem spec states for token loss timeout expired in recovery state:
"1. discard all new messages received on the new ring
2. empty retrans_message queue
3. determine current old ring aru
4. call shift_to_gather"

It appears the code currently takes point 3 and expands it into restoring all state about the current ring, which is wrong.
Comment 7 Steven Dake 2011-02-07 13:31:29 EST
*** Bug 631835 has been marked as a duplicate of this bug. ***
Comment 9 Jan Friesse 2011-03-03 09:43:47 EST
Reproducer (in code) now exists for corosync https://bugzilla.redhat.com/attachment.cgi?id=482098&action=diff
Comment 10 Steven Dake 2011-03-04 11:54:17 EST
brilliant work on comment #9.
Comment 11 Steven Dake 2011-03-04 16:12:14 EST
Honza,

Please backport the resolution from corosync master patch.

Thanks!
-steve
Comment 12 Jan Friesse 2011-03-14 05:00:25 EDT
Created attachment 484103 [details]
Backport from corosync master branch

Backport of Corosync 6aa47fde953bf2179f5bd2dd07815fc7d80f47bb
Comment 13 Jan Friesse 2011-03-16 06:05:44 EDT
Patch sent to ML.
Comment 14 Lon Hohberger 2011-03-21 19:26:20 EDT
*** Bug 556804 has been marked as a duplicate of this bug. ***
Comment 15 Jan Friesse 2011-03-24 11:49:53 EDT
According to Steve, this patch introduced a regression in the code which was
exposed during QE testing.  I'm flipping the state.
Comment 16 Steven Dake 2011-03-28 14:26:19 EDT
following patches need backports
6aa47fde953bf2179f5bd2dd07815fc7d80f47bb
d99fba72e65545d8a3573b754525bd2ec8dcc540
336741ee96caf3ae435b609ee8a76788902c9edf
0eabeee63eca7a4cc1d907607057ac668fafbcae
0fcf760977a73ef78bacff47e61c30ba21929faa
Comment 17 Jan Friesse 2011-03-29 05:57:00 EDT
Created attachment 488395 [details]
Backport of d99fba72e65545d8a3573b754525bd2ec8dcc540
Comment 18 Jan Friesse 2011-03-29 05:57:31 EDT
Created attachment 488396 [details]
Backport of 336741ee96caf3ae435b609ee8a76788902c9edf
Comment 19 Jan Friesse 2011-03-29 06:22:26 EDT
Created attachment 488400 [details]
Backport of Corosync 0eabeee63eca7a4cc1d907607057ac668fafbcae
Comment 20 Jan Friesse 2011-03-29 06:22:55 EDT
Created attachment 488401 [details]
Backport of 0fcf760977a73ef78bacff47e61c30ba21929faa
Comment 21 Jan Friesse 2011-03-30 05:13:32 EDT
Created attachment 488697 [details]
Backport of Corosync 0eabeee63eca7a4cc1d907607057ac668fafbcae

Fixed missing instance->old_ring_state_saved = 1;
Comment 30 errata-xmlrpc 2011-07-21 03:47:20 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1012.html

Note You need to log in before you can comment on or make changes to this bug.