Bug 729650

Summary: toggle down/up interface causes to cluster cannot synchronize
Product: Red Hat Enterprise Linux 5 Reporter: Zdenek Kraus <zkraus>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED CANTFIX QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 5.7CC: cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-10 14:59:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Zdenek Kraus 2011-08-10 13:11:17 UTC
Description of problem:
Two openais nodes is synchronized and operational. After turning off one's interface, cluster splits. Then after turning that interface on, both of them starts to synchronize and they keeps sending mCast traffic and cannot synchronize. No service is using openais.


Version-Release number of selected component (if applicable):
Host: 
  Linux <HOSTNAME> 2.6.35.13-92.fc14.x86_64 #1 SMP Sat May 21 17:26:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
  Fedora release 14 (Laughlin)

Guests: 
  1. Linux rhel5x 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
  Red Hat Enterprise Linux Server release 5.7 (Tikanga)
  2. Linux rhel5i 2.6.18-274.el5 #1 SMP Fri Jul 8 17:39:55 EDT 2011 i686 i686 i386 GNU/Linux 
  Red Hat Enterprise Linux Server release 5.7 (Tikanga)

component:
  openais-0.80.6-30.el5


How reproducible:
98%

Steps to Reproduce:
0. VM hosts are connected throuch virtual bridge virbr0, which is separated from internet. Interfaces on VMs is eth1.
1. configure openais for same cluster (eg. bindnetaddr: 192.168.5.0)
2. service iptables stop; newgrp ais; service openais start
NOTE: successfully synchronized
3. rhel5i # ifdown eth1
NOTE: cluster is splitted, rhel5i is connected to localhost
4. rhel5i # ifup eth1
NOTE: start to synchronize
  
Actual results:
repeating of mCast traffic and log records like this:
Aug 10 15:06:01.890461 [TOTEM] Sending initial ORF token
Aug 10 15:06:01.891348 [TOTEM] entering OPERATIONAL state.
Aug 10 15:06:01.906107 [TOTEM] entering GATHER state from 11.
Aug 10 15:06:02.717460 [TOTEM] entering GATHER state from 0.
Aug 10 15:06:02.717528 [TOTEM] Creating commit token because I am the rep.
Aug 10 15:06:02.717557 [TOTEM] Storing new sequence id for ring 1bbdc
Aug 10 15:06:02.717604 [TOTEM] entering COMMIT state.
Aug 10 15:06:02.717627 [TOTEM] entering RECOVERY state.
Aug 10 15:06:02.718393 [TOTEM] position [0] member 192.168.5.2:
Aug 10 15:06:02.718406 [TOTEM] previous ring seq 113620 rep 192.168.5.2
Aug 10 15:06:02.718412 [TOTEM] aru 0 high delivered 0 received flag 1
Aug 10 15:06:02.718418 [TOTEM] Did not need to originate any messages in recovery.


Expected results:
synchronize after few moments and log records like:
Aug 10 15:07:01.914060 [TOTEM] entering GATHER state from 11.
Aug 10 15:07:01.916320 [TOTEM] Storing new sequence id for ring 1bdfc
Aug 10 15:07:01.916479 [TOTEM] entering COMMIT state.
Aug 10 15:07:01.917316 [TOTEM] entering RECOVERY state.
Aug 10 15:07:01.918193 [TOTEM] position [0] member 192.168.5.1:
Aug 10 15:07:01.918207 [TOTEM] previous ring seq 114168 rep 192.168.5.1
Aug 10 15:07:01.918214 [TOTEM] aru 0 high delivered 0 received flag 1
Aug 10 15:07:01.918221 [TOTEM] position [1] member 192.168.5.2:
Aug 10 15:07:01.918227 [TOTEM] previous ring seq 114168 rep 192.168.5.2
Aug 10 15:07:01.918254 [TOTEM] aru c high delivered c received flag 1
Aug 10 15:07:01.918261 [TOTEM] Did not need to originate any messages in recovery.
Aug 10 15:07:01.920487 [CLM  ] CLM CONFIGURATION CHANGE
Aug 10 15:07:01.920504 [CLM  ] New Configuration:
Aug 10 15:07:01.920515 [CLM  ]  r(0) ip(192.168.5.2) 
Aug 10 15:07:01.920521 [CLM  ] Members Left:
Aug 10 15:07:01.920526 [CLM  ] Members Joined:
Aug 10 15:07:01.920540 [CLM  ] CLM CONFIGURATION CHANGE
Aug 10 15:07:01.920547 [CLM  ] New Configuration:
Aug 10 15:07:01.920554 [CLM  ]  r(0) ip(192.168.5.1) 
Aug 10 15:07:01.920561 [CLM  ]  r(0) ip(192.168.5.2) 
Aug 10 15:07:01.920566 [CLM  ] Members Left:
Aug 10 15:07:01.920572 [CLM  ] Members Joined:
Aug 10 15:07:01.920578 [CLM  ]  r(0) ip(192.168.5.1) 
Aug 10 15:07:01.920595 [SYNC ] This node is within the primary component and will provide service.
Aug 10 15:07:01.921395 [TOTEM] entering OPERATIONAL state.
Aug 10 15:07:01.923995 [CLM  ] got nodejoin message 192.168.5.1
Aug 10 15:07:01.924281 [CLM  ] got nodejoin message 192.168.5.2


Additional info:
openais with down interface even silently exitted twice.

Comment 1 Steven Dake 2011-08-10 14:59:42 UTC
Do not take an interface out of service when using openais.  This can't be fixed.

Regards
-steve