Bug 751901

Summary: all but one node in a three node cluster drop out when cumin starts
Product: Red Hat Enterprise MRG Reporter: Patrick O'Brien <pobrien>
Component: qpid-cppAssignee: mick <mgoulish>
Status: CLOSED INSUFFICIENT_DATA QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 2.0CC: aconway, freznice, jross, pmoravec, whenry
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-24 20:57:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Patrick O'Brien 2011-11-08 00:46:50 UTC
Description of problem:
It appears that launching cumin is causing all but one node in a three node cluster to drop out of the cluster.

Version-Release number of selected component (if applicable):
MRG 2.0
RHEL 5.6 x86_64 for the MRG nodes
RHEL 5.5 x86_64 for the Cumin node
cumin-0.1.4794-1.el5
mrg-release-2.0.0-1.el5
qpid-cpp-server-cluster-0.10-8.el5

How reproducible:
Every time I start cumin the nodes drop out, so I guess pretty reproducible.


Steps to Reproduce:
1. Install MRG 2.0 qpid packages on cluster nodes, configure them, etc, and start qpid
2. Install MRG 2.0 cumin packages on cumin node, configure it, etc and start cumin
3. tail logs and wait for nodes to drop out.
  
Actual results:

All but one node drops out of the cluster.

Expected results:

None of the nodes drop out of the cluster.

Additional info:

Here is the qpid log from one of the cluster members that ejected itself:

2011-11-03 11:06:44 notice Journal "TplStore": Created
2011-11-03 11:06:44 notice Store module initialized; store-dir=/data/mrg/data
2011-11-03 11:06:44 notice Initializing CPG
2011-11-03 11:06:44 notice Cluster store state: empty
2011-11-03 11:06:44 notice cluster(172.18.244.23:532 PRE_INIT) configuration change: 172.18.244.23:532 172.18.244.24:4735 172.18.244.25:20126
2011-11-03 11:06:44 notice cluster(172.18.244.23:532 PRE_INIT) Members joined: 172.18.244.23:532
2011-11-03 11:06:56 notice Listening on TCP port 5671
2011-11-03 11:06:56 notice Read ACL file "/data/mrg/conf/qpidd.acl"
2011-11-03 11:06:56 notice cluster(172.18.244.23:532 INIT) cluster-uuid = 95b01437-7399-4b95-be06-68e1907a1586
2011-11-03 11:06:56 notice cluster(172.18.244.23:532 JOINER) joining cluster TEST
2011-11-03 11:06:56 notice Broker running
2011-11-03 11:06:56 notice cluster(172.18.244.23:532 UPDATEE) receiving update from 172.18.244.25:20126
2011-11-03 11:06:56 notice Journal "public.queue.test1": Created
2011-11-03 11:06:56 notice Journal "public.queue.test2": Created
2011-11-03 11:07:32 notice Running in a cluster, marking store dirty.
2011-11-03 11:07:32 notice cluster(172.18.244.23:532 CATCHUP) update complete, starting catch-up.
2011-11-03 11:07:32 error Execution exception: invalid-argument: cumin.domain.com.7520.2: confirmed < (1197643+0) but only sent < (1197641+0) (qpid/SessionState.cpp:154)
2011-11-03 11:07:32 critical cluster(172.18.244.23:532 CATCHUP/error) local error 1054676016 did not occur on member 172.18.244.24:4735: invalid-argument: cumin.domain.com.7520.2: confirmed < (1197643+0) but only sent < (1197641+0) (qpid/SessionState.cpp:154)
2011-11-03 11:07:32 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: cumin.domain.com.7520.2: confirmed < (1197643+0) but only sent < (1197641+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
2011-11-03 11:07:32 notice cluster(172.18.244.23:532 LEFT/error) leaving cluster TEST
2011-11-03 11:07:33 notice Shut down

Comment 1 Alan Conway 2011-11-23 15:19:40 UTC
According to the log above, the broker was only recently started and had just been just caught up with the rest of the cluster when it failed. Do we know what the pattern of starting/stopping brokers was leading up to the failure? Do we have logs for the other brokers? In a cluster failure its useful to get logs from all the brokers, not just the failed ones, to see what was going on at the time of the failure.

Comment 2 William Henry 2011-12-01 17:47:38 UTC
Can we please add the JVM package/version information?

Comment 3 mick 2011-12-02 19:20:58 UTC
Here is what I did to try to reproduce this bug.  I will include a synopsis, and then all the scripts and the C++ program that I used.

synopsis
{
  start a 3-cluster
  run a script that repeatedly starts cumin, stops cumin, dumps the cumin 
  database, and restarts cumin.
  while that script is running, run a C++ program that randomly declares
  new queues, and chooses which queue to send to or receive from.
  When the C++ program decides to send to a queue, it chooses a random 
  number of messages to send in a burst.  When it decides to receive, it
  randomly chooses which queue it wants, and then drains that queue.
  
  result: Total cumin-cycles count was 2000, during which time 12,000
          queues were created and several million message sent & received.
          No luck on the repro.  All 3 original brokers were still alive 
          at the end.
}



scripts
{
  starting the 3-cluster
  {
    #! /bin/bash

    QPIDD=/usr/sbin/qpidd

    rm -rf /tmp/mick
    mkdir  /tmp/mick

    for N in 1 2 3
    do
      $QPIDD                                      \
       --cluster-name  cluster_flock              \
        -p 581$N                                  \
        --data-dir /tmp/mick/data_$N              \
        --auth=no                                 \
        --mgmt-enable=yes                         \
        --log-enable=info+                        \
        --log-enable=debug+:cluster               \
        --log-to-file /tmp/mick/qpidd_$N.log      \
        -d
      echo "============================================"
      echo "started broker $N from $QPIDD"
      echo "============================================"
    done
  }


  cycling cumin -- this script uses 3 other scripts, included below.
  {
    #! /bin/bash


    \rm -rf ./cumin_output
    mkdir  ./cumin_output

    let count=1

    while [ $count -lt 501 ]
    do
      echo " "
      echo "------------- test $count -------------"
      echo "starting cumin..."
      sleep 2
      ./start_cumin $count &
      sleep 5
      echo "Sleeping 15 seconds..."
      sleep 15


      echo "stopping cumin and dumping DB..."
      sleep 2
      ./stop_cumin &
      sleep 5
      echo "Sleeping 15 seconds..."
      sleep 15


      echo "reinstalling cumin db..."
      sleep 2
      ./cumin_reinstall &
      sleep 5
      echo "Sleeping 15 seconds..."
      sleep 15

      let count+=1

      echo -n "Check brokers...  "
      ps -aef | grep qpidd | grep -v grep | wc -l
    done
  }

  start_cumin
  {
    #! /bin/bash

    N=$1

    output=cumin_output/$N
    echo "starting cumin..."
    /home/mick/cumin_trunk/cumin/bin/cumin-data --debug 2> $output  << yes_dang_it
    yes
    yes_dang_it
  }


  stop_cumin
  {
    #! /bin/bash

    echo "Killing cumin..."
    pkill cumin
    echo "dumping database..."
    /home/mick/cumin_trunk/cumin/bin/cumin-database drop << say_yes
    yes
    say_yes
    echo "cumin stopped and dumped."
  }


  cumin_reinstall
  {
    #! /bin/bash
    /home/mick/cumin_trunk/cumin/bin/cumin-database install << say_yes
    yes
    say_yes
  }

# end of the scripts  
}


the c++ program
{
#include <vector>

#include <qpid/messaging/Connection.h>
#include <qpid/messaging/Message.h>
#include <qpid/messaging/Receiver.h>
#include <qpid/messaging/Sender.h>
#include <qpid/messaging/Session.h>

#include <stdio.h>
#include <unistd.h>     // For pause()
#include <sys/time.h>
#include <signal.h>     // For signal


#include <unistd.h>
#include <cstdlib>
#include <iostream>

#include <sstream>




using namespace std;
using namespace qpid::messaging;


struct kolchak_queue
{
  string     queue_name;
  string     address;
  Session &  session;
  Sender     sender;
  Receiver   receiver;

  kolchak_queue ( Session & _session,
                   char * _queue_name
                 );
};





kolchak_queue::kolchak_queue ( Session & _session,
                               char * _queue_name
                             )
  : session ( _session ),
    queue_name ( _queue_name )
{
  address = queue_name;
  address += ";{create:always,node:{type:queue}}";
  sender   = session.createSender   ( address );
  receiver = session.createReceiver ( address );
}








struct kolchak
{
  kolchak ( string & _broker,
            string & _options
          );

  int seed;

  string     broker;
  Connection cnx;
  Session    session;

  void run ( int n_steps );

  int rand ( int max_desired_value );

  // Actions -----------------------
  void declare_queue ( );
  void send ( );
  void drain ( );
  // End actions -----------------------

  void print_report ( );

  vector < kolchak_queue * > queues;

  int message_count,
      message_burst_count,
      drain_count,
      report_count;
};





kolchak::kolchak ( string & _broker,
                   string & _options
                 ) :
  broker ( _broker ),
  cnx ( _broker, _options ),
  message_count ( 0 ),
  message_burst_count ( 0 ),
  drain_count ( 0 ),
  report_count ( 0 )
{
  seed = getpid();
  srand ( seed );
  fprintf ( stderr, "MDEBUG kolchak seed is: %d\n", seed );

  cnx.open();
  session  = cnx.createSession();
}





int
kolchak::rand ( int max_plus_one )
{
  double zero_to_one = (double) ::rand() / (double) RAND_MAX;
  return (int) (zero_to_one * (double) max_plus_one );
}





void
kolchak::run ( int n_steps )
{
  for ( int step = 0; step < n_steps; ++ step )
  {
    if ( step && ( ! (step % 100) ) )
    {
      print_report ( );
    }

    int num = rand ( 1000 );

    fprintf ( stderr, "step %d choice %d ", step, num );

    if ( num < 50 )
    {
      declare_queue ( );
    }
    else
    if ( num < 500 )
    {
      send ( );
    }
    else
    {
      drain ( );
    }
  }
}





void
kolchak::declare_queue ( )
{
  char name[1000];
  sprintf ( name, "q_%d", queues.size() );
  kolchak_queue * q = new kolchak_queue ( session, name );
  queues.push_back ( q );

  fprintf ( stderr, "declared queue |%s|\n", name );
}





void
kolchak::send ( )
{
  int n_queues = queues.size();
  int queue_index;

  if ( n_queues < 1 )
  {
    fprintf ( stderr, "no queues to send to yet.\n" );
    return;
  }

  if ( n_queues == 1 )
    queue_index = 0;
  else
  {
    // Randomly choose which queue to send them to.
    queue_index = rand ( queues.size() );
  }

  // Randomly choose how many messages to send.
  int n_messages = rand ( 1000 );

  string msg_str("message...");
  Message msg ( msg_str );

  kolchak_queue * q = queues [ queue_index ];

  for ( int i = 0; i < n_messages; ++ i )
  {
    q->sender.send ( msg );
  }

  message_count += n_messages;
  ++ message_burst_count;

  fprintf ( stderr,
            "sent %d messages to queue |%s|\n",
            n_messages,
            q->queue_name.c_str()
          );
}





void
kolchak::print_report ( )
{
  ++ report_count;
  fprintf ( stderr, "\nreport %d --------------------------\n", report_count );

  fprintf ( stderr, "%d queues declared.\n", queues.size() );

  fprintf ( stderr,
            "%d messages sent in %d bursts.\n",
            message_count,
            message_burst_count
          );

  fprintf ( stderr,
            "%d queues drained.\n",
            drain_count
          );
  fprintf ( stderr, "--------------------------\n\n" );
}





void
kolchak::drain ( )
{
  int n_queues = queues.size();
  int queue_index;

  if ( n_queues < 1 )
  {
    fprintf ( stderr, "no queues to drain yet.\n" );
    return;
  }

  if ( n_queues == 1 )
    queue_index = 0;
  else
  {
    // Randomly choose which queue to drain.
    queue_index = rand ( queues.size() );
  }

  kolchak_queue * q = queues [ queue_index ];

  while ( 1 )
  {
    try
    {
      Message message = q->receiver.fetch ( Duration::SECOND * 1 );
    }
    catch (const NoMessageAvailable& e)
    {
      break;
    }
  }

  session.acknowledge();
  fprintf ( stderr, "drained queue |%s|\n", q->queue_name.c_str() );

  ++ drain_count;
}





main ( int argc, char** argv )
{
  string broker  = argc > 1 ? argv[1] : "localhost:5811";
  string options("");

  kolchak k ( broker, options ) ;

  k.run ( 10000 );
}


// end of C++ program

}




build script for the c++ program
{
    #! /bin/bash

    echo "============================================================="
    echo "   making kolchak "
    echo "============================================================="
    g++ -g -O2                         \
    -I$QPID_ROOT/cpp/include           \
    -I$QPID_ROOT/cpp/src               \
    -L$QPID_ROOT/cpp/src/.libs         \
    -lqpidmessaging                    \
    kolchak.cpp                        \
    -o kolchak
}

Comment 4 Justin Ross 2011-12-07 21:07:15 UTC
We have not yet been able to reproduce this issue, and the window for 0.14-mrg (2.2) is closing.  Targeting 2.3.