Bug 1496228

Summary: Cassandra readiness probe can incorrectly fail in multi node setup
Product: OpenShift Container Platform Reporter: Matt Wringe <mwringe>
Component: HawkularAssignee: Ruben Vargas Palma <rvargasp>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.2.1CC: aos-bugs, jcosta, juzhao
Target Milestone: ---   
Target Release: 3.2.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1494673 Environment:
Last Closed: 2019-11-21 18:38:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1494673, 1511627, 1511628, 1511629, 1511631    
Bug Blocks:    
Attachments:
Description Flags
sample_nodetool_output none

Description Matt Wringe 2017-09-26 17:30:44 UTC
+++ This bug was initially created as a clone of Bug #1494673 +++

Our Cassandra readiness probe will parse the output of 'nodetool status' to determine if the Cassandra instance is in the 'up' and 'normal' state.

Our string parsing of the output can have an issue in certain situations. If the string value of the current host's ip address is contained within the ip of another node in the cluster, then we will try and parse two lines of the output instead of just one.

For instance, consider the case where we have two nodes in our Cassandra cluster where their ip addresses are '172.17.0.3' and '172.17.0.3' ('72.17.0.3' and '172.17.0.3' would also cause a problem as well).

How we are parsing this output, our script would incorrectly try and handle both entries from 'nodetool status' instead of just the one.

This will cause the readiness probe to get unexpected information and fail.

If the pod is brought down and restarted, it should be granted a new ip address which should not conflict with the second ip address anymore and then be able to continue.

--- Additional comment from Matt Wringe on 2017-09-22 15:16:20 EDT ---

Simple PR which fixes this issue by checking for whitespace before and after the ip address, thus preventing the script from considering the ip address the same: https://github.com/openshift/origin-metrics/pull/380

Comment 4 Matt Wringe 2017-09-29 14:33:13 UTC
Created attachment 1332416 [details]
sample_nodetool_output

sample nodetool output

Comment 6 Junqi Zhao 2017-09-30 09:14:46 UTC
Tested based on Comment 5,this issue is fixed in metrics-cassandra:3.2.1-17,
/tmp/output file
*********************************************************************
cat: /etc/ld.so.conf.d/*.conf: No such file or directory
Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.1.0.6  523.8 KB   256          ?       89f099c3-0a24-4201-84d6-9dc5c0b373ad  rack1
DN  10.1.0.60  ?          256          ?       1cbb8dc3-6758-4578-b3a1-80f4d6db5077  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
*********************************************************************

Tested with metrics-cassandra:3.2.1-16 firstly, it showed the following error
*****************************************************************************
sh-4.2$ source /opt/apache-cassandra/bin/cassandra-docker-ready.sh
sh: [: too many arguments
Cassandra not in the up and normal state. Current state is UN
DN
error: error executing remote command: Error executing command in container: Error executing in Docker Container: 1

*****************************************************************************
It showed Cassandra not in the up and normal state, this was expected.
 
Tested again with metrics-cassandra:3.2.1-17
sh-4.2$ source /opt/apache-cassandra/bin/cassandra-docker-ready.sh 
Cassandra is in the up and normal state. It is now ready.

Set it to VERIFIED