Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1147098

Summary: The server needs to handle failures inserting raw data
Product: [JBoss] JBoss Operations Network Reporter: John Sanda <jsanda>
Component: Core Server, Storage NodeAssignee: Michael Burman <miburman>
Status: CLOSED ERRATA QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.2, JON 3.2.1, JON 3.2.2, JON 3.2.3CC: fbrychta, hrupp, loleary, mfoley, miburman, spinder, vnguyen
Target Milestone: ER01Keywords: Regression, Triaged
Target Release: JON 3.3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1147097 Environment:
Last Closed: 2015-04-30 16:09:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1147097    
Bug Blocks: 1212980    
Attachments:
Description Flags
verification logs none

Description John Sanda 2014-09-26 21:51:29 UTC
+++ This bug was initially created as a clone of Bug #1147097 +++

Description of problem:
MetricsServer.addNumericData is the method responsible for storing raw metrics. If an error occurs, we log a warning and simply drop the data point(s) for which the error(s) occurred. Prior to RHQ 4.9, if errors occurred, the server threw an exception, and the agent would resend the report at some point in the future.

We could throw an exception as was done prior to RHQ 4.9, but that could be inefficient. Suppose we are insert 10,000 data points. An error occurs trying to insert the last one. We would incur the network I/O overhead of sending the whole measurement report back and forth again. Then we would re-insert data that has already been successfully been inserted.

A better approach would be to just let the server handle it. We log any data that we fail to insert. The log should be stored to disk so that we do not lose data across restarts. At some point in the future, after the failure(s), we go through the log and retry inserting data. This will help us better handle bursts in traffic after agents with lots of spooled measurement reports reconnect to the server for example.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Simeon Pinder 2015-01-19 20:53:00 UTC
Moving into CR01 target milestone as missed ER01 cutoff.

Comment 4 Michael Burman 2015-03-27 13:28:00 UTC
PR #162

Comment 5 Michael Burman 2015-04-09 18:23:13 UTC
In the release/jon3.3.x:

commit c9cd10cf0f541fe48cba03c03ce8fc1049b34f3f
Author: Michael Burman <miburman>
Date:   Mon Mar 16 16:14:41 2015 +0200

    [BZ 1147098] Add all the raw data first to a queue and from the queue
    write it to the Cassandra. Use the buffer to prevent bursts from
    overloading the Storage Node. Also, in case the queue is full, return an
    exception to the agent for agent queueing.
    
    (cherry picked from commit 5127e55c2cccb4ee10b96671bcffb56d45f1486d)

Comment 6 Simeon Pinder 2015-04-13 04:14:51 UTC
Moving to ON_QA for testing with latest cumulative patch build:
https://brewweb.devel.redhat.com//buildinfo?buildID=429507

Note: Build maps to JON 3.3.2 ER01 build.

Comment 7 Viet Nguyen 2015-04-16 20:55:48 UTC
General verification steps:

1. Increase CPU system/user load collection interval to 30s.
2. Shutdown storage node (rhqctl stop/start --storage) for a couple minutes and observe if there are gaps in the graph

However the server failed to reconnect to storage node so I had to restart both the server and storage node (https://bugzilla.redhat.com/show_bug.cgi?id=1212627)


A. Confirmed the issue is in 3.3.0 GA. See "Before" screen cap.


B. Stopped storage node only.  But server had to be restarted as well (see BZ 1212627 mentioned above).  There seemed to be 1 missing data point ("After gap1" png)


C. Stop both storage and server for about 13 minutes.  Data were backfilled for that period. No gaps observed (After no-gaps png)

Recommended action: reverify after fixing BZ 1212627

Comment 8 Viet Nguyen 2015-04-16 20:58:25 UTC
Created attachment 1015382 [details]
verification logs

Comment 9 Michael Burman 2015-04-17 12:12:01 UTC
These tests processes are incorrect and test the wrong issue. When you shutdown the storage node it will start reporting to agents that server is in maintenance-mode and won't accept metrics. This hasn't been changed.

What you need to do is keep the storage-node up and make sure it's not able to ingest all the events that are generated. In that case, it should start throwing things back to the agent once the internal queue is filled. This is possible with a very low requestlimit for example (or otherwise preventing Cassandra-layer to process the writes).

Comment 11 Viet Nguyen 2015-04-17 20:05:56 UTC
What is the exception to look for once the internal queue is full?

Comment 12 Michael Burman 2015-04-18 20:38:39 UTC
MeasurementStorageException, "The server is overloaded, queue is full."

Comment 13 Viet Nguyen 2015-04-21 04:11:56 UTC
I'm still not able to reproduce the bug in 3.3.0 GA build with the following steps:

1. Set request min = 5
2. disrupt connection to Cassandra (No Host Available exception) without causing it to go Maintenance mode 
 # iptables -A INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.20 -j DROP

Please suggest a verification strategy.  Thanks.

Comment 14 Michael Burman 2015-04-21 05:39:38 UTC
Interrupting connection to Cassandra is not the key (the connection to the Storage Node has to be alive). One solution is to reduce RequestLimit so low that it can't handle the load you're pushing to the solution (you need to push more metrics to fill up the queue). Using perftest + low requestlimit should do the trick.

Comment 15 Viet Nguyen 2015-04-21 22:16:10 UTC
Is there a way to inspect queue size, limit rate, etc?  I have not been able to reproduce any exceptions in 3.3.0 or 3.3.2 with perftest metrics schedules set at 30s interval; ~10K metrics per minute 

#perftest plugin config
-Drhq.perftest.scenario=configurable-1 -Drhq.perftest.server-a-count=10 -Drhq.perftest.service-a-count=50

#rhq-server.properties
rhq.storage.request-limit=50

Comment 16 Michael Burman 2015-04-22 06:58:02 UTC
Yes, through JMX (because the rhq-server plugin isn't updated) you can view the available queue size as well as rate limits. 10k / minute isn't enough, I ran this test for a week with 100k/minute and couldn't make it full with a single storage node (on VM).

So with default settings it's going to be painful without slowing down Cassandra a lot.

Comment 19 errata-xmlrpc 2015-04-30 16:09:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0920.html