Bug 1147098
| Summary: | The server needs to handle failures inserting raw data | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Operations Network | Reporter: | John Sanda <jsanda> | ||||
| Component: | Core Server, Storage Node | Assignee: | Michael Burman <miburman> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Foley <mfoley> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | JON 3.2, JON 3.2.1, JON 3.2.2, JON 3.2.3 | CC: | fbrychta, hrupp, loleary, mfoley, miburman, spinder, vnguyen | ||||
| Target Milestone: | ER01 | Keywords: | Regression, Triaged | ||||
| Target Release: | JON 3.3.2 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | 1147097 | Environment: | |||||
| Last Closed: | 2015-04-30 16:09:54 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1147097 | ||||||
| Bug Blocks: | 1212980 | ||||||
| Attachments: |
|
||||||
|
Description
John Sanda
2014-09-26 21:51:29 UTC
Moving into CR01 target milestone as missed ER01 cutoff. PR #162 In the release/jon3.3.x:
commit c9cd10cf0f541fe48cba03c03ce8fc1049b34f3f
Author: Michael Burman <miburman>
Date: Mon Mar 16 16:14:41 2015 +0200
[BZ 1147098] Add all the raw data first to a queue and from the queue
write it to the Cassandra. Use the buffer to prevent bursts from
overloading the Storage Node. Also, in case the queue is full, return an
exception to the agent for agent queueing.
(cherry picked from commit 5127e55c2cccb4ee10b96671bcffb56d45f1486d)
Moving to ON_QA for testing with latest cumulative patch build: https://brewweb.devel.redhat.com//buildinfo?buildID=429507 Note: Build maps to JON 3.3.2 ER01 build. General verification steps: 1. Increase CPU system/user load collection interval to 30s. 2. Shutdown storage node (rhqctl stop/start --storage) for a couple minutes and observe if there are gaps in the graph However the server failed to reconnect to storage node so I had to restart both the server and storage node (https://bugzilla.redhat.com/show_bug.cgi?id=1212627) A. Confirmed the issue is in 3.3.0 GA. See "Before" screen cap. B. Stopped storage node only. But server had to be restarted as well (see BZ 1212627 mentioned above). There seemed to be 1 missing data point ("After gap1" png) C. Stop both storage and server for about 13 minutes. Data were backfilled for that period. No gaps observed (After no-gaps png) Recommended action: reverify after fixing BZ 1212627 Created attachment 1015382 [details]
verification logs
These tests processes are incorrect and test the wrong issue. When you shutdown the storage node it will start reporting to agents that server is in maintenance-mode and won't accept metrics. This hasn't been changed. What you need to do is keep the storage-node up and make sure it's not able to ingest all the events that are generated. In that case, it should start throwing things back to the agent once the internal queue is filled. This is possible with a very low requestlimit for example (or otherwise preventing Cassandra-layer to process the writes). What is the exception to look for once the internal queue is full? MeasurementStorageException, "The server is overloaded, queue is full." I'm still not able to reproduce the bug in 3.3.0 GA build with the following steps: 1. Set request min = 5 2. disrupt connection to Cassandra (No Host Available exception) without causing it to go Maintenance mode # iptables -A INPUT -p tcp --dport 9142 -m statistic --mode random --probability 0.20 -j DROP Please suggest a verification strategy. Thanks. Interrupting connection to Cassandra is not the key (the connection to the Storage Node has to be alive). One solution is to reduce RequestLimit so low that it can't handle the load you're pushing to the solution (you need to push more metrics to fill up the queue). Using perftest + low requestlimit should do the trick. Is there a way to inspect queue size, limit rate, etc? I have not been able to reproduce any exceptions in 3.3.0 or 3.3.2 with perftest metrics schedules set at 30s interval; ~10K metrics per minute #perftest plugin config -Drhq.perftest.scenario=configurable-1 -Drhq.perftest.server-a-count=10 -Drhq.perftest.service-a-count=50 #rhq-server.properties rhq.storage.request-limit=50 Yes, through JMX (because the rhq-server plugin isn't updated) you can view the available queue size as well as rate limits. 10k / minute isn't enough, I ran this test for a week with 100k/minute and couldn't make it full with a single storage node (on VM). So with default settings it's going to be painful without slowing down Cassandra a lot. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0920.html |