Bug 1161806 - Aggregation timeslices not properly computed due to DST changes
Summary: Aggregation timeslices not properly computed due to DST changes
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Storage Node
Version: JON 3.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: CR02
: JON 3.3.0
Assignee: Stefan Negrea
QA Contact: Filip Brychta
URL:
Whiteboard:
Keywords: Regression
Depends On:
Blocks: 1206671
TreeView+ depends on / blocked
 
Reported: 2014-11-07 23:04 UTC by Stefan Negrea
Modified: 2015-03-27 17:43 UTC (History)
7 users (show)

(edit)
Clone Of:
: 1206671 (view as bug list)
(edit)
Last Closed: 2014-12-11 14:00:21 UTC


Attachments (Terms of Use)

Description Stefan Negrea 2014-11-07 23:04:31 UTC
Description of problem:
When switching from DST to non-DST time (or the other way around) the aggregation time slices are not properly computed resulting in lack of data aggregation. 

Version-Release number of selected component (if applicable):
3.3.0 ER01

How reproducible:
Every time.

Steps to Reproduce:
1. Get the system to a date and time just a few days after a DST or non-DST time change.
2. Collect metrics
3. Check data aggregation

Actual results:
Data collected after the time change does not get aggregated until the time change is 2 weeks in the past.

Expected results:
Data aggregation continues to run properly after the time change.


Additional info:
Due to the time change, time slices will get skewed by an hour resulting in slice queries to fail to retrieve any data. This happens on the both time changes, to DST or non-DST. 

In the current state, data will eventually get aggregated after about 2 weeks since the aggregation time window shifts due to data retention. But the system will attempt to aggregate about 2 weeks worth of data at once most likely causing a system crash.

This problem is only applicable to JON 3.3 since there are significant changes to the data storage and aggregation code. This bug is a release blocker because will notice a lack of aggregate data for about 2 weeks after a DST or non-DST transition.

Comment 1 Stefan Negrea 2014-11-07 23:17:37 UTC
Fixed the slice being wrongly computed after a DST transition. If the duration of the time slice is more than one hour, account for the timezone offset and adjust the time slice start time. This will address the issue where the query for time slice data was not returning data due to the a skewed start time.


master branch commit:

commit 6b31fdb7a6ec78d1db18aa340e900f4151898ec6
Author: Stefan Negrea <snegrea@redhat.com>
Date:   Fri Nov 7 17:13:05 2014 -0600

    [BZ 1161806] Fix the slice being wrongly computed after a DST transition. If the duration of the time slice is more than one h

Comment 2 Simeon Pinder 2014-11-10 01:56:04 UTC
As the DST failure is still consistently occurring across all pg and ora tests, I'm going to cherry-pick this to release/jon3.3.x and run the entire regression suite again to see if we get all green. If on further investigation of the cherry-pick it is decided that this is too risky then I can revert.  As of now all other regression tests pass for CR02/GA build.

Cherry-picked to release/jon3.3.x with commit: 2fe8255ef8ede

@Heiko, Do you have any other reservations/modifications with/for this fix? 

Moving to MODIFIED.

Comment 3 Simeon Pinder 2014-11-11 14:30:03 UTC
Moving this back to ASSIGNED as Stefan indicates that after conversing with John Sanda and more detailed testing that there is still a little more required to finish fix this issue.

Comment 7 Stefan Negrea 2014-11-12 14:57:40 UTC
A new set of changes has been added to the master branch:

commit a45eb3e477cbb9b284da28d8049c4a9514584d62
Author: Stefan Negrea <snegrea@redhat.com>
Date:   Tue Nov 11 15:39:01 2014 -0600

    [BZ 1161806] Update the migration code to handle cases where the data spans a DST change. Without this c

commit 8a5df3d3922db1652220222b1d7384a40bc37093
Author: Stefan Negrea <snegrea@redhat.com>
Date:   Tue Nov 11 13:51:58 2014 -0600

    [BZ 1161806] Fix broken unit tests due to timeslice shift to UTC timezone based.

commit cf281ecb648c80bd8659a02ca19af0eddb13c684
Author: Stefan Negrea <snegrea@redhat.com>
Date:   Tue Nov 11 13:46:02 2014 -0600

    [BZ 1161806] Shift the timeslice to a correct UTC timeslice during the index migration. This will gurant

commit ffd54287c7c02a3552d3f6c7288a00a6cd85bed9
Author: Stefan Negrea <snegrea@redhat.com>
Date:   Mon Nov 10 16:56:10 2014 -0600

    [BZ 1161806] Switch the entire server metrics code to use an UTC timezone in all the code. This will avo

Comment 8 Stefan Negrea 2014-11-12 15:01:50 UTC
To re-iterate the issue, only 6 hour and 24 hour aggregates were impacted because the aggregation slices were set in wall clock intervals. This was causing issues immediately after DST transitions or in cases where an HA environment had JON server distributed across different timezones.

For the new set of fixes, switched the entire server metrics code to use an UTC timezone in all the code. This will avoid any DST issues since there is no DST in UTC timezone. Also, this will avoid any problems due to having an HA environment where the servers are distributed in different timezones.


Also, the migration job from the old indexes to the new indexes was updated to translate all the index entries into UTC timezone based slice entries.

Comment 9 Stefan Negrea 2014-11-12 16:42:54 UTC
Created attachment 956797 [details]
Metrics Aggregation Index Update

A diagram to illustrate the decision process for updating metrics aggregation indexes to UTC timezone slices.

Comment 10 John Sanda 2014-11-12 19:28:12 UTC
Changes have been merged to the release/jon3.3.x branch.

commit hashes:
2fe8255ef8
db908a0e8
c04cd203
818d986fc
21fa2ecdc

Comment 11 Simeon Pinder 2014-11-14 04:48:20 UTC
Moving to ON_QA as available for test with build:
https://brewweb.devel.redhat.com//buildinfo?buildID=398756

Comment 16 Stefan Negrea 2014-11-20 14:42:48 UTC
This BZ was discovered and further verified by the following sets of developer integration tests:

1) org.rhq.server.metrics.MetricsServerTest
2) org.rhq.cassandra.schema.ReplaceIndexTest

These tests failed around the DST transition and kept failing until a proper fix was developed. There are a lot of other tests for the aggregation code that did not report an error before or after the change. Which means no regressions were detected for the fix.

Also, the code fix itself went thorough review by the development team.

Comment 17 Filip Brychta 2014-11-21 15:17:36 UTC
Verified on
Version :	
3.3.0.GA
Build Number :	
4f16df3:e347f77


Note You need to log in before you can comment on or make changes to this bug.