Bug 739194

Summary: No Version#0 changeset created for drift configurations created during network outage, even after the outage is repaired (scenario #7)
Product: [Other] RHQ Project Reporter: Mike Foley <mfoley>
Component: driftAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1CC: jsanda
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-07 19:29:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 707225    
Attachments:
Description Flags
Agent log none

Description Mike Foley 2011-09-16 18:22:29 UTC
Description of problem:  No Version#0 changeset created for drift configurations created during network outage, even after the outage is repaired (scenario #7)


How reproducible:
100%.  both jsanda and mfoley have repro'd this

Steps to Reproduce:
1.  There is a network partition.  My configuration was server on Linux, and
agent on my PC ...connected by wifi.
2.  I simulated network failure (I turned wifi off)
3.  I created a Drift configuration while there was a network outage
4.  I turned wifi back on...resolving the network failure
5.  I do not observe a version#0 changeset for the new drift configuration

  
Actual results:
no version #0 changeset



Expected results:
a version #0 changeset

Additional info:

Comment 1 Mike Foley 2011-09-16 18:25:13 UTC
Created attachment 523604 [details]
Agent log

Comment 2 John Sanda 2011-10-04 18:19:40 UTC
This issue should be resolved now with changes introduced around error handling.

commit hash: 3f3397557aedabbd11420c022344776d08f76e2e

From the commit log...

    This commit introduces several changes and a changed work flow to
    address some boundary conditions that can arise when the server fails
    to receive a change set report. The issue stems from the way we stream
    the change set report to the server. Because the request is processed
    aysnchronously, we cannot know for certainty if/when errors arise in the
    comm layer.
    
    When DriftDetector runs, a new snapshot file is generated, and now a
    copy of the previous version snapshot is maintained as well. After the
    server processes the change set, it now sends an ack to the agent. This
    lets the agent know that the change set was successfully persisted on
    the server. The agent then cleans up, deleting the previous version
    snapshot, and the change set zip file.
    
    If drift detection runs again before the agent receives the
    ackowlegement, drift detection is skipped. The most likely scenario for
    not receiving an acknowledgement would be a network error or a down
    server.
    
    If any errors occur during drift detection, which includes sending the
    change set to the server, the agent will attempt to revert back to the
    previous version snapshot. This is to ensure we have a consistent
    snapshot on disk with which to work.
    
    This commit also fixes a bug in the drift inventory sync code. In
    situations where there are existing change sets on the server, and the
    agent has to fetch a snapshot from the server, the snapshot version was
    getting set incorrectly. This is because the snapshot was not being
    built correctly. Change sets were being applied out of order. This is
    fixed now.

Comment 3 Mike Foley 2011-10-04 21:41:46 UTC
documenting the behavior here:

1) i did *not* get version #0 changeset after repairing the outage
2) i did *not* get a version #0 changeset after repairing the outage and clicking the "detect now" button
3) it was only after a subsequent change was drift detected ... and this change was picked up as version #0

this is different than i expected.  i expected version #0 changeset after the network outage was repaired and clicking the "detect now" button.

jsanda ... can you clarify if the behavior i am seeing is correct or not?

Comment 4 John Sanda 2011-10-05 13:57:41 UTC
You definitely should get that initial change set at some point after agent reconnects with the server. I think I see the problem. I forgot to handle the base case. For any version greater than zero, the agent checks to see if there is a copy of the previous snapshot before doing a drift scan. That previous snapshot only gets removed when the server acknowledges that it processed it successfully; so, its presence let's the agent know that something may have gone wrong. This would likely be the case during a network outage. When agent and server reconnect, an inventory sync runs, and the agent will revert to the previous snapshot which will trigger the agent to resend the change set to the server (assuming that drift is still present). I need to put similar logic in place for the initial change set.

Comment 5 John Sanda 2011-11-21 20:39:56 UTC
I have retested this and got the expected results. I think this was fixed some time ago and I just forgot to update the BZ.

Comment 6 Mike Foley 2012-02-07 19:29:09 UTC
changing status of VERIFIED BZs for JON 2.4.2 and JON 3.0 to CLOSED/CURRENTRELEASE

Comment 7 Mike Foley 2012-02-07 19:30:03 UTC
marking VERIFIED BZs to CLOSED/CURRENTRELEASE