Bug 1599625

Summary: [GSS](6.4.z) Host controllers can not connect to domain after creating a rollout plan and restarting the master host controller
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: tmiyargi
Component: Domain ManagementAssignee: Jiri Ondrusek <jondruse>
Status: CLOSED CURRENTRELEASE QA Contact: Peter Mackay <pmackay>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4.21CC: bmaxwell, brian.stansberry, dandread, dcihak, jbaesner, jondruse, pmackay, rstancel
Target Milestone: CR1   
Target Release: EAP 6.4.21   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-19 12:45:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1567790    

Description tmiyargi 2018-07-10 08:34:39 UTC
Creating a rollout plan and restarting the dc host prevent the other hosts to connect to the master again. The slave hc is unable to connect giving the error JBAS014687: Resource is immutable, the dc shows many errors like:

JBAS012119:  cancelled task by interrupting thread Thread[Host Controller Service Threads - 117,5,Host Controller Service Threads]

To reproduce create a domain with master and slave, create a rollout plan and restart like this:

rollout-plan add --name=my-plan --content={rollout groupa^groupb}
/host=my-dc:reload

Comment 4 Brian Stansberry 2018-07-10 14:59:14 UTC
What doesn't work is a slave HC reconnecting to the master following loss of connectivity.  A common case for that being the master is reloaded, which is the specific thing reported here.  Other things that cause reconnection, e.g. a network outage detected by the slave and then later resolved, would result in the same problem.

There is a guard in the code that rejects a particular call path for providing updates to rollout-plan resources, unless the resource is in a kind of "initial" state, i.e. what it would be in early in HC boot.  When the slave HC reconnects it syncs its local copy of the domain-wide model with what the master currently has, and while doing that it uses the call path that's being rejected. 

Best fix is probably to eliminate that guard as the value it provides is basically theoretical, a check against EAP developers doing something wrong that is hard to imagine actually being done. Trying to work around the call path that trips the guard would add complexity to already complex code.