Bug 1563627

Summary:	Service and VM retirement are non-deterministic, running parallel
Product:	Red Hat CloudForms Management Engine	Reporter:	Gellert Kis <gekis>
Component:	Automate	Assignee:	drew uhlmann <duhlmann>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Niyaz Akhtar Ansari <nansari>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.8.0	CC:	cpelland, duhlmann, gekis, igortiunov, mkanoor, mshriver, obarenbo, smallamp, tfitzger
Target Milestone:	GA	Keywords:	TestOnly, ZStream
Target Release:	5.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	5.10.0.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1570950 1570951 (view as bug list)		Environment:	all CFME (bug should be cloned)
Last Closed:	2019-01-24 14:29:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	Bug
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	Unknown	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1570950, 1570951

Comment 2 drew uhlmann 2018-04-04 14:27:27 UTC

Could you please ask the customer to supply logs for the scenario for when both retirements process concurrently?

Comment 5 drew uhlmann 2018-04-05 20:14:24 UTC

We had a change to the update_service_retirement_status.rb which is in ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb:35. Could the customer please check that that line is correct? We had a typo fix here: https://github.com/ManageIQ/manageiq-content/pull/189/files and that line should read ```if step.downcase == 'startretirement'```.

Comment 8 drew uhlmann 2018-04-06 16:44:22 UTC

I am seeing that there is custom code in use for the retirement. The start_retirement method doesn't have the update_service_retirement_status method and that the custom method bumpretirementdate runs immediately after start_retirement and thus clears the retirement_state. Will the customer disable all of the custom code and run the ManageIQ Service retirement state machine and resend the logs showing 2 Service retirement state machines running for the same Service if that still happens, please?

Comment 9 ITD27M01 2018-04-09 12:29:29 UTC

Hi drew uhlmann

I will create the logs for you, but could you before analyze the following:

In the environment with multiple worker appliances with automation role enabled the request_service_retire event can occurred.

Assume the following scenario:

1. [1]-First Appliance check service retirement state by service.retiring? and service.retired? methods.

2. [1]-First Appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate related to retirement State Machine.

3. [2]-Second appliance service retirement state by service.retiring? and service.retired?

4. [2]-Second appliance fail to start because service retirement_state is retiring.

But in our Universe there can be another sequence:

1. [1]-First Appliance check service retirement state by service.retiring? and service.retired? methods.

2. [2]-Second appliance check service retirement state by service.retiring? and service.retired? methods.

3. [1]-First Appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate retirement state machine.

4. [2]-Second appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate retirement state machine.

Here is a classical "race condition" problem. Race conditions have a reputation of being difficult to reproduce and debug, since the end result is nondeterministic and depends on the relative timing between interfering threads. Problems occurring in production systems can therefore disappear when running in debug mode, when additional logging is added, or when attaching a debugger, often referred to as a "Heisenbug". It is therefore better to avoid race conditions by careful software design rather than attempting to fix them afterwards.

Comment 11 drew uhlmann 2018-04-10 13:45:16 UTC

Since services are not tied to zones, the scheduler in any zone can initiate the retirement. This is expected behavior. I'm in the process of perusing the logs.

Comment 12 Tina Fitzgerald 2018-04-10 14:47:31 UTC

Hi Gellert,

Can you let us know when the new set of logs are ready? 

Thanks,
Tina

Comment 15 drew uhlmann 2018-04-11 13:02:21 UTC

Have PR open: https://github.com/ManageIQ/manageiq/pull/17280. However, without the logs from the last run I can at best say that this PR may not completely solve the issue in this BZ but it will help the out-of-the-box retirement anyway.

Comment 18 Gellert Kis 2018-04-11 13:54:06 UTC

(Log files available)

Comment 19 Satoe Imaishi 2018-04-23 20:03:20 UTC

*** Bug 1568522 has been marked as a duplicate of this bug. ***

Comment 23 Sudhir Mallamprabhakara 2019-01-24 14:29:13 UTC

not much QE can do about this one; it will have to go out unverified. 

https://bugzilla.redhat.com/show_bug.cgi?id=1570950#c5