Bug 1563627

Summary: Service and VM retirement are non-deterministic, running parallel
Product: Red Hat CloudForms Management Engine Reporter: Gellert Kis <gekis>
Component: AutomateAssignee: drew uhlmann <duhlmann>
Status: CLOSED CURRENTRELEASE QA Contact: Niyaz Akhtar Ansari <nansari>
Severity: high Docs Contact:
Priority: high    
Version: 5.8.0CC: cpelland, duhlmann, gekis, igortiunov, mkanoor, mshriver, obarenbo, smallamp, tfitzger
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.10.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1570950 1570951 (view as bug list) Environment:
all CFME (bug should be cloned)
Last Closed: 2019-01-24 14:29:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Unknown Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1570950, 1570951    

Comment 2 drew uhlmann 2018-04-04 14:27:27 UTC
Could you please ask the customer to supply logs for the scenario for when both retirements process concurrently?

Comment 5 drew uhlmann 2018-04-05 20:14:24 UTC
We had a change to the update_service_retirement_status.rb which is in ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb:35. Could the customer please check that that line is correct? We had a typo fix here: https://github.com/ManageIQ/manageiq-content/pull/189/files and that line should read ```if step.downcase == 'startretirement'```.

Comment 8 drew uhlmann 2018-04-06 16:44:22 UTC
I am seeing that there is custom code in use for the retirement. The start_retirement method doesn't have the update_service_retirement_status method and that the custom method bumpretirementdate runs immediately after start_retirement and thus clears the retirement_state. Will the customer disable all of the custom code and run the ManageIQ Service retirement state machine and resend the logs showing 2 Service retirement state machines running for the same Service if that still happens, please?

Comment 9 ITD27M01 2018-04-09 12:29:29 UTC
Hi drew uhlmann

I will create the logs for you, but could you before analyze the following:

In the environment with multiple worker appliances with automation role enabled the request_service_retire event can occurred.


Assume the following scenario:

1. [1]-First Appliance check service retirement state by service.retiring? and service.retired? methods.

2. [1]-First Appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate related to retirement State Machine.

3. [2]-Second appliance service retirement state by service.retiring? and service.retired?

4. [2]-Second appliance fail to start because service retirement_state is retiring.



But in our Universe there can be another sequence:

1. [1]-First Appliance check service retirement state by service.retiring? and service.retired? methods.

2. [2]-Second appliance check service retirement state by service.retiring? and service.retired? methods.

3. [1]-First Appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate retirement state machine.

4. [2]-Second appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate retirement state machine.



Here is a classical "race condition" problem. Race conditions have a reputation of being difficult to reproduce and debug, since the end result is nondeterministic and depends on the relative timing between interfering threads. Problems occurring in production systems can therefore disappear when running in debug mode, when additional logging is added, or when attaching a debugger, often referred to as a "Heisenbug". It is therefore better to avoid race conditions by careful software design rather than attempting to fix them afterwards.

Comment 11 drew uhlmann 2018-04-10 13:45:16 UTC
Since services are not tied to zones, the scheduler in any zone can initiate the retirement. This is expected behavior. I'm in the process of perusing the logs.

Comment 12 Tina Fitzgerald 2018-04-10 14:47:31 UTC
Hi Gellert,

Can you let us know when the new set of logs are ready? 

Thanks,
Tina

Comment 15 drew uhlmann 2018-04-11 13:02:21 UTC
Have PR open: https://github.com/ManageIQ/manageiq/pull/17280. However, without the logs from the last run I can at best say that this PR may not completely solve the issue in this BZ but it will help the out-of-the-box retirement anyway.

Comment 18 Gellert Kis 2018-04-11 13:54:06 UTC
(Log files available)

Comment 19 Satoe Imaishi 2018-04-23 20:03:20 UTC
*** Bug 1568522 has been marked as a duplicate of this bug. ***

Comment 23 Sudhir Mallamprabhakara 2019-01-24 14:29:13 UTC
not much QE can do about this one; it will have to go out unverified. 

https://bugzilla.redhat.com/show_bug.cgi?id=1570950#c5