Bug 1563627 - Service and VM retirement are non-deterministic, running parallel
Summary: Service and VM retirement are non-deterministic, running parallel
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Automate
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.10.0
Assignee: drew uhlmann
QA Contact: Niyaz Akhtar Ansari
URL:
Whiteboard:
: 1568522 (view as bug list)
Depends On:
Blocks: 1570950 1570951
TreeView+ depends on / blocked
 
Reported: 2018-04-04 10:58 UTC by Gellert Kis
Modified: 2019-02-26 07:14 UTC (History)
9 users (show)

Fixed In Version: 5.10.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1570950 1570951 (view as bug list)
Environment:
all CFME (bug should be cloned)
Last Closed: 2019-01-24 14:29:13 UTC
Category: Bug
Cloudforms Team: Unknown
Target Upstream Version:


Attachments (Terms of Use)

Comment 2 drew uhlmann 2018-04-04 14:27:27 UTC
Could you please ask the customer to supply logs for the scenario for when both retirements process concurrently?

Comment 5 drew uhlmann 2018-04-05 20:14:24 UTC
We had a change to the update_service_retirement_status.rb which is in ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb:35. Could the customer please check that that line is correct? We had a typo fix here: https://github.com/ManageIQ/manageiq-content/pull/189/files and that line should read ```if step.downcase == 'startretirement'```.

Comment 8 drew uhlmann 2018-04-06 16:44:22 UTC
I am seeing that there is custom code in use for the retirement. The start_retirement method doesn't have the update_service_retirement_status method and that the custom method bumpretirementdate runs immediately after start_retirement and thus clears the retirement_state. Will the customer disable all of the custom code and run the ManageIQ Service retirement state machine and resend the logs showing 2 Service retirement state machines running for the same Service if that still happens, please?

Comment 9 ITD27M01 2018-04-09 12:29:29 UTC
Hi drew uhlmann

I will create the logs for you, but could you before analyze the following:

In the environment with multiple worker appliances with automation role enabled the request_service_retire event can occurred.


Assume the following scenario:

1. [1]-First Appliance check service retirement state by service.retiring? and service.retired? methods.

2. [1]-First Appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate related to retirement State Machine.

3. [2]-Second appliance service retirement state by service.retiring? and service.retired?

4. [2]-Second appliance fail to start because service retirement_state is retiring.



But in our Universe there can be another sequence:

1. [1]-First Appliance check service retirement state by service.retiring? and service.retired? methods.

2. [2]-Second appliance check service retirement state by service.retiring? and service.retired? methods.

3. [1]-First Appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate retirement state machine.

4. [2]-Second appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate retirement state machine.



Here is a classical "race condition" problem. Race conditions have a reputation of being difficult to reproduce and debug, since the end result is nondeterministic and depends on the relative timing between interfering threads. Problems occurring in production systems can therefore disappear when running in debug mode, when additional logging is added, or when attaching a debugger, often referred to as a "Heisenbug". It is therefore better to avoid race conditions by careful software design rather than attempting to fix them afterwards.

Comment 11 drew uhlmann 2018-04-10 13:45:16 UTC
Since services are not tied to zones, the scheduler in any zone can initiate the retirement. This is expected behavior. I'm in the process of perusing the logs.

Comment 12 Tina Fitzgerald 2018-04-10 14:47:31 UTC
Hi Gellert,

Can you let us know when the new set of logs are ready? 

Thanks,
Tina

Comment 15 drew uhlmann 2018-04-11 13:02:21 UTC
Have PR open: https://github.com/ManageIQ/manageiq/pull/17280. However, without the logs from the last run I can at best say that this PR may not completely solve the issue in this BZ but it will help the out-of-the-box retirement anyway.

Comment 18 Gellert Kis 2018-04-11 13:54:06 UTC
(Log files available)

Comment 19 Satoe Imaishi 2018-04-23 20:03:20 UTC
*** Bug 1568522 has been marked as a duplicate of this bug. ***

Comment 23 Sudhir Mallamprabhakara 2019-01-24 14:29:13 UTC
not much QE can do about this one; it will have to go out unverified. 

https://bugzilla.redhat.com/show_bug.cgi?id=1570950#c5


Note You need to log in before you can comment on or make changes to this bug.