Bug 1563627
Summary: | Service and VM retirement are non-deterministic, running parallel | |||
---|---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | Gellert Kis <gekis> | |
Component: | Automate | Assignee: | drew uhlmann <duhlmann> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Niyaz Akhtar Ansari <nansari> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 5.8.0 | CC: | cpelland, duhlmann, gekis, igortiunov, mkanoor, mshriver, obarenbo, smallamp, tfitzger | |
Target Milestone: | GA | Keywords: | TestOnly, ZStream | |
Target Release: | 5.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | 5.10.0.0 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1570950 1570951 (view as bug list) | Environment: |
all CFME (bug should be cloned)
|
|
Last Closed: | 2019-01-24 14:29:13 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | Bug | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | Unknown | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1570950, 1570951 |
Comment 2
drew uhlmann
2018-04-04 14:27:27 UTC
We had a change to the update_service_retirement_status.rb which is in ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb:35. Could the customer please check that that line is correct? We had a typo fix here: https://github.com/ManageIQ/manageiq-content/pull/189/files and that line should read ```if step.downcase == 'startretirement'```. I am seeing that there is custom code in use for the retirement. The start_retirement method doesn't have the update_service_retirement_status method and that the custom method bumpretirementdate runs immediately after start_retirement and thus clears the retirement_state. Will the customer disable all of the custom code and run the ManageIQ Service retirement state machine and resend the logs showing 2 Service retirement state machines running for the same Service if that still happens, please? Hi drew uhlmann I will create the logs for you, but could you before analyze the following: In the environment with multiple worker appliances with automation role enabled the request_service_retire event can occurred. Assume the following scenario: 1. [1]-First Appliance check service retirement state by service.retiring? and service.retired? methods. 2. [1]-First Appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate related to retirement State Machine. 3. [2]-Second appliance service retirement state by service.retiring? and service.retired? 4. [2]-Second appliance fail to start because service retirement_state is retiring. But in our Universe there can be another sequence: 1. [1]-First Appliance check service retirement state by service.retiring? and service.retired? methods. 2. [2]-Second appliance check service retirement state by service.retiring? and service.retired? methods. 3. [1]-First Appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate retirement state machine. 4. [2]-Second appliance success to start_retirement and set the retirement state for service by invoking related method: service.start_retirement and initiate retirement state machine. Here is a classical "race condition" problem. Race conditions have a reputation of being difficult to reproduce and debug, since the end result is nondeterministic and depends on the relative timing between interfering threads. Problems occurring in production systems can therefore disappear when running in debug mode, when additional logging is added, or when attaching a debugger, often referred to as a "Heisenbug". It is therefore better to avoid race conditions by careful software design rather than attempting to fix them afterwards. Since services are not tied to zones, the scheduler in any zone can initiate the retirement. This is expected behavior. I'm in the process of perusing the logs. Hi Gellert, Can you let us know when the new set of logs are ready? Thanks, Tina Have PR open: https://github.com/ManageIQ/manageiq/pull/17280. However, without the logs from the last run I can at best say that this PR may not completely solve the issue in this BZ but it will help the out-of-the-box retirement anyway. (Log files available) *** Bug 1568522 has been marked as a duplicate of this bug. *** not much QE can do about this one; it will have to go out unverified. https://bugzilla.redhat.com/show_bug.cgi?id=1570950#c5 |