Bug 1748353 - multiple workers start the same retirement when retirement date is reached
Summary: multiple workers start the same retirement when retirement date is reached
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Automate
Version: 5.10.8
Hardware: All
OS: All
medium
medium
Target Milestone: GA
: 5.12.0
Assignee: Tina Fitzgerald
QA Contact: Devidas Gaikwad
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks: 1764197 1767824
TreeView+ depends on / blocked
 
Reported: 2019-09-03 12:22 UTC by Felix Dewaleyne
Modified: 2023-03-24 15:22 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1764197 1767824 (view as bug list)
Environment:
Last Closed: 2020-06-10 12:33:21 UTC
Category: Bug
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Felix Dewaleyne 2019-09-03 12:22:00 UTC
Description of problem:
when the retirement date of the customer's service is reached, several workers attempt to retire it leading to multiple retirement confirmations at the end.
this doesn't hapen if one uses the "retire now" button.

Version-Release number of selected component (if applicable):
5.10.7

How reproducible:
customer environment

Steps to Reproduce:
1.create email retirement method & add it to automate
2.provision service with a retirement date
3.reach retirement date
4.watch logs

Actual results:
the retirement is started multiple times : 

[----] I, [2019-08-29T13:11:29.511158 #25844:8daf54]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) <AEMethod [/PLUSDev/Service/Retirement/StateMachines/Methods/check_service_retired]> Starting 
[----] I, [2019-08-29T13:11:29.831179 #25849:c9eaa0]  INFO -- : Validating Notification type: vm_retired
[----] I, [2019-08-29T13:11:29.832566 #25849:c9eaa0]  INFO -- : Calling Create Notification type: vm_retired subject type: VmOrTemplate id: 1000000169078 options: {}
[----] I, [2019-08-29T13:11:29.968993 #25844:bfd4c0]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) <AEMethod check_service_retired> Checking if all service tasks have been retired.
[----] I, [2019-08-29T13:11:29.972948 #25844:bfd4c0]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) <AEMethod check_service_retired> Service RetireCheck with <retry> for state <active> and status <Ok>
[----] I, [2019-08-29T13:11:29.985149 #25844:bfd4c0]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) <AEMethod check_service_retired> Service task Service Retire for: pwstest2.sbg.ac.at is not retired, setting retry.
[----] I, [2019-08-29T13:11:30.009834 #25844:8daf54]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) <AEMethod [/PLUSDev/Service/Retirement/StateMachines/Methods/check_service_retired]> Ending
[----] I, [2019-08-29T13:11:30.009944 #25844:8daf54]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) Method exited with rc=MIQ_OK
[----] I, [2019-08-29T13:11:30.010237 #25844:8daf54]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) Followed  Relationship [miqaedb:/Service/Retirement/StateMachines/Methods/CheckServiceRetired#create]
[----] I, [2019-08-29T13:11:30.010316 #25844:8daf54]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) Processed State=[CheckServiceRetired] with Result=[retry]
[----] I, [2019-08-29T13:11:30.010571 #25844:8daf54]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) In State=[CheckServiceRetired], invoking [on_exit] method=[update_service_retirement_status(status => 'Checked Service retirement')]
[----] I, [2019-08-29T13:11:30.274676 #25844:8daf54]  INFO -- : Q-task_id([r1000000003670_service_retire_task_1000000007792]) Updated namespace [Service/Retirement/StateMachines/ServiceRetirement/update_service_retirement_status  ManageIQ/Service/Retirement/StateMachines]

Expected results:
the retirement is not run multiple times at the same time

Additional info:
this does not happen when using "retire now"
heavy automation in use but I still cannot explain why several workers were started and only for "retire now".

Comment 4 Tina Fitzgerald 2019-09-03 17:39:02 UTC
Retirement as a request(RaaR) was introduced in 5.10. Part of the RaaR work included creating a new method to initiate the retirement process. The new method was created because we wanted to keep the retire_now method for backward compatibility. Retirement should work properly with the old/new way, depending on the Automate model in use.  Most of the details of these changes aren't significant to the reported issue, but, what is important is the timing of entering the retirement state machine.

retire_now is event based, a request_*_event is raised when the method is called, and the event processing includes a call to the retirement state machine. The objects retirement_state is set to retiring in the first step of the state machine. The  retirement_state == retiring will prevent the scheduler from trying to retire the object again because it considers it already retiring.

Retirement as a request has a much different workflow. A retire request is created when the method is called, the request has to be approved, then tasks are created to process each of the retireable objects.  The tasks then enter the retirement state machine.
The objects retirement_state is set to retiring in the first step of the state machine. At this point, the scheduler will not try to retire the object again because it considers it already retiring.  

Service retirement has an additional challenge in that Services are not zone based.  Retirement of VM's are queued to the VM's zone, but Services are queued without a zone which means that service retirement will run anywhere a worker is available to pick up the work.
 
We recently made a change to have Service retirement checked at the region level instead of zone. If you had a region that had 3 zones, you could have 3 retirement requests for the same service.  With this change, you will only have 1 retirement request dor the service.
  
https://github.com/ManageIQ/manageiq/pull/19143

Comment 6 Tina Fitzgerald 2019-09-06 20:42:37 UTC
Hi Felix,

The target release was just changed to 5.11.1.

Thanks,
Tina

Comment 7 Tina Fitzgerald 2019-10-08 17:08:35 UTC
Hi Dennis,

The PR referenced in comment 4 will resolve this issue. The PR has been merged, but is not tagged for backporting to Hammer. Should we add the hammer label and change the target release?

Please advise.

Thanks,
Tina

Comment 8 dmetzger 2019-10-22 12:52:31 UTC
Updating this for inclusion in 5.10.12


Note You need to log in before you can comment on or make changes to this bug.