Bug 1440972 - Service Retirements (which work correctly) result in two separate emails to service owner
Summary: Service Retirements (which work correctly) result in two separate emails to s...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Automate
Version: 5.7.0
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: GA
: 5.9.0
Assignee: Tina Fitzgerald
QA Contact: Landon LaSmith
URL:
Whiteboard: service:retirement:email
Depends On:
Blocks: 1496947
TreeView+ depends on / blocked
 
Reported: 2017-04-10 21:11 UTC by Thomas Hennessy
Modified: 2021-06-10 12:10 UTC (History)
15 users (show)

Fixed In Version: 5.9.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1496947 (view as bug list)
Environment:
Last Closed: 2018-04-17 16:53:06 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
spreadsheet detailing following two messages resulting in email from same retirement service (13.83 KB, application/zip)
2017-05-01 18:15 UTC, Thomas Hennessy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1568522 0 high CLOSED Retirement can run in parallel in case of more than worker appliance 2021-06-10 15:51:58 UTC

Internal Links: 1568522

Description Thomas Hennessy 2017-04-10 21:11:19 UTC
Description of problem:Customer is creating services using standard service dialog  process but is changing the retirement date to a date several days after service isinstantionated.  Servide and associated VMs do get retired but multiple emails are output to the service owner indicating service retirement.


Version-Release number of selected component (if applicable): Version: 5.7.1.3


How reproducible:


Steps to Reproduce:
1.create a service using CFME with normal owner notification at service retirement
2. use appliance console to alter the retirement date to tomorrow
3. wai

Actual results: two emails are sent to service owner


Expected results:only one email is expected


Additional info:

I have chased down the scheduler actually created two separate messages at the time that retirement is to happen based on some service names provided by the customer.  So once we have these two messages set up, the appear to be processed in very close time synchronization and each of them results in an email being sent out.  According to the customer this happens > 50% of the time when he locally "improves" the retirement date on services.

I want to emphasize that retirement *do* happen as expected.  The BUG is that two emails are sent when only one is expected.

I have logs from multiple days showing this behavior as well as copies of the services table at different points in time.

I think this is a problem where the pre-defined warning time messages are not found and there is likely some logic error that causes two messages to be created into the miq_queue table where one is likely to be intended as the warning message and the other is the retirement message, but they both show up as retirement messages to the user.

Comment 2 Thomas Hennessy 2017-04-10 21:15:07 UTC
comment from the customer today:
I had some services retire 'naturally' this weekend and this issue still occurred. I altered my provisioning code so that no retirement_warn value was assigned to these services, only a retires_on value. It looks like we will need a bz for this.

I will create a google disk location where these logs are provided by the customer.

Comment 4 Thomas Hennessy 2017-04-11 10:38:00 UTC
diagnostic materials are loaded into google drive with following url
https://drive.google.com/drive/folders/0Bw3u5g60tmjuSHlLcmJiRFE5OVE?usp=sharing
the folder "BZ 1440972" contains the materials for this case.

Comment 5 Tina Fitzgerald 2017-04-26 20:48:21 UTC
Hi Tom,

Investigating issue and looking at retired emails for ESTE-20170403-115154 
xxricfw10_gtk_142528_tar_analyzd\contains_genericmailer_loglines:[----] I, [2017-04-05T12:15:33.268660 #25549:3d2264]  INFO -- : MIQ(GenericMailer.deliver_queue) starting: method: automation_notification args: {:to=>"jnovotni", :from=>"cfme.do.not.reply", :subject=>"Service Retirement Alert for ESTE-20170403-115154", :content_type=>nil, :body=>"Hello, <br><br>Your service named [ESTE-20170403-115154] has been retired.<br><br> Thank you,<br> Virtualization Infrastructure Team”}

xxricfw14_gtk_142545_tar_analyzd\contains_genericmailer_loglines:[----] I, [2017-04-05T12:15:28.617078 #31361:5437948]  INFO -- : MIQ(GenericMailer.deliver_queue) starting: method: automation_notification args: {:to=>"jnovotni", :from=>"cfme.do.not.reply", :subject=>"Service Retirement Alert for ESTE-20170403-115154", :content_type=>nil, :body=>"Hello, <br><br>Your service named [ESTE-20170403-115154] has been retired.<br><br> Thank you,<br> Virtualization Infrastructure Team"}

I can see where the worker 142528 starts the service retirement state machine and queues a retry:
[----] I, [2017-04-05T12:14:20.012702 #25565:3e312c]  INFO -- : MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"Service", :object_id=>10000000000190, :attrs=>{:event_type=>"request_service_retire", "Service::service"=>10000000000190, :service_id=>10000000000190, :retirement_initiator=>"system", :type=>"Service", "MiqEvent::miq_event"=>10000002337888, :miq_event_id=>10000002337888, "EventStream::event_stream"=>10000002337888, :event_stream_id=>10000002337888}, :instance_name=>"Event", :user_id=>10000000000001, :miq_group_id=>10000000000002, :tenant_id=>10000000000001, :automate_message=>nil, :state=>"CheckServiceRetired", :ae_fsm_started=>nil, :ae_state_started=>"2017-04-05 16:12:05 UTC", :ae_state_retries=>3, :ae_state_previous=>"---\n\"/IGT/Service/Retirement/StateMachines/ServiceRetirement/Default\":\n  ae_state: CheckServiceRetired\n  ae_state_retries: 3\n  ae_state_started: 2017-04-05 16:12:05 UTC\n"} for object [Service.10000000000190] with state [CheckServiceRetired] to Automate for delivery in [60] seconds

worker 145545 picks up the queue items and completes the retirement state machine, sending the email.
[----] I, [2017-04-05T12:15:20.827910 #31361:ee113c]  INFO -- : MIQ(MiqPriorityWorker::Runner#get_message_via_drb) Message id: [10000017213940], MiqWorker id: [10000000743799], Zone: [IGT RI Dev vCenter], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [dequeue], Deliver On: [2017-04-05 16:15:20 UTC], Data: [], Args: [{:object_type=>"Service", :object_id=>10000000000190, :attrs=>{:event_type=>"request_service_retire", "Service::service"=>10000000000190, :service_id=>10000000000190, :retirement_initiator=>"system", :type=>"Service", "MiqEvent::miq_event"=>10000002337888, :miq_event_id=>10000002337888, "EventStream::event_stream"=>10000002337888, :event_stream_id=>10000002337888}, :instance_name=>"Event", :user_id=>10000000000001, :miq_group_id=>10000000000002, :tenant_id=>10000000000001, :automate_message=>nil, :state=>"CheckServiceRetired", :ae_fsm_started=>nil, :ae_state_started=>"2017-04-05 16:12:05 UTC", :ae_state_retries=>3, :ae_state_previous=>"---\n\"/IGT/Service/Retirement/StateMachines/ServiceRetirement/Default\":\n  ae_state: CheckServiceRetired\n  ae_state_retries: 3\n  ae_state_started: 2017-04-05 16:12:05 UTC\n"}], Dequeued in: [207.957307195] seconds 

within the same time period, worker 142528 gets a queue item and also continues the retirement state machine, sending the email as well.

----] I, [2017-04-05T12:15:24.335532 #25549:3e312c]  INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [10000017213943], MiqWorker id: [10000000743783], Zone: [IGT RI Dev vCenter], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [100], State: [dequeue], Deliver On: [2017-04-05 16:15:23 UTC], Data: [], Args: [{:object_type=>"Service", :object_id=>10000000000190, :attrs=>{:event_type=>"request_service_retire", "Service::service"=>10000000000190, :service_id=>10000000000190, :retirement_initiator=>"system", :type=>"Service", "MiqEvent::miq_event"=>10000002337892, :miq_event_id=>10000002337892, "EventStream::event_stream"=>10000002337892, :event_stream_id=>10000002337892}, :instance_name=>"Event", :user_id=>10000000000001, :miq_group_id=>10000000000002, :tenant_id=>10000000000001, :automate_message=>nil, :state=>"CheckServiceRetired", :ae_fsm_started=>nil, :ae_state_started=>"2017-04-05 16:12:08 UTC", :ae_state_retries=>3, :ae_state_previous=>"---\n\"/IGT/Service/Retirement/StateMachines/ServiceRetirement/Default\":\n  ae_state: CheckServiceRetired\n  ae_state_retries: 3\n  ae_state_started: 2017-04-05 16:12:08 UTC\n"}], Dequeued in: [192.899790396] seconds

I searched the logs and can't find where Message ID: 10000017213943 is put on the queue.

Can you see if you can find the message id: 10000017213943 in the provided logs?

Thanks,
Tina

Comment 6 Thomas Hennessy 2017-05-01 18:13:41 UTC
Tina,
As we discussed, I discovered that the appliance which created the message above was not provided by the customer,
but I do have an alternative group of log ies for a different day, and a tracking that I hopw will save you lots and lots of time.

I created a spreadsheet that I used to track two different messages that were associated with the two retirement messages that the customer reported.  I am uploading that as an attachment to the case.  I hope that you have openoffice or libreoffice and can use that to open the following file:

Tracking origins of two separate email messages.xlsb

the logs I used to generate the above are from March 31 but we have all of the appliance logs so this might be more successful.

the directory containing the above is in the original location identified in comment 4 above as is the spreadsheet.  you can use the following url

https://drive.google.com/drive/folders/0Bw3u5g60tmjuTklkRWxpT1V6S2c?usp=sharing

Comment 7 Thomas Hennessy 2017-05-01 18:15:37 UTC
Created attachment 1275460 [details]
spreadsheet detailing following two messages resulting in email from same retirement service

Comment 8 Tina Fitzgerald 2017-05-03 15:32:43 UTC
Hi Tom,

Thanks so much for the analysis!
It would have taken me a long time to sift through all of the logs looking for the relevant information.

This is an interesting issue in that workers in two zones are running the service retirement check within seconds of each other.

The timing is such that both workers raise the request_service_retire event on the same Service within seconds of each other.
(2017-03-31T13:32:51.483866 vs. 2017-03-31T13:32:53.911381)

And enter the Service retirement state machine within seconds of each other.
  
Normally, the Service retirement state machine will prevent a second state machine from processing the same Service, but the timing in this case is allowing both processes to continue.

The service retirement check is performed daily and I suspect the two workers are running the retirement check around the same time because they were started at the same time.  If we could possibly stagger the start times of the two appliances, it would probably alleviate the email issue while we're working on a fix.

Please let know if you have any questions.  

Thanks,
Tina

Comment 9 Thomas Hennessy 2017-05-09 13:22:52 UTC
Thanks Tina

Comment 13 Tina Fitzgerald 2017-09-05 19:42:45 UTC
Hi Marianne,

Sorry for the delayed response.  I started working on a solution to this issue, but have not gotten back to it in the last few weeks.  

The solution is to lock the object while we update the retirement_state as "queued" when the retirement event is raised. 

I'll move back to assigned since I'm not actively working on it. 

Thanks,
Tina

Comment 14 Jarryd Novotni 2017-09-19 13:15:41 UTC
I'm not sure if this helps any, but this was my original observation about the issue that I put on the case... "There is a check in Service/Retirement/StateMachines/Methods/start_retirement to see if the service is retiring, 'if service.retiring?', but when this check fails, the worker does an MIQ_ABORT, which results in the service.retirement_state eventually being set to 'error', which results in service.retiring? returning false. So when another worker comes along, it's able to execute that state machine."

Comment 15 Tina Fitzgerald 2017-09-26 20:51:47 UTC
Hi Jarryd,

Thanks so much for your observation! The issue turned out to be that we have a typo in the update_service_retirement_status Automate method. The method doesn't update the retirement_state if the step is start retirement, but the typo resulted in the retirement_state getting set every time.
Thanks again for catching that.

Regards,
Tina

Comment 16 Tina Fitzgerald 2017-09-26 20:56:10 UTC
https://github.com/ManageIQ/manageiq-content/pull/189

commit 0ccaec8c32156314fc7a19a4df21c2cb3d87ed9f
Author: william fitzgerald <wfitzger>
Date:   Fri Sep 22 12:43:47 2017 -0400

    Fixed typo in  Service Retirement method.
    
    update_service_retirement_status method had incorrect step name.
    
    This caused a problem when trying to retire a service that is already in the process of being retired.

diff --git a/content/automate/ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb b/content/automate/ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.cla
index 831d9bc..5aac6b0 100644
--- a/content/automate/ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb
+++ b/content/automate/ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb
@@ -32,7 +32,7 @@ updated_message += "Current Retry Number [#{$evm.root['ae_state_retries']}]" if
 # Update Status for on_error for all states other than the first state which is start retirement
 # in the retirement state machine.
 if $evm.root['ae_result'] == 'error'
-  if step.downcase == 'start retirement'
+  if step.downcase == 'startretirement'
     msg = 'Cannot continue because Service is '
     msg += service ? "#{service.retirement_state}." : 'nil.'
     $evm.log('info', msg)

Comment 18 Landon LaSmith 2018-01-26 18:48:06 UTC
VERIFIED. This fix is present in 5.9.0.17


Note You need to log in before you can comment on or make changes to this bug.