Description of problem:Customer is creating services using standard service dialog process but is changing the retirement date to a date several days after service isinstantionated. Servide and associated VMs do get retired but multiple emails are output to the service owner indicating service retirement. Version-Release number of selected component (if applicable): Version: 5.7.1.3 How reproducible: Steps to Reproduce: 1.create a service using CFME with normal owner notification at service retirement 2. use appliance console to alter the retirement date to tomorrow 3. wai Actual results: two emails are sent to service owner Expected results:only one email is expected Additional info: I have chased down the scheduler actually created two separate messages at the time that retirement is to happen based on some service names provided by the customer. So once we have these two messages set up, the appear to be processed in very close time synchronization and each of them results in an email being sent out. According to the customer this happens > 50% of the time when he locally "improves" the retirement date on services. I want to emphasize that retirement *do* happen as expected. The BUG is that two emails are sent when only one is expected. I have logs from multiple days showing this behavior as well as copies of the services table at different points in time. I think this is a problem where the pre-defined warning time messages are not found and there is likely some logic error that causes two messages to be created into the miq_queue table where one is likely to be intended as the warning message and the other is the retirement message, but they both show up as retirement messages to the user.
comment from the customer today: I had some services retire 'naturally' this weekend and this issue still occurred. I altered my provisioning code so that no retirement_warn value was assigned to these services, only a retires_on value. It looks like we will need a bz for this. I will create a google disk location where these logs are provided by the customer.
diagnostic materials are loaded into google drive with following url https://drive.google.com/drive/folders/0Bw3u5g60tmjuSHlLcmJiRFE5OVE?usp=sharing the folder "BZ 1440972" contains the materials for this case.
Hi Tom, Investigating issue and looking at retired emails for ESTE-20170403-115154 xxricfw10_gtk_142528_tar_analyzd\contains_genericmailer_loglines:[----] I, [2017-04-05T12:15:33.268660 #25549:3d2264] INFO -- : MIQ(GenericMailer.deliver_queue) starting: method: automation_notification args: {:to=>"jnovotni", :from=>"cfme.do.not.reply", :subject=>"Service Retirement Alert for ESTE-20170403-115154", :content_type=>nil, :body=>"Hello, <br><br>Your service named [ESTE-20170403-115154] has been retired.<br><br> Thank you,<br> Virtualization Infrastructure Team”} xxricfw14_gtk_142545_tar_analyzd\contains_genericmailer_loglines:[----] I, [2017-04-05T12:15:28.617078 #31361:5437948] INFO -- : MIQ(GenericMailer.deliver_queue) starting: method: automation_notification args: {:to=>"jnovotni", :from=>"cfme.do.not.reply", :subject=>"Service Retirement Alert for ESTE-20170403-115154", :content_type=>nil, :body=>"Hello, <br><br>Your service named [ESTE-20170403-115154] has been retired.<br><br> Thank you,<br> Virtualization Infrastructure Team"} I can see where the worker 142528 starts the service retirement state machine and queues a retry: [----] I, [2017-04-05T12:14:20.012702 #25565:3e312c] INFO -- : MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"Service", :object_id=>10000000000190, :attrs=>{:event_type=>"request_service_retire", "Service::service"=>10000000000190, :service_id=>10000000000190, :retirement_initiator=>"system", :type=>"Service", "MiqEvent::miq_event"=>10000002337888, :miq_event_id=>10000002337888, "EventStream::event_stream"=>10000002337888, :event_stream_id=>10000002337888}, :instance_name=>"Event", :user_id=>10000000000001, :miq_group_id=>10000000000002, :tenant_id=>10000000000001, :automate_message=>nil, :state=>"CheckServiceRetired", :ae_fsm_started=>nil, :ae_state_started=>"2017-04-05 16:12:05 UTC", :ae_state_retries=>3, :ae_state_previous=>"---\n\"/IGT/Service/Retirement/StateMachines/ServiceRetirement/Default\":\n ae_state: CheckServiceRetired\n ae_state_retries: 3\n ae_state_started: 2017-04-05 16:12:05 UTC\n"} for object [Service.10000000000190] with state [CheckServiceRetired] to Automate for delivery in [60] seconds worker 145545 picks up the queue items and completes the retirement state machine, sending the email. [----] I, [2017-04-05T12:15:20.827910 #31361:ee113c] INFO -- : MIQ(MiqPriorityWorker::Runner#get_message_via_drb) Message id: [10000017213940], MiqWorker id: [10000000743799], Zone: [IGT RI Dev vCenter], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [dequeue], Deliver On: [2017-04-05 16:15:20 UTC], Data: [], Args: [{:object_type=>"Service", :object_id=>10000000000190, :attrs=>{:event_type=>"request_service_retire", "Service::service"=>10000000000190, :service_id=>10000000000190, :retirement_initiator=>"system", :type=>"Service", "MiqEvent::miq_event"=>10000002337888, :miq_event_id=>10000002337888, "EventStream::event_stream"=>10000002337888, :event_stream_id=>10000002337888}, :instance_name=>"Event", :user_id=>10000000000001, :miq_group_id=>10000000000002, :tenant_id=>10000000000001, :automate_message=>nil, :state=>"CheckServiceRetired", :ae_fsm_started=>nil, :ae_state_started=>"2017-04-05 16:12:05 UTC", :ae_state_retries=>3, :ae_state_previous=>"---\n\"/IGT/Service/Retirement/StateMachines/ServiceRetirement/Default\":\n ae_state: CheckServiceRetired\n ae_state_retries: 3\n ae_state_started: 2017-04-05 16:12:05 UTC\n"}], Dequeued in: [207.957307195] seconds within the same time period, worker 142528 gets a queue item and also continues the retirement state machine, sending the email as well. ----] I, [2017-04-05T12:15:24.335532 #25549:3e312c] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [10000017213943], MiqWorker id: [10000000743783], Zone: [IGT RI Dev vCenter], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [100], State: [dequeue], Deliver On: [2017-04-05 16:15:23 UTC], Data: [], Args: [{:object_type=>"Service", :object_id=>10000000000190, :attrs=>{:event_type=>"request_service_retire", "Service::service"=>10000000000190, :service_id=>10000000000190, :retirement_initiator=>"system", :type=>"Service", "MiqEvent::miq_event"=>10000002337892, :miq_event_id=>10000002337892, "EventStream::event_stream"=>10000002337892, :event_stream_id=>10000002337892}, :instance_name=>"Event", :user_id=>10000000000001, :miq_group_id=>10000000000002, :tenant_id=>10000000000001, :automate_message=>nil, :state=>"CheckServiceRetired", :ae_fsm_started=>nil, :ae_state_started=>"2017-04-05 16:12:08 UTC", :ae_state_retries=>3, :ae_state_previous=>"---\n\"/IGT/Service/Retirement/StateMachines/ServiceRetirement/Default\":\n ae_state: CheckServiceRetired\n ae_state_retries: 3\n ae_state_started: 2017-04-05 16:12:08 UTC\n"}], Dequeued in: [192.899790396] seconds I searched the logs and can't find where Message ID: 10000017213943 is put on the queue. Can you see if you can find the message id: 10000017213943 in the provided logs? Thanks, Tina
Tina, As we discussed, I discovered that the appliance which created the message above was not provided by the customer, but I do have an alternative group of log ies for a different day, and a tracking that I hopw will save you lots and lots of time. I created a spreadsheet that I used to track two different messages that were associated with the two retirement messages that the customer reported. I am uploading that as an attachment to the case. I hope that you have openoffice or libreoffice and can use that to open the following file: Tracking origins of two separate email messages.xlsb the logs I used to generate the above are from March 31 but we have all of the appliance logs so this might be more successful. the directory containing the above is in the original location identified in comment 4 above as is the spreadsheet. you can use the following url https://drive.google.com/drive/folders/0Bw3u5g60tmjuTklkRWxpT1V6S2c?usp=sharing
Created attachment 1275460 [details] spreadsheet detailing following two messages resulting in email from same retirement service
Hi Tom, Thanks so much for the analysis! It would have taken me a long time to sift through all of the logs looking for the relevant information. This is an interesting issue in that workers in two zones are running the service retirement check within seconds of each other. The timing is such that both workers raise the request_service_retire event on the same Service within seconds of each other. (2017-03-31T13:32:51.483866 vs. 2017-03-31T13:32:53.911381) And enter the Service retirement state machine within seconds of each other. Normally, the Service retirement state machine will prevent a second state machine from processing the same Service, but the timing in this case is allowing both processes to continue. The service retirement check is performed daily and I suspect the two workers are running the retirement check around the same time because they were started at the same time. If we could possibly stagger the start times of the two appliances, it would probably alleviate the email issue while we're working on a fix. Please let know if you have any questions. Thanks, Tina
Thanks Tina
Hi Marianne, Sorry for the delayed response. I started working on a solution to this issue, but have not gotten back to it in the last few weeks. The solution is to lock the object while we update the retirement_state as "queued" when the retirement event is raised. I'll move back to assigned since I'm not actively working on it. Thanks, Tina
I'm not sure if this helps any, but this was my original observation about the issue that I put on the case... "There is a check in Service/Retirement/StateMachines/Methods/start_retirement to see if the service is retiring, 'if service.retiring?', but when this check fails, the worker does an MIQ_ABORT, which results in the service.retirement_state eventually being set to 'error', which results in service.retiring? returning false. So when another worker comes along, it's able to execute that state machine."
Hi Jarryd, Thanks so much for your observation! The issue turned out to be that we have a typo in the update_service_retirement_status Automate method. The method doesn't update the retirement_state if the step is start retirement, but the typo resulted in the retirement_state getting set every time. Thanks again for catching that. Regards, Tina
https://github.com/ManageIQ/manageiq-content/pull/189 commit 0ccaec8c32156314fc7a19a4df21c2cb3d87ed9f Author: william fitzgerald <wfitzger> Date: Fri Sep 22 12:43:47 2017 -0400 Fixed typo in Service Retirement method. update_service_retirement_status method had incorrect step name. This caused a problem when trying to retire a service that is already in the process of being retired. diff --git a/content/automate/ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb b/content/automate/ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.cla index 831d9bc..5aac6b0 100644 --- a/content/automate/ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb +++ b/content/automate/ManageIQ/Service/Retirement/StateMachines/ServiceRetirement.class/__methods__/update_service_retirement_status.rb @@ -32,7 +32,7 @@ updated_message += "Current Retry Number [#{$evm.root['ae_state_retries']}]" if # Update Status for on_error for all states other than the first state which is start retirement # in the retirement state machine. if $evm.root['ae_result'] == 'error' - if step.downcase == 'start retirement' + if step.downcase == 'startretirement' msg = 'Cannot continue because Service is ' msg += service ? "#{service.retirement_state}." : 'nil.' $evm.log('info', msg)
VERIFIED. This fix is present in 5.9.0.17