Bug 1313063

Summary: Synchronization issue in the MiqQueue between two workers causes automation_task to fail.
Product: Red Hat CloudForms Management Engine Reporter: Josh Carter <jocarter>
Component: AutomateAssignee: mkanoor
Status: CLOSED INSUFFICIENT_DATA QA Contact: Milan Falešník <mfalesni>
Severity: high Docs Contact:
Priority: high    
Version: 5.3.0CC: jhardy, jocarter, mkanoor, obarenbo, tfitzger, vsira
Target Milestone: GA   
Target Release: 5.6.0   
Hardware: All   
OS: All   
Whiteboard: automate
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-21 17:48:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 8 mkanoor 2016-03-01 21:44:58 UTC
The Sequence of events are
(1) The first worker which was running the Automate creates Queue Entry when it detects a retry is needed. The Queue entry has the "delay" for retry along with the state machine variables.
(2) The worker ends the job
(3) Some other worker picks up the Queue entry and runs with it.

When the request gets dispatched by the other worker we should be seeing the state machine variables.

Can you check if there is a delay between the retries, you can induce this delay by setting the

$evm.root['ae_retry_interval'] = 5

Which will introduce a 5 second delay.

Comment 9 mkanoor 2016-03-01 22:39:08 UTC
The log shows that the method did have parameters when it failed. It has a list of volumes.


[----] I, [2016-02-26T17:51:32.989601 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]
[----] I, [2016-02-26T17:51:32.990052 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - VOLUME ID: b5847a04-09bc-4ef7-a346-08a782e17af5
[----] I, [2016-02-26T17:51:32.990717 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - Disk number of the disks that are getting modified :
[----] I, [2016-02-26T17:51:32.991039 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space -  &&& Start Testing GEM method &&&
[----] I, [2016-02-26T17:51:32.991407 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - CloudForms Automate Method Started
[----] I, [2016-02-26T17:51:32.996126 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> Running kinit - run_kinit
[----] I, [2016-02-26T17:51:33.330137 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> kinit output: Password for svc.cldauto.COM:
 - run_kinit
[----] I, [2016-02-26T17:51:33.330521 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> Connecting to WinRM on URL: http://wpdospwa00125.ads.aexp.com:5985/wsman with user svc.cldauto.COM - winRM
[----] I, [2016-02-26T17:51:33.330849 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> Using Realm: ADS.AEXP.COM - winRM
[----] E, [2016-02-26T17:52:33.883444 #2613:61b1d58] ERROR -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> [execution expired]
/opt/rh/ruby193/root/usr/share/gems/gems/httpclient-2.2.7/lib/httpclient/session.rb:800:in `initialize'
/opt/rh/ruby193/root/usr/share/gems/gems/httpclient-2.2.7/lib/httpclient/session.rb:800:in `new'
/opt/rh/ruby193/root/usr/share/gems/gems/httpclient-2.2.7/lib/httpclient/session.rb:800:in `create_socket'

Comment 11 mkanoor 2016-03-01 23:20:09 UTC
Which is the log file that has the beginning of this transaction? I can't seem to find the starting point.
This log line seems to be in the method
#2613:61b1d58
[----] I, [2016-02-26T17:51:32.989601 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_70000    000012070]) <AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F",     "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]

Where does this transaction start, is that in a different log file?

Comment 13 mkanoor 2016-03-02 16:53:33 UTC
The logs don't have the beginning of this task

The first line is
1951 [----] I, [2016-02-26T17:51:32.989601 #2613:61b1d58]  INFO -- : Q-task_id([automation_task_7000     0000012070]) <AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F     ", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]

So we would need the beginning of that task #2613:61b1d58

Also could you validate if they have tried setting
$evm.root['ae_retry_interval'] = 5

in the method that issues the retry.

This will introduce a 5 second delay before the next queue entry gets dispatched.

Comment 14 Victor Sira 2016-03-02 17:45:53 UTC
Yes, that's all we have in the logs for that task. Do we need to increase verbosity in the logs to capture additional information? Also, how can I tell where the tasks begin?

The ae_retry_interval for the specific method is not set, so it uses the default value. Note we don't have retry logic for this particular error condition, that is for not having the info available in the state_var, not sure if that would help. We could add code that works around the issue by adding retry logic if that's recommended.

Comment 15 mkanoor 2016-03-03 15:35:29 UTC
Does this fail consistently or it fails sporadically?

Comment 16 Victor Sira 2016-03-03 16:43:21 UTC
Out of 10 requests we'll get 1 to 2 failures. Only happens for Windows VMs ( we have similar state machine handling Linux VMs) and as I mentioned only in multiworker environments.

Comment 17 mkanoor 2016-03-03 17:19:26 UTC
Would it be possible to get to the beginning of the dequeuing to see if all the state data was read properly, by the time we see the failure the log has rolled over. There should be rolled over logs. If they have a fresh run that would also help that shows the beginning of the task.

If you search for MiqAeEngine.deliver in the evm.log's (please check the rolled over logs too) we would see what was getting delivered.

The log level is ok, its just that we are missing the starting data that might be in the roll over logs.

If they have a fresh logs that would help

Comment 18 Victor Sira 2016-03-03 23:28:25 UTC
I'm not able to find anything in that log or the next after rotation ( The rotation didn't hit around the time this issue happened btw ).

Here is what I was doing in case my search is not correct:

zcat evm.log-20160227.gz  |grep  automation_task_70000000012070 |grep MiqAeEngine.deliver 


We also tried another fresh query today but were not able to reproduce today. We'll attempt tomorrow again.

Comment 19 mkanoor 2016-03-04 14:53:39 UTC
Another thing to search for is
MIQ(MiqAeEngine.deliver) Requeuing

This is when we get a retry and we store things back in the Queue for the next launch.
That should tell us if we are storing partial info

Comment 23 mkanoor 2016-03-10 18:34:20 UTC
Yes thats correct, we need the logs that shows us dequeueing the entry. You don't have to change the log level

Comment 25 mkanoor 2016-03-14 22:04:22 UTC
Based on the data that I see in the logs the data that we dequeued had these 3 entries

entry1 = {'disk_num' => '0', 'slot' => '04', 'vol_num' => '0', 'mount' => 'System Rese', 'capacity'   => '99', 'capacity_type' => 'MB', 'free' => '71', 'free_type' => 'MB', 'volume_id' => 'b797dacf-3  da8-4d9e-bf00-4ed19165740e'}
  
  entry2 = {'disk_num' => '0', 'slot' => '04', 'vol_num' => '1', 'mount' => 'C', 'capacity' => '99',   'capacity_type' => 'MB', 'free' => '27', 'free_type' => 'MB', 'volume_id' => 'b797dacf-3da8-4d9e-b  f00-4ed19165740e'}
  entry3 = {'disk_num' => '1', 'slot' => '05', 'vol_num' => '2', 'mount' => 'F', 'capacity' => '49',   'capacity_type' => 'GB', 'free' => '49', 'free_type' => 'GB', 'volume_id' => 'b5847a04-09bc-4ef7-a  346-08a782e17af5'}

When the method ran it only mentions the last entry

<AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]

Are they saying they can't access the first two entries but only the last one.
We have the data all the way when the method launches

In the first entry the mount looks System Rese is that valid or does that look incomplete.

Comment 26 Victor Sira 2016-03-16 17:52:26 UTC
As I mentioned in 10 above, the input that you see logged here:

<AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]

Doesn't come from the queue but from the REST parameters ( they look similar because they were obtained in the same format ). The issue is that none of the messages in queue made in time before this method executed and it failed.

The "System Rese" Is normal, that is a reserved windows partition for recovery.

The problem I have now is that they don't want to do additional testing until we can provide an answer on why the information that you're looking in the logs wasn't logged in the first place.

Let me know if you prefer a phone call to clarify all this.

Comment 27 mkanoor 2016-04-04 18:06:42 UTC
Josh any updates on this issue?

Comment 28 mkanoor 2016-04-04 18:08:02 UTC
Had a meeting with the customer on adding additional checks in their automate method around the call to fetch the state variables

Comment 29 Victor Sira 2016-04-05 22:24:16 UTC
Code was added. However, we haven't seen the issue since that was done a couple of weeks ago. Is it feasible to leave this open a few weeks more to give enough time for this issue to happen?

Comment 31 mkanoor 2016-05-24 14:09:14 UTC
Josh,
Can this be closed?
Thanks,
Madhu

Comment 32 Josh Carter 2016-06-21 17:48:53 UTC
Madhu, 

Closing this due to the support case closing and no information was provided back.