| Summary: | Synchronization issue in the MiqQueue between two workers causes automation_task to fail. | ||
|---|---|---|---|
| Product: | Red Hat CloudForms Management Engine | Reporter: | Josh Carter <jocarter> |
| Component: | Automate | Assignee: | mkanoor |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Milan Falešník <mfalesni> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 5.3.0 | CC: | jhardy, jocarter, mkanoor, obarenbo, tfitzger, vsira |
| Target Milestone: | GA | ||
| Target Release: | 5.6.0 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | automate | ||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-06-21 17:48:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Comment 8
mkanoor
2016-03-01 21:44:58 UTC
The log shows that the method did have parameters when it failed. It has a list of volumes.
[----] I, [2016-02-26T17:51:32.989601 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]
[----] I, [2016-02-26T17:51:32.990052 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - VOLUME ID: b5847a04-09bc-4ef7-a346-08a782e17af5
[----] I, [2016-02-26T17:51:32.990717 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - Disk number of the disks that are getting modified :
[----] I, [2016-02-26T17:51:32.991039 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - &&& Start Testing GEM method &&&
[----] I, [2016-02-26T17:51:32.991407 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> modify disk space - CloudForms Automate Method Started
[----] I, [2016-02-26T17:51:32.996126 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> Running kinit - run_kinit
[----] I, [2016-02-26T17:51:33.330137 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> kinit output: Password for svc.cldauto.COM:
- run_kinit
[----] I, [2016-02-26T17:51:33.330521 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> Connecting to WinRM on URL: http://wpdospwa00125.ads.aexp.com:5985/wsman with user svc.cldauto.COM - winRM
[----] I, [2016-02-26T17:51:33.330849 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> Using Realm: ADS.AEXP.COM - winRM
[----] E, [2016-02-26T17:52:33.883444 #2613:61b1d58] ERROR -- : Q-task_id([automation_task_70000000012070]) <AEMethod executeinguesttool_w> [execution expired]
/opt/rh/ruby193/root/usr/share/gems/gems/httpclient-2.2.7/lib/httpclient/session.rb:800:in `initialize'
/opt/rh/ruby193/root/usr/share/gems/gems/httpclient-2.2.7/lib/httpclient/session.rb:800:in `new'
/opt/rh/ruby193/root/usr/share/gems/gems/httpclient-2.2.7/lib/httpclient/session.rb:800:in `create_socket'
Which is the log file that has the beginning of this transaction? I can't seem to find the starting point.
This log line seems to be in the method
#2613:61b1d58
[----] I, [2016-02-26T17:51:32.989601 #2613:61b1d58] INFO -- : Q-task_id([automation_task_70000 000012070]) <AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]
Where does this transaction start, is that in a different log file?
The logs don't have the beginning of this task
The first line is
1951 [----] I, [2016-02-26T17:51:32.989601 #2613:61b1d58] INFO -- : Q-task_id([automation_task_7000 0000012070]) <AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F ", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]
So we would need the beginning of that task #2613:61b1d58
Also could you validate if they have tried setting
$evm.root['ae_retry_interval'] = 5
in the method that issues the retry.
This will introduce a 5 second delay before the next queue entry gets dispatched.
Yes, that's all we have in the logs for that task. Do we need to increase verbosity in the logs to capture additional information? Also, how can I tell where the tasks begin? The ae_retry_interval for the specific method is not set, so it uses the default value. Note we don't have retry logic for this particular error condition, that is for not having the info available in the state_var, not sure if that would help. We could add code that works around the issue by adding retry logic if that's recommended. Does this fail consistently or it fails sporadically? Out of 10 requests we'll get 1 to 2 failures. Only happens for Windows VMs ( we have similar state machine handling Linux VMs) and as I mentioned only in multiworker environments. Would it be possible to get to the beginning of the dequeuing to see if all the state data was read properly, by the time we see the failure the log has rolled over. There should be rolled over logs. If they have a fresh run that would also help that shows the beginning of the task. If you search for MiqAeEngine.deliver in the evm.log's (please check the rolled over logs too) we would see what was getting delivered. The log level is ok, its just that we are missing the starting data that might be in the roll over logs. If they have a fresh logs that would help I'm not able to find anything in that log or the next after rotation ( The rotation didn't hit around the time this issue happened btw ). Here is what I was doing in case my search is not correct: zcat evm.log-20160227.gz |grep automation_task_70000000012070 |grep MiqAeEngine.deliver We also tried another fresh query today but were not able to reproduce today. We'll attempt tomorrow again. Another thing to search for is MIQ(MiqAeEngine.deliver) Requeuing This is when we get a retry and we store things back in the Queue for the next launch. That should tell us if we are storing partial info Yes thats correct, we need the logs that shows us dequeueing the entry. You don't have to change the log level Based on the data that I see in the logs the data that we dequeued had these 3 entries
entry1 = {'disk_num' => '0', 'slot' => '04', 'vol_num' => '0', 'mount' => 'System Rese', 'capacity' => '99', 'capacity_type' => 'MB', 'free' => '71', 'free_type' => 'MB', 'volume_id' => 'b797dacf-3 da8-4d9e-bf00-4ed19165740e'}
entry2 = {'disk_num' => '0', 'slot' => '04', 'vol_num' => '1', 'mount' => 'C', 'capacity' => '99', 'capacity_type' => 'MB', 'free' => '27', 'free_type' => 'MB', 'volume_id' => 'b797dacf-3da8-4d9e-b f00-4ed19165740e'}
entry3 = {'disk_num' => '1', 'slot' => '05', 'vol_num' => '2', 'mount' => 'F', 'capacity' => '49', 'capacity_type' => 'GB', 'free' => '49', 'free_type' => 'GB', 'volume_id' => 'b5847a04-09bc-4ef7-a 346-08a782e17af5'}
When the method ran it only mentions the last entry
<AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]
Are they saying they can't access the first two entries but only the last one.
We have the data all the way when the method launches
In the first entry the mount looks System Rese is that valid or does that look incomplete.
As I mentioned in 10 above, the input that you see logged here:
<AEMethod executeinguesttool_w> modify disk space - list of volumes: [{"mount"=>"F", "size"=>131, "volumeID"=>"b5847a04-09bc-4ef7-a346-08a782e17af5"}]
Doesn't come from the queue but from the REST parameters ( they look similar because they were obtained in the same format ). The issue is that none of the messages in queue made in time before this method executed and it failed.
The "System Rese" Is normal, that is a reserved windows partition for recovery.
The problem I have now is that they don't want to do additional testing until we can provide an answer on why the information that you're looking in the logs wasn't logged in the first place.
Let me know if you prefer a phone call to clarify all this.
Josh any updates on this issue? Had a meeting with the customer on adding additional checks in their automate method around the call to fetch the state variables Code was added. However, we haven't seen the issue since that was done a couple of weeks ago. Is it feasible to leave this open a few weeks more to give enough time for this issue to happen? Josh, Can this be closed? Thanks, Madhu Madhu, Closing this due to the support case closing and no information was provided back. |