Created attachment 1197117 [details]
Example of output after several SSAs
Description of problem:
SSA fails while performing in Windows workloads, but it works in Linux ones. It does all the process but it fails after Scanning for metadata from VM with the error Fnished job timed out after 3043.763708524 seconds of inactivity. Inactivity threshold [3000 seconds].
The snapshot is created correctly and it is downloaded to the CF appliance
We found an error connecting to OSP <Fog> excon.error #<Excon::Error::Unauthorized: Expected() <=> Actual(401 Unauthorized)
Full trace can be found in: https://paste.fedoraproject.org/419745/81216514/ and fog output in: https://paste.fedoraproject.org/419746/72812562/
Version-Release number of selected component (if applicable):
we are using 220.127.116.11 in CFME and OSP 9
Connect CF to OSP and Upload a Windows image
Steps to Reproduce:
This is internal demo project, so you can have access to the environment from the VPN if you want
The two paste.fedoraproject.org files are not available. in addition, the attached screenshot has an error stating "Failed to create vm snapshot with EMS. Error: [NoMethodError]: [undefined method 'metadata' for nil:NilClass]. This does not match the error in your description of this BZ at all. Do you have any information related to the actual error you are reporting? Alternatively, do you have any background information related to the error in the screenshot?
Created attachment 1220961 [details]
New screenshot showing windows SSA error message
Created attachment 1220962 [details]
Windows image to be analysed is the image from cloudbase.it: https://cloudbase.it/windows-cloud-images/
This is also not the error in this BZ description. You stated above that this was a timeout issue. Are there three different issues here? Can you open separate BZs for each one if so? Thanks. Also can you please provide access to the appliance and Openstack provider in question. Thanks much.
Jerry, understood that the information provided does not match the preliminary findings mentioned in the bug report.
As background: Victor and I are building a demo and SSA on the windows image has consistently never worked. The appliance instance that Victor used for the initial bug report has long been destroyed, i.e. no log can be retrieved related to the original bug report.
We can however provide screenshots, logs and access to three different openstack environments and cloudforms appliances SSA/Windows does not work today.
Do you suggest to open another bugzilla to continue the analyisis and close this one since we cannot provide any updates?
Created attachment 1221379 [details]
New Screenshot showing windows SSA status message showing results of both image and instance scans
Previous image contained only an image scan
Created attachment 1221380 [details]
evm log dated 2016-11-16
For an image scan, there is the error message mentioned in the previous bug report in the logs:
[----] W, [2016-11-16T12:57:32.424030 #53725:f53994] WARN -- : MIQ(VmScan#timeout!) Job: guid: [1620a31e-abec-11e6-a010-020000000111], job timed out after 3048.072594672 seconds of inactivity. Inactivity threshold [3000 seconds], aborting
[----] E, [2016-11-16T12:57:46.974131 #53717:f53994] ERROR -- : MIQ(VmScan#process_abort) job aborting, job timed out after 3048.072594672 seconds of inactivity. Inactivity threshold [3000 seconds]
[----] I, [2016-11-16T12:57:47.035964 #53717:f53994] INFO -- : MIQ(VmScan#process_finished) job finished, job timed out after 3048.072594672 seconds of inactivity. Inactivity threshold [3000 seconds]
[----] I, [2016-11-16T12:57:47.060489 #53717:f53994] INFO -- : MIQ(VmScan#dispatch_finish) Dispatch Status is 'finished'
I tested SSA on the same window VM. It failed with following errors:
[----] E, [2017-04-18T03:19:44.370855 #38245:36112c] ERROR -- : MIQ(MiqQueue#deliver) Message id: , Error: [wrong number of arguments (given 2, expected 1)]
[----] E, [2017-04-18T03:19:44.371140 #38245:36112c] ERROR -- : [ArgumentError]: wrong number of arguments (given 2, expected 1) Method:[rescue in deliver]
[----] E, [2017-04-18T03:19:44.371228 #38245:36112c] ERROR -- : /var/www/miq/vmdb/app/models/storage.rb:725:in `perf_capture'
/var/www/miq/vmdb/app/models/miq_queue.rb:347:in `block in deliver'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch'
The BZ https://bugzilla.redhat.com/show_bug.cgi?id=1437644 has already opened for this problem. We have to wait until that fix is merged here.
Updated the dependency.
thanks for the analysis - what I find odd is that on CFME 5.6 SSA does work for Linux VMs, it only doesn't work for Windows. If the performance capture is on the code path I would assume that it should also fail for Linux VMs.
(In reply to Wolfram Richter from comment #12)
> thanks for the analysis - what I find odd is that on CFME 5.6 SSA does work
> for Linux VMs, it only doesn't work for Windows. If the performance capture
> is on the code path I would assume that it should also fail for Linux VMs.
Yes, you are right. This is not the root cause for SSA failing on Window VMs. It only blocked us to reproduce the issue. I'll retest it after the fix is patched.
May I borrow your https://cloudforms.hailstorm2.coe.muc.redhat.com? I found one suspicious codes and need to prove it's the root cause.
you can use https://cloudforms.hailstorm1.coe.muc.redhat.com/ (CFME 18.104.22.168, same credentials). Hailstorm2 is currently being reinstalled as testbed for CFME 5.8 beta 2)
Thank you, Wolfram. In this new appliance, somehow the openstack provider's validation failed. Can you revalidate it?
Sorry, I see that all my OpenStack environments seem to have keystone problems - I'll report back when I have a usable env.
https://cloudforms.hailstorm2.coe.muc.redhat.com/ (CFME 22.214.171.124) and its openstack are in a working condition again, the others (with CFME 5.7) will probably reappear tomorrow
https://cloudforms.hailstorm3.coe.muc.redhat.com/ (CFME 126.96.36.199) and its openstack are also working again. The hailstorm2 env. will probable be of limited availability since I'm working on the underlying RHEL (hailstorm3 will be stable).
The scan target here is Openstack image of windows server 2012. The root causes of SSA failure are:
1. Its OS device signature is shorter than expect (22 vs. 60), which causes 'Unable to mount filesystem. Reason:[No root filesystem found.]';
2. vmConfig is undefined for MiqOpenstackeImage, which causes job hangs and finally times out.
I'll send PR request to fix them. Thank you for your environments.
There are two PRs for this BZ:
They are merged into master branch.
Worked with Ido Ovadia and cross-checked that SSA is gathering all required data from Windows Instance and Image.