Bug 1372672

Summary: SSA Fails in Windows workloads but not in Linux ones on OSP9
Product: Red Hat CloudForms Management Engine Reporter: Victor Estival <vestival>
Component: SmartState AnalysisAssignee: Hui Song <hsong>
Status: CLOSED CURRENTRELEASE QA Contact: Ido Ovadia <iovadia>
Severity: high Docs Contact:
Priority: high    
Version: 5.6.0CC: cbolz, dajohnso, iovadia, jhardy, jkeselma, obarenbo, roliveri, sbulage, simaishi, vestival, wrichter
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.9.0   
Hardware: Unspecified   
OS: Windows   
Whiteboard: openstack:smartstate
Fixed In Version: 5.9.0.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1450514 1450515 1459235 (view as bug list) Environment:
Last Closed: 2018-03-06 14:57:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Openstack Target Upstream Version:
Embargoed:
Bug Depends On: 1437644    
Bug Blocks: 1450514, 1450515, 1459235    
Attachments:
Description Flags
Example of output after several SSAs
none
New screenshot showing windows SSA error message
none
evm.log
none
New Screenshot showing windows SSA status message showing results of both image and instance scans
none
evm log dated 2016-11-16 none

Description Victor Estival 2016-09-02 10:44:40 UTC
Created attachment 1197117 [details]
Example of output after several SSAs

Description of problem:
SSA fails while performing in Windows workloads, but it works in Linux ones. It does all the process but it fails after Scanning for metadata from VM with the error Fnished   job timed out after 3043.763708524 seconds of inactivity. Inactivity threshold [3000 seconds].

The snapshot is created correctly and it is downloaded to the CF appliance

We found an error connecting to OSP <Fog> excon.error     #<Excon::Error::Unauthorized: Expected([201]) <=> Actual(401 Unauthorized)

Full trace can be found in: https://paste.fedoraproject.org/419745/81216514/ and fog output in: https://paste.fedoraproject.org/419746/72812562/


Version-Release number of selected component (if applicable): 
we are using 5.6.1.2 in CFME and OSP 9


How reproducible: 
Connect CF to OSP and Upload a Windows image


Steps to Reproduce:
1.
2.
3.

Actual results: 
SSA fails


Expected results: 
SSA works


Additional info:
This is internal demo project, so you can have access to the environment from the VPN if you want

Comment 2 Jerry Keselman 2016-11-10 19:25:07 UTC
The two paste.fedoraproject.org files are not available.  in addition, the attached screenshot has an error stating "Failed to create vm snapshot with EMS. Error: [NoMethodError]: [undefined method 'metadata' for nil:NilClass].  This does not match the error in your description of this BZ at all.  Do you have any information related to the actual error you are reporting? Alternatively, do you have any background information related to the error in the screenshot?

Comment 3 Wolfram Richter 2016-11-15 21:50:20 UTC
Created attachment 1220961 [details]
New screenshot showing windows SSA error message

Comment 4 Wolfram Richter 2016-11-15 21:52:55 UTC
Created attachment 1220962 [details]
evm.log

Comment 5 Wolfram Richter 2016-11-15 21:53:39 UTC
Windows image to be analysed is the image from cloudbase.it: https://cloudbase.it/windows-cloud-images/

Comment 6 Jerry Keselman 2016-11-15 21:56:40 UTC
Wolfram,

This is also not the error in this BZ description.  You stated above that this was a timeout issue.  Are there three different issues here?  Can you open separate BZs for each one if so?  Thanks.  Also can you please provide access to the appliance and Openstack provider in question.  Thanks much.

Comment 7 Wolfram Richter 2016-11-15 22:09:28 UTC
Jerry, understood that the information provided does not match the preliminary findings mentioned in the bug report. 

As background: Victor and I are building a demo and SSA on the windows image has consistently never worked. The appliance instance that Victor used for the initial bug report has long been destroyed, i.e. no log can be retrieved related to the original bug report. 

We can however provide screenshots, logs and access to three different openstack environments and cloudforms appliances SSA/Windows does not work today.

Do you suggest to open another bugzilla to continue the analyisis and close this one since we cannot provide any updates?

Comment 9 Wolfram Richter 2016-11-16 20:58:11 UTC
Created attachment 1221379 [details]
New Screenshot showing windows SSA status message showing results of both image and instance scans

Previous image contained only an image scan

Comment 10 Wolfram Richter 2016-11-16 21:03:00 UTC
Created attachment 1221380 [details]
evm log dated 2016-11-16

For an image scan, there is the error message mentioned in the previous bug report in the logs:

[----] W, [2016-11-16T12:57:32.424030 #53725:f53994]  WARN -- : MIQ(VmScan#timeout!) Job: guid: [1620a31e-abec-11e6-a010-020000000111], job timed out after 3048.072594672 seconds of inactivity.  Inactivity threshold [3000 seconds], aborting
[----] E, [2016-11-16T12:57:46.974131 #53717:f53994] ERROR -- : MIQ(VmScan#process_abort) job aborting, job timed out after 3048.072594672 seconds of inactivity.  Inactivity threshold [3000 seconds]
[----] I, [2016-11-16T12:57:47.035964 #53717:f53994]  INFO -- : MIQ(VmScan#process_finished) job finished, job timed out after 3048.072594672 seconds of inactivity.  Inactivity threshold [3000 seconds]
[----] I, [2016-11-16T12:57:47.060489 #53717:f53994]  INFO -- : MIQ(VmScan#dispatch_finish) Dispatch Status is 'finished'

Comment 11 Hui Song 2017-04-18 14:09:38 UTC
Volfram,

I tested SSA on the same window VM. It failed with following errors:

[----] E, [2017-04-18T03:19:44.370855 #38245:36112c] ERROR -- : MIQ(MiqQueue#deliver) Message id: [919000000161855], Error: [wrong number of arguments (given 2, expected 1)]
[----] E, [2017-04-18T03:19:44.371140 #38245:36112c] ERROR -- : [ArgumentError]: wrong number of arguments (given 2, expected 1)  Method:[rescue in deliver]
[----] E, [2017-04-18T03:19:44.371228 #38245:36112c] ERROR -- : /var/www/miq/vmdb/app/models/storage.rb:725:in `perf_capture'
/var/www/miq/vmdb/app/models/storage.rb:717:in `perf_capture_hourly'
/var/www/miq/vmdb/app/models/miq_queue.rb:347:in `block in deliver'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
......

The BZ https://bugzilla.redhat.com/show_bug.cgi?id=1437644 has already opened for this problem. We have to wait until that fix is merged here.

Updated the dependency.

Comment 12 Wolfram Richter 2017-04-18 14:28:19 UTC
Hui,

thanks for the analysis - what I find odd is that on CFME 5.6 SSA does work for Linux VMs, it only doesn't work for Windows. If the performance capture is on the code path I would assume that it should also fail for Linux VMs. 

Cheers,
Wolfram

Comment 13 Hui Song 2017-04-18 14:33:42 UTC
(In reply to Wolfram Richter from comment #12)
> Hui,
> 
> thanks for the analysis - what I find odd is that on CFME 5.6 SSA does work
> for Linux VMs, it only doesn't work for Windows. If the performance capture
> is on the code path I would assume that it should also fail for Linux VMs. 
> 
> Cheers,
> Wolfram

Yes, you are right. This is not the root cause for SSA failing on Window VMs. It only blocked us to reproduce the issue. I'll retest it after the fix is patched.

Thanks,

Hui

Comment 14 Hui Song 2017-04-20 13:54:44 UTC
Wolfram,

May I borrow your https://cloudforms.hailstorm2.coe.muc.redhat.com? I found one suspicious codes and need to prove it's the root cause.

Thanks,

Comment 15 Wolfram Richter 2017-04-20 14:09:21 UTC
you can use https://cloudforms.hailstorm1.coe.muc.redhat.com/ (CFME 5.7.0.17, same credentials). Hailstorm2 is currently being reinstalled as testbed for CFME 5.8 beta 2)

Comment 16 Hui Song 2017-04-20 14:35:19 UTC
Thank you, Wolfram. In this new appliance, somehow the openstack provider's validation failed. Can you revalidate it?

Comment 17 Wolfram Richter 2017-04-20 15:28:34 UTC
Sorry, I see that all my OpenStack environments seem to have keystone problems - I'll report back when I have a usable env.

Comment 18 Wolfram Richter 2017-04-20 22:08:01 UTC
https://cloudforms.hailstorm2.coe.muc.redhat.com/ (CFME 5.8.0.10) and its openstack are in a working condition again, the others (with CFME 5.7) will probably reappear tomorrow

Comment 19 Wolfram Richter 2017-04-22 08:26:19 UTC
https://cloudforms.hailstorm3.coe.muc.redhat.com/ (CFME 5.7.2.1) and its openstack are also working again. The hailstorm2 env. will probable be of limited availability since I'm working on the underlying RHEL (hailstorm3 will be stable).

Comment 20 Hui Song 2017-04-24 14:21:06 UTC
Wolfram,

The scan target here is Openstack image of windows server 2012. The root causes of SSA failure are:
1. Its OS device signature is shorter than expect (22 vs. 60), which causes 'Unable to mount filesystem. Reason:[No root filesystem found.]';
2. vmConfig is undefined for MiqOpenstackeImage, which causes job hangs and finally times out.

I'll send PR request to fix them. Thank you for your environments.

Comment 21 Hui Song 2017-05-01 20:25:14 UTC
There are two PRs for this BZ:

https://github.com/ManageIQ/manageiq-gems-pending/pull/143
https://github.com/ManageIQ/manageiq-gems-pending/pull/144

They are merged into master branch.

Comment 26 Ido Ovadia 2018-02-28 11:47:35 UTC
Verified
========
5.9.0.22

Comment 27 Satyajit Bulage 2018-02-28 12:05:38 UTC
Hello Dave,

Worked with Ido Ovadia and cross-checked that SSA is gathering all required data from Windows Instance and Image.

Thanks,
Satyajit Bulage.