1372672 – SSA Fails in Windows workloads but not in Linux ones on OSP9

Bug 1372672 - SSA Fails in Windows workloads but not in Linux ones on OSP9

Summary: SSA Fails in Windows workloads but not in Linux ones on OSP9

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	SmartState Analysis
Sub Component:
Version:	5.6.0
Hardware:	Unspecified
OS:	Windows
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.9.0
Assignee:	Hui Song
QA Contact:	Ido Ovadia
Docs Contact:
URL:
Whiteboard:	openstack:smartstate
Depends On:	1437644
Blocks:	1450514 1450515 1459235
TreeView+	depends on / blocked

Reported:	2016-09-02 10:44 UTC by Victor Estival
Modified:	2018-03-06 14:57 UTC (History)
CC List:	11 users (show)
Fixed In Version:	5.9.0.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1450514 1450515 1459235 (view as bug list)
Environment:
Last Closed:	2018-03-06 14:57:52 UTC
Category:	---
Cloudforms Team:	Openstack
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Example of output after several SSAs (574.85 KB, image/png) 2016-09-02 10:44 UTC, Victor Estival	no flags	Details
New screenshot showing windows SSA error message (198.43 KB, image/png) 2016-11-15 21:50 UTC, Wolfram Richter	no flags	Details
evm.log (12.22 MB, application/x-gzip) 2016-11-15 21:52 UTC, Wolfram Richter	no flags	Details
New Screenshot showing windows SSA status message showing results of both image and instance scans (150.20 KB, image/png) 2016-11-16 20:58 UTC, Wolfram Richter	no flags	Details
evm log dated 2016-11-16 (17.84 MB, application/x-gzip) 2016-11-16 21:03 UTC, Wolfram Richter	no flags	Details
View All

Description Victor Estival 2016-09-02 10:44:40 UTC

Created attachment 1197117 [details]
Example of output after several SSAs

Description of problem:
SSA fails while performing in Windows workloads, but it works in Linux ones. It does all the process but it fails after Scanning for metadata from VM with the error Fnished   job timed out after 3043.763708524 seconds of inactivity. Inactivity threshold [3000 seconds].

The snapshot is created correctly and it is downloaded to the CF appliance

We found an error connecting to OSP <Fog> excon.error     #<Excon::Error::Unauthorized: Expected([201]) <=> Actual(401 Unauthorized)

Full trace can be found in: https://paste.fedoraproject.org/419745/81216514/ and fog output in: https://paste.fedoraproject.org/419746/72812562/


Version-Release number of selected component (if applicable): 
we are using 5.6.1.2 in CFME and OSP 9


How reproducible: 
Connect CF to OSP and Upload a Windows image


Steps to Reproduce:
1.
2.
3.

Actual results: 
SSA fails


Expected results: 
SSA works


Additional info:
This is internal demo project, so you can have access to the environment from the VPN if you want

Comment 2 Jerry Keselman 2016-11-10 19:25:07 UTC

The two paste.fedoraproject.org files are not available.  in addition, the attached screenshot has an error stating "Failed to create vm snapshot with EMS. Error: [NoMethodError]: [undefined method 'metadata' for nil:NilClass].  This does not match the error in your description of this BZ at all.  Do you have any information related to the actual error you are reporting? Alternatively, do you have any background information related to the error in the screenshot?

Comment 3 Wolfram Richter 2016-11-15 21:50:20 UTC

Created attachment 1220961 [details]
New screenshot showing windows SSA error message

Comment 4 Wolfram Richter 2016-11-15 21:52:55 UTC

Created attachment 1220962 [details]
evm.log

Comment 5 Wolfram Richter 2016-11-15 21:53:39 UTC

Windows image to be analysed is the image from cloudbase.it: https://cloudbase.it/windows-cloud-images/

Comment 6 Jerry Keselman 2016-11-15 21:56:40 UTC

Wolfram,

This is also not the error in this BZ description.  You stated above that this was a timeout issue.  Are there three different issues here?  Can you open separate BZs for each one if so?  Thanks.  Also can you please provide access to the appliance and Openstack provider in question.  Thanks much.

Comment 7 Wolfram Richter 2016-11-15 22:09:28 UTC

Jerry, understood that the information provided does not match the preliminary findings mentioned in the bug report. 

As background: Victor and I are building a demo and SSA on the windows image has consistently never worked. The appliance instance that Victor used for the initial bug report has long been destroyed, i.e. no log can be retrieved related to the original bug report. 

We can however provide screenshots, logs and access to three different openstack environments and cloudforms appliances SSA/Windows does not work today.

Do you suggest to open another bugzilla to continue the analyisis and close this one since we cannot provide any updates?

Comment 9 Wolfram Richter 2016-11-16 20:58:11 UTC

Created attachment 1221379 [details]
New Screenshot showing windows SSA status message showing results of both image and instance scans

Previous image contained only an image scan

Comment 10 Wolfram Richter 2016-11-16 21:03:00 UTC

Created attachment 1221380 [details]
evm log dated 2016-11-16

For an image scan, there is the error message mentioned in the previous bug report in the logs:

[----] W, [2016-11-16T12:57:32.424030 #53725:f53994]  WARN -- : MIQ(VmScan#timeout!) Job: guid: [1620a31e-abec-11e6-a010-020000000111], job timed out after 3048.072594672 seconds of inactivity.  Inactivity threshold [3000 seconds], aborting
[----] E, [2016-11-16T12:57:46.974131 #53717:f53994] ERROR -- : MIQ(VmScan#process_abort) job aborting, job timed out after 3048.072594672 seconds of inactivity.  Inactivity threshold [3000 seconds]
[----] I, [2016-11-16T12:57:47.035964 #53717:f53994]  INFO -- : MIQ(VmScan#process_finished) job finished, job timed out after 3048.072594672 seconds of inactivity.  Inactivity threshold [3000 seconds]
[----] I, [2016-11-16T12:57:47.060489 #53717:f53994]  INFO -- : MIQ(VmScan#dispatch_finish) Dispatch Status is 'finished'

Comment 11 Hui Song 2017-04-18 14:09:38 UTC

Volfram,

I tested SSA on the same window VM. It failed with following errors:

[----] E, [2017-04-18T03:19:44.370855 #38245:36112c] ERROR -- : MIQ(MiqQueue#deliver) Message id: [919000000161855], Error: [wrong number of arguments (given 2, expected 1)]
[----] E, [2017-04-18T03:19:44.371140 #38245:36112c] ERROR -- : [ArgumentError]: wrong number of arguments (given 2, expected 1)  Method:[rescue in deliver]
[----] E, [2017-04-18T03:19:44.371228 #38245:36112c] ERROR -- : /var/www/miq/vmdb/app/models/storage.rb:725:in `perf_capture'
/var/www/miq/vmdb/app/models/storage.rb:717:in `perf_capture_hourly'
/var/www/miq/vmdb/app/models/miq_queue.rb:347:in `block in deliver'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:91:in `block in timeout'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `block in catch'
/opt/rh/rh-ruby23/root/usr/share/ruby/timeout.rb:33:in `catch'
......

The BZ https://bugzilla.redhat.com/show_bug.cgi?id=1437644 has already opened for this problem. We have to wait until that fix is merged here.

Updated the dependency.

Comment 12 Wolfram Richter 2017-04-18 14:28:19 UTC

Hui,

thanks for the analysis - what I find odd is that on CFME 5.6 SSA does work for Linux VMs, it only doesn't work for Windows. If the performance capture is on the code path I would assume that it should also fail for Linux VMs. 

Cheers,
Wolfram

Comment 13 Hui Song 2017-04-18 14:33:42 UTC

(In reply to Wolfram Richter from comment #12)
> Hui,
> 
> thanks for the analysis - what I find odd is that on CFME 5.6 SSA does work
> for Linux VMs, it only doesn't work for Windows. If the performance capture
> is on the code path I would assume that it should also fail for Linux VMs. 
> 
> Cheers,
> Wolfram

Yes, you are right. This is not the root cause for SSA failing on Window VMs. It only blocked us to reproduce the issue. I'll retest it after the fix is patched.

Thanks,

Hui

Comment 14 Hui Song 2017-04-20 13:54:44 UTC

Wolfram,

May I borrow your https://cloudforms.hailstorm2.coe.muc.redhat.com? I found one suspicious codes and need to prove it's the root cause.

Thanks,

Comment 15 Wolfram Richter 2017-04-20 14:09:21 UTC

you can use https://cloudforms.hailstorm1.coe.muc.redhat.com/ (CFME 5.7.0.17, same credentials). Hailstorm2 is currently being reinstalled as testbed for CFME 5.8 beta 2)

Comment 16 Hui Song 2017-04-20 14:35:19 UTC

Thank you, Wolfram. In this new appliance, somehow the openstack provider's validation failed. Can you revalidate it?

Comment 17 Wolfram Richter 2017-04-20 15:28:34 UTC

Sorry, I see that all my OpenStack environments seem to have keystone problems - I'll report back when I have a usable env.

Comment 18 Wolfram Richter 2017-04-20 22:08:01 UTC

https://cloudforms.hailstorm2.coe.muc.redhat.com/ (CFME 5.8.0.10) and its openstack are in a working condition again, the others (with CFME 5.7) will probably reappear tomorrow

Comment 19 Wolfram Richter 2017-04-22 08:26:19 UTC

https://cloudforms.hailstorm3.coe.muc.redhat.com/ (CFME 5.7.2.1) and its openstack are also working again. The hailstorm2 env. will probable be of limited availability since I'm working on the underlying RHEL (hailstorm3 will be stable).

Comment 20 Hui Song 2017-04-24 14:21:06 UTC

Wolfram,

The scan target here is Openstack image of windows server 2012. The root causes of SSA failure are:
1. Its OS device signature is shorter than expect (22 vs. 60), which causes 'Unable to mount filesystem. Reason:[No root filesystem found.]';
2. vmConfig is undefined for MiqOpenstackeImage, which causes job hangs and finally times out.

I'll send PR request to fix them. Thank you for your environments.

Comment 21 Hui Song 2017-05-01 20:25:14 UTC

There are two PRs for this BZ:

https://github.com/ManageIQ/manageiq-gems-pending/pull/143
https://github.com/ManageIQ/manageiq-gems-pending/pull/144

They are merged into master branch.

Comment 26 Ido Ovadia 2018-02-28 11:47:35 UTC

Verified
========
5.9.0.22

Comment 27 Satyajit Bulage 2018-02-28 12:05:38 UTC

Hello Dave,

Worked with Ido Ovadia and cross-checked that SSA is gathering all required data from Windows Instance and Image.

Thanks,
Satyajit Bulage.

Note You need to log in before you can comment on or make changes to this bug.