1382326 – SmartState Analysis failure

Bug 1382326 - SmartState Analysis failure

Summary: SmartState Analysis failure

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	SmartState Analysis
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.8.0
Assignee:	Erez Freiberger
QA Contact:	Pavel Zagalsky
Docs Contact:
URL:
Whiteboard:	container:smartstate
Depends On:
Blocks:	1406023
TreeView+	depends on / blocked

Reported:	2016-10-06 11:15 UTC by Pavel Zagalsky
Modified:	2017-06-12 16:20 UTC (History)
CC List:	8 users (show)
Fixed In Version:	5.8.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1406023 (view as bug list)
Environment:
Last Closed:	2017-06-12 16:20:38 UTC
Category:	---
Cloudforms Team:	Container Management
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
SSA Scan fail (11.78 KB, text/plain) 2016-10-06 11:15 UTC, Pavel Zagalsky	no flags	Details
OpenSCAP Fail 5.7.10 (460.51 KB, text/plain) 2016-11-14 11:48 UTC, Pavel Zagalsky	no flags	Details
OSE VM findings (1.79 KB, text/plain) 2016-12-19 12:35 UTC, Jaroslav Henner	no flags	Details
View All

Description Pavel Zagalsky 2016-10-06 11:15:01 UTC

Created attachment 1207892 [details]
SSA Scan fail

Description of problem:
After running compliance test on an image there's an error in Tasks menu

How reproducible:
Always

Steps to Reproduce:
1. Enable SSA in EVM --> Configuration
2. Add a policy to the provider
3. Select a Container Image and run Smart State Analysis on it

Actual results:
The SSA fails and there's error in Tasks menu that says:
job timed out after 355.310125967 seconds of inactivity. Inactivity threshold [300 seconds]	

Expected results:
The SSA should pass successfully

Additional info:
Log file with further info attached

Comment 2 Jeff Teehan 2016-10-11 19:50:36 UTC

I haven't been able to get it to work either using my goto VMs for this.  Here is output for ubuntu, rhel, and windows.  All of these worked in 5.6.1

	Status = Error	10/11/16 16:43:58 UTC	10/11/16 15:52:54 UTC	10/11/16 15:52:42 UTC	finished	job timed out after 3048.5864924 seconds of inactivity. Inactivity threshold [3000 seconds]	Scan from Vm RHEL72SSA	admin	EVM	
	Status = Error	10/11/16 15:49:58 UTC	10/11/16 14:59:21 UTC	10/11/16 14:59:18 UTC	finished	job timed out after 3022.1714621 seconds of inactivity. Inactivity threshold [3000 seconds]	Scan from Vm WS2012R2SSA	admin	EVM	
	Status = Error	10/11/16 15:28:53 UTC	10/11/16 14:38:16 UTC	10/11/16 14:38:07 UTC	finished	job timed out after 3022.2359251 seconds of inactivity. Inactivity threshold [3000 seconds]	Scan from Vm UBU1404	admin	EVM

Comment 3 Rich Oliveri 2016-10-11 20:17:29 UTC

Just to be clear, you're trying to perform SSA on a container, not a VM, correct?

Comment 4 Jeff Teehan 2016-10-11 20:32:26 UTC

My three were Azure VMs

Comment 5 Rich Oliveri 2016-10-11 20:39:05 UTC

Jeff, yes but the original description says "container image", so it might not be related to what you're seeing on Azure.

Azure will throttle requests based on usage, which can cause this problem. Errors in the log may shed more light on it.

If the problems are related, it would have to be at a very high level (like an appliance issue) because the 2 code paths are very different.

Comment 6 Pavel Zagalsky 2016-10-13 06:42:30 UTC

My attempts were on a Container Image

Comment 8 Pavel Zagalsky 2016-11-07 14:21:41 UTC

I do not remember, but I will try to test it again

Comment 9 Pavel Zagalsky 2016-11-14 11:48:36 UTC

Created attachment 1220366 [details]
OpenSCAP Fail 5.7.10

Comment 10 Pavel Zagalsky 2016-11-14 11:49:32 UTC

Erez, I added updated log from an image scan on 5.7.10

Comment 11 Pavel Zagalsky 2016-11-16 09:55:21 UTC

I checked it again on 5.7.0.10 with a new OpenShift setup and got this while trying to scan an nginx image I got from Docker.io

[----] E, [2016-11-16T04:41:39.731551 #2871:899144] ERROR -- : Q-task_id([d34c1cc2-abe0-11e6-a2c0-001a4a1697bb]) MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::Scanning::Job#process_abort) job aborting, cannot analyze non docker images
[----] E, [2016-11-16T04:41:39.756713 #2871:899144] ERROR -- : Q-task_id([d34c1cc2-abe0-11e6-a2c0-001a4a1697bb]) MIQ(MiqQueue#deliver) Message id: [24904], Error: [undefined method `[]' for nil:NilClass]
[----] E, [2016-11-16T04:41:39.756921 #2871:899144] ERROR -- : Q-task_id([d34c1cc2-abe0-11e6-a2c0-001a4a1697bb]) [NoMethodError]: undefined method `[]' for nil:NilClass  Method:[rescue in deliver]
[----] E, [2016-11-16T04:41:39.757103 #2871:899144] ERROR -- : Q-task_id([d34c1cc2-abe0-11e6-a2c0-001a4a1697bb]) /var/www/miq/vmdb/app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb:195:in `cleanup'

Comment 14 Jaroslav Henner 2016-12-19 12:22:24 UTC

I am not sure how much how much are following findings relevant for this bug, or is it another one, but after starting the SmartState Analysis, the CPU load average of VM running the openshift jumps from 0.5 to like 30, I saw image_inspector process on the top of `top` sorted by CPU%. The VM is laggy for quite a while (1 minute or more) after teh image_inspector process seems to be gone, kswap and loop processes are active in that time.

Comment 15 Jaroslav Henner 2016-12-19 12:35:15 UTC

Created attachment 1233371 [details]
OSE VM findings

It seems the the high load is created by the image extraction process. maybe it the timeout of the image scan in CFME is caused by the VM being overloaded by IO of the image_scanning or extraction process.

Note You need to log in before you can comment on or make changes to this bug.