Bug 1382326

Summary: SmartState Analysis failure
Product: Red Hat CloudForms Management Engine Reporter: Pavel Zagalsky <pzagalsk>
Component: SmartState AnalysisAssignee: Erez Freiberger <efreiber>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Zagalsky <pzagalsk>
Severity: high Docs Contact:
Priority: high    
Version: 5.7.0CC: cpelland, dajohnso, fsimonce, jhardy, jhenner, jteehan, obarenbo, pzagalsk
Target Milestone: GAKeywords: TestOnly
Target Release: 5.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: container:smartstate
Fixed In Version: 5.8.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1406023 (view as bug list) Environment:
Last Closed: 2017-06-12 16:20:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Container Management Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1406023    
Attachments:
Description Flags
SSA Scan fail
none
OpenSCAP Fail 5.7.10
none
OSE VM findings none

Description Pavel Zagalsky 2016-10-06 11:15:01 UTC
Created attachment 1207892 [details]
SSA Scan fail

Description of problem:
After running compliance test on an image there's an error in Tasks menu

How reproducible:
Always

Steps to Reproduce:
1. Enable SSA in EVM --> Configuration
2. Add a policy to the provider
3. Select a Container Image and run Smart State Analysis on it

Actual results:
The SSA fails and there's error in Tasks menu that says:
job timed out after 355.310125967 seconds of inactivity. Inactivity threshold [300 seconds]	

Expected results:
The SSA should pass successfully

Additional info:
Log file with further info attached

Comment 2 Jeff Teehan 2016-10-11 19:50:36 UTC
I haven't been able to get it to work either using my goto VMs for this.  Here is output for ubuntu, rhel, and windows.  All of these worked in 5.6.1

	Status = Error	10/11/16 16:43:58 UTC	10/11/16 15:52:54 UTC	10/11/16 15:52:42 UTC	finished	job timed out after 3048.5864924 seconds of inactivity. Inactivity threshold [3000 seconds]	Scan from Vm RHEL72SSA	admin	EVM	
	Status = Error	10/11/16 15:49:58 UTC	10/11/16 14:59:21 UTC	10/11/16 14:59:18 UTC	finished	job timed out after 3022.1714621 seconds of inactivity. Inactivity threshold [3000 seconds]	Scan from Vm WS2012R2SSA	admin	EVM	
	Status = Error	10/11/16 15:28:53 UTC	10/11/16 14:38:16 UTC	10/11/16 14:38:07 UTC	finished	job timed out after 3022.2359251 seconds of inactivity. Inactivity threshold [3000 seconds]	Scan from Vm UBU1404	admin	EVM

Comment 3 Rich Oliveri 2016-10-11 20:17:29 UTC
Just to be clear, you're trying to perform SSA on a container, not a VM, correct?

Comment 4 Jeff Teehan 2016-10-11 20:32:26 UTC
My three were Azure VMs

Comment 5 Rich Oliveri 2016-10-11 20:39:05 UTC
Jeff, yes but the original description says "container image", so it might not be related to what you're seeing on Azure.

Azure will throttle requests based on usage, which can cause this problem. Errors in the log may shed more light on it.

If the problems are related, it would have to be at a very high level (like an appliance issue) because the 2 code paths are very different.

Comment 6 Pavel Zagalsky 2016-10-13 06:42:30 UTC
My attempts were on a Container Image

Comment 8 Pavel Zagalsky 2016-11-07 14:21:41 UTC
I do not remember, but I will try to test it again

Comment 9 Pavel Zagalsky 2016-11-14 11:48:36 UTC
Created attachment 1220366 [details]
OpenSCAP Fail 5.7.10

Comment 10 Pavel Zagalsky 2016-11-14 11:49:32 UTC
Erez, I added updated log from an image scan on 5.7.10

Comment 11 Pavel Zagalsky 2016-11-16 09:55:21 UTC
I checked it again on 5.7.0.10 with a new OpenShift setup and got this while trying to scan an nginx image I got from Docker.io

[----] E, [2016-11-16T04:41:39.731551 #2871:899144] ERROR -- : Q-task_id([d34c1cc2-abe0-11e6-a2c0-001a4a1697bb]) MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::Scanning::Job#process_abort) job aborting, cannot analyze non docker images
[----] E, [2016-11-16T04:41:39.756713 #2871:899144] ERROR -- : Q-task_id([d34c1cc2-abe0-11e6-a2c0-001a4a1697bb]) MIQ(MiqQueue#deliver) Message id: [24904], Error: [undefined method `[]' for nil:NilClass]
[----] E, [2016-11-16T04:41:39.756921 #2871:899144] ERROR -- : Q-task_id([d34c1cc2-abe0-11e6-a2c0-001a4a1697bb]) [NoMethodError]: undefined method `[]' for nil:NilClass  Method:[rescue in deliver]
[----] E, [2016-11-16T04:41:39.757103 #2871:899144] ERROR -- : Q-task_id([d34c1cc2-abe0-11e6-a2c0-001a4a1697bb]) /var/www/miq/vmdb/app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb:195:in `cleanup'

Comment 14 Jaroslav Henner 2016-12-19 12:22:24 UTC
I am not sure how much how much are following findings relevant for this bug, or is it another one, but after starting the SmartState Analysis, the CPU load average of VM running the openshift jumps from 0.5 to like 30, I saw image_inspector process on the top of `top` sorted by CPU%. The VM is laggy for quite a while (1 minute or more) after teh image_inspector process seems to be gone, kswap and loop processes are active in that time.

Comment 15 Jaroslav Henner 2016-12-19 12:35:15 UTC
Created attachment 1233371 [details]
OSE VM findings

It seems the the high load is created by the image extraction process. maybe it the timeout of the image scan in CFME is caused by the VM being overloaded by IO of the image_scanning or extraction process.