Bug 1367473

Summary:	SmartState Analysis not working for container images
Product:	Red Hat CloudForms Management Engine	Reporter:	Prasad Mukhedkar <pmukhedk>
Component:	SmartState Analysis	Assignee:	Rich Oliveri <roliveri>
Status:	CLOSED DUPLICATE	QA Contact:	Dave Johnson <dajohnso>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.6.0	CC:	cpelland, jhardy, mtayer, obarenbo, pmukhedk
Target Milestone:	GA
Target Release:	5.7.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	container
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-19 08:00:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Prasad Mukhedkar 2016-08-16 13:35:37 UTC

Smart State analysis for container images not working, The fleecing task is getting stuck into "waiting_to_start" status infinitely.. In log I see following :

 [----] W, [2016-08-11T09:34:32.938155 #3020:b09988]  WARN -- : Q-task_id([job_dispatcher]) MIQ(JobProxyDispatcher#dispatch_to_ems) SKIPPING remaining Container Image scan jobs for Ext Management System [99000000000001] in dispatch since there are [3] active scans in zone [default]

This is what I see in the database : 


vmdb_production=# select guid,state,status,message,name,dispatch_status from jobs where dispatch_status='active';
                 guid                 |      state      | status |      message      |           name           | dispatch_status 
--------------------------------------+-----------------+--------+-------------------+--------------------------+-----------------
 7e4d8d06-48de-11e6-9c8c-005056957282 | waiting_to_scan | ok     | process initiated | Container image analysis | active
 7e497702-48de-11e6-9c8c-005056957282 | waiting_to_scan | ok     | process initiated | Container image analysis | active
 7e4bbc2e-48de-11e6-9c8c-005056957282 | waiting_to_scan | ok     | process initiated | Container image analysis | active
(3 rows)


vmdb_production=# select guid,state,status,message,name,dispatch_status from jobs where dispatch_status!='active';
 00759e7c-5a5f-11e6-872e-005056957282 | waiting_to_start | ok     | process initiated | Container image analysis | pending
 29434312-5a60-11e6-872e-005056957282 | waiting_to_start | ok     | process initiated | Container image analysis | pending
(394 rows)

Other ERROR in the logs : 

[----] I, [2016-08-12T06:27:35.161662 #8285:b09988]  INFO -- : MIQ(MiqGenericWorker::Runner) ID [99000000031743] PID [8285] GUID [7fcc106a-6041-11e6-872e-005056957282] Exit request received. Worker exiting.
------------------------

[----] I, [2016-08-11T08:26:11.311503 #25633:b09988]  INFO -- : MIQ(ManageIQ::Providers::OpenshiftEnterprise::ContainerManager::MetricsCollectorWorker::Runner) ID [99000000028351] PID [25633] GUID [6ece4a2c-5f8c-11e6-872e-005056957282] Exit request received. Worker exiting.


----------

[----] E, [2016-08-11T07:13:18.040908 #11818:b09988] ERROR -- : MIQ(Job.check_jobs_for_timeout) Couldn't find VmOrTemplate with 'id'=99000000000003
[----] I, [2016-08-11T07:14:10.374479 #11845:b09988]  INFO -- : MIQ(MiqQueue.put) Message id: [99000002812075],  id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [Job.check_jobs_for_timeout], Timeout: [600], Priority: [90], State: [ready], Deliver On: [], Data: [], Args: []

Can we remove the jobs from the database? Will that help? 
We dont have assertive info in the logs to understand why 
the active tasks execution is failing. Dont see any 
timeout either. 

Customer database Restored on : 10.65.200.236  root:smartvm

Comment 2 Mooli Tayer 2016-08-17 09:04:41 UTC

Prasad is this a clone of https://bugzilla.redhat.com/show_bug.cgi?id=1366143 ?

That happens if we have three failed jobs already stuck in the queue but their status isn't reported correctly.

Comment 4 Mooli Tayer 2016-08-17 12:32:35 UTC

Quick fix[1]:
cd /var/www/miq/vmdb/
source /etc/default/evm
bin/rails c

irb(main):016:0> Job.update(:state => 'finished')
irb(main):014:0> Job.destroy_all

[1] since only "finished" or "waiting_to_start" jobs can be deleted.

Comment 6 Mooli Tayer 2016-08-17 12:41:08 UTC

(In reply to Mooli Tayer from comment #4)
> Quick fix[1]:
> cd /var/www/miq/vmdb/
> source /etc/default/evm
> bin/rails c
> 
> irb(main):016:0> Job.update(:state => 'finished')
> irb(main):014:0> Job.destroy_all
> 
> [1] since only "finished" or "waiting_to_start" jobs can be deleted.

Actually that's very bad. I copied it from what I provided to qe. 

We don't want to delete all of a customer's job history.
Just update and delete the jobs that are stuck.