Bug 1254564

Summary:	SmartState times out if snapshot creation takes too long
Product:	Red Hat CloudForms Management Engine	Reporter:	Christian Jung <cbolz>
Component:	SmartState Analysis	Assignee:	Hui Song <hsong>
Status:	CLOSED ERRATA	QA Contact:	Ramesh A <rananda>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.4.0	CC:	bascar, cpelland, hsong, jhardy, mfeifer, ncatling, obarenbo, roliveri, tcarlin
Target Milestone:	GA	Keywords:	ZStream
Target Release:	5.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	5.5.0.1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1259809 (view as bug list)		Environment:
Last Closed:	2015-12-08 13:27:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1259809

Description Christian Jung 2015-08-18 12:01:28 UTC

Description of problem:
We are using OSP6 with CF 3.2 with Ceph as a backend storage. Creating a snapshot in this particular storage configuration takes about 20 minutes.

This triggers a timeout in CloudForms. Since the snapshot creation fails, the SmartState does not complete, also.


Version-Release number of selected component (if applicable):
cfme-5.4.1.0-1.el6cf.x86_64

How reproducible:
always

Steps to Reproduce:
1. create an instance in OSP and make sure that snapshot creation takes more than 5 minutes
2. run a SmartState Analysis task
3. task will fail

Actual results:
[----] I, [2015-08-17T15:48:36.779019 #37612:fc1ea8] INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [49000001671049], MiqWorker id: [49000000001775], Zone: [default], Role: [smartstate], Server: [], Ident: [generic], Target id: [], Insta
nce id: [49000000000023], Task id: [], Command: [Job.signal], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [:abort, "job timed out after 346.399386357 seconds of inactivity. Inactivity threshold [3000
seconds]", "error"], Dequeued in: [5.638601999] seconds
[----] I, [2015-08-17T15:48:36.779148 #37612:fc1ea8] INFO -- : MIQ(MiqQueue.deliver) Message id: [49000001671049], Delivering...
[----] I, [2015-08-17T15:48:36.791850 #37612:fc1ea8] INFO -- : MIQ(Event.raise_evm_event): Event Raised [vm_scan_abort]
[----] I, [2015-08-17T15:48:36.860858 #37612:fc1ea8] INFO -- : MIQ(Event.raise_evm_event): Alert for Event [vm_scan_abort]
[----] I, [2015-08-17T15:48:36.861076 #37612:fc1ea8] INFO -- : MIQ(MiqAlert.evaluate_alerts) [vm_scan_abort] Target: VmOpenstack Name: [cfme167], Id: [49000000000374]
[----] E, [2015-08-17T15:48:36.867684 #37612:fc1ea8] ERROR -- : action-abort: job aborting, job timed out after 346.399386357 seconds of inactivity. Inactivity threshold [3000 seconds]
[----] I, [2015-08-17T15:48:36.878189 #37612:fc1ea8] INFO -- : action-finished: job finished, job timed out after 346.399386357 seconds of inactivity. Inactivity threshold [3000 seconds]

Expected results:


Additional info:

Comment 5 Rich Oliveri 2015-09-03 14:55:55 UTC

changed to use correct timeout value for fleecing job on openstack provider

https://github.com/ManageIQ/manageiq/pull/4126

Comment 6 Christian Jung 2015-09-09 15:46:49 UTC

HiHo,
I applied the provided fix to my test appliance. Now I see a different error message:
Failed to create evm snapshot with EMS. Error: []: [execution expired] 

I'll upload the full logs.

Regards,
Christian

Comment 10 Rich Oliveri 2015-10-26 16:26:16 UTC

*** Bug 1259024 has been marked as a duplicate of this bug. ***

Comment 11 Ramesh A 2015-11-02 16:31:01 UTC

Good to go.  Verified and working fine in 5.5.0.8-beta1.4.20151027164951_4ab7fea.

This was verified through the commit change but in terms of actually verifying that it handles slow snapshots, we cannot test that at the moment.  QE is investigating ways to freeze the snapshot process to force the issue but we are not there yet, scanning works, and do not view this as a blocker to release. 

Procedure followed to timeout scenario:
=======================================
Disabled the smart proxy role and performed a SSA.  Got timeout exception after 3012.133582419 seconds of inactivity.  Please find the log snippet below.


evm.log snippet:
================
[----] I, [2015-11-02T07:09:34.858221 #3065:41f988]  INFO -- : MIQ(MiqQueue.put) Message id: [4136],  id: [], Zone: [default], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"ManageIQ::Providers::Openstack::CloudManager::Vm", :object_id=>121, :attrs=>{:event_type=>"vm_scan_abort", "VmOrTemplate::vm"=>121, :vm_id=>121, :host=>nil, "MiqEvent::miq_event"=>303, :miq_event_id=>303, "EventStream::event_stream"=>303, :event_stream_id=>303}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>1, :tenant_id=>1, :automate_message=>nil}]
[----] E, [2015-11-02T07:09:34.858518 #3065:41f988] ERROR -- : MIQ(VmScan#process_abort) job aborting, job timed out after 3012.133582419 seconds of inactivity.  Inactivity threshold [3000 seconds]
[----] I, [2015-11-02T07:09:34.947919 #3065:41f988]  INFO -- : MIQ(VmScan#process_finished) job finished, job timed out after 3012.133582419 seconds of inactivity.  Inactivity threshold [3000 seconds]
[----] I, [2015-11-02T07:09:34.993475 #3065:41f988]  INFO -- : MIQ(VmScan#dispatch_finish) Dispatch Status is 'finished'

Comment 13 errata-xmlrpc 2015-12-08 13:27:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551