Bug 1571278

Summary:	Ansible job failed after restoring the database
Product:	Red Hat CloudForms Management Engine	Reporter:	Neha Chugh <nchugh>
Component:	Appliance	Assignee:	Joe Rafaniello <jrafanie>
Status:	CLOSED NOTABUG	QA Contact:	Dave Johnson <dajohnso>
Severity:	medium	Docs Contact:
Priority:	high
Version:	5.9.0	CC:	abellott, bascar, cpelland, dclarizi, jrafanie, lavenel, ncarboni, nchugh, obarenbo
Target Milestone:	GA	Keywords:	Reopened
Target Release:	5.9.3
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-18 15:32:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1561041

Description Neha Chugh 2018-04-24 12:38:45 UTC

Description of problem:
Job launching failed after executing the Service catalog based on Ansible playbook after restoring the database

Version-Release number of selected component (if applicable):
5.8.1.5

How reproducible:
at customer environment.

Steps to Reproduce:

1. After restoring the cfme appliance, the service catalog based on Ansible playbook failed with below exception at "check_completed" stage:

[----] I, [2018-04-20T06:14:32.204298 #2923:53cdd4]  INFO -- : Q-task_id([service_template_provision_task_1000000000064]) <AEMethod check_completed> Starting check_completed
[----] E, [2018-04-20T06:14:32.348547 #2923:53cdd4] ERROR -- : Q-task_id([service_template_provision_task_1000000000064]) <AEMethod check_completed> Error in check completed: Job launching failed
[----] I, [2018-04-20T06:14:32.374230 #2923:53cdd4]  INFO -- : Q-task_id([service_template_provision_task_1000000000064]) <AEMethod check_completed> Ending check_completed
[----] I, [2018-04-20T06:14:32.392113 #2923:54b140]  INFO -- : Q-task_id([service_template_provision_task_1000000000064]) <AEMethod [/ManageIQ/Service/Generic/StateMachines/GenericLifecycle/check_completed]> Ending

We have checked the git repositories path, credentials,repositories, playbooks all are seems to be in sync with the original environment but still the catalog is getting failed after ordering the service.



Actual results:

Getting "job launching failed" exception after ordering the service.

Expected results:
Should execute without any exception.

Additional info:
After restoring the environment, the "Embedded Ansible" role was not enabled by default, manually we enabled the role. Not sure if this is intended behavior

Comment 3 Joe Rafaniello 2018-04-24 16:25:42 UTC

Hi Neha,

Embedded ansible uses the config/database.yml setting for the host of the awx database.  If you're backing up the appliance that is also the database, you need to backup and then restore the awx database along with the vmdb_production database.

Also, if you want to replace the appliance but also keep the original appliance's identity (and the assigned roles such as embedded ansible), you need to backup and restore the GUID file in /var/www/miq/vmdb as mentioned in 1.2. Backing Up Current Appliances in https://access.redhat.com/documentation/en-us/red_hat_cloudforms/4.6/html/migrating_to_red_hat_cloudforms_4.6/index#backup_45-46

Comment 4 Joe Rafaniello 2018-04-25 18:20:26 UTC

Neha, I just realized I provided the 4.6 documentation.  Here is the 4.5 doc 
showing how to do a binary backup which will also backup the awx database:

https://access.redhat.com/documentation/en-us/red_hat_cloudforms/4.5/html/general_configuration/configuration#binary-backup-and-restore-database

In 4.6, we recommend using pg_basebackup if you're using embedded ansible since  it will "Preserve data at the file system level by performing a binary backup. This includes all databases, users and roles, and other objects."

In addition, you should backup/restore the appliance identity, GUID and REGION, as described here (otherwise you won't inherit the existing server roles):

https://access.redhat.com/documentation/en-
us/red_hat_cloudforms/4.5/html/migrating_to_red_hat_cloudforms_4.5/index#backup_42-45

In 4.5, our internal backup/restore still uses pg_dump/pg_restore so you will also need to manually backup/restore the awx database or use pg_basebackup to do everything.

Note, our internal backup/restore was moved to use pg_basebackup in 4.6, see this:
https://bugzilla.redhat.com/show_bug.cgi?id=1495192

Comment 5 Neha Chugh 2018-04-26 07:10:05 UTC

Thanks Joe for the update. I have updated the case with all the required BZ updates and currently we are waiting for his response after implementing the same.

Will update the BZ once I ll get the confirmation from the customer and will move to BZ closure if everything goes well.

Regards,
Neha Chugh

Comment 6 Joe Rafaniello 2018-05-03 18:18:04 UTC

Hi Neha,

Do you have an update on this case? Were they able to test restoring of a CFME database appliance?

Comment 7 Neha Chugh 2018-05-04 03:59:02 UTC

Hello Joe,

I got an update from customer and they are still checking the configurations at their end. I am still waiting for their confirmation on the same.

I guess we can wait for day or so for their confirmation, else we can close this BZ, will reopened if required.

Thank you once again for all your inputs.

Regards,
Neha Chugh

Comment 8 Joe Rafaniello 2018-05-09 21:49:41 UTC

I'll close this bug since we've not received a response.  If it's still an issue, we can reopen.

Comment 9 Neha Chugh 2018-05-18 08:08:21 UTC

Hello Joe,

I got an update from customer and they have tried with both logical and binary backup but somehow logical backup didn't work for them but binary do so that are fine with that.

The customer has encountered one more issue  i.e.

~~~

 If you go to services-> my services in CloudForms in the restore environment and select one service from the active services and go to the provisioning tab, I don't see the standard output produced by an Ansible execution that happened before the backup. While this ouput is present in the original cloudforms instance where the backup was taken from.  It shows the error message: "Standard Output stdout capture is missing". Any idea why that is happening? 


~~~

Do we have the solution for "stdout capture is missing" exception. I found some solution for the same but for ansible tower but I am not sure for Embedded Ansible part. 

Can you please suggest the solution for the same.

Thanks,
Neha Chugh

Comment 10 Nick Carboni 2018-05-18 15:31:01 UTC

Standard out for ansible is stored on the filesystem, not in the database. So a database backup and restore is not expected to also include the stdout for previously run jobs.

Can you post the solution you found for Ansible Tower? It's likely that the same will apply here.

Comment 11 Nick Carboni 2018-05-18 15:32:58 UTC

Also, this is a very different issue that the original problem. I'm going to close this again as the original issue was addressed. If the stdout issue is something we want to address in CF then please open another bug.