Bug 1439783 - Embedded Ansible role does not migrate cleanly to another appliance
Summary: Embedded Ansible role does not migrate cleanly to another appliance
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.9.0
Assignee: Nick Carboni
QA Contact: luke couzens
URL:
Whiteboard:
Depends On:
Blocks: 1460803
TreeView+ depends on / blocked
 
Reported: 2017-04-06 14:11 UTC by Jared Deubel
Modified: 2020-08-13 09:02 UTC (History)
9 users (show)

Fixed In Version: 5.9.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1460803 (view as bug list)
Environment:
Last Closed: 2018-03-06 15:29:25 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jared Deubel 2017-04-06 14:11:26 UTC
Description of problem:
When migrating the embedded Ansible role to another appliance we are erroring and killing the worker (with the below error). This error is frequently


evm.log
=========================================================
[----] E, [2017-04-06T09:32:10.399965 #5289:8d2609c] ERROR -- : AwesomeSpawn: /opt/ansible-installer/setup.sh exit code: 2
[----] E, [2017-04-06T09:32:10.400062 #5289:8d2609c] ERROR -- : AwesomeSpawn:
[----] E, [2017-04-06T09:32:10.412155 #5289:8d2609c] ERROR -- : [AwesomeSpawn::CommandResultError]: /opt/ansible-installer/setup.sh exit code: 2  Method:[rescue in do_before_work_loop]
[----] E, [2017-04-06T09:32:10.415106 #5289:8d2609c] ERROR -- : /opt/rh/cfme-gemset/gems/awesome_spawn-1.4.1/lib/awesome_spawn.rb:105:in `run!'
/var/www/miq/vmdb/lib/embedded_ansible.rb:100:in `block in run_setup_script'
/var/www/miq/vmdb/lib/embedded_ansible.rb:110:in `with_inventory_file'
/var/www/miq/vmdb/lib/embedded_ansible.rb:93:in `run_setup_script'
/var/www/miq/vmdb/lib/embedded_ansible.rb:53:in `start'
/var/www/miq/vmdb/app/models/embedded_ansible_worker/runner.rb:37:in `setup_ansible'
/var/www/miq/vmdb/app/models/embedded_ansible_worker/runner.rb:13:in `do_before_work_loop'
/var/www/miq/vmdb/app/models/embedded_ansible_worker/runner.rb:7:in `prepare'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:21:in `start_worker'
/var/www/miq/vmdb/app/models/embedded_ansible_worker.rb:10:in `block in start_runner'
[----] E, [2017-04-06T09:32:10.990525 #5289:8d2609c] ERROR -- : AwesomeSpawn: /bin/systemctl exit code: 5
[----] E, [2017-04-06T09:32:10.992021 #5289:8d2609c] ERROR -- : AwesomeSpawn: Failed to stop postgresql-9.4.service: Unit postgresql-9.4.service not loaded.
[----] E, [2017-04-06T09:32:10.993829 #5289:8d2609c] ERROR -- : MIQ(EmbeddedAnsibleWorker::Runner) ID [1000000000107] PID [5289] GUID [329d8bc4-1acd-11e7-b2e4-525400806779] Error in before_exit: /bin/systemctl exit code: 5
=========================================================


Rake evm:status showing EmbeddedAnsibleWorker is started:
=================================================================================
[root@localhost vmdb]# rake evm:status
Checking EVM status...
 Zone    | Server | Status  |            ID |  PID |  SPID | URL                     | Started On           | Last Heartbeat       | Master? | Active Roles
---------+--------+---------+---------------+------+-------+-------------------------+----------------------+----------------------+---------+-------------------------------------------------------------------------------------------------------------------------
 default | EVM 2  | started | 1000000000002 | 2442 | 15582 | druby://127.0.0.1:36004 | 2017-04-06T13:55:00Z | 2017-04-06T14:01:43Z | false   | automate:database_operations:embedded_ansible:ems_operations:reporting:smartstate:user_interface:web_services:websocket

 Worker Type           | Status   |            ID |  PID | SPID  |     Server id | Queue Name / URL      | Started On           | Last Heartbeat
-----------------------+----------+---------------+------+-------+---------------+-----------------------+----------------------+----------------------
 EmbeddedAnsibleWorker | started  | 1000000000129 | 2442 | 15670 | 1000000000002 |                       | 2017-04-06T13:55:41Z | 2017-04-06T14:01:47Z
 MiqGenericWorker      | started  | 1000000000125 | 2750 | 15634 | 1000000000002 | generic               | 2017-04-06T13:55:06Z | 2017-04-06T14:01:56Z
 MiqGenericWorker      | started  | 1000000000124 | 2742 | 15633 | 1000000000002 | generic               | 2017-04-06T13:55:05Z | 2017-04-06T14:01:55Z
 MiqPriorityWorker     | started  | 1000000000127 | 2766 | 15636 | 1000000000002 | generic               | 2017-04-06T13:55:06Z | 2017-04-06T14:01:57Z
 MiqPriorityWorker     | started  | 1000000000126 | 2758 | 15635 | 1000000000002 | generic               | 2017-04-06T13:55:06Z | 2017-04-06T14:01:57Z
 MiqReportingWorker    | creating | 1000000000130 | 2804 |       | 1000000000002 | reporting             |                      | 2017-04-06T13:55:41Z
 MiqReportingWorker    | started  | 1000000000131 | 2808 | 15675 | 1000000000002 | reporting             | 2017-04-06T13:55:43Z | 2017-04-06T14:01:54Z
 MiqScheduleWorker     | started  | 1000000000128 | 2774 | 15637 | 1000000000002 |                       | 2017-04-06T13:55:07Z | 2017-04-06T14:01:57Z
 MiqUiWorker           | started  | 1000000000133 | 2837 |       | 1000000000002 | http://127.0.0.1:3000 | 2017-04-06T13:55:45Z | 2017-04-06T14:01:48Z
 MiqWebServiceWorker   | started  | 1000000000134 | 2847 |       | 1000000000002 | http://127.0.0.1:4000 | 2017-04-06T13:55:46Z | 2017-04-06T14:01:52Z
 MiqWebsocketWorker    | started  | 1000000000132 | 2829 |       | 1000000000002 | http://127.0.0.1:5000 | 2017-04-06T13:55:44Z | 2017-04-06T14:01:49Z
=================================================================================

Systemctl Status not showing EmbededAnsibleWorker:
=================================================================================
● evmserverd.service - EVM server daemon
   Loaded: loaded (/usr/lib/systemd/system/evmserverd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2017-04-06 09:52:29 EDT; 11min ago
 Main PID: 2442 (ruby)
   CGroup: /system.slice/evmserverd.service
           ├─2442 MIQ Server
           ├─2742 MIQ: MiqGenericWorker id: 1000000000124, queue: generic
           ├─2750 MIQ: MiqGenericWorker id: 1000000000125, queue: generic
           ├─2758 MIQ: MiqPriorityWorker id: 1000000000126, queue: generic
           ├─2766 MIQ: MiqPriorityWorker id: 1000000000127, queue: generic
           ├─2774 MIQ: MiqScheduleWorker id: 1000000000128
           ├─2808 MIQ: MiqReportingWorker id: 1000000000131, queue: reporting
           ├─2829 puma 3.3.0 (tcp://127.0.0.1:5000) [MIQ: Web Server Worker]
           ├─2837 puma 3.3.0 (tcp://127.0.0.1:3000) [MIQ: Web Server Worker]
           └─2847 puma 3.3.0 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
=================================================================================

Not being cleaned up in the miq_workers table:
=================================================================================
vmdb_production=# select status,started_on,stopped_on,last_heartbeat,pid,type from miq_workers where type = 'EmbeddedAnsibleWorker';
  status  |         started_on         | stopped_on |       last_heartbeat       |  pid  |         type          
----------+----------------------------+------------+----------------------------+-------+-----------------------
 starting | 2017-04-06 13:50:37.784141 |            | 2017-04-06 13:50:37.785323 |  5289 | EmbeddedAnsibleWorker
 started  | 2017-04-06 13:55:41.572457 |            | 2017-04-06 14:09:31.820392 |  2442 | EmbeddedAnsibleWorker
 started  | 2017-04-05 18:55:52.871573 |            | 2017-04-05 21:12:12.656854 | 12372 | EmbeddedAnsibleWorker
=================================================================================


Version-Release number of selected component (if applicable):
5.8.0.9-alpha2

Comment 2 luke couzens 2017-04-07 09:22:53 UTC
Hi Jared, what is your setup here, some sort of HA?

How are you trying to migrate embedded ansible?

Comment 3 Jared Deubel 2017-04-07 11:40:16 UTC
(In reply to luke couzens from comment #2)
> Hi Jared, what is your setup here, some sort of HA?
> 
> How are you trying to migrate embedded ansible?

Having the role turned on in multiple appliances where the original appliance that ot was on goes down. The role will migrate to the secondary appliance.

Comment 4 Nick Carboni 2017-04-17 15:51:12 UTC
I was not able to reproduce this issue using version 5.8.0.10-beta1.

These are the steps I followed:
1. Configure a 2 server installation (same region + zone)
2. Assign the embedded ansible role to both servers
5. Find the server the role is active on (for me it was Server 1 and I used the diagnostics tab Zone view)
6. run `systemctl stop evmserverd` on Server 1
7. Observe that the role is started on Server 2

Also, I only see one row in the miq_workers table after this process.

Could you get the logs from the latest setup log in /var/log/tower?
That should give us a more verbose explanation of the error.

Comment 5 Nick Carboni 2017-04-19 15:15:13 UTC
Closing this as it works fine for me and we can't get the logs from the appliance that had the issue.

Comment 6 Dave Johnson 2017-04-24 03:52:39 UTC
Alex, we need to make sure we have a test case for comment 4 procedure.

Comment 8 Nick Carboni 2017-06-02 20:12:32 UTC
I'm not sure why this would be a problem ...

I'm able to successfully enable the role, but I also don't have the postgresql-9.4 service on the system.

Will need the contents of /var/log/tower/setup*.log to figure out why we are trying to do something with that service.

Comment 12 Nick Carboni 2017-06-05 20:30:21 UTC
It seems like there might be some conflicting python dependencies on the system in question.

The line that failed seems to suggest that there are system packages installed that are not expected to be installed on our appliance.

The diff of the package list shows me the following python related packages as differences:
(My test appliance is the '<' side while the customer's appliance is the '>' side.

424c506,507
< pyserial 2.6-5.el7
---
> pyparsing 1.5.6-9.el7
> pytalloc 2.1.6-1.el7
433d515
< python-crypto 2.6.1-7.el7
441d522
< python-ecdsa 0.11-4.el7
455a537
> python-kerberos 1.1-15.el7
457a540
> python-krbV 1.0.90-8.el7
468d550
< python-paramiko 1.15.2-3.el7
489a572
> python2-crypto 2.6.1-9.el7
490a574
> python2-ecdsa 0.13-4.el7
492a577
> python2-paramiko 1.16.1-1.el7

One of these packages could have caused the conflicting dependency, alternatively they can try `yum provides /usr/bin/six.rb` and uninstall the returned package.

Another option would be (if the customer put that file there themselves) to delete (or move) that file and see if that helps.

To rerun the setup playbook (which will retry the ansible tower setup) they will need to remove the `/etc/tower/SECRET_KEY` file from the appliance.

I would also recommend turning off the role in the mean time as we will continue to attempt to start the tower services.

As an enhancement I'll change our embedded ansible code to only consider the role "configured" if we successfully ran the setup playbook. This will allow the setup to re-run without having to remove the SECRET_KEY file.

Comment 13 Nick Carboni 2017-06-05 21:18:35 UTC
Ah some issues with the last comment.
> One of these packages could have caused the conflicting dependency, alternatively they can try `yum provides /usr/bin/six.rb` and uninstall the returned package.

Should be `yum provides /usr/bin/six.py`

Also, it (clearly) looks like the file isn't actually there so this comment is also not helpful:

> Another option would be (if the customer put that file there themselves) to delete (or move) that file and see if that helps.

Maybe something in their PATH is misconfigured? Try getting the output of `env` from that appliance.

> To rerun the setup playbook (which will retry the ansible tower setup) they will need to remove the `/etc/tower/SECRET_KEY` file from the appliance.

Make that remove the contents of the file. It turns out things break if we remove it entirely.

Comment 14 Nick Carboni 2017-06-06 13:42:15 UTC
This looks like a very similar issue to what the customer is seeing.

https://groups.google.com/forum/#!topic/ansible-project/Z6herrX4i78

Comment 16 CFME Bot 2017-06-06 15:56:32 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/42eb2f8deefcf4b5390d7ac31dbaa195f289afcf

commit 42eb2f8deefcf4b5390d7ac31dbaa195f289afcf
Author:     Nick Carboni <ncarboni>
AuthorDate: Mon Jun 5 17:49:01 2017 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Tue Jun 6 10:09:27 2017 -0400

    Handle additional case for /etc/tower/SECRET_KEY
    
    Previously if we had a value in the database, but the file
    didn't exist on the filesystem .configured? would raise an error
    when it should really just return false and we will write out
    the value from the database to the filesystem.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1439783
    https://bugzilla.redhat.com/show_bug.cgi?id=1458886

 lib/embedded_ansible.rb           | 1 +
 spec/lib/embedded_ansible_spec.rb | 6 ++++++
 2 files changed, 7 insertions(+)

Comment 17 CFME Bot 2017-06-06 15:56:44 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/f80e6dd559994125bce144f1ddcd6753324dcc69

commit f80e6dd559994125bce144f1ddcd6753324dcc69
Author:     Nick Carboni <ncarboni>
AuthorDate: Tue Jun 6 08:55:59 2017 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Tue Jun 6 10:09:31 2017 -0400

    Remove the secret key from the database when the setup fails
    
    This will force `.configured?` to false the next time `.start` is run
    allowing us to retry the configuration.
    
    Before this change, users would have to blank the SECRET_KEY file
    on the filesystem to force a retry.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1439783
    https://bugzilla.redhat.com/show_bug.cgi?id=1458886

 lib/embedded_ansible.rb           |  4 ++++
 spec/lib/embedded_ansible_spec.rb | 12 +++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

Comment 20 luke couzens 2017-10-12 16:07:22 UTC
Verified in 5.9.0.2


Note You need to log in before you can comment on or make changes to this bug.