Bug 1570387 - [mixed_version] scale out failed, Failed to initialize erlang distribution on mixed versions
Summary: [mixed_version] scale out failed, Failed to initialize erlang distribution on...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Michele Baldessari
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-22 13:51 UTC by Ronnie Rasouli
Modified: 2018-05-29 08:34 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-29 08:34:47 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Ronnie Rasouli 2018-04-22 13:51:45 UTC
Description of problem:
During scale out of mixed versions, undercloud updated and upgraded to RHOS13, scalling out a new node for RHOS12, Scale out failed, the rabbit docker services in unhealthy state.



[root@controller-2 heat-admin]# rabbitmqctl
{"init terminating in do_boot",{badarg,[{erl_prim_loader,check_file_result,3,[]},{init,get_boot,1,[]},{init,get_boot,2,[]},{init,do_boot,3,[]}]}}
init terminating in do_boot ()
[root@controller-2 heat-admin]# rabbitmqctl status
{"init terminating in do_boot",{badarg,[{erl_prim_loader,check_file_result,3,[]},{init,get_boot,1,[]},{init,get_boot,2,[]},{init,do_boot,3,[]}]}}
init terminating in do_boot ()

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.15-3.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy rhos12 
2. update and upgrade undercloud to RHOS13
3. perform scale out of additional compute node

Actual results:
Scale out failed 
openstack stack failures list overcloud
overcloud.AllNodesDeploySteps.ControllerDeployment_Step2.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: a6d8d8b3-88b0-442c-9ea0-84990354cb57
  status: UPDATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq/Pacemaker::Resource::Ocf[rabbitmq]/Pcmk_resource[rabbitmq]/ensure: change from absent to present failed: pcs -f /var/lib/pacemaker/cib/puppet-cib-backup20180422-857912-1xds9iz create failed: Error: Resource 'rabbitmq-clone' does not exist",
            "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq/Exec[rabbitmq-ready]: Skipping because of failed dependencies",
            "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!"
        ]
    }
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/d97aba07-1381-4822-840c-f57cb619d31f_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=4    changed=1    unreachable=0    failed=1

    (truncated, view all with --long)
  deploy_stderr: |


Expected results:
Scale out should work on RHOS12 from undercloud in RHOS13

Additional info:

Comment 2 Peter Lemenkov 2018-04-23 08:49:55 UTC
(In reply to Ronnie Rasouli from comment #0)
> Description of problem:
> During scale out of mixed versions, undercloud updated and upgraded to
> RHOS13, scalling out a new node for RHOS12, Scale out failed, the rabbit
> docker services in unhealthy state.
> 
> 
> 
> [root@controller-2 heat-admin]# rabbitmqctl
> {"init terminating in
> do_boot",{badarg,[{erl_prim_loader,check_file_result,3,[]},{init,get_boot,1,
> []},{init,get_boot,2,[]},{init,do_boot,3,[]}]}}
> init terminating in do_boot ()
> [root@controller-2 heat-admin]# rabbitmqctl status
> {"init terminating in
> do_boot",{badarg,[{erl_prim_loader,check_file_result,3,[]},{init,get_boot,1,
> []},{init,get_boot,2,[]},{init,do_boot,3,[]}]}}
> init terminating in do_boot ()
> 
> Version-Release number of selected component (if applicable):
> rabbitmq-server-3.6.15-3.el7ost.noarch
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. Deploy rhos12 
> 2. update and upgrade undercloud to RHOS13
> 3. perform scale out of additional compute node
> 
> Actual results:
> Scale out failed 
> openstack stack failures list overcloud
> overcloud.AllNodesDeploySteps.ControllerDeployment_Step2.0:
>   resource_type: OS::Heat::StructuredDeployment
>   physical_resource_id: a6d8d8b3-88b0-442c-9ea0-84990354cb57
>   status: UPDATE_FAILED
>   status_reason: |
>     Error: resources[0]: Deployment to server failed: deploy_status_code :
> Deployment exited with non-zero status code: 2
>   deploy_stdout: |
>     ...
>             "Error:
> /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq/Pacemaker::Resource::
> Ocf[rabbitmq]/Pcmk_resource[rabbitmq]/ensure: change from absent to present
> failed: pcs -f
> /var/lib/pacemaker/cib/puppet-cib-backup20180422-857912-1xds9iz create
> failed: Error: Resource 'rabbitmq-clone' does not exist",
>             "Warning:
> /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq/Exec[rabbitmq-ready]:
> Skipping because of failed dependencies",
>             "Error: Failed to apply catalog: Command is still failing after
> 180 seconds expired!"
>         ]
>     }
>     	to retry, use: --limit
> @/var/lib/heat-config/heat-config-ansible/d97aba07-1381-4822-840c-
> f57cb619d31f_playbook.retry
> 
>     PLAY RECAP
> *********************************************************************
>     localhost                  : ok=4    changed=1    unreachable=0   
> failed=1
> 
>     (truncated, view all with --long)
>   deploy_stderr: |
> 
> 
> Expected results:
> Scale out should work on RHOS12 from undercloud in RHOS13
> 
> Additional info:

From the first glance this looks like a file/directory permissions issue. Erlang VM cannot load bytecode for some reason. I'm looking at this.

Comment 3 Peter Lemenkov 2018-04-23 09:02:36 UTC
Ok, first of all - upgrade from 3.6.5 to 3.6.15 is not supported by upstream due to compatibility breaking changes in 3.6.7. The entire cluster must be upgraded simultaneously (all nodes stopped, all nodes upgraded, then all nodes started). See this changelog for further details:

https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_7

Also Indeed there is a permissions issue as it shown in sosreport-controller-2-20180422141037/sos_commands/rabbitmq/docker_exec_-t_rabbitmq-bundle-docker-2_rabbitmqctl_report file:

Error: Failed to initialize erlang distribution: {{shutdown,
                                                   {failed_to_start_child,
                                                    auth,
                                                    {"Error when reading /var/lib/rabbitmq/.erlang.cookie: eacces",

Comment 4 Peter Lemenkov 2018-04-23 13:35:44 UTC
Ok, preliminary results. I really don't know how it worked before.

At least /var/lib/rabbitmq/.erlang.cookie is required to have a proper access mode. So I run the following commant on every node:

  docker exec -it <docker id> chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie

At least tt restores rabbitmqctl operation within the container.

Comment 5 Ronnie Rasouli 2018-04-24 07:46:53 UTC
correction, the version is.
rabbitmq-server-3.6.5-5.el7ost.noarch

Comment 14 Ronnie Rasouli 2018-05-29 08:34:47 UTC
Looks like indirect fix


Note You need to log in before you can comment on or make changes to this bug.