Description of problem: During scale out of mixed versions, undercloud updated and upgraded to RHOS13, scalling out a new node for RHOS12, Scale out failed, the rabbit docker services in unhealthy state. [root@controller-2 heat-admin]# rabbitmqctl {"init terminating in do_boot",{badarg,[{erl_prim_loader,check_file_result,3,[]},{init,get_boot,1,[]},{init,get_boot,2,[]},{init,do_boot,3,[]}]}} init terminating in do_boot () [root@controller-2 heat-admin]# rabbitmqctl status {"init terminating in do_boot",{badarg,[{erl_prim_loader,check_file_result,3,[]},{init,get_boot,1,[]},{init,get_boot,2,[]},{init,do_boot,3,[]}]}} init terminating in do_boot () Version-Release number of selected component (if applicable): rabbitmq-server-3.6.15-3.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy rhos12 2. update and upgrade undercloud to RHOS13 3. perform scale out of additional compute node Actual results: Scale out failed openstack stack failures list overcloud overcloud.AllNodesDeploySteps.ControllerDeployment_Step2.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: a6d8d8b3-88b0-442c-9ea0-84990354cb57 status: UPDATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq/Pacemaker::Resource::Ocf[rabbitmq]/Pcmk_resource[rabbitmq]/ensure: change from absent to present failed: pcs -f /var/lib/pacemaker/cib/puppet-cib-backup20180422-857912-1xds9iz create failed: Error: Resource 'rabbitmq-clone' does not exist", "Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq/Exec[rabbitmq-ready]: Skipping because of failed dependencies", "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!" ] } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/d97aba07-1381-4822-840c-f57cb619d31f_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=4 changed=1 unreachable=0 failed=1 (truncated, view all with --long) deploy_stderr: | Expected results: Scale out should work on RHOS12 from undercloud in RHOS13 Additional info:
(In reply to Ronnie Rasouli from comment #0) > Description of problem: > During scale out of mixed versions, undercloud updated and upgraded to > RHOS13, scalling out a new node for RHOS12, Scale out failed, the rabbit > docker services in unhealthy state. > > > > [root@controller-2 heat-admin]# rabbitmqctl > {"init terminating in > do_boot",{badarg,[{erl_prim_loader,check_file_result,3,[]},{init,get_boot,1, > []},{init,get_boot,2,[]},{init,do_boot,3,[]}]}} > init terminating in do_boot () > [root@controller-2 heat-admin]# rabbitmqctl status > {"init terminating in > do_boot",{badarg,[{erl_prim_loader,check_file_result,3,[]},{init,get_boot,1, > []},{init,get_boot,2,[]},{init,do_boot,3,[]}]}} > init terminating in do_boot () > > Version-Release number of selected component (if applicable): > rabbitmq-server-3.6.15-3.el7ost.noarch > > How reproducible: > 100% > > Steps to Reproduce: > 1. Deploy rhos12 > 2. update and upgrade undercloud to RHOS13 > 3. perform scale out of additional compute node > > Actual results: > Scale out failed > openstack stack failures list overcloud > overcloud.AllNodesDeploySteps.ControllerDeployment_Step2.0: > resource_type: OS::Heat::StructuredDeployment > physical_resource_id: a6d8d8b3-88b0-442c-9ea0-84990354cb57 > status: UPDATE_FAILED > status_reason: | > Error: resources[0]: Deployment to server failed: deploy_status_code : > Deployment exited with non-zero status code: 2 > deploy_stdout: | > ... > "Error: > /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq/Pacemaker::Resource:: > Ocf[rabbitmq]/Pcmk_resource[rabbitmq]/ensure: change from absent to present > failed: pcs -f > /var/lib/pacemaker/cib/puppet-cib-backup20180422-857912-1xds9iz create > failed: Error: Resource 'rabbitmq-clone' does not exist", > "Warning: > /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq/Exec[rabbitmq-ready]: > Skipping because of failed dependencies", > "Error: Failed to apply catalog: Command is still failing after > 180 seconds expired!" > ] > } > to retry, use: --limit > @/var/lib/heat-config/heat-config-ansible/d97aba07-1381-4822-840c- > f57cb619d31f_playbook.retry > > PLAY RECAP > ********************************************************************* > localhost : ok=4 changed=1 unreachable=0 > failed=1 > > (truncated, view all with --long) > deploy_stderr: | > > > Expected results: > Scale out should work on RHOS12 from undercloud in RHOS13 > > Additional info: From the first glance this looks like a file/directory permissions issue. Erlang VM cannot load bytecode for some reason. I'm looking at this.
Ok, first of all - upgrade from 3.6.5 to 3.6.15 is not supported by upstream due to compatibility breaking changes in 3.6.7. The entire cluster must be upgraded simultaneously (all nodes stopped, all nodes upgraded, then all nodes started). See this changelog for further details: https://github.com/rabbitmq/rabbitmq-server/releases/tag/rabbitmq_v3_6_7 Also Indeed there is a permissions issue as it shown in sosreport-controller-2-20180422141037/sos_commands/rabbitmq/docker_exec_-t_rabbitmq-bundle-docker-2_rabbitmqctl_report file: Error: Failed to initialize erlang distribution: {{shutdown, {failed_to_start_child, auth, {"Error when reading /var/lib/rabbitmq/.erlang.cookie: eacces",
Ok, preliminary results. I really don't know how it worked before. At least /var/lib/rabbitmq/.erlang.cookie is required to have a proper access mode. So I run the following commant on every node: docker exec -it <docker id> chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie At least tt restores rabbitmqctl operation within the container.
correction, the version is. rabbitmq-server-3.6.5-5.el7ost.noarch
Looks like indirect fix