Bug 1659072 - osp8 overcloud deployment fails with : epmd reports: node 'rabbit' not running at all
Summary: osp8 overcloud deployment fails with : epmd reports: node 'rabbit' not runnin...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents
Version: 7.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Oyvind Albrigtsen
QA Contact: michal novacek
URL:
Whiteboard:
Depends On:
Blocks: 1647587 1692889
TreeView+ depends on / blocked
 
Reported: 2018-12-13 14:07 UTC by pkomarov
Modified: 2019-08-06 10:21 UTC (History)
13 users (show)

Fixed In Version: resource-agents-4.1.1-16.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1692889 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github ClusterLabs resource-agents pull 1276 None None None 2018-12-13 18:07:08 UTC
Red Hat Knowledge Base (Solution) 3756061 None None None 2018-12-14 09:19:36 UTC

Description pkomarov 2018-12-13 14:07:51 UTC
Description of problem:
osp8 overcloud deployment fails with : epmd reports: node 'rabbit' not running at all  

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.HA deployment os osp8
2.
3.
[stack@undercloud-0 ~]$ cat core_puddle_version 
2018-11-15.1[stack@undercloud-0 ~]$ 
[stack@undercloud-0 ~]$ rhos-release -L

Installed repositories (rhel-7.6):
  8-director
  8
  ceph-2
  ceph-osd-2
  rhel-7.6


[stack@undercloud-0 ~]$ ansible controller -b -mshell -a'pcs status|grep FAILED'
 [WARNING]: Found both group and host with same name: undercloud

controller-2 | SUCCESS | rc=0 >>
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED controller-0 (Monitoring)


controller-0 | SUCCESS | rc=0 >>
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED controller-0 (Monitoring)

controller-1 | SUCCESS | rc=0 >>
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED controller-0 (Monitoring)

[stack@undercloud-0 ~]$ openstack stack resource list overcloud|grep -v COMP
+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+
| resource_name                             | physical_resource_id                          | resource_type                                     | resource_status | updated_time        |
+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+
| ControllerNodesPostDeployment             | be5be2cf-1615-41a5-89d8-7c9b45af19b6          | OS::TripleO::ControllerPostDeployment             | CREATE_FAILED   | 2018-12-13T06:58:31 |
+--------------------------------[stack@undercloud-0 ~]$ openstack software deployment list |grep -v COMP
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+
| id                                   | config_id                            | server_id                            | action | status   |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+
| c248efc4-c2a7-4641-a40f-323ff301d3e3 | d84dd9ef-d281-408f-a1f9-ae0ef2bfe9fd | 174e9834-1c28-4ec4-ab4d-b3c54dda5262 | CREATE | FAILED   |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+

[stack@undercloud-0 ~]$ openstack software config show d84dd9ef-d281-408f-a1f9-ae0ef2bfe9fd
output in link (basically it's post deployment pacemaker setup : puppet-tripleo:manifests/profile/base/pacemaker.pp)

http://pastebin.test.redhat.com/683353

#on the controller: from /var/log/messages:

Dec 13 05:29:31 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:31 controller-0 systemd: Started Session c14 of user rabbitmq.
Dec 13 05:29:32 controller-0 lrmd[8850]:  notice: rabbitmq_monitor_10000:20269:stderr [ /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster: line 206: 1 ]
Dec 13 05:29:32 controller-0 lrmd[8850]:  notice: rabbitmq_monitor_10000:20269:stderr [ ...done. * 2 : syntax error: invalid arithmetic operator (error token is "...done. * 2 ") ]
Dec 13 05:29:32 controller-0 crmd[8853]:  notice: controller-0-rabbitmq_monitor_10000:74 [ /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster: line 206: 1\n...done. * 2 : syntax error: invalid arithmetic operator (error token is "...done. * 2 ")\n ]
Dec 13 05:29:33 controller-0 crmd[8853]:  notice: Result of start operation for memcached on controller-0: 0 (ok)
Dec 13 05:29:34 controller-0 ntpd[7663]: 0.0.0.0 c61c 0c clock_step +0.424865 s
Dec 13 05:29:34 controller-0 ntpd[7663]: 0.0.0.0 c614 04 freq_mode
Dec 13 05:29:34 controller-0 systemd: Time has been changed
Dec 13 05:29:35 controller-0 ntpd[7663]: 0.0.0.0 c618 08 no_sys_peer
Dec 13 05:29:36 controller-0 crmd[8853]:  notice: controller-0-rabbitmq_monitor_10000:74 [ /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster: line 206: 1\n...done. * 2 : syntax error: invalid arithmetic operator (error token is "...done. * 2 ")\n ]
Dec 13 05:29:36 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:36 controller-0 systemd: Started Session c15 of user rabbitmq.
Dec 13 05:29:36 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:36 controller-0 systemd: Started Session c16 of user rabbitmq.
Dec 13 05:29:36 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:36 controller-0 systemd: Started Session c17 of user rabbitmq.
Dec 13 05:29:36 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:36 controller-0 systemd: Started Session c18 of user rabbitmq.
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ Error: unable to connect to node 'rabbit@controller-0': nodedown ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ DIAGNOSTICS ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ =========== ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ attempted to contact: ['rabbit@controller-0'] ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ rabbit@controller-0: ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [   * connected to epmd (port 4369) on controller-0 ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [   * epmd reports: node 'rabbit' not running at all ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [                   no other nodes on controller-0 ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [   * suggestion: start the node ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ current node details: ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ - node name: 'rabbitmqctl20809@controller-0' ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ - home dir: /var/lib/rabbitmq ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ - cookie hash: J9D9g2x1Jd/x/mEaud74Uw== ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]

Comment 3 John Eckersberg 2018-12-13 17:54:46 UTC
This is because the rabbitmq-server version in OSP8 has different output format for eval calls than in newer OSP versions.

On 8:

[root@localhost ~]# rabbitmqctl eval 'testing.'
testing
...done.
[root@localhost ~]# 

On 9 through 14:
[root@localhost ~]# rabbitmqctl eval 'testing.'
testing
[root@localhost ~]# 

On the 8 version, the "...done." output can be suppressed by passing the -q flag:

[root@localhost ~]# rpm -q rabbitmq-server
rabbitmq-server-3.3.5-34.el7ost.noarch
[root@localhost ~]# rhos-release -L
Installed repositories (rhel-7.6):
  8
  ceph-1.3
  ceph-osd-1.3
  rhel-7.6
[root@localhost ~]# rabbitmqctl eval -q 'testing.'
testing
[root@localhost ~]#

So we must update the resource agent to always use -q for eval.


Note You need to log in before you can comment on or make changes to this bug.