Bug 1659072

Summary: osp8 overcloud deployment fails with : epmd reports: node 'rabbit' not running at all
Product: Red Hat Enterprise Linux 7 Reporter: pkomarov
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED WONTFIX QA Contact: michal novacek <mnovacek>
Severity: high Docs Contact:
Priority: high    
Version: 7.6CC: agk, aherr, amarecek, apevec, bperkins, cfeist, cluster-maint, fdinitto, jeckersb, lhh, lmiccini, michele, mnovacek, oblaut
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: resource-agents-4.1.1-16.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1692889 (view as bug list) Environment:
Last Closed: 2021-03-15 07:32:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1647587, 1692889    

Description pkomarov 2018-12-13 14:07:51 UTC
Description of problem:
osp8 overcloud deployment fails with : epmd reports: node 'rabbit' not running at all  

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.HA deployment os osp8
2.
3.
[stack@undercloud-0 ~]$ cat core_puddle_version 
2018-11-15.1[stack@undercloud-0 ~]$ 
[stack@undercloud-0 ~]$ rhos-release -L

Installed repositories (rhel-7.6):
  8-director
  8
  ceph-2
  ceph-osd-2
  rhel-7.6


[stack@undercloud-0 ~]$ ansible controller -b -mshell -a'pcs status|grep FAILED'
 [WARNING]: Found both group and host with same name: undercloud

controller-2 | SUCCESS | rc=0 >>
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED controller-0 (Monitoring)


controller-0 | SUCCESS | rc=0 >>
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED controller-0 (Monitoring)

controller-1 | SUCCESS | rc=0 >>
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED controller-0 (Monitoring)

[stack@undercloud-0 ~]$ openstack stack resource list overcloud|grep -v COMP
+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+
| resource_name                             | physical_resource_id                          | resource_type                                     | resource_status | updated_time        |
+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+
| ControllerNodesPostDeployment             | be5be2cf-1615-41a5-89d8-7c9b45af19b6          | OS::TripleO::ControllerPostDeployment             | CREATE_FAILED   | 2018-12-13T06:58:31 |
+--------------------------------[stack@undercloud-0 ~]$ openstack software deployment list |grep -v COMP
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+
| id                                   | config_id                            | server_id                            | action | status   |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+
| c248efc4-c2a7-4641-a40f-323ff301d3e3 | d84dd9ef-d281-408f-a1f9-ae0ef2bfe9fd | 174e9834-1c28-4ec4-ab4d-b3c54dda5262 | CREATE | FAILED   |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+

[stack@undercloud-0 ~]$ openstack software config show d84dd9ef-d281-408f-a1f9-ae0ef2bfe9fd
output in link (basically it's post deployment pacemaker setup : puppet-tripleo:manifests/profile/base/pacemaker.pp)

http://pastebin.test.redhat.com/683353

#on the controller: from /var/log/messages:

Dec 13 05:29:31 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:31 controller-0 systemd: Started Session c14 of user rabbitmq.
Dec 13 05:29:32 controller-0 lrmd[8850]:  notice: rabbitmq_monitor_10000:20269:stderr [ /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster: line 206: 1 ]
Dec 13 05:29:32 controller-0 lrmd[8850]:  notice: rabbitmq_monitor_10000:20269:stderr [ ...done. * 2 : syntax error: invalid arithmetic operator (error token is "...done. * 2 ") ]
Dec 13 05:29:32 controller-0 crmd[8853]:  notice: controller-0-rabbitmq_monitor_10000:74 [ /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster: line 206: 1\n...done. * 2 : syntax error: invalid arithmetic operator (error token is "...done. * 2 ")\n ]
Dec 13 05:29:33 controller-0 crmd[8853]:  notice: Result of start operation for memcached on controller-0: 0 (ok)
Dec 13 05:29:34 controller-0 ntpd[7663]: 0.0.0.0 c61c 0c clock_step +0.424865 s
Dec 13 05:29:34 controller-0 ntpd[7663]: 0.0.0.0 c614 04 freq_mode
Dec 13 05:29:34 controller-0 systemd: Time has been changed
Dec 13 05:29:35 controller-0 ntpd[7663]: 0.0.0.0 c618 08 no_sys_peer
Dec 13 05:29:36 controller-0 crmd[8853]:  notice: controller-0-rabbitmq_monitor_10000:74 [ /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster: line 206: 1\n...done. * 2 : syntax error: invalid arithmetic operator (error token is "...done. * 2 ")\n ]
Dec 13 05:29:36 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:36 controller-0 systemd: Started Session c15 of user rabbitmq.
Dec 13 05:29:36 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:36 controller-0 systemd: Started Session c16 of user rabbitmq.
Dec 13 05:29:36 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:36 controller-0 systemd: Started Session c17 of user rabbitmq.
Dec 13 05:29:36 controller-0 su: (to rabbitmq) root on none
Dec 13 05:29:36 controller-0 systemd: Started Session c18 of user rabbitmq.
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ Error: unable to connect to node 'rabbit@controller-0': nodedown ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ DIAGNOSTICS ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ =========== ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ attempted to contact: ['rabbit@controller-0'] ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ rabbit@controller-0: ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [   * connected to epmd (port 4369) on controller-0 ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [   * epmd reports: node 'rabbit' not running at all ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [                   no other nodes on controller-0 ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [   * suggestion: start the node ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ current node details: ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ - node name: 'rabbitmqctl20809@controller-0' ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ - home dir: /var/lib/rabbitmq ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [ - cookie hash: J9D9g2x1Jd/x/mEaud74Uw== ]
Dec 13 05:29:38 controller-0 lrmd[8850]:  notice: rabbitmq_stop_0:20585:stderr [  ]

Comment 3 John Eckersberg 2018-12-13 17:54:46 UTC
This is because the rabbitmq-server version in OSP8 has different output format for eval calls than in newer OSP versions.

On 8:

[root@localhost ~]# rabbitmqctl eval 'testing.'
testing
...done.
[root@localhost ~]# 

On 9 through 14:
[root@localhost ~]# rabbitmqctl eval 'testing.'
testing
[root@localhost ~]# 

On the 8 version, the "...done." output can be suppressed by passing the -q flag:

[root@localhost ~]# rpm -q rabbitmq-server
rabbitmq-server-3.3.5-34.el7ost.noarch
[root@localhost ~]# rhos-release -L
Installed repositories (rhel-7.6):
  8
  ceph-1.3
  ceph-osd-1.3
  rhel-7.6
[root@localhost ~]# rabbitmqctl eval -q 'testing.'
testing
[root@localhost ~]#

So we must update the resource agent to always use -q for eval.

Comment 21 RHEL Program Management 2021-03-15 07:32:30 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.