1120826 – haproxy will not run, PCSD will not go online

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1120826 - haproxy will not run, PCSD will not go online

Summary: haproxy will not run, PCSD will not go online

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Feist
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-17 19:30 UTC by John H Terpstra
Modified:	2015-04-13 22:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-04-13 22:26:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Haproxy config file (446 bytes, text/plain) 2014-07-17 19:30 UTC, John H Terpstra	no flags	Details
Earlier crm_report file (253.35 KB, application/octet-stream) 2014-07-17 20:20 UTC, John H Terpstra	no flags	Details
crm_report - current - part A (needs to be concatenated with part B) (15.00 MB, application/octet-stream) 2014-07-17 20:21 UTC, John H Terpstra	no flags	Details
crm_report - current - part B (needs to be concatenated with Part A) (7.74 MB, application/octet-stream) 2014-07-17 20:23 UTC, John H Terpstra	no flags	Details
SOS Report (split part A) (18.00 MB, application/octet-stream) 2014-07-17 20:24 UTC, John H Terpstra	no flags	Details
sosreport (split part B) (18.00 MB, application/octet-stream) 2014-07-17 20:25 UTC, John H Terpstra	no flags	Details
SOS Report (split part C - final) (8.70 MB, application/octet-stream) 2014-07-17 20:25 UTC, John H Terpstra	no flags	Details
View All

Description John H Terpstra 2014-07-17 19:30:56 UTC

Created attachment 918802 [details]
Haproxy config file

Description of problem:
Haproxy is configured to enable RabbitMQ to operate in HA mode.  Haproxy can be started using systemctl, but fails when run via pacemaker. When pacemaker is started lb-haproxy will start and then is shutdown shortly after.  Disabling and then re-enabling will not restart haproxy.

Version-Release number of selected component (if applicable):
pacemaker-1.1.10-21.el7_0.x86_64
haproxy-2.5-0.3.dev22.el7.x86_64
rabbitmq-server-3.1.5-6.3.el7ost.noarch

How reproducible:
haproxy.conf file is included. 

Steps to Reproduce:
1. Install haproxy.conf file in /etc/haproxy directory
2. Execute the following:
a) ocs resource create lb-haproxy systemd:haproxy op monitor start-delay=10s --clone
b) pcs resource create rabbit-vip IPaddr2 ip=$BROKER_VIP (set elsewhere)
c) pcs resource create msg-rabbit systemd:rabbitmq-server --clone
3. Run pcs status to monitor progress of cluster services
4. Try to restart lb-haproxy fails (following steps demonstrate issue)
a) pcs resource disable lb-haproxy-clone
b) pcs resources enable lb-haproxy-clone
c) pcs status

Actual results:
Clone set: lb-haproxy-clone will stop and is not capable of being restarted via disable/enable sequence

Expected results:
disable/enable sequence is expected to restart lb-haproxy

Additional info:
crm_report and log files have not cast light on potential cause/s.

Comment 2 Chris Feist 2014-07-17 20:05:32 UTC

Does haproxy start one one node if you start with a fresh cluster and use this command:

pcs resource create lb-haproxy systemd:haproxy op monitor start-delay=10s

Comment 3 John H Terpstra 2014-07-17 20:07:37 UTC

Yes.

I also have the  crm_report and the sos report from the first node, but both files are too large to attach to this bug report.

Comment 4 John H Terpstra 2014-07-17 20:20:07 UTC

Created attachment 918820 [details]
Earlier crm_report file

This is an older file. An additional crm_report and current sosreport will also be attached.

Comment 5 John H Terpstra 2014-07-17 20:21:11 UTC

Created attachment 918821 [details]
crm_report - current - part A (needs to be concatenated with part B)

Comment 6 John H Terpstra 2014-07-17 20:23:06 UTC

Created attachment 918822 [details]
crm_report - current - part B (needs to be concatenated with Part A)

To reconsititute:
cat RHEL7HACluster.tar.bz2.a* > RHEL7HACluster.tar.bz2

Comment 7 John H Terpstra 2014-07-17 20:24:09 UTC

Created attachment 918823 [details]
SOS Report (split part A)

Comment 8 John H Terpstra 2014-07-17 20:25:13 UTC

Created attachment 918824 [details]
sosreport (split part B)

Comment 9 John H Terpstra 2014-07-17 20:25:53 UTC

Created attachment 918825 [details]
SOS Report (split part C - final)

Comment 10 Chris Feist 2014-07-17 22:53:14 UTC

John,

Can you send me the output of 'service haproxy status' on all 3 nodes?

Also, I did see this potential issue:

haproxy is set to bind to 192.168.123.39 (for the rabbitmq frontend), but that ip (resource: rabbit-vip) will only be running on one node.  And I'm pretty sure you only want to run haproxy on one node as well.

So I would change the configuration to do something like this (this will put the rabbit-vip & lb-haproxy on the same node, and require rabbit-vip starts before the haproxy service starts).

a) pcs resource create rabbit-vip IPaddr2 ip=$BROKER_VIP --group haproxy_group
b) pcs resource create lb-haproxy systemd:haproxy op monitor start-delay=10s --group haproxy_group
c) pcs resource create msg-rabbit systemd:rabbitmq-server --clone

If this still doesn't work, can you post the output of 'pcs status' (one one node) and 'service haproxy status' (on all 3 nodes).

Comment 11 Steve Reichard 2014-07-18 14:45:10 UTC

Currently only one service is under HAproxy and cluster control.


This experiement is attempting to add them 1 by 1.  When they are all added, each service will have a separte VIP making HAproxy partially used on each cluster systems and will change as teh VIPs relocate.

This is the recommended practice from Fabio's OSP how-to recipes.

I currently have a compelte cluster configured that used the pupper to deploy if you would like to see its values.

spr

Comment 12 Andrew Beekhof 2014-07-22 07:09:25 UTC

I see this in one of the configs:

        <meta_attributes id="lb-haproxy-clone-meta_attributes">
          <nvpair id="lb-haproxy-clone-meta_attributes-target-role" name="target-role" value="Stopped"/>
        </meta_attributes>

That would certainly prevent haproxy from being started (and cause pacemaker to stop haproxy if it found it running).

Its also looks like rabbitmq is suffering from the systemd-start-returns-too-early issue that we now have a z-stream for.  I'd recommend an upgrade.

The symptom of that being at tight loop of:

Jul 16 22:45:26 [19028] rh7cntl1.rcbd.lab       lrmd:     info: log_execute: 	executing - rsc:msg-rabbit action:start call_id:110098
Jul 16 22:45:26 [19028] rh7cntl1.rcbd.lab       lrmd:     info: systemd_async_dispatch: 	Call to start passed: /org/freedesktop/systemd1/job/2894604
Jul 16 22:45:26 [19028] rh7cntl1.rcbd.lab       lrmd:     info: log_finished: 	finished - rsc:msg-rabbit action:start call_id:110098  exit-code:0 exec-time:348ms queue-time:0ms
Jul 16 22:45:26 [19028] rh7cntl1.rcbd.lab       lrmd:     info: pcmk_dbus_get_property: 	Calling: GetAll on org.freedesktop.systemd1
Jul 16 22:45:26 [19028] rh7cntl1.rcbd.lab       lrmd:     info: cancel_recurring_action: 	Cancelling operation msg-rabbit_status_60000
Jul 16 22:45:26 [19028] rh7cntl1.rcbd.lab       lrmd:     info: log_execute: 	executing - rsc:msg-rabbit action:stop call_id:110101
Jul 16 22:45:26 [19028] rh7cntl1.rcbd.lab       lrmd:     info: systemd_async_dispatch: 	Call to stop passed: /org/freedesktop/systemd1/job/2894708
Jul 16 22:45:26 [19028] rh7cntl1.rcbd.lab       lrmd:     info: log_finished: 	finished - rsc:msg-rabbit action:stop call_id:110101  exit-code:0 exec-time:86ms queue-time:0ms

Comment 13 Andrew Beekhof 2014-07-22 07:15:29 UTC

Specifically, it doesn't appear it was ever set otherwise:

# for f in `ls -1 rh7cntl3/pengine/pe-input-*.bz2`; do bzcat $f | grep lb-haproxy-clone-meta_attributes-target-role | grep -v Stopped ; done | wc -l
       0
# for f in `ls -1 rh7cntl3/pengine/pe-input-*.bz2`; do bzcat $f | grep lb-haproxy-clone-meta_attributes-target-role | grep Stopped ; done | wc -l
    4000

Note You need to log in before you can comment on or make changes to this bug.