1600169 – Cannot add new node to pcs cluster - Error: Error connecting to controller-3 (HTTP error: 408)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1600169 - Cannot add new node to pcs cluster - Error: Error connecting to controller-3 (HTTP error: 408)

Summary: Cannot add new node to pcs cluster - Error: Error connecting to controller-3 ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ondrej Mular
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1564218 (view as bug list)
Depends On:
Blocks:	1599758
TreeView+	depends on / blocked

Reported:	2018-07-11 15:04 UTC by Michele Baldessari
Modified:	2021-12-10 16:37 UTC (History)
CC List:	10 users (show)
Fixed In Version:	pcs-0.9.165-2.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-30 08:06:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3066	0	None	None	None	2018-10-30 08:07:49 UTC

Description Michele Baldessari 2018-07-11 15:04:19 UTC

Description of problem:
(split off from https://bugzilla.redhat.com/show_bug.cgi?id=1599758)
So this has been rarely hit and is not reproduceable (we are trying to get reproducer going with debug enabled). We did hit it some time ago (via https://bugzilla.redhat.com/show_bug.cgi?id=1564218) but we had never reproduced it again, so we chalked it up to a networking hiccup.

The short story is the following:
[heat-admin@controller-0 ~]$ sudo pcs cluster node add controller-3
Disabling SBD service...
controller-3: sbd disabled
Sending 'corosync authkey' to 'controller-3'
controller-3: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-3'
Error: Error connecting to controller-3 (HTTP error: 408)
Error: Errors have occurred, therefore pcs is unable to continue
[heat-admin@controller-0 ~]$ sudo pcs cluster node add controller-3
Disabling SBD service...
controller-3: sbd disabled
Sending 'corosync authkey' to 'controller-3'
controller-3: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-3'
Error: Error connecting to controller-3 (HTTP error: 408)
Error: Errors have occurred, therefore pcs is unable to continue
 
Added pcsd debug on both controller-0 and controller-3 so that I could capture logs
and of course now it just worked :/
[root@controller-0 ~]# strace -f -y -o /tmp/strace pcs cluster node add controller-3
Disabling SBD service...
controller-3: sbd disabled
Sending 'corosync authkey' to 'controller-3'
controller-3: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-3'
controller-3: successful distribution of the file 'pacemaker_remote authkey'
controller-0: Corosync updated
controller-2: Corosync updated
Setting up corosync...
controller-3: Succeeded
Synchronizing pcsd certificates on nodes controller-3...
controller-3: Success
Restarting pcsd on the nodes in order to reload the certificates...
controller-3: Success

Even though there was no debug log in pcsd on neither controller, we did see the following on controller-3:
::ffff:172.17.1.12 - - [11/Jul/2018:13:37:53 +0000] "POST /remote/put_file HTTP/1.1" 200 62 0.0063
::ffff:172.17.1.12 - - [11/Jul/2018:13:37:53 +0000] "POST /remote/put_file HTTP/1.1" 200 62 0.0064
controller-0.localdomain - - [11/Jul/2018:13:37:53 UTC] "POST /remote/put_file HTTP/1.1" 200 62
- -> /remote/put_file
I, [2018-07-11T13:38:05.178904 #19074]  INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2018-07-11T13:38:05.179022 #19074]  INFO -- : CIB USER: hacluster, groups:
I, [2018-07-11T13:38:05.183918 #19074]  INFO -- : Return Value: 1
W, [2018-07-11T13:38:05.184009 #19074]  WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
W, [2018-07-11T13:38:05.184058 #19074]  WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
controller-0.localdomain - - [11/Jul/2018:13:37:53 UTC] "POST /remote/put_file HTTP/1.1" 408 321
- -> /remote/put_file

As if the first put_file succeeded with 200 but the second one failed with 408. We have a rather largish pacemaker authkey (4096) bytes (due to how heat has a very limited way of generating password characters), not sure if that rings a bell. 

We will try to reproduce this with debug logs enabled and report here.

Version-Release number of selected component (if applicable):
pcs-0.9.162-5.el7_5.1.x86_64

Comment 3 Damien Ciabrini 2018-07-11 16:19:56 UTC

Reproduced the HTTP error with the following minimal env:
 . 2 VM running RHEL 7.5, named rhel1 rhel2
 . following packages version:
    pacemaker-cli-1.1.18-11.el7_5.3.x86_64
    pcs-0.9.162-5.el7_5.1.x86_64
    pacemaker-remote-1.1.18-11.el7_5.3.x86_64
    pacemaker-libs-1.1.18-11.el7_5.3.x86_64
    pacemaker-cluster-libs-1.1.18-11.el7_5.3.x86_64
    pacemaker-cts-1.1.18-11.el7_5.3.x86_64
    pacemaker-1.1.18-11.el7_5.3.x86_64

The reproducer consist in setting up the cluster with pre-existing auth keys, like we do in openstack.

on rhel1:
dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1; chown hacluster. /etc/pacemaker/authkey; chmod 640 /etc/pacemaker/authkey
dd if=/dev/urandom of=/etc/corosync/authkey bs=128 count=1; chown root. /etc/corosync/authkey; chmod 400 /etc/corosync/authkey
pcs cluster setup --force --name foo --encryption 1 rhel1
pcs cluster start

on rhel2:
# make sure no pre-existing cluster is in the way
pcs cluster destroy
rm -f /etc/pacemaker/authkey /etc/corosync/authkey

on rhel1:                                                                                                                  
[root@rhel1 ~]# pcs cluster node add rhel2
Disabling SBD service...
rhel2: sbd disabled
Sending 'corosync authkey' to 'rhel2'
rhel2: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'rhel2'
Error: Error connecting to rhel2 (HTTP error: 408)
Error: Errors have occurred, therefore pcs is unable to continue

Comment 5 Ondrej Mular 2018-07-20 13:10:10 UTC

I've managed to reproduce the issue in my testing environment only under heavy CPU load but still not very reliably (~ 30% success rate).

After further investigation, I found out that cURL lib (used in pcs to communicate with pcsd) is adding HTTP header 'Expect: 100-continue' for requests with larger body size (apparently 1kB pacemaker authkey should be sufficient). This header is not supported in pcsd.

More about Except header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expect

Under some circumstances, this can cause a race condition in which pcs will not be able to send whole HTTP request to pcsd and therefore pcsd will time-out with 408 response code.

So, the solution might be to force cURL lib to not use Expect header. After this change, I was unable to reproduce the issue anymore. Here is the change I applied:

index f7fe241..d54e856 100644
--- a/pcs/common/node_communicator.py
+++ b/usr/lib/python2.7/site-packages/pcs/common/node_communicator.py
@@ -532,6 +532,7 @@ def _create_request_handle(request, cookies, timeout):
     handle.setopt(pycurl.SSL_VERIFYHOST, 0)
     handle.setopt(pycurl.SSL_VERIFYPEER, 0)
     handle.setopt(pycurl.NOSIGNAL, 1) # required for multi-threading
+    handle.setopt(pycurl.HTTPHEADER, ["Expect: "])
     if cookies:
         handle.setopt(
             pycurl.COOKIE, _dict_to_cookies(cookies).encode("utf-8")

Michele, can you please test this patch in your environment and let me know if the issue still persists?

Comment 6 Ondrej Mular 2018-07-23 07:49:53 UTC

Moving to POST as we'll include the patch in the following build.

Comment 7 Damien Ciabrini 2018-07-23 10:32:05 UTC

Hey Ondrej,

> index f7fe241..d54e856 100644
> --- a/pcs/common/node_communicator.py
> +++ b/usr/lib/python2.7/site-packages/pcs/common/node_communicator.py
> @@ -532,6 +532,7 @@ def _create_request_handle(request, cookies, timeout):
>      handle.setopt(pycurl.SSL_VERIFYHOST, 0)
>      handle.setopt(pycurl.SSL_VERIFYPEER, 0)
>      handle.setopt(pycurl.NOSIGNAL, 1) # required for multi-threading
> +    handle.setopt(pycurl.HTTPHEADER, ["Expect: "])
>      if cookies:
>          handle.setopt(
>              pycurl.COOKIE, _dict_to_cookies(cookies).encode("utf-8")
> 
> Michele, can you please test this patch in your environment and let me know
> if the issue still persists?

Just applied that patch on my previous setup and restarted pcsd, and now node rhel2 correctly gets added to the cluster. Works perfectly thanks!

Comment 9 pkomarov 2018-09-20 10:13:25 UTC

Verified , 

env : 

[stack@undercloud-0 ~]$ rhos-release -L
Installed repositories (rhel-7.5):
  13
  ceph-3
  ceph-osd-3
  rhel-7.5
[stack@undercloud-0 ~]$ cat core_puddle_version 
2018-08-31.1

[stack@undercloud-0 ~]$ ansible overcloud -mshell -b -a'rpm -qa|grep pcs'

controller-0 | SUCCESS | rc=0 >>
pcs-0.9.162-5.el7_5.1.x86_64

controller-1 | SUCCESS | rc=0 >>
pcs-0.9.162-5.el7_5.1.x86_64

controller-2 | SUCCESS | rc=0 >>
pcs-0.9.162-5.el7_5.1.x86_64


#fail test before pcs update : 

[root@controller-1 ~]# pcs cluster destroy --all
Requesting stop of service pacemaker_remote on 'overcloud-novacomputeiha-0', 'overcloud-novacomputeiha-1'
Warning: 172.17.1.10: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60059 milliseconds)
Warning: 172.17.1.13: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60057 milliseconds)
Requesting remove remote node files from 'overcloud-novacomputeiha-0', 'overcloud-novacomputeiha-1'
Warning: 172.17.1.10: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60060 milliseconds)
Warning: 172.17.1.13: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60058 milliseconds)
controller-0: Stopping Cluster (pacemaker)...
controller-1: Stopping Cluster (pacemaker)...
controller-2: Stopping Cluster (pacemaker)...
controller-0: Successfully destroyed cluster
controller-1: Successfully destroyed cluster
controller-2: Successfully destroyed cluster
[root@controller-1 ~]# dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1; chown hacluster. /etc/pacemaker/authkey; chmod 640 /etc/pacemaker/authkey
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000547725 s, 7.5 MB/s
[root@controller-1 ~]# dd if=/dev/urandom of=/etc/corosync/authkey bs=128 count=1; chown root. /etc/corosync/authkey; chmod 400 /etc/corosync/authkey
1+0 records in
1+0 records out
128 bytes (128 B) copied, 0.000478243 s, 268 kB/s
[root@controller-1 ~]# pcs cluster setup --force --name foo --encryption 1 controller-1
Destroying cluster on nodes: controller-1...
controller-1: Stopping Cluster (pacemaker)...
controller-1: Successfully destroyed cluster

Sending 'corosync authkey', 'pacemaker_remote authkey' to 'controller-1'
controller-1: successful distribution of the file 'corosync authkey'
controller-1: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
controller-1: Succeeded

Synchronizing pcsd certificates on nodes controller-1...
controller-1: Success
Restarting pcsd on the nodes in order to reload the certificates...
controller-1: Success
[root@controller-1 ~]# pcs cluster start
Starting Cluster...
[root@controller-1 ~]# pcs cluster node add controller-2
Disabling SBD service...
controller-2: sbd disabled
Sending 'corosync authkey' to 'controller-2'
controller-2: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-2'
Error: Error connecting to controller-2 (HTTP error: 408)
Error: Errors have occurred, therefore pcs is unable to continue



#test pcs update : [stack@undercloud-0 pacemaker_rpms]$  ansible overcloud -mshell -b -a'rpm -qa|grep pcs'

overcloud-novacomputeiha-0 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64

overcloud-novacomputeiha-1 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64

controller-1 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64

controller-0 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64

controller-2 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64


# test adding a cluster node : 

[root@controller-1 ~]# pcs cluster node add controller-2
Disabling SBD service...
controller-2: sbd disabled
Sending 'corosync authkey' to 'controller-2'
controller-2: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-2'
controller-2: successful distribution of the file 'pacemaker_remote authkey'
controller-1: Corosync updated
Setting up corosync...
controller-2: Succeeded
Synchronizing pcsd certificates on nodes controller-2...
controller-2: Success
Restarting pcsd on the nodes in order to reload the certificates...
controller-2: Success

Comment 11 errata-xmlrpc 2018-10-30 08:06:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3066

Comment 12 Michele Baldessari 2018-11-21 20:57:26 UTC

*** Bug 1564218 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.