Bug 1600169

Summary:	Cannot add new node to pcs cluster - Error: Error connecting to controller-3 (HTTP error: 408)
Product:	Red Hat Enterprise Linux 7	Reporter:	Michele Baldessari <michele>
Component:	pcs	Assignee:	Ondrej Mular <omular>
Status:	CLOSED ERRATA	QA Contact:	pkomarov
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.5	CC:	ahrechan, cfeist, chjones, cluster-maint, dciabrin, idevat, michele, omular, pkomarov, tojeline
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	pcs-0.9.165-2.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-30 08:06:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1599758

Description Michele Baldessari 2018-07-11 15:04:19 UTC

Description of problem:
(split off from https://bugzilla.redhat.com/show_bug.cgi?id=1599758)
So this has been rarely hit and is not reproduceable (we are trying to get reproducer going with debug enabled). We did hit it some time ago (via https://bugzilla.redhat.com/show_bug.cgi?id=1564218) but we had never reproduced it again, so we chalked it up to a networking hiccup.

The short story is the following:
[heat-admin@controller-0 ~]$ sudo pcs cluster node add controller-3
Disabling SBD service...
controller-3: sbd disabled
Sending 'corosync authkey' to 'controller-3'
controller-3: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-3'
Error: Error connecting to controller-3 (HTTP error: 408)
Error: Errors have occurred, therefore pcs is unable to continue
[heat-admin@controller-0 ~]$ sudo pcs cluster node add controller-3
Disabling SBD service...
controller-3: sbd disabled
Sending 'corosync authkey' to 'controller-3'
controller-3: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-3'
Error: Error connecting to controller-3 (HTTP error: 408)
Error: Errors have occurred, therefore pcs is unable to continue
 
Added pcsd debug on both controller-0 and controller-3 so that I could capture logs
and of course now it just worked :/
[root@controller-0 ~]# strace -f -y -o /tmp/strace pcs cluster node add controller-3
Disabling SBD service...
controller-3: sbd disabled
Sending 'corosync authkey' to 'controller-3'
controller-3: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-3'
controller-3: successful distribution of the file 'pacemaker_remote authkey'
controller-0: Corosync updated
controller-2: Corosync updated
Setting up corosync...
controller-3: Succeeded
Synchronizing pcsd certificates on nodes controller-3...
controller-3: Success
Restarting pcsd on the nodes in order to reload the certificates...
controller-3: Success

Even though there was no debug log in pcsd on neither controller, we did see the following on controller-3:
::ffff:172.17.1.12 - - [11/Jul/2018:13:37:53 +0000] "POST /remote/put_file HTTP/1.1" 200 62 0.0063
::ffff:172.17.1.12 - - [11/Jul/2018:13:37:53 +0000] "POST /remote/put_file HTTP/1.1" 200 62 0.0064
controller-0.localdomain - - [11/Jul/2018:13:37:53 UTC] "POST /remote/put_file HTTP/1.1" 200 62
- -> /remote/put_file
I, [2018-07-11T13:38:05.178904 #19074]  INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2018-07-11T13:38:05.179022 #19074]  INFO -- : CIB USER: hacluster, groups:
I, [2018-07-11T13:38:05.183918 #19074]  INFO -- : Return Value: 1
W, [2018-07-11T13:38:05.184009 #19074]  WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
W, [2018-07-11T13:38:05.184058 #19074]  WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
controller-0.localdomain - - [11/Jul/2018:13:37:53 UTC] "POST /remote/put_file HTTP/1.1" 408 321
- -> /remote/put_file

As if the first put_file succeeded with 200 but the second one failed with 408. We have a rather largish pacemaker authkey (4096) bytes (due to how heat has a very limited way of generating password characters), not sure if that rings a bell. 

We will try to reproduce this with debug logs enabled and report here.

Version-Release number of selected component (if applicable):
pcs-0.9.162-5.el7_5.1.x86_64

Comment 3 Damien Ciabrini 2018-07-11 16:19:56 UTC

Reproduced the HTTP error with the following minimal env:
 . 2 VM running RHEL 7.5, named rhel1 rhel2
 . following packages version:
    pacemaker-cli-1.1.18-11.el7_5.3.x86_64
    pcs-0.9.162-5.el7_5.1.x86_64
    pacemaker-remote-1.1.18-11.el7_5.3.x86_64
    pacemaker-libs-1.1.18-11.el7_5.3.x86_64
    pacemaker-cluster-libs-1.1.18-11.el7_5.3.x86_64
    pacemaker-cts-1.1.18-11.el7_5.3.x86_64
    pacemaker-1.1.18-11.el7_5.3.x86_64

The reproducer consist in setting up the cluster with pre-existing auth keys, like we do in openstack.

on rhel1:
dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1; chown hacluster. /etc/pacemaker/authkey; chmod 640 /etc/pacemaker/authkey
dd if=/dev/urandom of=/etc/corosync/authkey bs=128 count=1; chown root. /etc/corosync/authkey; chmod 400 /etc/corosync/authkey
pcs cluster setup --force --name foo --encryption 1 rhel1
pcs cluster start

on rhel2:
# make sure no pre-existing cluster is in the way
pcs cluster destroy
rm -f /etc/pacemaker/authkey /etc/corosync/authkey

on rhel1:                                                                                                                  
[root@rhel1 ~]# pcs cluster node add rhel2
Disabling SBD service...
rhel2: sbd disabled
Sending 'corosync authkey' to 'rhel2'
rhel2: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'rhel2'
Error: Error connecting to rhel2 (HTTP error: 408)
Error: Errors have occurred, therefore pcs is unable to continue

Comment 5 Ondrej Mular 2018-07-20 13:10:10 UTC

I've managed to reproduce the issue in my testing environment only under heavy CPU load but still not very reliably (~ 30% success rate).

After further investigation, I found out that cURL lib (used in pcs to communicate with pcsd) is adding HTTP header 'Expect: 100-continue' for requests with larger body size (apparently 1kB pacemaker authkey should be sufficient). This header is not supported in pcsd.

More about Except header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expect

Under some circumstances, this can cause a race condition in which pcs will not be able to send whole HTTP request to pcsd and therefore pcsd will time-out with 408 response code.

So, the solution might be to force cURL lib to not use Expect header. After this change, I was unable to reproduce the issue anymore. Here is the change I applied:

index f7fe241..d54e856 100644
--- a/pcs/common/node_communicator.py
+++ b/usr/lib/python2.7/site-packages/pcs/common/node_communicator.py
@@ -532,6 +532,7 @@ def _create_request_handle(request, cookies, timeout):
     handle.setopt(pycurl.SSL_VERIFYHOST, 0)
     handle.setopt(pycurl.SSL_VERIFYPEER, 0)
     handle.setopt(pycurl.NOSIGNAL, 1) # required for multi-threading
+    handle.setopt(pycurl.HTTPHEADER, ["Expect: "])
     if cookies:
         handle.setopt(
             pycurl.COOKIE, _dict_to_cookies(cookies).encode("utf-8")

Michele, can you please test this patch in your environment and let me know if the issue still persists?

Comment 6 Ondrej Mular 2018-07-23 07:49:53 UTC

Moving to POST as we'll include the patch in the following build.

Comment 7 Damien Ciabrini 2018-07-23 10:32:05 UTC

Hey Ondrej,

> index f7fe241..d54e856 100644
> --- a/pcs/common/node_communicator.py
> +++ b/usr/lib/python2.7/site-packages/pcs/common/node_communicator.py
> @@ -532,6 +532,7 @@ def _create_request_handle(request, cookies, timeout):
>      handle.setopt(pycurl.SSL_VERIFYHOST, 0)
>      handle.setopt(pycurl.SSL_VERIFYPEER, 0)
>      handle.setopt(pycurl.NOSIGNAL, 1) # required for multi-threading
> +    handle.setopt(pycurl.HTTPHEADER, ["Expect: "])
>      if cookies:
>          handle.setopt(
>              pycurl.COOKIE, _dict_to_cookies(cookies).encode("utf-8")
> 
> Michele, can you please test this patch in your environment and let me know
> if the issue still persists?

Just applied that patch on my previous setup and restarted pcsd, and now node rhel2 correctly gets added to the cluster. Works perfectly thanks!

Comment 9 pkomarov 2018-09-20 10:13:25 UTC

Verified , 

env : 

[stack@undercloud-0 ~]$ rhos-release -L
Installed repositories (rhel-7.5):
  13
  ceph-3
  ceph-osd-3
  rhel-7.5
[stack@undercloud-0 ~]$ cat core_puddle_version 
2018-08-31.1

[stack@undercloud-0 ~]$ ansible overcloud -mshell -b -a'rpm -qa|grep pcs'

controller-0 | SUCCESS | rc=0 >>
pcs-0.9.162-5.el7_5.1.x86_64

controller-1 | SUCCESS | rc=0 >>
pcs-0.9.162-5.el7_5.1.x86_64

controller-2 | SUCCESS | rc=0 >>
pcs-0.9.162-5.el7_5.1.x86_64


#fail test before pcs update : 

[root@controller-1 ~]# pcs cluster destroy --all
Requesting stop of service pacemaker_remote on 'overcloud-novacomputeiha-0', 'overcloud-novacomputeiha-1'
Warning: 172.17.1.10: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60059 milliseconds)
Warning: 172.17.1.13: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60057 milliseconds)
Requesting remove remote node files from 'overcloud-novacomputeiha-0', 'overcloud-novacomputeiha-1'
Warning: 172.17.1.10: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60060 milliseconds)
Warning: 172.17.1.13: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60058 milliseconds)
controller-0: Stopping Cluster (pacemaker)...
controller-1: Stopping Cluster (pacemaker)...
controller-2: Stopping Cluster (pacemaker)...
controller-0: Successfully destroyed cluster
controller-1: Successfully destroyed cluster
controller-2: Successfully destroyed cluster
[root@controller-1 ~]# dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1; chown hacluster. /etc/pacemaker/authkey; chmod 640 /etc/pacemaker/authkey
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000547725 s, 7.5 MB/s
[root@controller-1 ~]# dd if=/dev/urandom of=/etc/corosync/authkey bs=128 count=1; chown root. /etc/corosync/authkey; chmod 400 /etc/corosync/authkey
1+0 records in
1+0 records out
128 bytes (128 B) copied, 0.000478243 s, 268 kB/s
[root@controller-1 ~]# pcs cluster setup --force --name foo --encryption 1 controller-1
Destroying cluster on nodes: controller-1...
controller-1: Stopping Cluster (pacemaker)...
controller-1: Successfully destroyed cluster

Sending 'corosync authkey', 'pacemaker_remote authkey' to 'controller-1'
controller-1: successful distribution of the file 'corosync authkey'
controller-1: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
controller-1: Succeeded

Synchronizing pcsd certificates on nodes controller-1...
controller-1: Success
Restarting pcsd on the nodes in order to reload the certificates...
controller-1: Success
[root@controller-1 ~]# pcs cluster start
Starting Cluster...
[root@controller-1 ~]# pcs cluster node add controller-2
Disabling SBD service...
controller-2: sbd disabled
Sending 'corosync authkey' to 'controller-2'
controller-2: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-2'
Error: Error connecting to controller-2 (HTTP error: 408)
Error: Errors have occurred, therefore pcs is unable to continue



#test pcs update : [stack@undercloud-0 pacemaker_rpms]$  ansible overcloud -mshell -b -a'rpm -qa|grep pcs'

overcloud-novacomputeiha-0 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64

overcloud-novacomputeiha-1 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64

controller-1 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64

controller-0 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64

controller-2 | SUCCESS | rc=0 >>
pcs-0.9.165-2.el7.x86_64
pcs-snmp-0.9.165-2.el7.x86_64


# test adding a cluster node : 

[root@controller-1 ~]# pcs cluster node add controller-2
Disabling SBD service...
controller-2: sbd disabled
Sending 'corosync authkey' to 'controller-2'
controller-2: successful distribution of the file 'corosync authkey'
Sending remote node configuration files to 'controller-2'
controller-2: successful distribution of the file 'pacemaker_remote authkey'
controller-1: Corosync updated
Setting up corosync...
controller-2: Succeeded
Synchronizing pcsd certificates on nodes controller-2...
controller-2: Success
Restarting pcsd on the nodes in order to reload the certificates...
controller-2: Success

Comment 11 errata-xmlrpc 2018-10-30 08:06:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3066

Comment 12 Michele Baldessari 2018-11-21 20:57:26 UTC

*** Bug 1564218 has been marked as a duplicate of this bug. ***