Bug 1600169
Summary: | Cannot add new node to pcs cluster - Error: Error connecting to controller-3 (HTTP error: 408) | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Michele Baldessari <michele> |
Component: | pcs | Assignee: | Ondrej Mular <omular> |
Status: | CLOSED ERRATA | QA Contact: | pkomarov |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.5 | CC: | ahrechan, cfeist, chjones, cluster-maint, dciabrin, idevat, michele, omular, pkomarov, tojeline |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | pcs-0.9.165-2.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-30 08:06:40 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1599758 |
Description
Michele Baldessari
2018-07-11 15:04:19 UTC
Reproduced the HTTP error with the following minimal env: . 2 VM running RHEL 7.5, named rhel1 rhel2 . following packages version: pacemaker-cli-1.1.18-11.el7_5.3.x86_64 pcs-0.9.162-5.el7_5.1.x86_64 pacemaker-remote-1.1.18-11.el7_5.3.x86_64 pacemaker-libs-1.1.18-11.el7_5.3.x86_64 pacemaker-cluster-libs-1.1.18-11.el7_5.3.x86_64 pacemaker-cts-1.1.18-11.el7_5.3.x86_64 pacemaker-1.1.18-11.el7_5.3.x86_64 The reproducer consist in setting up the cluster with pre-existing auth keys, like we do in openstack. on rhel1: dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1; chown hacluster. /etc/pacemaker/authkey; chmod 640 /etc/pacemaker/authkey dd if=/dev/urandom of=/etc/corosync/authkey bs=128 count=1; chown root. /etc/corosync/authkey; chmod 400 /etc/corosync/authkey pcs cluster setup --force --name foo --encryption 1 rhel1 pcs cluster start on rhel2: # make sure no pre-existing cluster is in the way pcs cluster destroy rm -f /etc/pacemaker/authkey /etc/corosync/authkey on rhel1: [root@rhel1 ~]# pcs cluster node add rhel2 Disabling SBD service... rhel2: sbd disabled Sending 'corosync authkey' to 'rhel2' rhel2: successful distribution of the file 'corosync authkey' Sending remote node configuration files to 'rhel2' Error: Error connecting to rhel2 (HTTP error: 408) Error: Errors have occurred, therefore pcs is unable to continue I've managed to reproduce the issue in my testing environment only under heavy CPU load but still not very reliably (~ 30% success rate). After further investigation, I found out that cURL lib (used in pcs to communicate with pcsd) is adding HTTP header 'Expect: 100-continue' for requests with larger body size (apparently 1kB pacemaker authkey should be sufficient). This header is not supported in pcsd. More about Except header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expect Under some circumstances, this can cause a race condition in which pcs will not be able to send whole HTTP request to pcsd and therefore pcsd will time-out with 408 response code. So, the solution might be to force cURL lib to not use Expect header. After this change, I was unable to reproduce the issue anymore. Here is the change I applied: index f7fe241..d54e856 100644 --- a/pcs/common/node_communicator.py +++ b/usr/lib/python2.7/site-packages/pcs/common/node_communicator.py @@ -532,6 +532,7 @@ def _create_request_handle(request, cookies, timeout): handle.setopt(pycurl.SSL_VERIFYHOST, 0) handle.setopt(pycurl.SSL_VERIFYPEER, 0) handle.setopt(pycurl.NOSIGNAL, 1) # required for multi-threading + handle.setopt(pycurl.HTTPHEADER, ["Expect: "]) if cookies: handle.setopt( pycurl.COOKIE, _dict_to_cookies(cookies).encode("utf-8") Michele, can you please test this patch in your environment and let me know if the issue still persists? Moving to POST as we'll include the patch in the following build. Hey Ondrej,
> index f7fe241..d54e856 100644
> --- a/pcs/common/node_communicator.py
> +++ b/usr/lib/python2.7/site-packages/pcs/common/node_communicator.py
> @@ -532,6 +532,7 @@ def _create_request_handle(request, cookies, timeout):
> handle.setopt(pycurl.SSL_VERIFYHOST, 0)
> handle.setopt(pycurl.SSL_VERIFYPEER, 0)
> handle.setopt(pycurl.NOSIGNAL, 1) # required for multi-threading
> + handle.setopt(pycurl.HTTPHEADER, ["Expect: "])
> if cookies:
> handle.setopt(
> pycurl.COOKIE, _dict_to_cookies(cookies).encode("utf-8")
>
> Michele, can you please test this patch in your environment and let me know
> if the issue still persists?
Just applied that patch on my previous setup and restarted pcsd, and now node rhel2 correctly gets added to the cluster. Works perfectly thanks!
Verified , env : [stack@undercloud-0 ~]$ rhos-release -L Installed repositories (rhel-7.5): 13 ceph-3 ceph-osd-3 rhel-7.5 [stack@undercloud-0 ~]$ cat core_puddle_version 2018-08-31.1 [stack@undercloud-0 ~]$ ansible overcloud -mshell -b -a'rpm -qa|grep pcs' controller-0 | SUCCESS | rc=0 >> pcs-0.9.162-5.el7_5.1.x86_64 controller-1 | SUCCESS | rc=0 >> pcs-0.9.162-5.el7_5.1.x86_64 controller-2 | SUCCESS | rc=0 >> pcs-0.9.162-5.el7_5.1.x86_64 #fail test before pcs update : [root@controller-1 ~]# pcs cluster destroy --all Requesting stop of service pacemaker_remote on 'overcloud-novacomputeiha-0', 'overcloud-novacomputeiha-1' Warning: 172.17.1.10: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60059 milliseconds) Warning: 172.17.1.13: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60057 milliseconds) Requesting remove remote node files from 'overcloud-novacomputeiha-0', 'overcloud-novacomputeiha-1' Warning: 172.17.1.10: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60060 milliseconds) Warning: 172.17.1.13: Connection timeout, try setting higher timeout in --request-timeout option (Connection timed out after 60058 milliseconds) controller-0: Stopping Cluster (pacemaker)... controller-1: Stopping Cluster (pacemaker)... controller-2: Stopping Cluster (pacemaker)... controller-0: Successfully destroyed cluster controller-1: Successfully destroyed cluster controller-2: Successfully destroyed cluster [root@controller-1 ~]# dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1; chown hacluster. /etc/pacemaker/authkey; chmod 640 /etc/pacemaker/authkey 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000547725 s, 7.5 MB/s [root@controller-1 ~]# dd if=/dev/urandom of=/etc/corosync/authkey bs=128 count=1; chown root. /etc/corosync/authkey; chmod 400 /etc/corosync/authkey 1+0 records in 1+0 records out 128 bytes (128 B) copied, 0.000478243 s, 268 kB/s [root@controller-1 ~]# pcs cluster setup --force --name foo --encryption 1 controller-1 Destroying cluster on nodes: controller-1... controller-1: Stopping Cluster (pacemaker)... controller-1: Successfully destroyed cluster Sending 'corosync authkey', 'pacemaker_remote authkey' to 'controller-1' controller-1: successful distribution of the file 'corosync authkey' controller-1: successful distribution of the file 'pacemaker_remote authkey' Sending cluster config files to the nodes... controller-1: Succeeded Synchronizing pcsd certificates on nodes controller-1... controller-1: Success Restarting pcsd on the nodes in order to reload the certificates... controller-1: Success [root@controller-1 ~]# pcs cluster start Starting Cluster... [root@controller-1 ~]# pcs cluster node add controller-2 Disabling SBD service... controller-2: sbd disabled Sending 'corosync authkey' to 'controller-2' controller-2: successful distribution of the file 'corosync authkey' Sending remote node configuration files to 'controller-2' Error: Error connecting to controller-2 (HTTP error: 408) Error: Errors have occurred, therefore pcs is unable to continue #test pcs update : [stack@undercloud-0 pacemaker_rpms]$ ansible overcloud -mshell -b -a'rpm -qa|grep pcs' overcloud-novacomputeiha-0 | SUCCESS | rc=0 >> pcs-0.9.165-2.el7.x86_64 pcs-snmp-0.9.165-2.el7.x86_64 overcloud-novacomputeiha-1 | SUCCESS | rc=0 >> pcs-0.9.165-2.el7.x86_64 pcs-snmp-0.9.165-2.el7.x86_64 controller-1 | SUCCESS | rc=0 >> pcs-0.9.165-2.el7.x86_64 pcs-snmp-0.9.165-2.el7.x86_64 controller-0 | SUCCESS | rc=0 >> pcs-0.9.165-2.el7.x86_64 pcs-snmp-0.9.165-2.el7.x86_64 controller-2 | SUCCESS | rc=0 >> pcs-0.9.165-2.el7.x86_64 pcs-snmp-0.9.165-2.el7.x86_64 # test adding a cluster node : [root@controller-1 ~]# pcs cluster node add controller-2 Disabling SBD service... controller-2: sbd disabled Sending 'corosync authkey' to 'controller-2' controller-2: successful distribution of the file 'corosync authkey' Sending remote node configuration files to 'controller-2' controller-2: successful distribution of the file 'pacemaker_remote authkey' controller-1: Corosync updated Setting up corosync... controller-2: Succeeded Synchronizing pcsd certificates on nodes controller-2... controller-2: Success Restarting pcsd on the nodes in order to reload the certificates... controller-2: Success Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3066 *** Bug 1564218 has been marked as a duplicate of this bug. *** |