Bug 1464657 - Random connection issues during router deployment after upgrade
Random connection issues during router deployment after upgrade
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing (Show other bugs)
3.5.1
Unspecified Unspecified
urgent Severity urgent
: ---
: 3.9.0
Assigned To: Ben Bennett
zhaozhanqi
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-24 06:02 EDT by Nicolas Nosenzo
Modified: 2018-03-28 10:05 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: haproxy versions earlier than 1.9 could drop new connections during a reload. Consequence: Users would see intermittent problems, and the SYN eater workaround was difficult to deploy in some environments (due to privilege concerns). Fix: Upgrade to haproxy 1.9 and pass the listen socket. Result: The reload is seamless.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-03-28 10:05:01 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mrobson: needinfo-


Attachments (Terms of Use)
Router DC (4.73 KB, text/plain)
2017-06-24 06:05 EDT, Nicolas Nosenzo
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2775611 None None None 2018-02-14 18:09 EST
Origin (Github) 18385 None None None 2018-02-02 09:59 EST
Red Hat Product Errata RHBA-2018:0489 None None None 2018-03-28 10:05 EDT

  None (edit)
Description Nicolas Nosenzo 2017-06-24 06:02:23 EDT
Description of problem:

After upgrading from 3.4 to 3.5 the router Pods cannot be deployed reliably anymore. When deploying the Pods, deployment sometimes fails with:

Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer

***
% oc logs ha-router-zrh-60-deploy
--> Scaling up ha-router-zrh-60 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-58 down to 1
    Scaling ha-router-zrh-60 up to 1
Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer
% oc logs ha-router-zrh-61-deploy
--> Scaling up ha-router-zrh-61 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-58 down to 1
    Scaling ha-router-zrh-61 up to 1
    Scaling ha-router-zrh-58 down to 0
    Scaling ha-router-zrh-61 up to 2
Unable to connect to the server: read tcp 10.1.12.16:34500->172.30.0.1:443: read: connection reset by peer
***

Sometimes it works and deployment runs through, but most of the time it stops with this error message.

Version-Release number of selected component (if applicable):
OCP 3.5.5.15-1
Docker 1.12.6-16

How reproducible:
Sometimes, in Customer environment

Steps to Reproduce:

- Re-deploy router pod_

--> Scaling up ha-router-zrh-99 from 0 to 2, scaling down ha-router-zrh-98 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-98 down to 1
    Scaling ha-router-zrh-99 up to 1
    Scaling ha-router-zrh-98 down to 0
    Scaling ha-router-zrh-99 up to 2
I0602 11:11:02.439849       1 helpers.go:221] Connection error: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/ha-router-zrh-99: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer
F0602 11:11:02.439922       1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer

Actual results:

F0602 11:11:02.439922       1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer

Expected results:

Router deployment working all the time.

Additional info:
Comment 1 Nicolas Nosenzo 2017-06-24 06:05 EDT
Created attachment 1291398 [details]
Router DC
Comment 5 Ben Bennett 2017-07-12 11:39:09 EDT

*** This bug has been marked as a duplicate of bug 1462338 ***
Comment 13 Ben Bennett 2017-08-31 11:00:24 EDT
Can you run the script at https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#further-help while the router is being started please?  Something is wrong with the deployer pod's connection to the master, but I have no clue why it only affects the router pod deployments.
Comment 21 Weibin Liang 2017-10-30 11:58:14 EDT
Following above scale routers steps, I can not reproduce the issue in my env when I upgrade version from v3.4.1.44.29  to v3.5.5.15
Comment 33 openshift-github-bot 2018-02-02 00:01:10 EST
Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/98f7fc6080bf5f528b29c465473e60fef17f30f2
Make haproxy pass open sockets when reloading

This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)

https://github.com/openshift/origin/commit/49100d9e4f1ef1be511c3b28b5d5c1f8783f0f41
Merge pull request #18385 from knobunc/bug/bz1464657-seamless-handover-haproxy-reload

Automatic merge from submit-queue (batch tested with PRs 18390, 18389, 18290, 18377, 18385).

Make haproxy pass open sockets when reloading

This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)
Comment 34 Ben Bennett 2018-02-02 09:59:54 EST
Fixed by https://github.com/openshift/origin/pull/18385
Comment 36 zhaozhanqi 2018-02-05 00:54:06 EST
verified this bug on v3.9.0-0.36.0, this issue cannot be reproduced.

steps:

1. create pod/svc/routes
2. access the route using multi request with command:
 ab -v 2 -r -n 200000 -c 64 http://hello-pod-z1.apps.0205-hk4.qe.rhcloud.com/ 

3. Create and delete another route during the step 2.

4. check the result of step 2, no error 'connect reset by peer'
Comment 53 errata-xmlrpc 2018-03-28 10:05:01 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.