Bug 1464657

Summary: Random connection issues during router deployment after upgrade
Product: OpenShift Container Platform Reporter: Nicolas Nosenzo <nnosenzo>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, bbennett, byount, cjo, dcbw, dzhukous, eparis, knakayam, mrobson, nnosenzo, rkharwar, sreber, weliang
Version: 3.5.1Keywords: Reopened
Target Milestone: ---Flags: mrobson: needinfo-
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: haproxy versions earlier than 1.9 could drop new connections during a reload. Consequence: Users would see intermittent problems, and the SYN eater workaround was difficult to deploy in some environments (due to privilege concerns). Fix: Upgrade to haproxy 1.9 and pass the listen socket. Result: The reload is seamless.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-28 14:05:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Router DC none

Description Nicolas Nosenzo 2017-06-24 10:02:23 UTC
Description of problem:

After upgrading from 3.4 to 3.5 the router Pods cannot be deployed reliably anymore. When deploying the Pods, deployment sometimes fails with:

Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer

***
% oc logs ha-router-zrh-60-deploy
--> Scaling up ha-router-zrh-60 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-58 down to 1
    Scaling ha-router-zrh-60 up to 1
Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer
% oc logs ha-router-zrh-61-deploy
--> Scaling up ha-router-zrh-61 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-58 down to 1
    Scaling ha-router-zrh-61 up to 1
    Scaling ha-router-zrh-58 down to 0
    Scaling ha-router-zrh-61 up to 2
Unable to connect to the server: read tcp 10.1.12.16:34500->172.30.0.1:443: read: connection reset by peer
***

Sometimes it works and deployment runs through, but most of the time it stops with this error message.

Version-Release number of selected component (if applicable):
OCP 3.5.5.15-1
Docker 1.12.6-16

How reproducible:
Sometimes, in Customer environment

Steps to Reproduce:

- Re-deploy router pod_

--> Scaling up ha-router-zrh-99 from 0 to 2, scaling down ha-router-zrh-98 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-98 down to 1
    Scaling ha-router-zrh-99 up to 1
    Scaling ha-router-zrh-98 down to 0
    Scaling ha-router-zrh-99 up to 2
I0602 11:11:02.439849       1 helpers.go:221] Connection error: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/ha-router-zrh-99: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer
F0602 11:11:02.439922       1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer

Actual results:

F0602 11:11:02.439922       1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer

Expected results:

Router deployment working all the time.

Additional info:

Comment 1 Nicolas Nosenzo 2017-06-24 10:05:49 UTC
Created attachment 1291398 [details]
Router DC

Comment 5 Ben Bennett 2017-07-12 15:39:09 UTC

*** This bug has been marked as a duplicate of bug 1462338 ***

Comment 13 Ben Bennett 2017-08-31 15:00:24 UTC
Can you run the script at https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#further-help while the router is being started please?  Something is wrong with the deployer pod's connection to the master, but I have no clue why it only affects the router pod deployments.

Comment 21 Weibin Liang 2017-10-30 15:58:14 UTC
Following above scale routers steps, I can not reproduce the issue in my env when I upgrade version from v3.4.1.44.29  to v3.5.5.15

Comment 33 openshift-github-bot 2018-02-02 05:01:10 UTC
Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/98f7fc6080bf5f528b29c465473e60fef17f30f2
Make haproxy pass open sockets when reloading

This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)

https://github.com/openshift/origin/commit/49100d9e4f1ef1be511c3b28b5d5c1f8783f0f41
Merge pull request #18385 from knobunc/bug/bz1464657-seamless-handover-haproxy-reload

Automatic merge from submit-queue (batch tested with PRs 18390, 18389, 18290, 18377, 18385).

Make haproxy pass open sockets when reloading

This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)

Comment 34 Ben Bennett 2018-02-02 14:59:54 UTC
Fixed by https://github.com/openshift/origin/pull/18385

Comment 36 zhaozhanqi 2018-02-05 05:54:06 UTC
verified this bug on v3.9.0-0.36.0, this issue cannot be reproduced.

steps:

1. create pod/svc/routes
2. access the route using multi request with command:
 ab -v 2 -r -n 200000 -c 64 http://hello-pod-z1.apps.0205-hk4.qe.rhcloud.com/ 

3. Create and delete another route during the step 2.

4. check the result of step 2, no error 'connect reset by peer'

Comment 53 errata-xmlrpc 2018-03-28 14:05:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489