Bug 1464657

Summary:

Random connection issues during router deployment after upgrade

Product:

OpenShift Container Platform

Reporter:

Nicolas Nosenzo <nnosenzo>

Component:

Networking

Assignee:

Ben Bennett <bbennett>

Networking sub component:

router

QA Contact:

zhaozhanqi <zzhao>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

urgent

Priority:

urgent

CC:

aos-bugs, bbennett, byount, cjo, dcbw, dzhukous, eparis, knakayam, mrobson, nnosenzo, rkharwar, sreber, weliang

Version:

3.5.1

Keywords:

Reopened

Target Milestone:

---

Flags:

mrobson: needinfo-

Target Release:

3.9.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: haproxy versions earlier than 1.9 could drop new connections during a reload. Consequence: Users would see intermittent problems, and the SYN eater workaround was difficult to deploy in some environments (due to privilege concerns). Fix: Upgrade to haproxy 1.9 and pass the listen socket. Result: The reload is seamless.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-03-28 14:05:01 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Router DC	none

Description Nicolas Nosenzo 2017-06-24 10:02:23 UTC

Description of problem:

After upgrading from 3.4 to 3.5 the router Pods cannot be deployed reliably anymore. When deploying the Pods, deployment sometimes fails with:

Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer

***
% oc logs ha-router-zrh-60-deploy
--> Scaling up ha-router-zrh-60 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-58 down to 1
    Scaling ha-router-zrh-60 up to 1
Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer
% oc logs ha-router-zrh-61-deploy
--> Scaling up ha-router-zrh-61 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-58 down to 1
    Scaling ha-router-zrh-61 up to 1
    Scaling ha-router-zrh-58 down to 0
    Scaling ha-router-zrh-61 up to 2
Unable to connect to the server: read tcp 10.1.12.16:34500->172.30.0.1:443: read: connection reset by peer
***

Sometimes it works and deployment runs through, but most of the time it stops with this error message.

Version-Release number of selected component (if applicable):
OCP 3.5.5.15-1
Docker 1.12.6-16

How reproducible:
Sometimes, in Customer environment

Steps to Reproduce:

- Re-deploy router pod_

--> Scaling up ha-router-zrh-99 from 0 to 2, scaling down ha-router-zrh-98 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-98 down to 1
    Scaling ha-router-zrh-99 up to 1
    Scaling ha-router-zrh-98 down to 0
    Scaling ha-router-zrh-99 up to 2
I0602 11:11:02.439849       1 helpers.go:221] Connection error: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/ha-router-zrh-99: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer
F0602 11:11:02.439922       1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer

Actual results:

F0602 11:11:02.439922       1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer

Expected results:

Router deployment working all the time.

Additional info:

Comment 1 Nicolas Nosenzo 2017-06-24 10:05:49 UTC

Created attachment 1291398 [details]
Router DC

Comment 5 Ben Bennett 2017-07-12 15:39:09 UTC


*** This bug has been marked as a duplicate of bug 1462338 ***

Comment 13 Ben Bennett 2017-08-31 15:00:24 UTC

Can you run the script at https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#further-help while the router is being started please?  Something is wrong with the deployer pod's connection to the master, but I have no clue why it only affects the router pod deployments.

Comment 21 Weibin Liang 2017-10-30 15:58:14 UTC

Following above scale routers steps, I can not reproduce the issue in my env when I upgrade version from v3.4.1.44.29  to v3.5.5.15

Comment 33 openshift-github-bot 2018-02-02 05:01:10 UTC

Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/98f7fc6080bf5f528b29c465473e60fef17f30f2
Make haproxy pass open sockets when reloading

This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)

https://github.com/openshift/origin/commit/49100d9e4f1ef1be511c3b28b5d5c1f8783f0f41
Merge pull request #18385 from knobunc/bug/bz1464657-seamless-handover-haproxy-reload

Automatic merge from submit-queue (batch tested with PRs 18390, 18389, 18290, 18377, 18385).

Make haproxy pass open sockets when reloading

This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)

Comment 34 Ben Bennett 2018-02-02 14:59:54 UTC

Fixed by https://github.com/openshift/origin/pull/18385

Comment 36 zhaozhanqi 2018-02-05 05:54:06 UTC

verified this bug on v3.9.0-0.36.0, this issue cannot be reproduced.

steps:

1. create pod/svc/routes
2. access the route using multi request with command:
 ab -v 2 -r -n 200000 -c 64 http://hello-pod-z1.apps.0205-hk4.qe.rhcloud.com/ 

3. Create and delete another route during the step 2.

4. check the result of step 2, no error 'connect reset by peer'

Comment 53 errata-xmlrpc 2018-03-28 14:05:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489