1464657 – Random connection issues during router deployment after upgrade

Bug 1464657 - Random connection issues during router deployment after upgrade

Summary: Random connection issues during router deployment after upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-24 10:02 UTC by Nicolas Nosenzo
Modified:	2022-08-04 22:20 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: haproxy versions earlier than 1.9 could drop new connections during a reload. Consequence: Users would see intermittent problems, and the SYN eater workaround was difficult to deploy in some environments (due to privilege concerns). Fix: Upgrade to haproxy 1.9 and pass the listen socket. Result: The reload is seamless.
Clone Of:
Environment:
Last Closed:	2018-03-28 14:05:01 UTC
Target Upstream Version:
Embargoed:
Flags:	mrobson: needinfo-

Attachments	(Terms of Use)
Router DC (4.73 KB, text/plain) 2017-06-24 10:05 UTC, Nicolas Nosenzo	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Origin (Github)	18385	None	None	None	2018-02-02 14:59:54 UTC
Red Hat Knowledge Base (Solution)	2775611	None	None	None	2018-02-14 23:09:07 UTC
Red Hat Product Errata	RHBA-2018:0489	None	None	None	2018-03-28 14:05:44 UTC

Description Nicolas Nosenzo 2017-06-24 10:02:23 UTC

Description of problem:

After upgrading from 3.4 to 3.5 the router Pods cannot be deployed reliably anymore. When deploying the Pods, deployment sometimes fails with:

Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer

***
% oc logs ha-router-zrh-60-deploy
--> Scaling up ha-router-zrh-60 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-58 down to 1
    Scaling ha-router-zrh-60 up to 1
Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer
% oc logs ha-router-zrh-61-deploy
--> Scaling up ha-router-zrh-61 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-58 down to 1
    Scaling ha-router-zrh-61 up to 1
    Scaling ha-router-zrh-58 down to 0
    Scaling ha-router-zrh-61 up to 2
Unable to connect to the server: read tcp 10.1.12.16:34500->172.30.0.1:443: read: connection reset by peer
***

Sometimes it works and deployment runs through, but most of the time it stops with this error message.

Version-Release number of selected component (if applicable):
OCP 3.5.5.15-1
Docker 1.12.6-16

How reproducible:
Sometimes, in Customer environment

Steps to Reproduce:

- Re-deploy router pod_

--> Scaling up ha-router-zrh-99 from 0 to 2, scaling down ha-router-zrh-98 from 2 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ha-router-zrh-98 down to 1
    Scaling ha-router-zrh-99 up to 1
    Scaling ha-router-zrh-98 down to 0
    Scaling ha-router-zrh-99 up to 2
I0602 11:11:02.439849       1 helpers.go:221] Connection error: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/ha-router-zrh-99: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer
F0602 11:11:02.439922       1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer

Actual results:

F0602 11:11:02.439922       1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer

Expected results:

Router deployment working all the time.

Additional info:

Comment 1 Nicolas Nosenzo 2017-06-24 10:05:49 UTC

Created attachment 1291398 [details]
Router DC

Comment 5 Ben Bennett 2017-07-12 15:39:09 UTC


*** This bug has been marked as a duplicate of bug 1462338 ***

Comment 13 Ben Bennett 2017-08-31 15:00:24 UTC

Can you run the script at https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#further-help while the router is being started please?  Something is wrong with the deployer pod's connection to the master, but I have no clue why it only affects the router pod deployments.

Comment 21 Weibin Liang 2017-10-30 15:58:14 UTC

Following above scale routers steps, I can not reproduce the issue in my env when I upgrade version from v3.4.1.44.29  to v3.5.5.15

Comment 33 openshift-github-bot 2018-02-02 05:01:10 UTC

Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/98f7fc6080bf5f528b29c465473e60fef17f30f2
Make haproxy pass open sockets when reloading

This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)

https://github.com/openshift/origin/commit/49100d9e4f1ef1be511c3b28b5d5c1f8783f0f41
Merge pull request #18385 from knobunc/bug/bz1464657-seamless-handover-haproxy-reload

Automatic merge from submit-queue (batch tested with PRs 18390, 18389, 18290, 18377, 18385).

Make haproxy pass open sockets when reloading

This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)

Comment 34 Ben Bennett 2018-02-02 14:59:54 UTC

Fixed by https://github.com/openshift/origin/pull/18385

Comment 36 zhaozhanqi 2018-02-05 05:54:06 UTC

verified this bug on v3.9.0-0.36.0, this issue cannot be reproduced.

steps:

1. create pod/svc/routes
2. access the route using multi request with command:
 ab -v 2 -r -n 200000 -c 64 http://hello-pod-z1.apps.0205-hk4.qe.rhcloud.com/ 

3. Create and delete another route during the step 2.

4. check the result of step 2, no error 'connect reset by peer'

Comment 53 errata-xmlrpc 2018-03-28 14:05:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.