Description of problem: After upgrading from 3.4 to 3.5 the router Pods cannot be deployed reliably anymore. When deploying the Pods, deployment sometimes fails with: Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer *** % oc logs ha-router-zrh-60-deploy --> Scaling up ha-router-zrh-60 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods) Scaling ha-router-zrh-58 down to 1 Scaling ha-router-zrh-60 up to 1 Unable to connect to the server: read tcp 10.1.11.32:35420->172.30.0.1:443: read: connection reset by peer % oc logs ha-router-zrh-61-deploy --> Scaling up ha-router-zrh-61 from 0 to 2, scaling down ha-router-zrh-58 from 2 to 0 (keep 1 pods available, don't exceed 2 pods) Scaling ha-router-zrh-58 down to 1 Scaling ha-router-zrh-61 up to 1 Scaling ha-router-zrh-58 down to 0 Scaling ha-router-zrh-61 up to 2 Unable to connect to the server: read tcp 10.1.12.16:34500->172.30.0.1:443: read: connection reset by peer *** Sometimes it works and deployment runs through, but most of the time it stops with this error message. Version-Release number of selected component (if applicable): OCP 3.5.5.15-1 Docker 1.12.6-16 How reproducible: Sometimes, in Customer environment Steps to Reproduce: - Re-deploy router pod_ --> Scaling up ha-router-zrh-99 from 0 to 2, scaling down ha-router-zrh-98 from 2 to 0 (keep 1 pods available, don't exceed 2 pods) Scaling ha-router-zrh-98 down to 1 Scaling ha-router-zrh-99 up to 1 Scaling ha-router-zrh-98 down to 0 Scaling ha-router-zrh-99 up to 2 I0602 11:11:02.439849 1 helpers.go:221] Connection error: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/ha-router-zrh-99: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer F0602 11:11:02.439922 1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer Actual results: F0602 11:11:02.439922 1 helpers.go:116] Unable to connect to the server: read tcp 10.1.12.55:44712->172.30.0.1:443: read: connection reset by peer Expected results: Router deployment working all the time. Additional info:
Created attachment 1291398 [details] Router DC
*** This bug has been marked as a duplicate of bug 1462338 ***
Can you run the script at https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#further-help while the router is being started please? Something is wrong with the deployer pod's connection to the master, but I have no clue why it only affects the router pod deployments.
Following above scale routers steps, I can not reproduce the issue in my env when I upgrade version from v3.4.1.44.29 to v3.5.5.15
Commits pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/98f7fc6080bf5f528b29c465473e60fef17f30f2 Make haproxy pass open sockets when reloading This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/) Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657) https://github.com/openshift/origin/commit/49100d9e4f1ef1be511c3b28b5d5c1f8783f0f41 Merge pull request #18385 from knobunc/bug/bz1464657-seamless-handover-haproxy-reload Automatic merge from submit-queue (batch tested with PRs 18390, 18389, 18290, 18377, 18385). Make haproxy pass open sockets when reloading This changes the way we do a reload to take advantage of haproxy 1.8's seamless reload feature (described in https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/) Fixes bug 1464657 (https://bugzilla.redhat.com/show_bug.cgi?id=1464657)
Fixed by https://github.com/openshift/origin/pull/18385
verified this bug on v3.9.0-0.36.0, this issue cannot be reproduced. steps: 1. create pod/svc/routes 2. access the route using multi request with command: ab -v 2 -r -n 200000 -c 64 http://hello-pod-z1.apps.0205-hk4.qe.rhcloud.com/ 3. Create and delete another route during the step 2. 4. check the result of step 2, no error 'connect reset by peer'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489