Description of problem: "/usr/bin/oc get hostsubnet -o json -n default" times out consistently during opesnshift-ansible upgrade attempts. Version-Release number of selected component (if applicable): oc v3.7.0-0.104.0 kubernetes v1.7.0+695f48a16f features: Basic-Auth GSSAPI Kerberos SPNEGO How reproducible: 100% on this cluster at the moment Steps to Reproduce: 1. Several oc invocations hang with this problem: a. /usr/bin/oc get hostsubnet -o json -n default b. oc get pods --all-namespaces Actual results: Command terminates with timeout error. Expected results: Command should return without error. Additional info: - The condition arose after several failed attempts to run storage migration. https://bugzilla.redhat.com/show_bug.cgi?id=1489168 - "oc projects" returns quickly. - When run with logging (/usr/bin/oc get hostsubnet -o json -n default --loglevel=8): I0906 21:18:56.880943 64171 round_trippers.go:383] GET https://internal.api.free-int.openshift.com:443/oapi/v1/hostsubnets I0906 21:18:56.880951 64171 round_trippers.go:390] Request Headers: I0906 21:18:56.880956 64171 round_trippers.go:393] Accept: application/json I0906 21:18:56.880981 64171 round_trippers.go:393] User-Agent: oc/v1.7.0+695f48a16f (linux/amd64) kubernetes/d2e5420 I0906 21:19:56.911301 64171 round_trippers.go:408] Response Status: 504 Gateway Timeout in 60030 milliseconds I0906 21:19:56.911322 64171 round_trippers.go:411] Response Headers: I0906 21:19:56.911328 64171 round_trippers.go:414] Cache-Control: no-store I0906 21:19:56.911332 64171 round_trippers.go:414] Content-Type: text/plain; charset=utf-8 I0906 21:19:56.911336 64171 round_trippers.go:414] Content-Length: 224 I0906 21:19:56.911340 64171 round_trippers.go:414] Date: Wed, 06 Sep 2017 21:19:56 GMT I0906 21:19:56.911377 64171 request.go:994] Response Body: {"metadata":{},"status":"Failure","message":"The list operation against hostsubnets could not be completed at this time, please try again.","reason":"ServerTimeout","details":{"name":"list","kind":"hostsubnets"},"code":500} { "apiVersion": "v1", "items": [], "kind": "List", "metadata": { "resourceVersion": "", "selfLink": "" } } I0906 21:19:56.911699 64171 helpers.go:206] server response object: [{ "metadata": {}, "status": "Failure", "message": "the server cannot complete the requested operation at this time, try again later (get hostsubnets)", "reason": "ServerTimeout", "details": { "kind": "hostsubnets", "causes": [ { "reason": "UnexpectedServerResponse", "message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"The list operation against hostsubnets could not be completed at this time, please try again.\",\"reason\":\"ServerTimeout\",\"details\":{\"name\":\"list\",\"kind\":\"hostsubnets\"},\"code\":500}" } ] }, "code": 504 }] F0906 21:19:56.911730 64171 helpers.go:120] Error from server (ServerTimeout): the server cannot complete the requested operation at this time, try again later (get hostsubnets)
Moving this bug to upgrade component and we'll need to verify that the upgrade aborts when we don't have enough space to take the backup. It's possible that when we created the backup we had sufficient space but then when `oc adm migrate storage` was issued that re-wrote enough data to fill the disk. So I think when we take our backup we should ensure that we have 2x the amount of space required to take a backup rather than 1x. This would ensure that we could take the backup and then continue re-writing lots of data. The cluster in question has been restored to service after deleting all but the two most recent backups.
Upstream PR: https://github.com/openshift/openshift-ansible/pull/5377
Merged upstream
Bug verification blocked by bug 1451023.
Version: atomic-openshift-utils-3.7.0-0.126.4.git.0.3fc2b9b.el7.noarch Steps: 1. Install OCP 3.6 with one dedicated etcd. # openshift version openshift v3.6.173.0.37 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 # oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-3-ft0f5 1/1 Running 0 2h default registry-console-1-klt6d 1/1 Running 0 2h default router-1-pf2td 1/1 Running 0 2h install-test mongodb-1-1jjrn 1/1 Running 0 2h install-test nodejs-mongodb-example-1-build 0/1 Completed 0 2h install-test nodejs-mongodb-example-1-r6gll 1/1 Running 0 2h mytest mongodb-1-fxlkk 1/1 Running 0 33s mytest nodejs-mongodb-example-1-build 1/1 Running 0 41s 2. Prepare data on etcd host to fill in /var/lib/etcd directory. Check l_avail_disk and l_etcd_disk_usage. # df --output=avail -k /var/lib/etcd/ | tail -n 1 4662996 # du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1 3538672 3. Run upgrade when 2*l_etcd_disk_usage > l_avail_disk > l_etcd_disk_usage. #ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml -v|tee log 4. Upgrade failed as expected with a little error of msg. MSG: 3538672 Kb disk space required for etcd backup, 4662996 Kb available. Here should be 2*l_etcd_disk_usage required in msg. 5. Clean some data on etcd host, check l_avail_disk and l_etcd_disk_usage agagin. # df --output=avail -k /var/lib/etcd/ | tail -n 1 6112600 # du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1 2113888 6. Run upgrade again when 2*l_etcd_disk_usage < l_avail_disk Upgrade succeed. # openshift version openshift v3.7.0-0.126.4 kubernetes v1.7.0+80709908fd etcd 3.2.1 The major fix works well, but there is still a little issue about error msg in step 4. Assign the bug back for further fix.
https://github.com/openshift/openshift-ansible/pull/5450
Verified on atomic-openshift-utils-3.7.0-0.127.0.git.0.b9941e4.el7.noarch.rpm,hint msg have been updated. TASK [etcd_common : Check current etcd disk usage] ************************************************************************************************************************** ok: [x.x.x.x] => { "changed": false, "cmd": "du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1", "delta": "0:00:00.895742", "end": "2017-09-22 05:37:27.876136", "rc": 0, "start": "2017-09-22 05:37:26.980394" } STDOUT: 3700640 MSG: 7401280 Kb disk space required for etcd backup, 4939480 Kb available.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188