Bug 1489182
Summary: | [free-int] /var disk space exhaustion during upgrades [was: API calls hanging with timeout] | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
Component: | Cluster Version Operator | Assignee: | Jan Chaloupka <jchaloup> |
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.7.0 | CC: | aos-bugs, decarr, jokerman, jupierce, mkhan, mmccomas, sdodson |
Target Milestone: | --- | Keywords: | DeliveryBlocker |
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openshift-ansible-3.7.0-0.126.1.git.0.0bb5b0c.el7.noarch | Doc Type: | Bug Fix |
Doc Text: |
Cause: Master upgrade took more disk space than was initially estimated
Consequence: Insufficient disk space lead to etcd member "no space left on device" failure
Fix: Increase the estimation of disk space that needs to be available before master upgrade can start
Result: A master node is properly upgraded with enough disk space left after the upgrade
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-11-28 22:09:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Justin Pierce
2017-09-06 21:40:06 UTC
Moving this bug to upgrade component and we'll need to verify that the upgrade aborts when we don't have enough space to take the backup. It's possible that when we created the backup we had sufficient space but then when `oc adm migrate storage` was issued that re-wrote enough data to fill the disk. So I think when we take our backup we should ensure that we have 2x the amount of space required to take a backup rather than 1x. This would ensure that we could take the backup and then continue re-writing lots of data. The cluster in question has been restored to service after deleting all but the two most recent backups. Merged upstream Bug verification blocked by bug 1451023. Version: atomic-openshift-utils-3.7.0-0.126.4.git.0.3fc2b9b.el7.noarch Steps: 1. Install OCP 3.6 with one dedicated etcd. # openshift version openshift v3.6.173.0.37 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 # oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-3-ft0f5 1/1 Running 0 2h default registry-console-1-klt6d 1/1 Running 0 2h default router-1-pf2td 1/1 Running 0 2h install-test mongodb-1-1jjrn 1/1 Running 0 2h install-test nodejs-mongodb-example-1-build 0/1 Completed 0 2h install-test nodejs-mongodb-example-1-r6gll 1/1 Running 0 2h mytest mongodb-1-fxlkk 1/1 Running 0 33s mytest nodejs-mongodb-example-1-build 1/1 Running 0 41s 2. Prepare data on etcd host to fill in /var/lib/etcd directory. Check l_avail_disk and l_etcd_disk_usage. # df --output=avail -k /var/lib/etcd/ | tail -n 1 4662996 # du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1 3538672 3. Run upgrade when 2*l_etcd_disk_usage > l_avail_disk > l_etcd_disk_usage. #ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml -v|tee log 4. Upgrade failed as expected with a little error of msg. MSG: 3538672 Kb disk space required for etcd backup, 4662996 Kb available. Here should be 2*l_etcd_disk_usage required in msg. 5. Clean some data on etcd host, check l_avail_disk and l_etcd_disk_usage agagin. # df --output=avail -k /var/lib/etcd/ | tail -n 1 6112600 # du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1 2113888 6. Run upgrade again when 2*l_etcd_disk_usage < l_avail_disk Upgrade succeed. # openshift version openshift v3.7.0-0.126.4 kubernetes v1.7.0+80709908fd etcd 3.2.1 The major fix works well, but there is still a little issue about error msg in step 4. Assign the bug back for further fix. Verified on atomic-openshift-utils-3.7.0-0.127.0.git.0.b9941e4.el7.noarch.rpm,hint msg have been updated. TASK [etcd_common : Check current etcd disk usage] ************************************************************************************************************************** ok: [x.x.x.x] => { "changed": false, "cmd": "du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1", "delta": "0:00:00.895742", "end": "2017-09-22 05:37:27.876136", "rc": 0, "start": "2017-09-22 05:37:26.980394" } STDOUT: 3700640 MSG: 7401280 Kb disk space required for etcd backup, 4939480 Kb available. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |