Bug 1489182

Summary:	[free-int] /var disk space exhaustion during upgrades [was: API calls hanging with timeout]
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Cluster Version Operator	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED ERRATA	QA Contact:	liujia <jiajliu>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.7.0	CC:	aos-bugs, decarr, jokerman, jupierce, mkhan, mmccomas, sdodson
Target Milestone:	---	Keywords:	DeliveryBlocker
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift-ansible-3.7.0-0.126.1.git.0.0bb5b0c.el7.noarch	Doc Type:	Bug Fix
Doc Text:	Cause: Master upgrade took more disk space than was initially estimated Consequence: Insufficient disk space lead to etcd member "no space left on device" failure Fix: Increase the estimation of disk space that needs to be available before master upgrade can start Result: A master node is properly upgraded with enough disk space left after the upgrade	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 22:09:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin Pierce 2017-09-06 21:40:06 UTC

Description of problem:
"/usr/bin/oc get hostsubnet -o json -n default"  times out consistently during opesnshift-ansible upgrade attempts.

Version-Release number of selected component (if applicable):
oc v3.7.0-0.104.0
kubernetes v1.7.0+695f48a16f
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:
100% on this cluster at the moment

Steps to Reproduce:
1. Several oc invocations hang with this problem:
 a. /usr/bin/oc get hostsubnet -o json -n default
 b. oc get pods --all-namespaces


Actual results:
Command terminates with timeout error.

Expected results:
Command should return without error.

Additional info:
- The condition arose after several failed attempts to run storage migration. https://bugzilla.redhat.com/show_bug.cgi?id=1489168

- "oc projects" returns quickly.

- When run with logging (/usr/bin/oc get hostsubnet -o json -n default --loglevel=8):
I0906 21:18:56.880943   64171 round_trippers.go:383] GET https://internal.api.free-int.openshift.com:443/oapi/v1/hostsubnets
I0906 21:18:56.880951   64171 round_trippers.go:390] Request Headers:
I0906 21:18:56.880956   64171 round_trippers.go:393]     Accept: application/json
I0906 21:18:56.880981   64171 round_trippers.go:393]     User-Agent: oc/v1.7.0+695f48a16f (linux/amd64) kubernetes/d2e5420
I0906 21:19:56.911301   64171 round_trippers.go:408] Response Status: 504 Gateway Timeout in 60030 milliseconds
I0906 21:19:56.911322   64171 round_trippers.go:411] Response Headers:
I0906 21:19:56.911328   64171 round_trippers.go:414]     Cache-Control: no-store
I0906 21:19:56.911332   64171 round_trippers.go:414]     Content-Type: text/plain; charset=utf-8
I0906 21:19:56.911336   64171 round_trippers.go:414]     Content-Length: 224
I0906 21:19:56.911340   64171 round_trippers.go:414]     Date: Wed, 06 Sep 2017 21:19:56 GMT
I0906 21:19:56.911377   64171 request.go:994] Response Body: {"metadata":{},"status":"Failure","message":"The list operation against hostsubnets could not be completed at this time, please try again.","reason":"ServerTimeout","details":{"name":"list","kind":"hostsubnets"},"code":500}
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}
I0906 21:19:56.911699   64171 helpers.go:206] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server cannot complete the requested operation at this time, try again later (get hostsubnets)",
  "reason": "ServerTimeout",
  "details": {
    "kind": "hostsubnets",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"The list operation against hostsubnets could not be completed at this time, please try again.\",\"reason\":\"ServerTimeout\",\"details\":{\"name\":\"list\",\"kind\":\"hostsubnets\"},\"code\":500}"
      }
    ]
  },
  "code": 504
}]
F0906 21:19:56.911730   64171 helpers.go:120] Error from server (ServerTimeout): the server cannot complete the requested operation at this time, try again later (get hostsubnets)

Comment 6 Scott Dodson 2017-09-07 16:18:21 UTC

Moving this bug to upgrade component and we'll need to verify that the upgrade aborts when we don't have enough space to take the backup. It's possible that when we created the backup we had sufficient space but then when `oc adm migrate storage` was issued that re-wrote enough data to fill the disk. So I think when we take our backup we should ensure that we have 2x the amount of space required to take a backup rather than 1x. This would ensure that we could take the backup and then continue re-writing lots of data.

The cluster in question has been restored to service after deleting all but the two most recent backups.

Comment 7 Jan Chaloupka 2017-09-12 15:38:51 UTC

Upstream PR: https://github.com/openshift/openshift-ansible/pull/5377

Comment 8 Jan Chaloupka 2017-09-13 10:53:28 UTC

Merged upstream

Comment 9 liujia 2017-09-14 09:43:42 UTC

Bug verification blocked by bug 1451023.

Comment 10 liujia 2017-09-19 08:20:24 UTC

Version:
atomic-openshift-utils-3.7.0-0.126.4.git.0.3fc2b9b.el7.noarch

Steps:
1. Install OCP 3.6 with one dedicated etcd.
# openshift version
openshift v3.6.173.0.37
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

# oc get pods --all-namespaces
NAMESPACE      NAME                             READY     STATUS      RESTARTS   AGE
default        docker-registry-3-ft0f5          1/1       Running     0          2h
default        registry-console-1-klt6d         1/1       Running     0          2h
default        router-1-pf2td                   1/1       Running     0          2h
install-test   mongodb-1-1jjrn                  1/1       Running     0          2h
install-test   nodejs-mongodb-example-1-build   0/1       Completed   0          2h
install-test   nodejs-mongodb-example-1-r6gll   1/1       Running     0          2h
mytest         mongodb-1-fxlkk                  1/1       Running     0          33s
mytest         nodejs-mongodb-example-1-build   1/1       Running     0          41s

2. Prepare data on etcd host to fill in /var/lib/etcd directory. Check l_avail_disk and l_etcd_disk_usage.
# df --output=avail -k /var/lib/etcd/ | tail -n 1
4662996
# du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1
3538672 

3. Run upgrade when 2*l_etcd_disk_usage > l_avail_disk > l_etcd_disk_usage.
#ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml -v|tee log

4. Upgrade failed as expected with a little error of msg.

MSG:

3538672 Kb disk space required for etcd backup, 4662996 Kb available.

Here should be 2*l_etcd_disk_usage required in msg. 

5. Clean some data on etcd host, check l_avail_disk and l_etcd_disk_usage agagin.
# df --output=avail -k /var/lib/etcd/ | tail -n 1
6112600
# du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1
2113888

6. Run upgrade again when 2*l_etcd_disk_usage < l_avail_disk

Upgrade succeed.
# openshift version
openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd
etcd 3.2.1

The major fix works well, but there is still a little issue about error msg in step 4. Assign the bug back for further fix.

Comment 11 Jan Chaloupka 2017-09-19 08:45:44 UTC

https://github.com/openshift/openshift-ansible/pull/5450

Comment 12 liujia 2017-09-22 09:40:49 UTC

Verified on atomic-openshift-utils-3.7.0-0.127.0.git.0.b9941e4.el7.noarch.rpm,hint msg have been updated.

TASK [etcd_common : Check current etcd disk usage] **************************************************************************************************************************
ok: [x.x.x.x] => {
    "changed": false,
    "cmd": "du --exclude='*openshift-backup*' -k /var/lib/etcd/ | tail -n 1 | cut -f1",
    "delta": "0:00:00.895742",
    "end": "2017-09-22 05:37:27.876136",
    "rc": 0,
    "start": "2017-09-22 05:37:26.980394"
}

STDOUT:

3700640


MSG:

7401280 Kb disk space required for etcd backup, 4939480 Kb available.

Comment 16 errata-xmlrpc 2017-11-28 22:09:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188