Description of problem: A script was run to create 300 PVCs of 1Gb each. Each node has two devices one of 1Tb and the other of 50Gb. After 219 volumes the heketi logs show no space error. Heketi topology info also shows 0 free space across all devices on all the three nodes. Version-Release number of selected component (if applicable): rhgs-volmanager-rhel7 v3.9.0 How reproducible: Actual results: PVC creation should be successful Expected results: PVC creation fails due to no space
Logs are available here: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1559834/log/
Updated the logs with heketi.db and script used for pvc creation: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1559834/log/
Looking through the heketi logs, I see $ grep "Started async operation: Create Volume" heketi.log | wc -l 1062 $ grep "Started POST /volumes" heketi.log | wc -l 21624 I think it means, there are 1062 requests that are accepted by heketi for volume creation. Also, the number of requests that have reached negroni for volume create OR volume expand are 21624. It is either the case that so many PVC requests were made or the openshift storage provisioner requested so many as retry mechanism. I need logs from provisioner to debug further.
oc describe pod heketi..... Type Reason Age From Message ---- ------ ---- ---- ------- Warning InspectFailed 1h (x19 over 5h) kubelet, dhcp47-160.lab.eng.blr.redhat.com Failed to inspect image "rhgs3/rhgs-volmanager-rhel7:v3.9.0": rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning Unhealthy 59m (x380 over 23h) kubelet, dhcp47-160.lab.eng.blr.redhat.com Liveness probe failed: Get http://10.129.0.8:8080/hello: dial tcp 10.129.0.8:8080: getsockopt: connection refused Warning Failed 38m (x55 over 6h) kubelet, dhcp47-160.lab.eng.blr.redhat.com Error: context deadline exceeded Normal Pulled 34m (x257 over 23h) kubelet, dhcp47-160.lab.eng.blr.redhat.com Container image "rhgs3/rhgs-volmanager-rhel7:v3.9.0" already present on machine Normal Killing 13m (x353 over 23h) kubelet, dhcp47-160.lab.eng.blr.redhat.com Killing container with id docker://heketi:Container failed liveness probe.. Container will be killed and recreated. Warning Unhealthy 9m (x2760 over 23h) kubelet, dhcp47-160.lab.eng.blr.redhat.com Readiness probe failed: Get http://10.129.0.8:8080/hello: dial tcp 10.129.0.8:8080: getsockopt: connection refused Normal Created 3m (x194 over 23h) kubelet, dhcp47-160.lab.eng.blr.redhat.com Created container
Logs from provisioner added to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1559834/log/
*** This bug has been marked as a duplicate of bug 1554467 ***