Bug 1236473

Summary: Hypervisor summary shows incorrect total storage (Ceph)
Product: Red Hat OpenStack Reporter: Martin Schuppert <mschuppe>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED WONTFIX QA Contact: nlevinki <nlevinki>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.0 (Juno)CC: bengland, berrange, brault, dasmith, ddomingo, ealcaniz, eglynn, kchamart, kimi.zhang, ndipanov, nlevine, rdopiera, rsussman, sbauza, sferdjao, sgordon, sknauss, vromanso, yeylon
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
When using Red Hat Ceph as a back end for ephemeral storage, the Compute service does not calculate the amount of available storage correctly. Specifically, Compute simply adds up the amount of available storage without factoring in replication. This results in grossly overstated available storage, which in turn could cause unexpected storage oversubscription. To determine the correct ephemeral storage capacity, query the Ceph service directly instead.
Story Points: ---
Clone Of:
: 1332165 (view as bug list) Environment:
Last Closed: 2016-03-31 12:58:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 743661, 1332165, 1336237, 1368279, 1430245    

Description Martin Schuppert 2015-06-29 07:24:59 UTC
Description of problem:

Since using Ceph for ephemeral storage it adds up the ceph storage seen in each storage node rather than just using the real amount of ceph storage.

e.g. in openstack deployment with three controllers and six compute nodes. Storage is done with ceph block storage in a ceph storage cluster. Each OSD node got a dedicated local hard disk of 1TB totaling ceph storage capacity to 2.7 TB. In the dashboard each compute node sees the whole ceph OSD storage as its own storage capacity totaling the overall storage capacity to amount of computes x ceph storage. So instead of 2.7 TB we see 16.3 TB of storage.

Now each compute seems to report free storage capacity based on the whole ceph storage minus the used storage by the running VMs. This means the system in openstack sees much more storage as really exists and allows oversubscribing of storage. 

[root@controller-1 ~(openstack_admin)]# nova hypervisor-stats
+----------------------+--------+
| Property             | Value  |
+----------------------+--------+
| count                | 6      |
| current_workload     | 0      |
| disk_available_least | 16206  |
| free_disk_gb         | 12662  |
| free_ram_mb          | 599679 |
---> | local_gb             | 16722  |
| local_gb_used        | 4060   |
| memory_mb            | 772735 |
| memory_mb_used       | 173056 |
| running_vms          | 23     |
| vcpus                | 220    |
| vcpus_used           | 83     |
+----------------------+--------+

[root@controller-1 ~(openstack_admin)]# ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    2787G     2701G       87739M          3.07
POOLS:
    NAME         ID     USED       %USED     MAX AVAIL     OBJECTS
    data         0           0         0         1348G           0
    metadata     1           0         0         1348G           0
    rbd          2           0         0         1348G           0
    images       3      12214M      0.43         1348G        1534
    volumes      4      33166M      1.16         1348G        7774

This will lead to problems if the ceph fills up but openstack still reports free storage for all/some compute nodes based on the nova audit.

Version-Release number of selected component (if applicable):
openstack-nova-compute-2014.2.3-9.el7ost.noarch

How reproducible:
always

Steps to Reproduce:
1. configure RBD usage as explained in http://ceph.com/docs/master/rbd/rbd-openstack/

Actual results:
disk usage reported by openstack is RBD * number of computes

Expected results:
Max disk usage is what is reported by ceph cluster

Additional info:

upstream bug: https://bugs.launchpad.net/nova/+bug/1387812

Comment 2 Martin Schuppert 2015-06-29 07:36:10 UTC
Related to this is what is being discussed in:

"nova hypervisor-stats shows wrong disk usage with shared storage" [1]

Let me know if I should file a separate BZ for this.

[1] https://bugs.launchpad.net/nova/+bug/1414432

Comment 4 Sahid Ferdjaoui 2015-07-24 13:16:46 UTC
Several attempts to fix this upstream have been done but nothing merged (all abandoned)

Comment 8 Eoghan Glynn 2015-10-09 11:21:17 UTC
Martin:

There is an upstream spec proposed that will help fix this, but it's in the early stages of discussion:

  https://review.openstack.org/225546

The problem is relatively well understood, but it needs a redeisgn of various scheduler aspects to resolve. So while the discussion is currently underway, at best the timeline would be mitaka/OSP9, and possibly later.

Comment 9 Stephen Gordon 2015-11-06 15:20:00 UTC
*** Bug 1248720 has been marked as a duplicate of this bug. ***