Bug 2066367
| Summary: | Small /srv partition on hardened whole-disk images makes swift (and any service based on it) unusable | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Filip Hubík <fhubik> | ||||
| Component: | tripleo-ansible | Assignee: | Steve Baker <sbaker> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Joe H. Rahme <jhakimra> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 17.0 (Wallaby) | CC: | abishop, apevec, astillma, bshephar, cschwede, gfidente, gthiemon, hjensas, igallagh, ihrachys, jkreger, jparker, jparoly, jschluet, lkuchlan, lpeer, ltoscano, majopela, michjohn, njohnston, oschwart, pgrist, pweeks, sbaker, scohen, spower | ||||
| Target Milestone: | ga | Keywords: | AutomationBlocker, Triaged | ||||
| Target Release: | 17.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | tripleo-ansible-3.3.1-0.20220506233512.96104ee.el8ost | Doc Type: | If docs needed, set a value | ||||
| Doc Text: |
The manual will mention this as part of https://issues.redhat.com/browse/RHOSPDOC-823
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-09-21 12:19:42 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Filip Hubík
2022-03-21 15:26:59 UTC
Created attachment 1867241 [details]
oc_deploy_snippet.log
Weird, I dont see any reason why glance service should reply 500 as from the log its running and also handling the requests around the timestamp the failure had happened. This is duplicate to the following glance/tripleo bugs: https://bugzilla.redhat.com/show_bug.cgi?id=2064290 https://bugzilla.redhat.com/show_bug.cgi?id=2065282 Basically the secure-RBAC override file from TripleO has a bug in it which causes glance to fail with 500 errors when a file is uploaded. (In reply to Christian Schwede (cschwede) from comment #10) > > Yes, they are - and this is most likely the reason why Swift fails. Good > catch! I'm just wondering why /srv is setup as separate small logical > volume? Where is this defined? I am not sure where this config is comming from (IR? TripleO?), but I noticed the disk-mount layout changed between RHEL8<->9 jobs, where in 8 the layout user paravirt. devices /dev/vdaX, otoh 9 uses the device-mapper and LVM heavily (/dev/mapper/vg-*)... Could this be comming from e.g. RHEL-9(guest-image) directly as it might handle storage devices differently (compared to the 8) by default now? [1] is the commit that defines the "growvols" partitioning, and the commit message states: 'There will be a mechanism in the "openstack overcloud node provision" command to specify different growvol arguments. This will not be required for most nodes, but swift storage nodes will have to grow both /srv and /var.' However, I looked over the python-tripleoclient code that implements "openstack overcloud node provision" and I don't see any, er, provision for specifying growvol arguments that are intended to address Swift's needs. The whole-disk-default blueprint [2] also says this: "Generally the /var volume should grow to take available disk space because this is where TripleO and OpenStack services store their state, but sometimes /srv will need to grow for Swift storage, and sometimes there may need to be a proportional split of multiple volumes. This suggests that there will be new tripleo-heat-templates variables which will specify the volume/proportion growth behaviour on a per-role basis." Clearly some thought has been given to Swift, but maybe it's missing in the current implementation? [1] https://review.opendev.org/c/openstack/tripleo-ansible/+/811536 [2] https://specs.openstack.org/openstack/tripleo-specs/specs/xena/whole-disk-default.html The interface to chance the default growvols arguments is in the baremetal_deployment.yaml, it's documented here[1].
Growvols is called as an implicit playbook with default args after deploying baremetal nodes with 'openstack overcloud node provision' command.
To provide custom arguments for growvols the playbook must be added under 'ansible_playbooks' in baremetal_deployment.yaml
Here is an example, baremetal_deployment.yaml entry for swift:
- name: Object
count: 10
hostname_format: object-%index%
ansible_playbooks:
- playbook: /usr/share/ansible/tripleo-playbooks/cli-overcloud-node-growvols.yaml
extra_vars:
growvols_args: >
/=8GB
/tmp=1GB
/var/log=10GB
/var/log/audit=2GB
/home=1GB
/srv=500GB
/var=100%
defaults:
profile: object
networks:
- network: internal_api
- network: storage
network_config:
template: templates/multiple_nics/multiple_nics_dvr.j2
[1] https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html#grow-volumes-playbook
Harald has summed up well what is required. With the hardened whole-disk image there needs to be volumes dedicated to specific purposes to comply with ANSSI requirements (data, packages, home, tmp, etc). Most roles store their data in /var, except for swift object storage which uses /srv. This is why /srv is created but is tiny, most roles need all remaining disk space to be assigned to /var. This is why the upstream documentation[1] uses /srv as its example of custom growvols arguments. This will definitely be documented for 17.0. [1] https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html#grow-volumes-playbook (In reply to Steve Baker from comment #17) > Harald has summed up well what is required. With the hardened whole-disk > image there needs to be volumes dedicated to specific purposes to comply > with ANSSI requirements (data, packages, home, tmp, etc). > > Most roles store their data in /var, except for swift object storage which > uses /srv. This is why /srv is created but is tiny, most roles need all > remaining disk space to be assigned to /var. This looks like a regression to me: any deployment without Ceph (and without an external storage backend for Glance) will now be unusable by default, because Glance can't store any images in Swift. If Swift is not disabled, there should be more disk space assigned to /srv (as much as possible). *** Bug 2070241 has been marked as a duplicate of this bug. *** *** Bug 2071654 has been marked as a duplicate of this bug. *** I'll just share my 2 cents as I've been tinkering the the values of /var and /srv in p1/p2 CI environment (where in non-ceph enabled topologies we get usually /var +-6GB and /srv+-1G after full Tempest per controller) to get reasonable results and I am still struggling here a bit about the right approach to this issue, but I tend to agree with the "TripleO defaults" option. Overall I have a question/suggestion. If TripleO has awareness of the service placement (lets put aside composable roles), also the node profiles/roles assigned and the relations between them (theoretically right there during the "openstack overcloud node provision ..." step(?) as by the input yaml file), it should have also awareness of minimum requirements per each service and per all services aggregated (to be deployed on each node). If it can, partitions in the hardened image could be grown to their minimal operational sizes (step1) and then partition associated with the selected storage service used grown to the 100% of the remaining image/storage (step2)? Same way the following "deployed ceph" step could be working (maybe it already is) with ceph nodes? Speaking about the "defaults" here mainly, I am also assuming the configurability needs to be kept and documented as people need to tweak the storage space based on their workloads and physical hw anyway. All in all we merged following workaround so far in IR in order to get reasonable CI results in the meantime: https://review.gerrithub.io/c/redhat-openstack/infrared/+/535621 until the right solution is decided here. I've proposed this upstream: https://review.opendev.org/c/openstack/tripleo-ansible/+/837438 Please post a review so we can come to a consensus on the splits for /srv and /var on Controller and ObjectStorage. I saw Filip's comment in the spreadsheet about CI needing at least 1GB /srv and > 6GB /var. If the CI root disk is at least 34GB then the default 10%/90% split will provide this, and the infrared override could be removed. *** Bug 2077927 has been marked as a duplicate of this bug. *** The code fix is to ensure real environments will give enough storage space to /srv by default to be useful, and the documentation will direct the deployer to consider their actual storage requirements to set a value for /srv growth that is more appropriate than the default. The ambiguity is that this issue broke every job that deploys swift, and because the test disks are so small the code fix wasn't enough, so each job needed to be modified to either increase the disk size, or override the defaults to give enough growth to /srv to store their workload. So even if some jobs are still broken, that is now unrelated to verifying the code fix. And there will be a dedicated section in the manual to discuss /srv storage requirements which is being tracked as part of https://issues.redhat.com/browse/RHOSPDOC-823 so I don't think this bug needs any extra Known Issue docs. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543 |