2252941 – [OSP17.1] SwiftRawDisks only takes sdX naming convention which can fail deployment when the introspection assign sdb for the OS disk.

Bug 2252941 - [OSP17.1] SwiftRawDisks only takes sdX naming convention which can fail deployment when the introspection assign sdb for the OS disk.

Summary: [OSP17.1] SwiftRawDisks only takes sdX naming convention which can fail deplo...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-swift
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	zstream
Target Release:	17.1
Assignee:	Christian Schwede (cschwede)
QA Contact:	Gal Amado
Docs Contact:	Andy Stillman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-12-05 11:28 UTC by ggrimaux
Modified:	2024-10-09 11:38 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-10-09 11:38:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-30658	0	None	None	None	2023-12-05 11:31:59 UTC
Red Hat Knowledge Base (Article)	7048606	0	None	None	None	2023-12-23 17:05:40 UTC

Description ggrimaux 2023-12-05 11:28:40 UTC

Description of problem:
Client is trying to build a lab.

He wants to deploy swift on a second disk of the controller nodes.

But during overcloud deployment it sometime fails with:
2023-11-24 13:04:48.702263 | 566fdacf-01b6-add0-8052-000000001dbd |      FATAL | Format SwiftRawDisks | controller-22 | item=sdb | error={"ansible_loop_var": "item", "changed": false, "cmd": "/sbin/mkfs.xfs -f
-f -i size=1024 /dev/sdb", "item": "sdb", "msg": "mkfs.xfs: cannot open /dev/sdb: Device or resource busy", "rc": 1, "stderr": "mkfs.xfs: cannot open /dev/sdb: Device or resource busy\n", "stderr_lines": ["mkfs.xfs: c
annot open /dev/sdb: Device or resource busy"], "stdout": "", "stdout_lines": []}

The cause of this is sometimes during introspection, sdb is the OS disk instead of sda:
(undercloud) [stack@director-02 ~]$ openstack baremetal introspection data save controller-22 | jq .inventory.disks
[
  {
    "name": "/dev/sda",
    "model": "QEMU HARDDISK",
    "size": 3220151730176,
    "rotational": true,
    "wwn": null,
    "serial": "165ddab4-390d-4711-98d7-dc71e08e8ca2",
    "vendor": "QEMU",
    "wwn_with_extension": null,
    "wwn_vendor_extension": null,
    "hctl": "0:0:0:1",
    "by_path": "/dev/disk/by-path/pci-0000:08:00.0-scsi-0:0:0:1"
  },
  {
    "name": "/dev/sdb",
    "model": "QEMU HARDDISK",
    "size": 1073741824000,
    "rotational": true,
    "wwn": null,
    "serial": "d285f75b-f180-4d1e-b35b-3d4f13e62c42",
    "vendor": "QEMU",
    "wwn_with_extension": null,
    "wwn_vendor_extension": null,
    "hctl": "0:0:0:0",
    "by_path": "/dev/disk/by-path/pci-0000:08:00.0-scsi-0:0:0:0"
  }
]

The issue is not with using root_serial at the baremetal level because client is doing it and yes the OS does get installed on the right disk each time, its more the labeling that is causing issues:
(undercloud) [stack@director-02 ~]$ openstack baremetal node show controller-22 -c properties -f value
{'cpus': '24', 'memory_mb': '73728', 'local_gb': '999', 'cpu_arch': 'x86_64', 'capabilities': 'boot_option:local,node:controller-1,profile:control,cpu_aes:true,cpu_hugepages:true,cpu_hugepages_1g:true', 'root_device': {'serial': 'd285f75b-f180-4d1e-b35b-3d4f13e62c42'}}


In his templates he tries to deploy with this (as our doc says):
parameter_defaults:
  SwiftMountCheck: true
  SwiftRawDisks: {"sdb": {}}
  SwiftUseLocalDir: false


I found a variable that should be able to help us here but I am not able to make it work:
SwiftUseNodeDataLookup
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.1/html/overcloud_parameters/ref_object-storage-swift-parameters_overcloud_parameters#doc-wrapper

I believe this feature never worked as intended because of a typo in a sed command used at some point:
/usr/share/openstack-tripleo-heat-templates/deployment/swift/swift-storage-container-puppet.yaml
  hiera -c /etc/puppet/hiera.yaml swift::storage::disks::args | sed =e 's/=>/:/g'
('sed =e' doesnt work, a new patch has been submitted to fix this typo).



So I built a lab to replicate this issue (where I manually fixed the sed command btw).

I put the following in my templates:
resource_registry:
  OS::TripleO::ControllerExtraConfigPre: /usr/share/openstack-tripleo-heat-templates/puppet/extraconfig/pre_deploy/per_node.yaml
parameter_defaults:
  SwiftMountCheck: true
  SwiftUseLocalDir: false
  SwiftUseNodeDataLookup: true # Use NodeDataLookup for disk devices due to non-persistent disk names
  NodeDataLookup: |
    {
      "3043894b-6a6f-48ed-b359-1275778779a7": {
        "swift::storage::disks::args": {
          "pci-0000:08:00.0": {
            "base_dir": "/dev/disk/by-path/"
          }
        },
        "tripleo::profile::base::swift::ringbuilder::raw_disks": [
          ":%PORT%/pci-0000:08:00.0"
        ]
      },
      "2b9b218f-2d74-4ff4-a8d0-8538c77431c0": {
        "swift::storage::disks::args": {
          "pci-0000:08:00.0": {
            "base_dir": "/dev/disk/by-path/"
          }
        },
        "tripleo::profile::base::swift::ringbuilder::raw_disks": [
          ":%PORT%/pci-0000:08:00.0"
        ]
      },
      "ada67792-e5d5-4380-b62c-57240cf53749": {
        "swift::storage::disks::args": {
          "pci-0000:08:00.0": {
            "base_dir": "/dev/disk/by-path/"
          }
        },
        "tripleo::profile::base::swift::ringbuilder::raw_disks": [
          ":%PORT%/pci-0000:08:00.0"
        ]
      }
    }

When I deploy with this I get:
<13>Dec  4 16:46:51 puppet-user: Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Ring_object_device[0000:08:00.0] is already declared at (file: /etc/puppet/modules/tripleo/manifests/profile/base/swift/add_devices.pp, line: 47); cannot redeclare (file: /etc/puppet/modules/tripleo/manifests/profile/base/swift/add_devices.pp, line: 47) (file: /etc/puppet/modules/tripleo/manifests/profile/base/swift/add_devices.pp, line: 47, column: 3) (file: /etc/puppet/modules/tripleo/manifests/profile/base/swift/ringbuilder.pp, line: 130) on node controller-0.redhat.local

I tried different options like by-id and use the serial of the disk instead but it truncates weirdly (keep only the last digit (because of '-')) and fails with the exact same error I put above (duplicate declaration).

I have my lab available if you want to this something out.

Please reach out to me.


Version-Release number of selected component (if applicable):
OSP17.1

How reproducible:
Random

Steps to Reproduce:
1. Introspect node and if you are unlucky sdb will be your disk OS.
2. Deploying swift on sdb will fail because disk is already in use.
3.

Actual results:
Failure to deploy swift on every deployment

Expected results:
Find a way to specify to swift which disk to deploy to.

Additional info:
I have a lab to test stuff out.

Note You need to log in before you can comment on or make changes to this bug.