2231360 – OSD addition always stucks when Ceph reports HEALTH_ERR due to full OSDs

Bug 2231360 - OSD addition always stucks when Ceph reports HEALTH_ERR due to full OSDs

Summary: OSD addition always stucks when Ceph reports HEALTH_ERR due to full OSDs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-cli
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Subham Rai
QA Contact:	Joy John Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2260844
TreeView+	depends on / blocked

Reported:	2023-08-11 11:34 UTC by Aman Agrawal
Modified:	2024-07-17 13:11 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Option to modify thresholds for Ceph `full`, `nearfull`, and `backfillfull` attributes Depending on the cluster requirements, the `full`, `nearfull`, and `backfillfull` threshold values can be updated by using the `odf-cli` CLI command. For example: `odf set full <val>` `odf set nearful <val>` `odf set backfillfull <val>` Note: The value must be in the range of 0.0 to 1.0 and need to ensure that the value is not very close to 1.0.
Clone Of:
Environment:
Last Closed:	2024-07-17 13:11:17 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage odf-cli pull 40	None	open	osd: add command to set osd full ratio	2024-05-03 16:57:56 UTC
Github	red-hat-storage odf-cli pull 46	None	open	Bug 2231360: [release-4.16] osd: add command to set osd full ratio	2024-05-14 06:41:09 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:11:25 UTC

Comment 4 Santosh Pillai 2023-08-11 15:06:18 UTC

OSD prepare pod is stuck while running ceph-volume prepare command.
The backing disk for this pv is /dev/sdd on compute-1 node. 
from ceph-volume logs on compute-1 node where this OSD prepare pod is stuck:

``` [2023-08-11 10:11:21,471][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /dev/sdd
[2023-08-11 10:11:21,492][ceph_volume.process][INFO  ] stderr unable to read label for /dev/sdd: (2) No such file or directory
[2023-08-11 10:11:21,492][ceph_volume.devices.raw.list][DEBUG ] assuming device /dev/sdd is not BlueStore; ceph-bluestore-tool failed to get info from device: []
['unable to read label for /dev/sdd: (2) No such file or directory']
[2023-08-11 10:11:21,492][ceph_volume.devices.raw.list][INFO  ] device /dev/sdd does not have BlueStore information
```

Could be something to do with the disk

Comment 5 Santosh Pillai 2023-08-11 15:20:02 UTC

ceph-volume inventory on this disk:

``` sh-5.1# ceph-volume inventory /dev/sdd
 stderr: lsblk: /dev/sdd: not a block device
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 33, in <module>
    sys.exit(load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')())
  File "/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 41, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.9/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.9/site-packages/ceph_volume/inventory/main.py", line 50, in main
    self.format_report(Device(self.args.path, with_lsm=self.args.with_lsm))
  File "/usr/lib/python3.9/site-packages/ceph_volume/util/device.py", line 131, in __init__
    self._parse()
  File "/usr/lib/python3.9/site-packages/ceph_volume/util/device.py", line 225, in _parse
    dev = disk.lsblk(self.path)
  File "/usr/lib/python3.9/site-packages/ceph_volume/util/disk.py", line 243, in lsblk
    result = lsblk_all(device=device,
  File "/usr/lib/python3.9/site-packages/ceph_volume/util/disk.py", line 337, in lsblk_all
    raise RuntimeError(f"Error: {err}")
RuntimeError: Error: ['lsblk: /dev/sdd: not a block device']
```

Comment 6 Santosh Pillai 2023-08-11 15:24:43 UTC

sh-5.1# lsblk /dev/sdd
lsblk: /dev/sdd: not a block device
sh-5.1#

Comment 7 Travis Nielsen 2023-08-15 15:20:57 UTC

Moving out of 4.14 while the bad disk is being investigate

Comment 10 Aman Agrawal 2023-08-24 05:43:10 UTC

I have no idea about it's reproducibility. It was hit for the first time if I am not wrong.

Comment 11 Santosh Pillai 2023-08-24 06:45:45 UTC

Ceph is complaining about the disk 
``` RuntimeError: Error: ['lsblk: /dev/sdd: not a block device']```

This is coming directly from the `lsblk` output. So issue could be with the disk. Can't investigate further since the cluster is no longer available. Didn't get a chance to look more into the cluster while it was up. 

Suggesting to retry the scenario and share a fresh cluster. 
Meanwhile I'm looking into the attached must gather for any clues

Comment 12 Santosh Pillai 2023-08-28 06:50:40 UTC

nothing interesting the must gather. 
Requesting the QE to retry this scenario and provide the cluster again. 
Thanks.

Comment 13 Santosh Pillai 2023-09-26 03:32:42 UTC

is the issue still reproducible?

Comment 14 Aman Agrawal 2023-09-26 12:32:52 UTC

(In reply to Santosh Pillai from comment #13)
> is the issue still reproducible?

It would take time to re-test it. Hence not a blocker for now.

Comment 15 Travis Nielsen 2023-10-05 18:14:02 UTC

In a new repro, the issue was that the cluster was full and the ceph health was HEALTH_ERR because of the full cluster. By increasing the full ratio [1], the OSD creation was able to complete. 

Shall we close this issue? 

[1] In the toolbox run: ceph osd set-full-ratio 0.9

Comment 16 Aman Agrawal 2023-10-06 07:02:11 UTC

(In reply to Travis Nielsen from comment #15)
> In a new repro, the issue was that the cluster was full and the ceph health
> was HEALTH_ERR because of the full cluster. By increasing the full ratio
> [1], the OSD creation was able to complete. 
> 
> Shall we close this issue? 
> 
> [1] In the toolbox run: ceph osd set-full-ratio 0.9

Hi Travis, 

I don't think we should close the issue because adding capacity had issues as osd-prepare jobs were stuck in running state and never compeleted, meaning OSDs were never added until you changed the value for ceph osd set-full-ratio to 0.9.
This won't be a feasible solution for customers as we don't recommend them using toolbox, and addition of OSDs when ceph is reporting HEALTH_ERR would always be a problem.

We should probably track this BZ to improve adding OSDs in scenarios like this.

Comment 17 Travis Nielsen 2023-10-06 15:18:41 UTC

Sounds good to keep it open, with the new BZ title to improve this scenario. It does get into a bigger question of what Rook should automatically do when the cluster does fill up. Rook could potentially detect this health error that the OSDs are full and increase the full ratio at the moment of adding a new OSD. But we have to be very conservative with this automation. The admin needs to be aware and likely reduce load on the cluster to avoid issues even while the new OSDs are coming online.

Comment 18 Aman Agrawal 2023-10-06 15:58:40 UTC

(In reply to Travis Nielsen from comment #17)
> Sounds good to keep it open, with the new BZ title to improve this scenario.
> It does get into a bigger question of what Rook should automatically do when
> the cluster does fill up. Rook could potentially detect this health error
> that the OSDs are full and increase the full ratio at the moment of adding a
> new OSD. But we have to be very conservative with this automation. The admin
> needs to be aware and likely reduce load on the cluster to avoid issues even
> while the new OSDs are coming online.

I like the idea. Do you think we should raise a warning level alert along with this change to make admins aware about the situation?

Comment 19 Santosh Pillai 2023-10-09 04:27:28 UTC

(In reply to Aman Agrawal from comment #16)
> (In reply to Travis Nielsen from comment #15)
> > In a new repro, the issue was that the cluster was full and the ceph health
> > was HEALTH_ERR because of the full cluster. By increasing the full ratio
> > [1], the OSD creation was able to complete. 
> > 
> > Shall we close this issue? 
> > 
> > [1] In the toolbox run: ceph osd set-full-ratio 0.9
> 
> Hi Travis, 
> 
> I don't think we should close the issue because adding capacity had issues
> as osd-prepare jobs were stuck in running state and never compeleted,
> meaning OSDs were never added until you changed the value for ceph osd
> set-full-ratio to 0.9.

IMO, the original issue for which the BZ was filed is different than. 
The Original cluster had the following issue:  
``` sh-5.1# lsblk /dev/sdd
lsblk: /dev/sdd: not a block device
sh-5.1#```

The new issue is about OSDs being full due to which user is not able to add new OSDs. 

So should either change the BZ title/description like Travis mentioned. Or open a new one for not being able to add new OSDs when existing OSDs are running full/near full. 


> This won't be a feasible solution for customers as we don't recommend them
> using toolbox, and addition of OSDs when ceph is reporting HEALTH_ERR would
> always be a problem.
> 
> We should probably track this BZ to improve adding OSDs in scenarios like
> this.

Comment 20 Aman Agrawal 2023-10-09 09:49:38 UTC

(In reply to Santosh Pillai from comment #19)
> (In reply to Aman Agrawal from comment #16)
> > (In reply to Travis Nielsen from comment #15)
> > > In a new repro, the issue was that the cluster was full and the ceph health
> > > was HEALTH_ERR because of the full cluster. By increasing the full ratio
> > > [1], the OSD creation was able to complete. 
> > > 
> > > Shall we close this issue? 
> > > 
> > > [1] In the toolbox run: ceph osd set-full-ratio 0.9
> > 
> > Hi Travis, 
> > 
> > I don't think we should close the issue because adding capacity had issues
> > as osd-prepare jobs were stuck in running state and never compeleted,
> > meaning OSDs were never added until you changed the value for ceph osd
> > set-full-ratio to 0.9.
> 
> IMO, the original issue for which the BZ was filed is different than. 

OSDs were full in the original issue as well and probably that's why one of the OSD wasn't added. 

> The Original cluster had the following issue:  
> ``` sh-5.1# lsblk /dev/sdd
> lsblk: /dev/sdd: not a block device
> sh-5.1#```
> 
> The new issue is about OSDs being full due to which user is not able to add
> new OSDs. 
> 
> So should either change the BZ title/description like Travis mentioned. Or
> open a new one for not being able to add new OSDs when existing OSDs are
> running full/near full. 
> 
> 
> > This won't be a feasible solution for customers as we don't recommend them
> > using toolbox, and addition of OSDs when ceph is reporting HEALTH_ERR would
> > always be a problem.
> > 
> > We should probably track this BZ to improve adding OSDs in scenarios like
> > this.

Comment 21 Santosh Pillai 2023-11-21 15:25:24 UTC

Still needs investigation

Comment 22 Mudit Agarwal 2024-01-02 10:20:43 UTC

Not a blocker

Comment 23 Santosh Pillai 2024-01-03 15:13:08 UTC

moving it to 4.16 since its not a blocker and workaround is available.

Comment 24 Santosh Pillai 2024-02-14 12:15:33 UTC

(In reply to Travis Nielsen from comment #17)
> Sounds good to keep it open, with the new BZ title to improve this scenario.
> It does get into a bigger question of what Rook should automatically do when
> the cluster does fill up. Rook could potentially detect this health error
> that the OSDs are full and increase the full ratio at the moment of adding a
> new OSD. But we have to be very conservative with this automation. The admin
> needs to be aware and likely reduce load on the cluster to avoid issues even
> while the new OSDs are coming online.


Rather than automating, can this be part of the ODF cli tool?  
The admin has to do it when adding a new OSD while existing OSDs are already full. It would need documentation.

Comment 25 Travis Nielsen 2024-02-14 20:05:38 UTC

As nice as it would be to adjust the OSD full ratio setting during OSD creation to allow for creation of the OSDs, it also suffers from several issues:
- How high to adjust the full ratio? If the default is 85%, should it be 90% or something else?
- If already adjusted higher and the OSDs still failed to add for some other reason, how can the customer proceed if the OSDs have now reached the new threshold?
- When would the threshold be returned to its previous level? If all goes well, after OSD creation is completed, but what if there is some error?

Exposing a command in the new odf CLI tool will better answer these concerns about the automation. While the CLI tool is less optimal since it requires the admin to run it, it does seem a better design:
- The admin can decide exactly when to increase the threshold and to what value
- The admin can decide when to return the threshold back to its previous value

Any concerns with requiring the CLI tool intervention?

Comment 26 Santosh Pillai 2024-02-15 05:11:07 UTC

(In reply to Travis Nielsen from comment #25)
> As nice as it would be to adjust the OSD full ratio setting during OSD
> creation to allow for creation of the OSDs, it also suffers from several
> issues:
> - How high to adjust the full ratio? If the default is 85%, should it be 90%
> or something else?
> - If already adjusted higher and the OSDs still failed to add for some other
> reason, how can the customer proceed if the OSDs have now reached the new
> threshold?
> - When would the threshold be returned to its previous level? If all goes
> well, after OSD creation is completed, but what if there is some error?

Agree, these questions will be difficult to answer since the workload might already be running on the cluster. If Rook increases the full ratio to say, 90%, and workloads are still running, the OSDs might reach 90% before the customer has added a new OSD. So it won't be very easy to automate via Rook. 

> 
> Exposing a command in the new odf CLI tool will better answer these concerns
> about the automation. While the CLI tool is less optimal since it requires
> the admin to run it, it does seem a better design:
> - The admin can decide exactly when to increase the threshold and to what
> value
> - The admin can decide when to return the threshold back to its previous
> value
> 
> Any concerns with requiring the CLI tool intervention?

No concerns. Apart from CLI, it should require documentation effort to help the admin know the exact steps to be taken. Steps I can think of are:
- Stop the workload 
- Increase the full ratio
- Add OSDs
- Change full ratio back to default.
- Wait for data rebalance.

Thoughts? @srai  and Aman

Comment 27 Aman Agrawal 2024-02-15 08:19:10 UTC

(In reply to Santosh Pillai from comment #26)
> (In reply to Travis Nielsen from comment #25)
> > As nice as it would be to adjust the OSD full ratio setting during OSD
> > creation to allow for creation of the OSDs, it also suffers from several
> > issues:
> > - How high to adjust the full ratio? If the default is 85%, should it be 90%
> > or something else?
> > - If already adjusted higher and the OSDs still failed to add for some other
> > reason, how can the customer proceed if the OSDs have now reached the new
> > threshold?
> > - When would the threshold be returned to its previous level? If all goes
> > well, after OSD creation is completed, but what if there is some error?
> 
> Agree, these questions will be difficult to answer since the workload might
> already be running on the cluster. If Rook increases the full ratio to say,
> 90%, and workloads are still running, the OSDs might reach 90% before the
> customer has added a new OSD. So it won't be very easy to automate via Rook. 
> 
> > 
> > Exposing a command in the new odf CLI tool will better answer these concerns
> > about the automation. While the CLI tool is less optimal since it requires
> > the admin to run it, it does seem a better design:
> > - The admin can decide exactly when to increase the threshold and to what
> > value
> > - The admin can decide when to return the threshold back to its previous
> > value
> > 
> > Any concerns with requiring the CLI tool intervention?
> 
> No concerns. Apart from CLI, it should require documentation effort to help
> the admin know the exact steps to be taken. Steps I can think of are:
> - Stop the workload 
> - Increase the full ratio
> - Add OSDs
> - Change full ratio back to default.
> - Wait for data rebalance.
> 
> Thoughts? @srai  and Aman

I don't think it is feasible to recommend stopping of IOs. At times, it's even difficult to do so for QE setups.

Isn't there a way to have successful OSD addition without the need to stop IOs in cases like this? This could be with or without changing the current threshold of 85%.

Comment 28 Travis Nielsen 2024-02-15 23:00:27 UTC

> I don't think it is feasible to recommend stopping of IOs. At times, it's
> even difficult to do so for QE setups.
> 
> Isn't there a way to have successful OSD addition without the need to stop
> IOs in cases like this? This could be with or without changing the current
> threshold of 85%.

What is critical is for the new OSDs to be created and to start rebalancing before the cluster fills up again to the new threshold. 
So stopping IO isn't really necessary, it just puts the cluster at risk of filling up to the new threshold before the OSDs are ready to handle the load. Realistically, this shouldn't be an issue though as long as the OSDs are created immediately.

Comment 29 Subham Rai 2024-02-16 06:56:04 UTC

(In reply to Santosh Pillai from comment #26)
> (In reply to Travis Nielsen from comment #25)
> > As nice as it would be to adjust the OSD full ratio setting during OSD
> > creation to allow for creation of the OSDs, it also suffers from several
> > issues:
> > - How high to adjust the full ratio? If the default is 85%, should it be 90%
> > or something else?
> > - If already adjusted higher and the OSDs still failed to add for some other
> > reason, how can the customer proceed if the OSDs have now reached the new
> > threshold?
> > - When would the threshold be returned to its previous level? If all goes
> > well, after OSD creation is completed, but what if there is some error?
> 
> Agree, these questions will be difficult to answer since the workload might
> already be running on the cluster. If Rook increases the full ratio to say,
> 90%, and workloads are still running, the OSDs might reach 90% before the
> customer has added a new OSD. So it won't be very easy to automate via Rook. 
> 
> > 
> > Exposing a command in the new odf CLI tool will better answer these concerns
> > about the automation. While the CLI tool is less optimal since it requires
> > the admin to run it, it does seem a better design:
> > - The admin can decide exactly when to increase the threshold and to what
> > value
> > - The admin can decide when to return the threshold back to its previous
> > value
> > 
> > Any concerns with requiring the CLI tool intervention?
> 
> No concerns. Apart from CLI, it should require documentation effort to help
> the admin know the exact steps to be taken. Steps I can think of are:
> - Stop the workload 
> - Increase the full ratio
> - Add OSDs
> - Change full ratio back to default.
> - Wait for data rebalance.
> 
> Thoughts? @srai  and Aman

Yeah sounds good to add in the cli tool and yes, increasing the to 90% may not be the right solution say we'll always cross that limit sometime so better to leave it to admin to update that.

Comment 30 Aman Agrawal 2024-02-21 19:15:21 UTC

(In reply to Travis Nielsen from comment #28)
> > I don't think it is feasible to recommend stopping of IOs. At times, it's
> > even difficult to do so for QE setups.
> > 
> > Isn't there a way to have successful OSD addition without the need to stop
> > IOs in cases like this? This could be with or without changing the current
> > threshold of 85%.
> 
> What is critical is for the new OSDs to be created and to start rebalancing
> before the cluster fills up again to the new threshold. 
> So stopping IO isn't really necessary, it just puts the cluster at risk of
> filling up to the new threshold before the OSDs are ready to handle the
> load. Realistically, this shouldn't be an issue though as long as the OSDs
> are created immediately.

ACK, so the feasible solution is to allow immediate addition of OSDs. And I hope with CLI we did not mean interacting with toolbox?

Comment 57 errata-xmlrpc 2024-07-17 13:11:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.