1948717 – [workload-dfg] some OSD devices see keyring auth failures

Bug 1948717 - [workload-dfg] some OSD devices see keyring auth failures

Summary: [workload-dfg] some OSD devices see keyring auth failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	5.0
Assignee:	Juan Miguel Olmo
QA Contact:	Vasishta
Docs Contact:	Ranjini M N
URL:
Whiteboard:
Depends On:
Blocks:	1959686
TreeView+	depends on / blocked

Reported:	2021-04-12 19:38 UTC by Tim Wilkinson
Modified:	2021-08-30 08:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ceph-16.2.0-32.el8cp
Doc Type:	Bug Fix
Doc Text:	.The `ceph-volume` commands do not block OSDs and devices and runs as expected Previously, the `ceph-volume` commands like `ceph-volume lvm list` and `ceph-volume inventory` were not completed thereby preventing the execution of other `ceph-volume` commands for creating OSDs, listing devices, and listing OSDs. With this update, the default output of these commands are not added to the Cephadm log resulting in completion of all `ceph-volume` commands run in a container launched by the cephadm binary.
Clone Of:
Environment:
Last Closed:	2021-08-30 08:29:34 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Script to purge ceph clusters created with cephadm (1.36 KB, text/plain) 2021-04-21 16:23 UTC, Juan Miguel Olmo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	50526	0	None	None	None	2021-04-27 14:18:10 UTC
Github	ceph ceph pull 41045	0	None	open	mgr/cephadm: ceph-volume verbose only when fails	2021-05-04 05:30:50 UTC
Red Hat Bugzilla	1960265	1	low	CLOSED	Containerized ceph-volume command output blocked	2021-07-14 05:08:27 UTC
Red Hat Issue Tracker	RHCEPH-1266	0	None	None	None	2021-08-30 00:21:02 UTC
Red Hat Product Errata	RHBA-2021:3294	0	None	None	None	2021-08-30 08:29:43 UTC

Comment 2 Juan Miguel Olmo 2021-04-13 09:26:20 UTC

I suspect that we are trying to get the osd.61 "auth" information before ceph-volume have been able to create the osd.

lets verify if my suspicions are true:

Please provide the following information:

1. In the host "f22-h25-000-6048r":

# grep "AQAFmHBgUrzRBxAA71uZEEHIOj6lNc9H3FeT2w==" /var/log/ceph/c3ce7c54-995c-11eb-91f1-bc97e17cf960/ceph-volume.log

2. You extracted the following log lines from the manager.log. Unfortunately without time information. 

PM[ERR]Failed to apply osd.defaultDG spec DriveGroupSpec(name=defaultDG->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='f22-h21-000-6048r.rdu2.scalelab.redhat.com', .....
...
 mgr_module.MonCommandFailed: auth get failed: failed to find osd.61 in keyring retval: -2

Can you provide the information about time of these log lines?:

Comment 5 Juan Miguel Olmo 2021-04-13 14:56:57 UTC

@Tim: Please attach the manager.log and the ceph-volume.log

Comment 7 Juan Miguel Olmo 2021-04-14 09:34:16 UTC

@Tim: The manager log is the source of the excerpt you have provided in previous comments.... (or this excerpt is not coming from your system?)

The manager log can be obtained from system journal (by default all ceph daemons log to systemd journal):

https://docs.ceph.com/en/latest/cephadm/operations/#logging-to-stdout


You can also access to the manager logs, using the podman logs command over the mgr container.


In both cases probably the output is going to be huge,,, so maybe using journalctl you can filter in order to have an excerpt of the log ( 10 minutes before, 1 day after) of the moment  you issue the command:

ceph orch apply -i osd_spec.yml

and that time.. based in the information you provided in your comments should be: 4/9/21 3:06:00

Comment 10 Juan Miguel Olmo 2021-04-15 15:09:58 UTC

I am working in this bug, still what is the cause of the problem. A session together only will help is we are able and we wnat to reproduce the problem. Just ping me.

Unfortunately the manager log provided do not contain the time interval more insteresting, which is the moment where the "osd apply" command was executed.
manager log provided:
-- Logs begin at Fri 2021-04-09 17:19:38 UTC, end at Wed 2021-04-14 13:44:41 UTC. --

The command was executed: 4/9/21 3:06:00

In any case is useful.

Tim: can you provide the active monitor log (18-h14-000-r640.rdu2.scalelab.redhat.com) in the interval 4/9/21 3:00:00 - 4/9/21 20:00?

If not posible. how can i access the cluster where this happened.? it is possible to reproduce the test?

Comment 14 Juan Miguel Olmo 2021-04-21 16:22:11 UTC

Discovered a strange behavior in the cluster. Now we are cleaning properly the devices 

We have discovered that OSDs are not being created because the devices are not available.

Before trying to install the cluster Tim cleaned the devices using "sgdisk --zap-all <device>",and after that command
using lvm we can see that apparently the device is clean ( No PVs in the device ).

However using lsbk, we found a table created in each of the devices. Th<e device alco cannot be cleaned using "ceph orch device zap" and error appears.
The only way o founfd to clean the device was to use "dmsetup wipe_table"

Open questions: why "ceph orch device zap"/ lvm zap is failing?

Details:
========
# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                                                                                     8:0    0 465.8G  0 disk  
|-sda1                                                                                                  8:1    0     4G  0 part  
| `-md1                                                                                                 9:1    0     4G  0 raid1 [SWAP]
|-sda2                                                                                                  8:2    0   512M  0 part  
| `-md0                                                                                                 9:0    0   511M  0 raid1 /boot
`-sda3                                                                                                  8:3    0 461.3G  0 part  
  `-md2                                                                                                 9:2    0 461.1G  0 raid1 /
sdb                                                                                                     8:16   0 465.8G  0 disk  
|-sdb1                                                                                                  8:17   0     4G  0 part  
| `-md1                                                                                                 9:1    0     4G  0 raid1 [SWAP]
|-sdb2                                                                                                  8:18   0   512M  0 part  
| `-md0                                                                                                 9:0    0   511M  0 raid1 /boot
`-sdb3                                                                                                  8:19   0 461.3G  0 part  
  `-md2                                                                                                 9:2    0 461.1G  0 raid1 /
sdc                                                                                                     8:32   0   1.8T  0 disk  
sdd                                                                                                     8:48   0   1.8T  0 disk  
`-ceph--76f7baff--6f7a--4a3b--84c5--036926de3cc1-osd--block--66f069cc--fe05--460e--ac0e--cc55729f3211 253:25   0   1.8T  0 lvm   
sde                                                                                                     8:64   0   1.8T  0 disk  
`-ceph--0d173c50--ae5e--4aaa--ad09--a826dbe8a2eb-osd--block--c7b5bf3f--31a4--482c--9e81--998d1e2b1d4f 253:27   0   1.8T  0 lvm   
sdf                                                                                                     8:80   0   1.8T  0 disk  
`-ceph--fb248ca1--787e--4974--a4d0--cebc7bd699ef-osd--block--6f79a8b0--2342--43b7--8bee--36b316086617 253:29   0   1.8T  0 lvm   



If we try to zap the device with the orchestrator, we have an error:
# ceph orch device zap f22-h21-000-6048r.rdu2.scalelab.redhat.com /dev/sdr --force
Error EINVAL: Zap failed: /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
....
/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
/bin/podman: --> Zapping: /dev/sdr
/bin/podman:  stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy
/bin/podman: --> failed to wipefs device, will try again to workaround probable race condition
/bin/podman:  stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy
/bin/podman: --> failed to wipefs device, will try again to workaround probable race condition
/bin/podman:  stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy
/bin/podman: --> failed to wipefs device, will try again to workaround probable race condition
/bin/podman:  stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy
/bin/podman: --> failed to wipefs device, will try again to workaround probable race condition
/bin/podman:  stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy
/bin/podman: --> failed to wipefs device, will try again to workaround probable race condition
/bin/podman:  stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy
/bin/podman: --> failed to wipefs device, will try again to workaround probable race condition
/bin/podman:  stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy
/bin/podman: --> failed to wipefs device, will try again to workaround probable race condition
/bin/podman:  stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy
/bin/podman: --> failed to wipefs device, will try again to workaround probable race condition
/bin/podman: -->  RuntimeError: could not complete wipefs on device: /dev/sdr
Traceback (most recent call last):
  File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 7924, in <module>
    main()
  File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 7912, in main
    r = ctx.func(ctx)
  File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 1611, in _infer_fsid
    return func(ctx)
  File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 1695, in _infer_image
    return func(ctx)
  File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 4433, in command_ceph_volume
    out, err, code = call_throws(ctx, c.run_cmd(), verbosity=verbosity)
  File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 1411, in call_throws
    raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: /bin/podman run --rm --ipc=host --no-hosts --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE=registry.redhat.io/rhceph-beta/rhceph-5-rhel8@sha256:24c617082680ef85c43c6e2c4fe462c69805d2f38df83e51f968cec6b1c097a2 -e NODE_NAME=f22-h21-000-6048r.rdu2.scalelab.redhat.com -e CEPH_USE_RANDOM_NONCE=1 -v /var/run/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960:/var/run/ceph:z -v /var/log/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960:/var/log/ceph:z -v /var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/selinux:/sys/fs/selinux:ro -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm registry.redhat.io/rhceph-beta/rhceph-5-rhel8@sha256:24c617082680ef85c43c6e2c4fe462c69805d2f38df83e51f968cec6b1c097a2 lvm zap --destroy /dev/sdr


The only way that i found to delete the device tables is using dmsetup:

dmsetup wipe_table <table-found in lsblk command>

for example:
dmsetup wipe_table ceph--fb248ca1--787e--4974--a4d0--cebc7bd699ef-osd--block--6f79a8b0--2342--43b7--8bee--36b316086617

clean the table in the device ( /dev/sdr) in this case, and only in this moment, we can see the device as available in the orchestrator.

Comment 15 Juan Miguel Olmo 2021-04-21 16:23:31 UTC

Created attachment 1774119 [details]
Script to purge ceph clusters created with cephadm

Comment 17 Juan Miguel Olmo 2021-04-27 11:21:55 UTC

Bug found... i will post the fix to upstream.

There is no workaround possible (just to modify the code in cephadm script), so for new clusters we need to wait for the fix in upstream or doing the modification manually( use a patched cephadm script).

.....................................

Apart of the "lock" problem executing "cephadm ceph-volume lvm" ( i think that the root cause of this bug). I have found a limitation in the number of OSDs that can be started in each host.

The Asynchronous non-blocking I/O (fs.aio-max-nr) feature that allows a process to initiate multiple I/O operations simultaneously without having to wait for any of them to complete. This helps boost performance for applications that are able to overlap processing and I/O.

This setting is set to 65536 by default and is not enough if you want to have more than 21 OSDs running in the same host. 

I have changed the setting in all the hosts with more than 21 OSDs :

# sysctl fs.aio-max-nr
fs.aio-max-nr = 65536

# sysctl -w fs.aio-max-nr=1048576
fs.aio-max-nr = 1048576

In my case, after change the setting I also needed to restart the osd services failed.  (there was 3 of them failed by hosts because it was not possible to start more than 21 osds services with fs.aio-max-nr = 65536)

IMPORTANT:

SET fs.aio-max-nr = max(65536, 3500 * (number of osds you want to have in the node)) before creating OSD services.

Comment 18 Vikhyat Umrao 2021-04-27 13:56:01 UTC

(In reply to Juan Miguel Olmo from comment #17)
> Bug found... i will post the fix to upstream.
> 
> There is no workaround possible (just to modify the code in cephadm script),
> so for new clusters we need to wait for the fix in upstream or doing the
> modification manually( use a patched cephadm script).
> 
> .....................................
> 
> Apart of the "lock" problem executing "cephadm ceph-volume lvm" ( i think
> that the root cause of this bug). I have found a limitation in the number of
> OSDs that can be started in each host.
> 
> The Asynchronous non-blocking I/O (fs.aio-max-nr) feature that allows a
> process to initiate multiple I/O operations simultaneously without having to
> wait for any of them to complete. This helps boost performance for
> applications that are able to overlap processing and I/O.
> 
> This setting is set to 65536 by default and is not enough if you want to
> have more than 21 OSDs running in the same host. 
> 
> I have changed the setting in all the hosts with more than 21 OSDs :
> 
> # sysctl fs.aio-max-nr
> fs.aio-max-nr = 65536
> 
> # sysctl -w fs.aio-max-nr=1048576
> fs.aio-max-nr = 1048576
> 
> In my case, after change the setting I also needed to restart the osd
> services failed.  (there was 3 of them failed by hosts because it was not
> possible to start more than 21 osds services with fs.aio-max-nr = 65536)
> 
> IMPORTANT:
> 
> SET fs.aio-max-nr = max(65536, 3500 * (number of osds you want to have in
> the node)) before creating OSD services.

Thanks, Juan great findings! can you please explain a little bit more about the locking issue was it lock contention?

I think in DFG clusters we can for now live with manually modifying the cephadm scripts and finally wait for the fix to land in RHCS 5 cephadm builds. Tim I think we good with it right?

Comment 19 Tim Wilkinson 2021-04-27 14:13:26 UTC

Yes, with the modified cephadm we should be good to go as long as it is preserved. If I compare /usr/sbin/cephadm on the bootstrap to that on one of the OSD nodes, it's drastically different although they're both 16.2.0-4.el8cp. I'm not sure which one to save. I'll verify with Juan which one to preserve and see what is required to add the other five OSD nodes so we get to our target 192 OSDs.

Comment 20 Juan Miguel Olmo 2021-04-27 14:19:55 UTC

In the upstream bug is the diagnose:
https://tracker.ceph.com/issues/50526

and in the associated PR is the explanation of the fix:
https://github.com/ceph/ceph/pull/41045

Comment 21 Vikhyat Umrao 2021-04-27 14:22:46 UTC

(In reply to Tim Wilkinson from comment #19)
> Yes, with the modified cephadm we should be good to go as long as it is
> preserved. If I compare /usr/sbin/cephadm on the bootstrap to that on one of
> the OSD nodes, it's drastically different although they're both
> 16.2.0-4.el8cp. I'm not sure which one to save. I'll verify with Juan which
> one to preserve and see what is required to add the other five OSD nodes so
> we get to our target 192 OSDs.

Thanks, Tim. I would recommend redeploying complete cluster with this fix and see if the cluster comes back with 192 OSDs up/in instead of just adding 5 more nodes.

Comment 22 Vikhyat Umrao 2021-04-27 14:29:21 UTC

(In reply to Juan Miguel Olmo from comment #20)
> In the upstream bug is the diagnose:
> https://tracker.ceph.com/issues/50526
> 
> and in the associated PR is the explanation of the fix:
> https://github.com/ceph/ceph/pull/41045

thank you for sharing the links, Juan! a verbose output can cause this too much :)

Comment 23 Tim Wilkinson 2021-04-27 19:09:22 UTC

I made the fix to the existing env and applied the remaining five OSD nodes via spec file. All OSDs started (it takes a few min for the up to become in). I've since purged the env and reran the deployment procedure. I applied the host_spec, made the changes to each of the OSD nodes, and applied the osd_spec file containing all eight OSD nodes. All expected OSDs (8*24) started.


  cluster:
    id:     51ff166a-a785-11eb-84b1-bc97e17cf960
    health: HEALTH_WARN
            3 stray daemon(s) not managed by cephadm

  services:
    mon: 3 daemons, quorum f18-h14-000-r640.rdu2.scalelab.redhat.com,f18-h09-000-r640,f18-h10-000-r640 (age 33m)
    mgr: f18-h14-000-r640.rdu2.scalelab.redhat.com.mdgfsr(active, since 42m), standbys: f22-h21-000-6048r.yvujdw
    osd: 192 osds: 192 up (since 4m), 192 in (since 9m)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   12 TiB used, 349 TiB / 361 TiB avail
    pgs:     1 active+clean




The OSD node changes:

1.)  

 # sysctl -w fs.aio-max-nr=1048576


2.)

 # vi /var/lib/ceph/<fisd>/cephadm.<uuid>
 # diff <new cephadm.<uuid>> <orig cephadm.<uuid>>  
 <     out, err, code = call_throws(ctx, c.run_cmd())
 ---
 >     verbosity = CallVerbosity.VERBOSE if ctx.log_output else CallVerbosity.VERBOSE_ON_FAILURE
 >     out, err, code = call_throws(ctx, c.run_cmd(), verbosity=verbosity)

Comment 24 Vikhyat Umrao 2021-04-28 05:21:54 UTC

(In reply to Tim Wilkinson from comment #23)
> I made the fix to the existing env and applied the remaining five OSD nodes
> via spec file. All OSDs started (it takes a few min for the up to become
> in). I've since purged the env and reran the deployment procedure. I applied
> the host_spec, made the changes to each of the OSD nodes, and applied the
> osd_spec file containing all eight OSD nodes. All expected OSDs (8*24)
> started.
> 
> 
>   cluster:
>     id:     51ff166a-a785-11eb-84b1-bc97e17cf960
>     health: HEALTH_WARN
>             3 stray daemon(s) not managed by cephadm
> 
>   services:
>     mon: 3 daemons, quorum
> f18-h14-000-r640.rdu2.scalelab.redhat.com,f18-h09-000-r640,f18-h10-000-r640
> (age 33m)
>     mgr: f18-h14-000-r640.rdu2.scalelab.redhat.com.mdgfsr(active, since
> 42m), standbys: f22-h21-000-6048r.yvujdw
>     osd: 192 osds: 192 up (since 4m), 192 in (since 9m)

Woot! amazing work guys!

Comment 25 Tim Wilkinson 2021-04-28 15:05:37 UTC

Additional info:
---------------

Podman version 2.2.1
Python version 3.6.8

Comment 26 Sebastian Wagner 2021-04-28 16:02:03 UTC

Tim, can you also provide:


> conmon --version

Comment 27 Tim Wilkinson 2021-04-28 17:08:08 UTC

conmon version 2.0.22

Comment 29 Juan Miguel Olmo 2021-05-04 07:25:23 UTC

@tserling:
I will try to speed up as much as possible the merge to Pacific, and will do the cherry pick.

Comment 37 Juan Miguel Olmo 2021-05-13 08:24:55 UTC

@twilkins :

The test upgrading conmon package was to see if we can find a root cause for the lock of console output in the containerized ceph-volume command. But the fix we proposed was tested and verified that works ok. So i think we can leave this bug in "verified", and probably is good to open a new Bug pointing directly to the real problem we have detected and that is not directly related with cephadm.

I propose to use: 

Containerized Ceph-volume command output blocked
=================================================

Version-Release number: kernel-4.18.0-193.el8.x86_64

OS Version: RHEL 8
Podman version 2.2.1
Python version 3.6.8
Conmon version: 2.0.22


How to reproduce: ( Only seen in hosts with a certain number of devices ,, for example  39)

   # cephadm ceph-volume inventory

This launches a ceph container to execute the command. The result is that the console output freezes and we can see a file lock:

   # lslocks |grep cephad
   python3         1463459  FLOCK   0B WRITE 0     0   0 /run/cephadm/9fa2b396-adb5-11eb-a2d3-bc97e17cf960.lock

Comment 40 Juan Miguel Olmo 2021-05-25 07:36:48 UTC

It seems ok. Only one clarification, I think that is needed to remark that the ceph-volume is run in a container launched by the cephadm binary.

Comment 43 errata-xmlrpc 2021-08-30 08:29:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294

Note You need to log in before you can comment on or make changes to this bug.