I suspect that we are trying to get the osd.61 "auth" information before ceph-volume have been able to create the osd. lets verify if my suspicions are true: Please provide the following information: 1. In the host "f22-h25-000-6048r": # grep "AQAFmHBgUrzRBxAA71uZEEHIOj6lNc9H3FeT2w==" /var/log/ceph/c3ce7c54-995c-11eb-91f1-bc97e17cf960/ceph-volume.log 2. You extracted the following log lines from the manager.log. Unfortunately without time information. PM[ERR]Failed to apply osd.defaultDG spec DriveGroupSpec(name=defaultDG->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='f22-h21-000-6048r.rdu2.scalelab.redhat.com', ..... ... mgr_module.MonCommandFailed: auth get failed: failed to find osd.61 in keyring retval: -2 Can you provide the information about time of these log lines?:
@Tim: Please attach the manager.log and the ceph-volume.log
@Tim: The manager log is the source of the excerpt you have provided in previous comments.... (or this excerpt is not coming from your system?) The manager log can be obtained from system journal (by default all ceph daemons log to systemd journal): https://docs.ceph.com/en/latest/cephadm/operations/#logging-to-stdout You can also access to the manager logs, using the podman logs command over the mgr container. In both cases probably the output is going to be huge,,, so maybe using journalctl you can filter in order to have an excerpt of the log ( 10 minutes before, 1 day after) of the moment you issue the command: ceph orch apply -i osd_spec.yml and that time.. based in the information you provided in your comments should be: 4/9/21 3:06:00
I am working in this bug, still what is the cause of the problem. A session together only will help is we are able and we wnat to reproduce the problem. Just ping me. Unfortunately the manager log provided do not contain the time interval more insteresting, which is the moment where the "osd apply" command was executed. manager log provided: -- Logs begin at Fri 2021-04-09 17:19:38 UTC, end at Wed 2021-04-14 13:44:41 UTC. -- The command was executed: 4/9/21 3:06:00 In any case is useful. Tim: can you provide the active monitor log (18-h14-000-r640.rdu2.scalelab.redhat.com) in the interval 4/9/21 3:00:00 - 4/9/21 20:00? If not posible. how can i access the cluster where this happened.? it is possible to reproduce the test?
Discovered a strange behavior in the cluster. Now we are cleaning properly the devices We have discovered that OSDs are not being created because the devices are not available. Before trying to install the cluster Tim cleaned the devices using "sgdisk --zap-all <device>",and after that command using lvm we can see that apparently the device is clean ( No PVs in the device ). However using lsbk, we found a table created in each of the devices. Th<e device alco cannot be cleaned using "ceph orch device zap" and error appears. The only way o founfd to clean the device was to use "dmsetup wipe_table" Open questions: why "ceph orch device zap"/ lvm zap is failing? Details: ======== # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 465.8G 0 disk |-sda1 8:1 0 4G 0 part | `-md1 9:1 0 4G 0 raid1 [SWAP] |-sda2 8:2 0 512M 0 part | `-md0 9:0 0 511M 0 raid1 /boot `-sda3 8:3 0 461.3G 0 part `-md2 9:2 0 461.1G 0 raid1 / sdb 8:16 0 465.8G 0 disk |-sdb1 8:17 0 4G 0 part | `-md1 9:1 0 4G 0 raid1 [SWAP] |-sdb2 8:18 0 512M 0 part | `-md0 9:0 0 511M 0 raid1 /boot `-sdb3 8:19 0 461.3G 0 part `-md2 9:2 0 461.1G 0 raid1 / sdc 8:32 0 1.8T 0 disk sdd 8:48 0 1.8T 0 disk `-ceph--76f7baff--6f7a--4a3b--84c5--036926de3cc1-osd--block--66f069cc--fe05--460e--ac0e--cc55729f3211 253:25 0 1.8T 0 lvm sde 8:64 0 1.8T 0 disk `-ceph--0d173c50--ae5e--4aaa--ad09--a826dbe8a2eb-osd--block--c7b5bf3f--31a4--482c--9e81--998d1e2b1d4f 253:27 0 1.8T 0 lvm sdf 8:80 0 1.8T 0 disk `-ceph--fb248ca1--787e--4974--a4d0--cebc7bd699ef-osd--block--6f79a8b0--2342--43b7--8bee--36b316086617 253:29 0 1.8T 0 lvm If we try to zap the device with the orchestrator, we have an error: # ceph orch device zap f22-h21-000-6048r.rdu2.scalelab.redhat.com /dev/sdr --force Error EINVAL: Zap failed: /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. .... /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: --> Zapping: /dev/sdr /bin/podman: stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy /bin/podman: --> failed to wipefs device, will try again to workaround probable race condition /bin/podman: stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy /bin/podman: --> failed to wipefs device, will try again to workaround probable race condition /bin/podman: stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy /bin/podman: --> failed to wipefs device, will try again to workaround probable race condition /bin/podman: stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy /bin/podman: --> failed to wipefs device, will try again to workaround probable race condition /bin/podman: stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy /bin/podman: --> failed to wipefs device, will try again to workaround probable race condition /bin/podman: stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy /bin/podman: --> failed to wipefs device, will try again to workaround probable race condition /bin/podman: stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy /bin/podman: --> failed to wipefs device, will try again to workaround probable race condition /bin/podman: stderr: wipefs: error: /dev/sdr: probing initialization failed: Device or resource busy /bin/podman: --> failed to wipefs device, will try again to workaround probable race condition /bin/podman: --> RuntimeError: could not complete wipefs on device: /dev/sdr Traceback (most recent call last): File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 7924, in <module> main() File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 7912, in main r = ctx.func(ctx) File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 1611, in _infer_fsid return func(ctx) File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 1695, in _infer_image return func(ctx) File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 4433, in command_ceph_volume out, err, code = call_throws(ctx, c.run_cmd(), verbosity=verbosity) File "/var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349", line 1411, in call_throws raise RuntimeError('Failed command: %s' % ' '.join(command)) RuntimeError: Failed command: /bin/podman run --rm --ipc=host --no-hosts --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE=registry.redhat.io/rhceph-beta/rhceph-5-rhel8@sha256:24c617082680ef85c43c6e2c4fe462c69805d2f38df83e51f968cec6b1c097a2 -e NODE_NAME=f22-h21-000-6048r.rdu2.scalelab.redhat.com -e CEPH_USE_RANDOM_NONCE=1 -v /var/run/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960:/var/run/ceph:z -v /var/log/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960:/var/log/ceph:z -v /var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /var/lib/ceph/bd63f262-a2ae-11eb-95e8-bc97e17cf960/selinux:/sys/fs/selinux:ro -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm registry.redhat.io/rhceph-beta/rhceph-5-rhel8@sha256:24c617082680ef85c43c6e2c4fe462c69805d2f38df83e51f968cec6b1c097a2 lvm zap --destroy /dev/sdr The only way that i found to delete the device tables is using dmsetup: dmsetup wipe_table <table-found in lsblk command> for example: dmsetup wipe_table ceph--fb248ca1--787e--4974--a4d0--cebc7bd699ef-osd--block--6f79a8b0--2342--43b7--8bee--36b316086617 clean the table in the device ( /dev/sdr) in this case, and only in this moment, we can see the device as available in the orchestrator.
Created attachment 1774119 [details] Script to purge ceph clusters created with cephadm
Bug found... i will post the fix to upstream. There is no workaround possible (just to modify the code in cephadm script), so for new clusters we need to wait for the fix in upstream or doing the modification manually( use a patched cephadm script). ..................................... Apart of the "lock" problem executing "cephadm ceph-volume lvm" ( i think that the root cause of this bug). I have found a limitation in the number of OSDs that can be started in each host. The Asynchronous non-blocking I/O (fs.aio-max-nr) feature that allows a process to initiate multiple I/O operations simultaneously without having to wait for any of them to complete. This helps boost performance for applications that are able to overlap processing and I/O. This setting is set to 65536 by default and is not enough if you want to have more than 21 OSDs running in the same host. I have changed the setting in all the hosts with more than 21 OSDs : # sysctl fs.aio-max-nr fs.aio-max-nr = 65536 # sysctl -w fs.aio-max-nr=1048576 fs.aio-max-nr = 1048576 In my case, after change the setting I also needed to restart the osd services failed. (there was 3 of them failed by hosts because it was not possible to start more than 21 osds services with fs.aio-max-nr = 65536) IMPORTANT: SET fs.aio-max-nr = max(65536, 3500 * (number of osds you want to have in the node)) before creating OSD services.
(In reply to Juan Miguel Olmo from comment #17) > Bug found... i will post the fix to upstream. > > There is no workaround possible (just to modify the code in cephadm script), > so for new clusters we need to wait for the fix in upstream or doing the > modification manually( use a patched cephadm script). > > ..................................... > > Apart of the "lock" problem executing "cephadm ceph-volume lvm" ( i think > that the root cause of this bug). I have found a limitation in the number of > OSDs that can be started in each host. > > The Asynchronous non-blocking I/O (fs.aio-max-nr) feature that allows a > process to initiate multiple I/O operations simultaneously without having to > wait for any of them to complete. This helps boost performance for > applications that are able to overlap processing and I/O. > > This setting is set to 65536 by default and is not enough if you want to > have more than 21 OSDs running in the same host. > > I have changed the setting in all the hosts with more than 21 OSDs : > > # sysctl fs.aio-max-nr > fs.aio-max-nr = 65536 > > # sysctl -w fs.aio-max-nr=1048576 > fs.aio-max-nr = 1048576 > > In my case, after change the setting I also needed to restart the osd > services failed. (there was 3 of them failed by hosts because it was not > possible to start more than 21 osds services with fs.aio-max-nr = 65536) > > IMPORTANT: > > SET fs.aio-max-nr = max(65536, 3500 * (number of osds you want to have in > the node)) before creating OSD services. Thanks, Juan great findings! can you please explain a little bit more about the locking issue was it lock contention? I think in DFG clusters we can for now live with manually modifying the cephadm scripts and finally wait for the fix to land in RHCS 5 cephadm builds. Tim I think we good with it right?
Yes, with the modified cephadm we should be good to go as long as it is preserved. If I compare /usr/sbin/cephadm on the bootstrap to that on one of the OSD nodes, it's drastically different although they're both 16.2.0-4.el8cp. I'm not sure which one to save. I'll verify with Juan which one to preserve and see what is required to add the other five OSD nodes so we get to our target 192 OSDs.
In the upstream bug is the diagnose: https://tracker.ceph.com/issues/50526 and in the associated PR is the explanation of the fix: https://github.com/ceph/ceph/pull/41045
(In reply to Tim Wilkinson from comment #19) > Yes, with the modified cephadm we should be good to go as long as it is > preserved. If I compare /usr/sbin/cephadm on the bootstrap to that on one of > the OSD nodes, it's drastically different although they're both > 16.2.0-4.el8cp. I'm not sure which one to save. I'll verify with Juan which > one to preserve and see what is required to add the other five OSD nodes so > we get to our target 192 OSDs. Thanks, Tim. I would recommend redeploying complete cluster with this fix and see if the cluster comes back with 192 OSDs up/in instead of just adding 5 more nodes.
(In reply to Juan Miguel Olmo from comment #20) > In the upstream bug is the diagnose: > https://tracker.ceph.com/issues/50526 > > and in the associated PR is the explanation of the fix: > https://github.com/ceph/ceph/pull/41045 thank you for sharing the links, Juan! a verbose output can cause this too much :)
I made the fix to the existing env and applied the remaining five OSD nodes via spec file. All OSDs started (it takes a few min for the up to become in). I've since purged the env and reran the deployment procedure. I applied the host_spec, made the changes to each of the OSD nodes, and applied the osd_spec file containing all eight OSD nodes. All expected OSDs (8*24) started. cluster: id: 51ff166a-a785-11eb-84b1-bc97e17cf960 health: HEALTH_WARN 3 stray daemon(s) not managed by cephadm services: mon: 3 daemons, quorum f18-h14-000-r640.rdu2.scalelab.redhat.com,f18-h09-000-r640,f18-h10-000-r640 (age 33m) mgr: f18-h14-000-r640.rdu2.scalelab.redhat.com.mdgfsr(active, since 42m), standbys: f22-h21-000-6048r.yvujdw osd: 192 osds: 192 up (since 4m), 192 in (since 9m) data: pools: 1 pools, 1 pgs objects: 0 objects, 0 B usage: 12 TiB used, 349 TiB / 361 TiB avail pgs: 1 active+clean The OSD node changes: 1.) # sysctl -w fs.aio-max-nr=1048576 2.) # vi /var/lib/ceph/<fisd>/cephadm.<uuid> # diff <new cephadm.<uuid>> <orig cephadm.<uuid>> < out, err, code = call_throws(ctx, c.run_cmd()) --- > verbosity = CallVerbosity.VERBOSE if ctx.log_output else CallVerbosity.VERBOSE_ON_FAILURE > out, err, code = call_throws(ctx, c.run_cmd(), verbosity=verbosity)
(In reply to Tim Wilkinson from comment #23) > I made the fix to the existing env and applied the remaining five OSD nodes > via spec file. All OSDs started (it takes a few min for the up to become > in). I've since purged the env and reran the deployment procedure. I applied > the host_spec, made the changes to each of the OSD nodes, and applied the > osd_spec file containing all eight OSD nodes. All expected OSDs (8*24) > started. > > > cluster: > id: 51ff166a-a785-11eb-84b1-bc97e17cf960 > health: HEALTH_WARN > 3 stray daemon(s) not managed by cephadm > > services: > mon: 3 daemons, quorum > f18-h14-000-r640.rdu2.scalelab.redhat.com,f18-h09-000-r640,f18-h10-000-r640 > (age 33m) > mgr: f18-h14-000-r640.rdu2.scalelab.redhat.com.mdgfsr(active, since > 42m), standbys: f22-h21-000-6048r.yvujdw > osd: 192 osds: 192 up (since 4m), 192 in (since 9m) Woot! amazing work guys!
Additional info: --------------- Podman version 2.2.1 Python version 3.6.8
Tim, can you also provide: > conmon --version
conmon version 2.0.22
@tserling: I will try to speed up as much as possible the merge to Pacific, and will do the cherry pick.
@twilkins : The test upgrading conmon package was to see if we can find a root cause for the lock of console output in the containerized ceph-volume command. But the fix we proposed was tested and verified that works ok. So i think we can leave this bug in "verified", and probably is good to open a new Bug pointing directly to the real problem we have detected and that is not directly related with cephadm. I propose to use: Containerized Ceph-volume command output blocked ================================================= Version-Release number: kernel-4.18.0-193.el8.x86_64 OS Version: RHEL 8 Podman version 2.2.1 Python version 3.6.8 Conmon version: 2.0.22 How to reproduce: ( Only seen in hosts with a certain number of devices ,, for example 39) # cephadm ceph-volume inventory This launches a ceph container to execute the command. The result is that the console output freezes and we can see a file lock: # lslocks |grep cephad python3 1463459 FLOCK 0B WRITE 0 0 0 /run/cephadm/9fa2b396-adb5-11eb-a2d3-bc97e17cf960.lock
It seems ok. Only one clarification, I think that is needed to remark that the ceph-volume is run in a container launched by the cephadm binary.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294