Description of problem:[cephadm]5.0 - Invalid inputs are accepted and services are deployed when using ceph orch apply mon/mgr commands for deploying services Version-Release number of selected component (if applicable): Using recent ceph image registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest ceph version 16.0.0-7953.el8cp (aac7c5c7d5f82d2973c366730f65255afd66e515) pacific (dev) How reproducible: Steps to Reproduce: 1. Install 5.0 cluster 2. Enter cephadm shell 3. Perform mon/mgr service deployment by passing invalid inputs 4. observe the behaviour Actual results: Invalid inputs/parameters are accepted without any error/warning messages. Below output for reference [ceph: root@magna094 /]# ceph orch apply mgr 12345 Scheduled mgr update... [ceph: root@magna094 /]# ceph orch apply mon 6789 Scheduled mon update... [ceph: root@magna094 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE NAME IMAGE ID alertmanager 1/1 41s ago 10w count:1 docker.io/prom/alertmanager:v0.20.0 0881eb8f169f crash 9/9 44s ago 10w * mix dd0a3c51082c grafana 1/1 41s ago 10w count:1 docker.io/ceph/ceph-grafana:6.6.2 a0dce381714a iscsi.iscsi 1/1 41s ago 4d magna094;count:1 registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4b985089d14513ccab29c42e1531bfcb2e98a614c497726153800d72a2ac11f0 dd0a3c51082c mds.test 3/3 44s ago 4d count:3 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c mgr 2/12345 43s ago 12s count:12345 mix dd0a3c51082c mon 5/6789 44s ago 4s count:6789 mix dd0a3c51082c nfs.foo 1/1 43s ago 4d count:1 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c node-exporter 9/9 44s ago 10w * docker.io/prom/node-exporter:v0.18.1 e5a616e4b9cf osd.None 7/0 44s ago - <unmanaged> mix dd0a3c51082c osd.all-available-devices 16/20 44s ago 3w <unmanaged> registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c osd.dashboard-admin-1605876982239 4/4 44s ago 4w * registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c prometheus 1/1 41s ago 10w count:1 docker.io/prom/prometheus:v2.18.1 de242295e225 rgw.myorg.us-east-1 2/2 43s ago 7w magna092;magna093;count:2 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.0-rhel-8-containers-candidate-93648-20201117204824 dd0a3c51082c rgw.test_realm.test_zone 0/2 - - count:2 <unknown> <unknown> [ceph: root@magna094 /]# ceph -s cluster: id: c97c2c8c-0942-11eb-ae18-002590fbecb6 health: HEALTH_ERR Module 'diskprediction_local' has failed: No module named 'sklearn' 3 daemons have recently crashed services: mon: 5 daemons, quorum magna094,magna067,magna073,magna093,magna077 (age 30m) mgr: magna067.cudixx(active, since 4w), standbys: magna094.hussmr mds: test:1 {0=test.magna076.xymdrn=up:active} 2 up:standby osd: 27 osds: 27 up (since 3w), 27 in (since 4w) rgw: 2 daemons active (myorg.us-east-1.magna092.bxiihn, myorg.us-east-1.magna093.nhekwk) data: pools: 21 pools, 617 pgs objects: 452 objects, 427 KiB usage: 10 GiB used, 25 TiB / 25 TiB avail pgs: 617 active+clean io: client: 937 B/s rd, 0 op/s rd, 0 op/s wr [ceph: root@magna094 /]# Expected results: Additional info: N
In this case of the apply command, the first positional arg is the service type and the second is the placement (there are more but that's all that's relevant here). So in a command like "ceph orch apply mgr 12345" "mgr" is the service we are applying and "12345" is the placement. Since integers are considered valid placements (in this case it's saying to put a mgr on up to 12345 hosts. In a typical use case you would do something like "ceph orch apply mgr 3" to put down 3 mgr daemons when you don't care which host they're on) it accepts the command and runs like normal. Do you have anything in mind when you say it should provide an error/warning message here? Technically both the args provided, "mgr" and "12345" are valid for the service type and placement so it doesn't make sense to generate an error saying there were invalid arguments. Maybe we should output back to the user how each arg in the apply command is getting used? For example, in this case, add something to the output saying that "mgr" is being used as the service type and "12345" is the placement to try and avoid confusion?
We have a PR to improve the output of the command "apply" in order to show the placement. https://github.com/ceph/ceph/pull/38689 But in the case of the mon and mgr deployments ... could it be nice to limit the number of mons or mgrs to min(5 , number of nodes in the cluster)
@Adam, (In reply to Adam King from comment #1) > In this case of the apply command, the first positional arg is the service > type and the second is the placement (there are more but that's all that's > relevant here). So in a command like "ceph orch apply mgr 12345" "mgr" is > the service we are applying and "12345" is the placement. Since integers are > considered valid placements (in this case it's saying to put a mgr on up to > 12345 hosts. In a typical use case you would do something like "ceph orch > apply mgr 3" to put down 3 mgr daemons when you don't care which host > they're on) it accepts the command and runs like normal. Do you have > anything in mind when you say it should provide an error/warning message > here? Technically both the args provided, "mgr" and "12345" are valid for > the service type and placement so it doesn't make sense to generate an error > saying there were invalid arguments. Maybe we should output back to the user > how each arg in the apply command is getting used? For example, in this > case, add something to the output saying that "mgr" is being used as the > service type and "12345" is the placement to try and avoid confusion? We should throw an error/warning message that we cannot have 12345 hosts to place the mgr/mons. Though integers are valid we should have limitation to pass the values as we cannot have 12345 hosts to put the mons/mgrs.
For placing a max on the placement count, which should fix this: tracker: https://tracker.ceph.com/issues/49960 PR: https://github.com/ceph/ceph/pull/40376 What exactly the max should be is still being debated so those numbers aren't set yet, but we've agreed upstream a max should exist.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294