Describe the issue: Restoring ceph-monitor quorum procedure is not correct. The bad mons cannot be deleted from the monmap because permission issue Describe the task you were trying to accomplish: Test Procedure: 1.Stop 2 worker nodes oviner:auth$ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 NotReady worker 3d v1.27.4+deb2c60 compute-1 NotReady worker 3d v1.27.4+deb2c60 compute-2 Ready worker 3d v1.27.4+deb2c60 control-plane-0 Ready control-plane,master 3d1h v1.27.4+deb2c60 control-plane-1 Ready control-plane,master 3d1h v1.27.4+deb2c60 control-plane-2 Ready control-plane,master 3d1h v1.27.4+deb2c60 oviner:auth$ oc get pods -l app=rook-ceph-mon NAME READY STATUS RESTARTS AGE rook-ceph-mon-a-576dc56947-l2cqx 0/2 Pending 0 20h rook-ceph-mon-b-569d6c5877-fvxf2 2/2 Terminating 0 21h rook-ceph-mon-b-569d6c5877-hclhg 0/2 Pending 0 20h rook-ceph-mon-c-6646b847ff-r9m4j 2/2 Running 1 (12h ago) 3d 2.Stop the rook-ceph-operator so that the mons are not failed over when you are modifying the monmap. $ oc -n openshift-storage scale deployment rook-ceph-operator --replicas=0 deployment.apps/rook-ceph-operator scaled 3. Open the YAML file and copy the command and arguments from the mon container $ oc -n openshift-storage get deployment rook-ceph-mon-c -o yaml > rook-ceph-mon-c-deployment.yaml 4.Cleanup the copied command and args fields to form a pastable command as follows: ceph-mon \ --fsid=8b24e1e2-00f9-4d81-a721-4ee4095fba99 \ --keyring=/etc/ceph/keyring-store/keyring \ --default-log-to-stderr=true \ --default-err-to-stderr=true \ --default-mon-cluster-log-to-stderr=true \ --default-log-stderr-prefix=debug \ --default-log-to-file=false \ --default-mon-cluster-log-to-file=false \ --mon-host=$(ROOK_CEPH_MON_HOST) \ --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) \ --id=c \ --setuser=ceph \ --setgroup=ceph \ --foreground \ --public-addr=172.30.53.157 \ --setuser-match-path=/var/lib/ceph/mon/ceph-c/store.db \ --public-bind-addr=$(ROOK_POD_IP) \ --extract-monmap=${monmap_path} 5. Patch the rook-ceph-mon-c Deployment to stop the working of this mon without deleting the mon pod. $ oc -n openshift-storage patch deployment rook-ceph-mon-c --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]' $ oc -n openshift-storage patch deployment rook-ceph-mon-c -p '{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}' 6.Connect to the pod of a healthy mon [mon-c]: $ oc -n openshift-storage exec -it rook-ceph-mon-c-765cbb446f-4xgzw bash [root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmap_path=/tmp/monmap 7.Review the contents of the monmap. [root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmaptool --print /tmp/monmap monmaptool: monmap file /tmp/monmap epoch 3 fsid 8b24e1e2-00f9-4d81-a721-4ee4095fba99 last_changed 2023-08-21T10:15:51.349720+0000 created 2023-08-21T10:13:54.902037+0000 min_mon_release 17 (quincy) election_strategy: 1 0: v2:172.30.122.31:3300/0 mon.a 1: v2:172.30.85.192:3300/0 mon.b 2: v2:172.30.53.157:3300/0 mon.c 8.Remove the bad mons from the monmap [Failed] [root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmaptool ${monmap_path} --rm a monmaptool: monmap file /tmp/monmap monmaptool: removing a monmaptool: writing epoch 3 to /tmp/monmap (2 monitors) bufferlist::write_file(/tmp/monmap): failed to open file: (13) Permission denied monmaptool: error writing to '/tmp/monmap': (13) Permission denied Suggestions for improvement: We need to find the correct procedure for restoring ceph-monitor quorum. Document URL: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.13/html/troubleshooting_openshift_data_foundation/restoring-ceph-monitor-quorum-in-openshift-data-foundation_rhodf#doc-wrapper Chapter/Section Number and Title: Chapter 12. Restoring ceph-monitor quorum in OpenShift Data Foundation Product Version: ODF Version: odf-operator.v4.14.0-111.stable OCP Version: 4.14.0-0.nightly-2023-08-11-055332 platform: Vsphere Environment Details: Any other versions of this document that also needs this update: Additional information: for more info: https://docs.google.com/document/d/1Xu6L4ibi-0PWD9Y8ezeXRQ-TsHRPtnHH-eRaw0pDRec/edit
Hit the same issue on the RDR environment when tried to recover the mons out of quorum Permission error ---------------- monmaptool ${monmap_path} --rm bmonmaptool: monmap file /tmp/monmapmonmaptool: removing bmonmaptool: writing epoch 5 to /tmp/monmap (2 monitors)bufferlist::write_file(/tmp/monmap): failed to open file: (13) Permission deniedmonmaptool: error writing to '/tmp/monmap': (13) Permission denied Followed the below URL: ----------------------- https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.13/html/troubleshooting_openshift_data_foundation/restoring-ceph-monitor-quorum-in-openshift-data-foundation_rhodf Product version: ODF- 4.13-219 OCP - 4.13.0-0.nightly-2023-08-11-101506 Platform: Vsphere
Hi Anjana, This bug needs to be fixed in docs of 4.13, 4.14 and 4.15. Can you please prioritize this BZ and provide fix soon?
I was doing some testing, and managed to get it to work: sh-5.1# monmaptool --rm a /tmp/monmap monmaptool: monmap file /tmp/monmap monmaptool: removing a monmaptool: writing epoch 3 to /tmp/monmap (2 monitors) bufferlist::write_file(/tmp/monmap): failed to open file: (13) Permission denied monmaptool: error writing to '/tmp/monmap': (13) Permission denied sh-5.1# monmaptool --print monmap monmaptool: monmap file monmap epoch 3 fsid 32deba18-6e88-4f41-8401-dede5787e344 last_changed 2024-09-02T02:22:52.521379+0000 created 2024-09-02T02:22:19.434678+0000 min_mon_release 18 (reef) election_strategy: 1 0: v2:172.30.195.83:3300/0 mon.a 1: v2:172.30.163.106:3300/0 mon.b 2: v2:172.30.193.96:3300/0 mon.c sh-5.1# monmaptool --add d 1.1.1.1:567 monmap monmaptool: monmap file monmap monmaptool: writing epoch 3 to monmap (4 monitors) sh-5.1# monmaptool --print monmap monmaptool: monmap file monmap epoch 3 fsid 32deba18-6e88-4f41-8401-dede5787e344 last_changed 2024-09-02T02:22:52.521379+0000 created 2024-09-02T02:22:19.434678+0000 min_mon_release 18 (reef) election_strategy: 1 0: v2:172.30.195.83:3300/0 mon.a 1: v2:172.30.163.106:3300/0 mon.b 2: v2:172.30.193.96:3300/0 mon.c 3: v2:1.1.1.1:567/0 mon.d sh-5.1# monmaptool rm a monmap monmaptool: too many arguments monmaptool -h for usage sh-5.1# monmaptool --rm a monmap monmaptool: monmap file monmap monmaptool: removing a monmaptool: writing epoch 3 to monmap (3 monitors) I don't know why, but as soon as I added to the mon map, I was able to remove from the mon map... seems like a bug to me?