Bug 2235311 - [DOC] Restoring ceph-monitor quorum procedure, The bad mons cannot be deleted from the monmap because permission issue [NEEDINFO]
Summary: [DOC] Restoring ceph-monitor quorum procedure, The bad mons cannot be deleted...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: documentation
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Anjana Suparna Sriram
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-28 11:48 UTC by Oded
Modified: 2024-09-03 00:50 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
hnallurv: needinfo? (asriram)
kelwhite: needinfo? (asriram)
kelwhite: needinfo? (asriram)


Attachments (Terms of Use)

Description Oded 2023-08-28 11:48:27 UTC
Describe the issue:
Restoring ceph-monitor quorum procedure is not correct.
The bad mons cannot be deleted from the monmap because permission issue

Describe the task you were trying to accomplish:
Test Procedure:
1.Stop 2 worker nodes
oviner:auth$ oc get nodes
NAME              STATUS     ROLES                  AGE    VERSION
compute-0         NotReady   worker                 3d     v1.27.4+deb2c60
compute-1         NotReady   worker                 3d     v1.27.4+deb2c60
compute-2         Ready      worker                 3d     v1.27.4+deb2c60
control-plane-0   Ready      control-plane,master   3d1h   v1.27.4+deb2c60
control-plane-1   Ready      control-plane,master   3d1h   v1.27.4+deb2c60
control-plane-2   Ready      control-plane,master   3d1h   v1.27.4+deb2c60

oviner:auth$ oc get pods -l app=rook-ceph-mon
NAME                               READY   STATUS        RESTARTS      AGE
rook-ceph-mon-a-576dc56947-l2cqx   0/2     Pending       0             20h
rook-ceph-mon-b-569d6c5877-fvxf2   2/2     Terminating   0             21h
rook-ceph-mon-b-569d6c5877-hclhg   0/2     Pending       0             20h
rook-ceph-mon-c-6646b847ff-r9m4j   2/2     Running       1 (12h ago)   3d


2.Stop the rook-ceph-operator so that the mons are not failed over when you are modifying the monmap.
$ oc -n openshift-storage scale deployment rook-ceph-operator --replicas=0
deployment.apps/rook-ceph-operator scaled


3. Open the YAML file and copy the command and arguments from the mon container
$ oc -n openshift-storage get deployment rook-ceph-mon-c -o yaml > rook-ceph-mon-c-deployment.yaml


4.Cleanup the copied command and args fields to form a pastable command as follows:
ceph-mon \
        --fsid=8b24e1e2-00f9-4d81-a721-4ee4095fba99 \
        --keyring=/etc/ceph/keyring-store/keyring \
        --default-log-to-stderr=true \
        --default-err-to-stderr=true \
        --default-mon-cluster-log-to-stderr=true \
        --default-log-stderr-prefix=debug \
        --default-log-to-file=false \
        --default-mon-cluster-log-to-file=false \
        --mon-host=$(ROOK_CEPH_MON_HOST) \
        --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) \
        --id=c \
        --setuser=ceph \
        --setgroup=ceph \
        --foreground \
        --public-addr=172.30.53.157 \
        --setuser-match-path=/var/lib/ceph/mon/ceph-c/store.db \
        --public-bind-addr=$(ROOK_POD_IP) \
        --extract-monmap=${monmap_path}

5. Patch the rook-ceph-mon-c Deployment to stop the working of this mon without deleting the mon pod.
$ oc -n openshift-storage patch deployment rook-ceph-mon-c  --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]'
$ oc -n openshift-storage patch deployment rook-ceph-mon-c -p '{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}'

6.Connect to the pod of a healthy mon [mon-c]:
$ oc -n openshift-storage exec -it  rook-ceph-mon-c-765cbb446f-4xgzw bash
[root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]#  monmap_path=/tmp/monmap

7.Review the contents of the monmap.
[root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 3
fsid 8b24e1e2-00f9-4d81-a721-4ee4095fba99
last_changed 2023-08-21T10:15:51.349720+0000
created 2023-08-21T10:13:54.902037+0000
min_mon_release 17 (quincy)
election_strategy: 1
0: v2:172.30.122.31:3300/0 mon.a
1: v2:172.30.85.192:3300/0 mon.b
2: v2:172.30.53.157:3300/0 mon.c

8.Remove the bad mons from the monmap [Failed]
[root@rook-ceph-mon-c-765cbb446f-4xgzw ceph]# monmaptool ${monmap_path} --rm a
monmaptool: monmap file /tmp/monmap
monmaptool: removing a
monmaptool: writing epoch 3 to /tmp/monmap (2 monitors)
bufferlist::write_file(/tmp/monmap): failed to open file: (13) Permission denied
monmaptool: error writing to '/tmp/monmap': (13) Permission denied


Suggestions for improvement:
We need to find the correct procedure for restoring ceph-monitor quorum.

Document URL:
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.13/html/troubleshooting_openshift_data_foundation/restoring-ceph-monitor-quorum-in-openshift-data-foundation_rhodf#doc-wrapper

Chapter/Section Number and Title:
Chapter 12. Restoring ceph-monitor quorum in OpenShift Data Foundation

Product Version:
ODF Version: odf-operator.v4.14.0-111.stable
OCP Version: 4.14.0-0.nightly-2023-08-11-055332
platform: Vsphere

Environment Details:

Any other versions of this document that also needs this update:


Additional information:
for more info:
https://docs.google.com/document/d/1Xu6L4ibi-0PWD9Y8ezeXRQ-TsHRPtnHH-eRaw0pDRec/edit

Comment 2 kmanohar 2023-08-28 12:00:42 UTC
Hit the same issue on the RDR environment when tried to recover the mons out of quorum

Permission error
----------------

monmaptool ${monmap_path} --rm bmonmaptool: monmap file /tmp/monmapmonmaptool: removing bmonmaptool: writing epoch 5 to /tmp/monmap (2 monitors)bufferlist::write_file(/tmp/monmap): failed to open file: (13) Permission deniedmonmaptool: error writing to '/tmp/monmap': (13) Permission denied

Followed the below URL:
-----------------------
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.13/html/troubleshooting_openshift_data_foundation/restoring-ceph-monitor-quorum-in-openshift-data-foundation_rhodf

Product version:
ODF- 4.13-219
OCP - 4.13.0-0.nightly-2023-08-11-101506
Platform: Vsphere

Comment 3 Harish NV Rao 2024-03-05 05:49:04 UTC
Hi Anjana, This bug needs to be fixed in docs of 4.13, 4.14 and 4.15. Can you please prioritize this BZ and provide fix soon?

Comment 6 kelwhite 2024-09-03 00:50:45 UTC
I was doing some testing, and managed to get it to work:

sh-5.1# monmaptool --rm a /tmp/monmap 
monmaptool: monmap file /tmp/monmap
monmaptool: removing a
monmaptool: writing epoch 3 to /tmp/monmap (2 monitors)
bufferlist::write_file(/tmp/monmap): failed to open file: (13) Permission denied
monmaptool: error writing to '/tmp/monmap': (13) Permission denied

sh-5.1#  monmaptool --print  monmap 
monmaptool: monmap file monmap
epoch 3
fsid 32deba18-6e88-4f41-8401-dede5787e344
last_changed 2024-09-02T02:22:52.521379+0000
created 2024-09-02T02:22:19.434678+0000
min_mon_release 18 (reef)
election_strategy: 1
0: v2:172.30.195.83:3300/0 mon.a
1: v2:172.30.163.106:3300/0 mon.b
2: v2:172.30.193.96:3300/0 mon.c

sh-5.1# monmaptool --add d 1.1.1.1:567 monmap 
monmaptool: monmap file monmap
monmaptool: writing epoch 3 to monmap (4 monitors)

sh-5.1#  monmaptool --print  monmap 
monmaptool: monmap file monmap
epoch 3
fsid 32deba18-6e88-4f41-8401-dede5787e344
last_changed 2024-09-02T02:22:52.521379+0000
created 2024-09-02T02:22:19.434678+0000
min_mon_release 18 (reef)
election_strategy: 1
0: v2:172.30.195.83:3300/0 mon.a
1: v2:172.30.163.106:3300/0 mon.b
2: v2:172.30.193.96:3300/0 mon.c
3: v2:1.1.1.1:567/0 mon.d

sh-5.1# monmaptool rm a monmap 
monmaptool: too many arguments
monmaptool -h for usage

sh-5.1# monmaptool --rm a monmap 
monmaptool: monmap file monmap
monmaptool: removing a
monmaptool: writing epoch 3 to monmap (3 monitors)

I don't know why, but as soon as I added to the mon map, I was able to remove from the mon map... seems like a bug to me?


Note You need to log in before you can comment on or make changes to this bug.