Bug 1969383
| Summary: | [RHCS-baremetal] issue with Recovering the Ceph Monitor Store on 4.x | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | skanta | ||||
| Component: | RADOS | Assignee: | skanta | ||||
| Status: | CLOSED NOTABUG | QA Contact: | skanta | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.2 | CC: | akupczyk, assingh, bhubbard, ceph-eng-bugs, gsitlani, hfukumot, kjosy, nojha, rzarzyns, sseshasa, vereddy, vumrao | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 5.1 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-10-19 19:07:55 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1973033, 1995854, 1995859 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
Document Reference- https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html-single/troubleshooting_guide/index?lb_target=production#recovering-the-ceph-monitor-store Performed the steps as mentioned in the DOC, Faced the issue at Step-2 of the "4.7.1. Recovering the Ceph Monitor store when using BlueStore" chapter. The parameter "$osd_nodes" is not initialized or exported in the script. Error Snippet:- --------------- [root@mon ~]# for host in $osd_nodes; do echo "$host" rsync -avz $ms $host:$ms rsync -avz $db $host:$db rsync -avz $db_slow $host:$db_slow ------------------------------------------ ----------------------------------------------- As workaround a executed the script with "for host in ceph-bharath-1623199697633-node3-mon-osd ceph-bharath-1623199697633-node4-osd-client ceph-bharath-1623199697633-node5-osd ceph-bharath-1623199697633-node6-osd ceph-bharath-1623199697633-node7-osd". Execution failed with error messages. For more details please check the attached log file. Created attachment 1789773 [details]
Log file
Performed the following steps to recover the MON-
PROCEDURE
---------
Perform the steps as a root user on the installer node
ssh-keygen
copy ssh key to all OSD nodes
ssh-copy-id root@<osd nodes>
cd /root/
ms=/tmp/monstore/
mkdir $ms
Create a script file with the following code-
NOTE: I included the sleep command after stopping and starting the service. This needs to be modified. Check that service is stopped or started before proceeding with the further steps.
vi recover.sh
#!/bin/bash -x
ms=/tmp/monstore/
rm -rf $ms
mkdir $ms
for host in ceph-bharath-1623727358372-node3-mon-osd ceph-bharath-1623727358372-node4-osd-client ceph-bharath-1623727358372-node5-osd ceph-bharath-1623727358372-node6-osd ceph-bharath-1623727358372-node7-osd;
do
echo "The Host name is :$host"
ssh -l root $host "rm -rf $ms"
ssh -l root $host "mkdir $ms”
scp -r $ms"store.db" $host:$ms
rm -rf $ms
mkdir $ms
#ssh -l root $host "mkdir $ms"
ssh -t root@$host <<'EOT'
ms=/tmp/monstore
for osd in /var/lib/ceph/osd/ceph-*;
do
IN=$osd
arrIN=(${IN//-/ })
systemctl stop ceph-osd@${arrIN[1]}.service
sleep 5
echo "ceph-objectstore-tool --type bluestore --data-path $osd --op update-mon-db --no-mon-config --mon-store-path $ms"
ceph-objectstore-tool --type bluestore --data-path $osd --op update-mon-db --no-mon-config --mon-store-path $ms
systemctl start ceph-osd@${arrIN[1]}.service
sleep 5
echo "Pulling data finished in OSD-$arrIN"
done
EOT
scp -r $host:$ms"*" $ms
echo "Finished Pulling data: $host"
done
Execute ./recover.sh
Creating a file with all keyrings (/tmp/all.keyring) and using this file in the
[root@MON]# ceph-authtool /etc/ceph/ceph.client.admin.keyring -n mon. --cap mon 'allow *' --gen-key
After generating the “/ceph.client.admin.keyring “ file add all the existing keyring to the /etc/ceph/ceph.client.admin.keyring file
Example:-
[root@ceph-bharath-1623839999591-node1-mon-mgr-installer /]# find / -name *.keyring
/etc/ceph/ceph.client.admin.keyring
/etc/ceph/ceph.mgr.ceph-bharath-1623839999591-node1-mon-mgr-installer.keyring
/etc/ceph/ceph.client.crash.keyring
/var/lib/ceph/bootstrap-mds/ceph.keyring
/var/lib/ceph/bootstrap-mgr/ceph.keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring
/var/lib/ceph/bootstrap-rbd/ceph.keyring
/var/lib/ceph/bootstrap-rbd-mirror/ceph.keyring
/var/lib/ceph/bootstrap-rgw/ceph.keyring
/var/lib/ceph/tmp/ceph.mon..keyring
[root@ceph-bharath-1623839999591-node1-mon-mgr-installer /]#
[root@ceph-bharath-1623839999591-node1-mon-mgr-installer /]# cat /etc/ceph/ceph.mgr.ceph-bharath-1623839999591-node1-mon-mgr-installer.keyring /etc/ceph/ceph.client.crash.keyring /var/lib/ceph/bootstrap-mds/ceph.keyring /var/lib/ceph/bootstrap-mgr/ceph.keyring /var/lib/ceph/bootstrap-osd/ceph.keyring /var/lib/ceph/bootstrap-rbd/ceph.keyring /var/lib/ceph/bootstrap-rbd-mirror/ceph.keyring /var/lib/ceph/bootstrap-rgw/ceph.keyring /var/lib/ceph/tmp/ceph.mon..keyring >>/etc/ceph/ceph.client.admin.keyring
[root@MON]#ceph-monstore-tool /tmp/monstore get monmap -- --out /tmp/monmap
[root@MON]#monmaptool /tmp/monmap --print
Notice theNo such file or directory error message if monmap is missed
Example-
[root@ceph-bharath-1623839999591-node1-mon-mgr-installer /]# monmaptool /tmp/monmap --print
monmaptool: monmap file /tmp/monmap
monmaptool: couldn't open /tmp/monmap: (2) No such file or directory
[root@ceph-bharath-1623839999591-node1-mon-mgr-installer /]#
If monmap is missed then create a new monmap
[root@MON]# monmaptool --create --add <mon-id> <mon-a-ip> --enable-all-features --clobber /root/monmap.mon-a --fsid <fsid>
Get the mon-id,mon-a-ip and fsid details from the /etc/ceph/ceph.conf file
Example -
[root@ceph-bharath-1623839999591-node1-mon-mgr-installer ceph-ceph-bharath-1623839999591-node1-mon-mgr-installer]# cat /etc/ceph/ceph.conf
[client]
rgw crypt require ssl = False
rgw crypt s3 kms encryption keys = testkey-1=YmluCmJvb3N0CmJvb3N0LWJ1aWxkCmNlcGguY29uZgo= testkey-2=aWIKTWFrZWZpbGUKbWFuCm91dApzcmMKVGVzdGluZwo=
# Please do not change this file directly since it is managed by Ansible and will be overwritten
[global]
cluster network = 10.0.208.0/22
fsid = 345ecf3f-1494-4b35-80cb-1df54355362b
mon host = [v2:10.0.210.146:3300,v1:10.0.210.146:6789],[v2:10.0.209.3:3300,v1:10.0.209.3:6789],[v2:10.0.208.15:3300,v1:10.0.208.15:6789]
mon initial members = ceph-bharath-1623839999591-node1-mon-mgr-installer,ceph-bharath-1623839999591-node2-mon,ceph-bharath-1623839999591-node3-mon-osd
mon_max_pg_per_osd = 1024
osd pool default crush rule = -1
osd_default_pool_size = 2
osd_pool_default_pg_num = 64
osd_pool_default_pgp_num = 64
public network = 10.0.208.0/22
[mon]
mon_allow_pool_delete = True
[root@ceph-bharath-1623839999591-node1-mon-mgr-installer ceph-ceph-bharath-1623839999591-node1-mon-mgr-installer]#
[root@ceph-bharath-1623839999591-node1-mon-mgr-installer /]# monmaptool --create --addv ceph-bharath-1623839999591-node2-mon [v2:10.0.209.3:3300,v1:10.0.209.3:6789] --addv ceph-bharath-1623839999591-node1-mon-mgr-installer [v2:10.0.210.146:3300,v1:10.0.210.146:6789] --addv ceph-bharath-1623839999591-node3-mon-osd [v2:10.0.208.15:3300,v1:10.0.208.15:6789] --enable-all-features --clobber /root/monmap.mon-a --fsid 345ecf3f-1494-4b35-80cb-1df54355362b
To check the generated monmap
[root@MON]# monmaptool /root/monmap.mon-a --print
import monmap
[root@MON]# ceph-monstore-tool /tmp/monstore rebuild -- --keyring /etc/ceph/ceph.client.admin.keyring --monmap /root/monmap.mon-a
[root@MON]# chown -R ceph:ceph /tmp/monstore
Repeat this step for all Ceph Monitor nodes:
[root@MON]#mv /var/lib/ceph/mon/ceph-HOSTNAME/store.db /var/lib/ceph/mon/ceph-HOSTNAME/store.db.corrupted
Replace the corrupted store. Repeat this step for all Ceph Monitor nodes:
[root@MON]#scp -r /tmp/monstore/store.db HOSTNAME:/var/lib/ceph/mon/ceph-HOSTNAME/
Replace HOSTNAME with the host name of the Monitor node
Example:
scp -r /tmp/monstore/store.db ceph-bharath-1623839999591-node2-mon:/var/lib/ceph/mon/ceph-ceph-bharath-1623839999591-node2-mon/
Change the owner of the new store. Repeat this step for all Ceph Monitor nodes:
[root@MON]# chown -R ceph:ceph /var/lib/ceph/mon/ceph-HOSTNAME/store.db
Replace HOSTNAME with the host name of the Monitor node
Example :
chown -R ceph:ceph /var/lib/ceph/mon/ceph-ceph-bharath-1623839999591-node3-mon-osd/store.db
Start all the Ceph Monitor daemons:
[root@MON]# start ceph-mon@ceph-*
Example :
[root@ceph-bharath-1623839999591-node2-mon /]# systemctl start ceph-mon
The chapter "4.7.1. Recovering the Ceph Monitor store when using BlueStoreTroubleshooting guide" procedure needs to be modified.
The script which is provided in the guide is modified as -
#!/bin/bash -x
ms=/tmp/monstore/
rm -rf $ms
mkdir $ms
for host in ceph-bharath-1623727358372-node3-mon-osd ceph-bharath-1623727358372-node4-osd-client ceph-bharath-1623727358372-node5-osd ceph-bharath-1623727358372-node6-osd ceph-bharath-1623727358372-node7-osd;
do
echo "The Host name is :$host"
ssh -l root $host "rm -rf $ms"
ssh -l root $host "mkdir $ms”
scp -r $ms"store.db" $host:$ms
rm -rf $ms
mkdir $ms
#ssh -l root $host "mkdir $ms"
ssh -t root@$host <<'EOT'
ms=/tmp/monstore
for osd in /var/lib/ceph/osd/ceph-*;
do
IN=$osd
arrIN=(${IN//-/ })
systemctl stop ceph-osd@${arrIN[1]}.service
sleep 5
echo "ceph-objectstore-tool --type bluestore --data-path $osd --op update-mon-db --no-mon-config --mon-store-path $ms"
ceph-objectstore-tool --type bluestore --data-path $osd --op update-mon-db --no-mon-config --mon-store-path $ms
systemctl start ceph-osd@${arrIN[1]}.service
sleep 5
echo "Pulling data finished in OSD-$arrIN"
done
EOT
scp -r $host:$ms"*" $ms
echo "Finished Pulling data: $host"
done
For testing purposes, I hardcoded the OSD hostnames and included the sleep commands in the script.
The developer has to verify this and needs to provide the refined script. To track the issue I am raising the dependency bug.
|
Description of problem: This is an internal functional bug tracker to track the verification of steps mentioned in the document - https://access.redhat.com/solutions/4202871 We perform the steps mentioned in the above document and noticed the following- 1) Ensure all OSD containers are stopped. Command - podman stop <container-id> Issues- ------- 1.1. After stopping all OSD containers, "ceph osd tree" output shows 5 OSD's are UP even though the services are down(Noticed in bare metal also). 1.2.Services are still running.stopped the services by executing "systemctl stop <servicenamee>". This is not the case with 5.0. if this is expected in 4.x please ignore. 2) Add Ceph repos on Ceph nodes based on their roles Observations- ------------ Please modify according to 4.x 3) Install packages on the nodes based on their role for Ceph-MON nodes: # yum install -y ceph-mon ceph-test rsync -------> Successfully installed for Ceph Storage nodes: # yum install -y ceph-osd ceph-test rsync --------> Successfully installed 4) On the Ceph storage nodes mount all Ceph-data disks into temporary locations issues:- -------- 4.1. data partitions can be listed with: # ceph-disk list Need to change the command #ceph-volume lvm list 4.2. # mkdir /mnt/ceph-tmp-001 # mount /dev/sda1 /mnt/ceph-tmp-001 mount OSD's step failed and getting the following Error on non-encrypted OSD's Error output: ------------- [root@ceph-pdhiran-1623053041789-node3-osd cephuser]# mount /dev/vg1/data-lv4 /mnt/ceph-tmp-001 mount: /mnt/ceph-tmp-001: wrong fs type, bad option, bad superblock on /dev/mapper/vg1-data--lv4, missing codepage or helper program, or other error. [root@ceph-pdhiran-1623053041789-node3-osd cephuser]# cd /var/lib/ceph/osd/ On encrypted OSD's- [root@ceph-pdhiran-1623053041789-node10-osd cephuser]# mount /dev/vg1/data-lv1 /mnt/ceph-tmp-001 mount: /mnt/ceph-tmp-001: unknown filesystem type 'crypto_LUKS'. [root@ceph-pdhiran-1623053041789-node10-osd cephuser]# Ceph Version: ------------- ceph version 14.2.11-179.el8cp (29de9ae52bcc20e38eb86cb8e4163bff2d1719c8) nautilus (stable)