Bug 2000412
| Summary: | infrastructure-playbooks/cephadm-adopt.yml fails to start iscsi daemons while converting the storage cluster daemons to run cephadm | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Gopi <gpatta> | |
| Component: | Ceph-Ansible | Assignee: | Guillaume Abrioux <gabrioux> | |
| Status: | CLOSED ERRATA | QA Contact: | Ameena Suhani S H <amsyedha> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 5.0 | CC: | aschoen, ceph-eng-bugs, ceph-qe-bugs, gabrioux, gmeno, gsitlani, kdreyer, mgowri, mmurthy, nthomas, tserlin, vereddy, xiubli, ykaul | |
| Target Milestone: | --- | Flags: | gpatta:
needinfo+
gpatta: needinfo+ gpatta: needinfo+ |
|
| Target Release: | 5.0z2 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | ceph-ansible-6.0.16-1.el8cp | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: cephadm-adopt playbook makes cephadm start iscsi service before stopping the containers that were managed by ceph-ansible, it makes the tcmu-runner process unable to open devices, there is nothing else to do than restarting the containers.
Consequence: iscsigw daemons don't work properly after the adoption is done.
Fix: Stop iscsi daemons containers that were managed by ceph-ansible before starting new container managed by cephadm. Also, iscsigws services should be migrated before OSDs (The clients should be closed before the server).
Result: iscsigw daemons work properly after the adoption is done.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2007683 (view as bug list) | Environment: | ||
| Last Closed: | 2021-12-08 13:57:04 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2026861 | |||
| Bug Blocks: | ||||
Hi Xiubo, You need to enter in to root after login with cephuser (Do "sudo su"). Else use "sudo" before your command. Hi Guillaume, Can you look in to this issue? Hi Guillaume,
Setup is still there.
[ceph: root@ceph-gp-rbd-fqj2nr-node1-installer ~]# ceph -s
cluster:
id: 7c80e5b9-c8ae-4fbb-b9e5-36b4288182d0
health: HEALTH_WARN
mons are allowing insecure global_id reclaim
insufficient standby MDS daemons available
1 pools have too many placement groups
services:
mon: 3 daemons, quorum ceph-gp-rbd-fqj2nr-node2,ceph-gp-rbd-fqj2nr-node3,ceph-gp-rbd-fqj2nr-node1-installer (age 5d)
mgr: ceph-gp-rbd-fqj2nr-node1-installer(active, since 5d), standbys: ceph-gp-rbd-fqj2nr-node2
mds: 1/1 daemons up
osd: 12 osds: 12 up (since 5d), 12 in (since 6d)
data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 1.09k objects, 4.1 GiB
usage: 13 GiB used, 167 GiB / 180 GiB avail
pgs: 193 active+clean
io:
client: 853 B/s rd, 0 op/s rd, 0 op/s wr
[ceph: root@ceph-gp-rbd-fqj2nr-node1-installer ~]# ceph orch ls
NAME RUNNING REFRESHED AGE PLACEMENT
crash 0/6 - 5d label:ceph
iscsi.rbd 0/2 - 5d count:2;label:iscsigws
mds.cephfs 0/3 - 5d count:3;label:mdss
mgr 0/2 - 5d count:2;label:mgrs
mon 0/3 - 5d count:3;label:mons
[ceph: root@ceph-gp-rbd-fqj2nr-node1-installer ~]# ceph orch host ls
HOST ADDR LABELS STATUS
ceph-gp-rbd-fqj2nr-node1-installer 10.0.210.115 mgrs mons ceph
ceph-gp-rbd-fqj2nr-node2 10.0.209.21 mgrs mons osds ceph
ceph-gp-rbd-fqj2nr-node3 10.0.210.56 iscsigws mons osds ceph
ceph-gp-rbd-fqj2nr-node4 10.0.208.78 mdss ceph
ceph-gp-rbd-fqj2nr-node5 10.0.209.206 iscsigws mdss osds ceph
ceph-gp-rbd-fqj2nr-node6 10.0.210.179 grafana-server mdss monitoring ceph
[ceph: root@ceph-gp-rbd-fqj2nr-node1-installer ~]#
[root@ceph-gp-rbd-fqj2nr-node3 cephuser]# systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; disabled; vendor preset: disabled)
Active: active (running) since Wed 2021-09-01 22:26:22 EDT; 5 days ago
Docs: man:tcmu-runner(8)
Main PID: 7350 (tcmu-runner)
Tasks: 5 (limit: 23465)
Memory: 45.6M
CGroup: /system.slice/tcmu-runner.service
└─7350 /usr/bin/tcmu-runner
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Hi Guillaume, I don't have setup right now but i will share new setup soon by reproducing the issue. Thanks, Gopi Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:5020 |
Description of problem: Upgraded the setup from 4.x to 5.x and the upgrade was successful and iscsi daemons started properly. When converting the storage cluster daemons to run cephadm using "ansible-playbook infrastructure-playbooks/cephadm-adopt.yml -i hosts", iscsi daemons failed to start until we start manually. Version-Release number of selected component (if applicable): ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) How reproducible: 100% Steps to Reproduce: 1. Install 4.x build on fresh cluster 2. Configure iscsi configuration on the cluster 3. Upgrade to 5.x from 4.x and check the iscsi configuration. 4. Convert the storage cluster daemons to run cephadm. 5. Check the iscsi configuration. Actual results: No tcmu-runner portals active on cluster: cluster: id: 7c80e5b9-c8ae-4fbb-b9e5-36b4288182d0 health: HEALTH_WARN mons are allowing insecure global_id reclaim insufficient standby MDS daemons available 1 pools have too many placement groups services: mon: 3 daemons, quorum ceph-gp-rbd-fqj2nr-node2,ceph-gp-rbd-fqj2nr-node3,ceph-gp-rbd-fqj2nr-node1-installer (age 4m) mgr: ceph-gp-rbd-fqj2nr-node1-installer(active, since 26m), standbys: ceph-gp-rbd-fqj2nr-node2 mds: 1/1 daemons up osd: 12 osds: 12 up (since 3m), 12 in (since 19h) data: volumes: 1/1 healthy pools: 4 pools, 193 pgs objects: 1.09k objects, 4.1 GiB usage: 12 GiB used, 167 GiB / 180 GiB avail pgs: 193 active+clean Expected results: tcmu-runner portals should be shown as below: cluster: id: 7c80e5b9-c8ae-4fbb-b9e5-36b4288182d0 health: HEALTH_WARN mons are allowing insecure global_id reclaim insufficient standby MDS daemons available 1 pools have too many placement groups services: mon: 3 daemons, quorum ceph-gp-rbd-fqj2nr-node2,ceph-gp-rbd-fqj2nr-node3,ceph-gp-rbd-fqj2nr-node1-installer (age 51s) mgr: ceph-gp-rbd-fqj2nr-node1-installer(active, since 23m), standbys: ceph-gp-rbd-fqj2nr-node2 mds: 1/1 daemons up osd: 12 osds: 12 up (since 45s), 12 in (since 19h) tcmu-runner: 1 daemon active (1 hosts) data: volumes: 1/1 healthy pools: 4 pools, 193 pgs objects: 1.09k objects, 4.1 GiB usage: 12 GiB used, 167 GiB / 180 GiB avail pgs: 193 active+clean Additional info: I restarted API services manually, then i can see portals and all. On client side before converting storage daaemons using cephadm: --------------- [root@ceph-gp-rbd-fqj2nr-node7 ~]# multipath -ll 3600140580c6fc634c3449dfb4f6264b4 dm-0 LIO-ORG,TCMU device size=50G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='queue-length 0' prio=50 status=enabled | `- 2:0:0:0 sda 8:0 active ready running `-+- policy='queue-length 0' prio=10 status=enabled `- 3:0:0:0 sdb 8:16 active ready running [root@ceph-gp-rbd-fqj2nr-node7 ~]# [root@ceph-gp-rbd-fqj2nr-node7 cephuser]# mount /dev/mapper/3600140580c6fc634c3449dfb4f6264b4 /tmp/iscsi_ditr [root@ceph-gp-rbd-fqj2nr-node7 cephuser]# [root@ceph-gp-rbd-fqj2nr-node7 cephuser]# [root@ceph-gp-rbd-fqj2nr-node7 cephuser]# [root@ceph-gp-rbd-fqj2nr-node7 cephuser]# cd /tmp/iscsi_ditr [root@ceph-gp-rbd-fqj2nr-node7 iscsi_ditr]# dd if=/dev/zero of=file1.txt count=1024 bs=1048576 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.19474 s, 336 MB/s [root@ceph-gp-rbd-fqj2nr-node7 iscsi_ditr]# dd if=/dev/zero of=file2.txt count=1024 bs=1048576 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.96971 s, 270 MB/s [root@ceph-gp-rbd-fqj2nr-node7 iscsi_ditr]# cluster: id: 7c80e5b9-c8ae-4fbb-b9e5-36b4288182d0 health: HEALTH_WARN mons are allowing insecure global_id reclaim insufficient standby MDS daemons available 1 pools have too many placement groups services: mon: 3 daemons, quorum ceph-gp-rbd-fqj2nr-node2,ceph-gp-rbd-fqj2nr-node3,ceph-gp-rbd-fqj2nr-node1-installer (age 12m) mgr: ceph-gp-rbd-fqj2nr-node1-installer(active, since 11m), standbys: ceph-gp-rbd-fqj2nr-node2 mds: 1/1 daemons up osd: 12 osds: 12 up (since 9m), 12 in (since 19h) tcmu-runner: 2 daemons active (2 hosts) Run infrastructure-playbooks/cephadm-adopt.yml: TASK [update the placement of iscsigw hosts] ******************************************************************************************************************************** Wednesday 01 September 2021 22:05:19 -0400 (0:00:00.311) 0:04:36.762 *** ok: [ceph-gp-rbd-fqj2nr-node3 -> ceph-gp-rbd-fqj2nr-node1-installer] PLAY [stop and remove legacy iscsigw daemons] ******************************************************************************************************************************* TASK [stop and disable iscsigw systemd services] **************************************************************************************************************************** Wednesday 01 September 2021 22:05:21 -0400 (0:00:02.119) 0:04:38.881 *** changed: [ceph-gp-rbd-fqj2nr-node3] => (item=rbd-target-api) changed: [ceph-gp-rbd-fqj2nr-node3] => (item=rbd-target-gw) changed: [ceph-gp-rbd-fqj2nr-node3] => (item=tcmu-runner) TASK [reset failed iscsigw systemd units] *********************************************************************************************************************************** Wednesday 01 September 2021 22:05:29 -0400 (0:00:07.798) 0:04:46.679 *** ok: [ceph-gp-rbd-fqj2nr-node3] => (item=rbd-target-api) ok: [ceph-gp-rbd-fqj2nr-node3] => (item=rbd-target-gw) ok: [ceph-gp-rbd-fqj2nr-node3] => (item=tcmu-runner) TASK [remove iscsigw systemd unit files] ************************************************************************************************************************************ Wednesday 01 September 2021 22:05:30 -0400 (0:00:00.769) 0:04:47.448 *** changed: [ceph-gp-rbd-fqj2nr-node3] => (item=rbd-target-api) changed: [ceph-gp-rbd-fqj2nr-node3] => (item=rbd-target-gw) changed: [ceph-gp-rbd-fqj2nr-node3] => (item=tcmu-runner) PLAY [stop and remove legacy iscsigw daemons] ******************************************************************************************************************************* TASK [stop and disable iscsigw systemd services] **************************************************************************************************************************** Wednesday 01 September 2021 22:05:30 -0400 (0:00:00.692) 0:04:48.141 *** changed: [ceph-gp-rbd-fqj2nr-node5] => (item=rbd-target-api) changed: [ceph-gp-rbd-fqj2nr-node5] => (item=rbd-target-gw) changed: [ceph-gp-rbd-fqj2nr-node5] => (item=tcmu-runner) TASK [reset failed iscsigw systemd units] *********************************************************************************************************************************** Wednesday 01 September 2021 22:05:38 -0400 (0:00:08.101) 0:04:56.243 *** ok: [ceph-gp-rbd-fqj2nr-node5] => (item=rbd-target-api) ok: [ceph-gp-rbd-fqj2nr-node5] => (item=rbd-target-gw) ok: [ceph-gp-rbd-fqj2nr-node5] => (item=tcmu-runner) Check the "ceph -s" status: -------------------------- cluster: id: 7c80e5b9-c8ae-4fbb-b9e5-36b4288182d0 health: HEALTH_WARN mons are allowing insecure global_id reclaim insufficient standby MDS daemons available 1 pools have too many placement groups services: mon: 3 daemons, quorum ceph-gp-rbd-fqj2nr-node2,ceph-gp-rbd-fqj2nr-node3,ceph-gp-rbd-fqj2nr-node1-installer (age 4m) mgr: ceph-gp-rbd-fqj2nr-node1-installer(active, since 4m), standbys: ceph-gp-rbd-fqj2nr-node2 mds: 1/1 daemons up osd: 12 osds: 12 up (since 2m), 12 in (since 19h) data: volumes: 1/1 healthy [root@ceph-gp-rbd-fqj2nr-node3 cephuser]# systemctl status tcmu-runner ● tcmu-runner.service - LIO Userspace-passthrough daemon Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:tcmu-runner(8) Sep 01 22:05:28 ceph-gp-rbd-fqj2nr-node3 conmon[116638]: teardown: Sending SIGTERM to PID 81 Sep 01 22:05:28 ceph-gp-rbd-fqj2nr-node3 conmon[116638]: teardown: Waiting PID 81 to terminate . Sep 01 22:05:28 ceph-gp-rbd-fqj2nr-node3 conmon[116638]: Sep 01 22:05:28 ceph-gp-rbd-fqj2nr-node3 conmon[116638]: (process:81): GLib-GObject-CRITICAL **: 22:05:28.996: g_object_unref: assertion 'G_IS_OBJECT (object)' failed Sep 01 22:05:29 ceph-gp-rbd-fqj2nr-node3 conmon[116638]: Sep 01 22:05:29 ceph-gp-rbd-fqj2nr-node3 conmon[116638]: teardown: Process 81 is terminated Sep 01 22:05:29 ceph-gp-rbd-fqj2nr-node3 sh[125169]: 5a95091781868327a26d71f3ab91b0152d438179b4fa9b98fd55e9181e3df0dc Sep 01 22:05:29 ceph-gp-rbd-fqj2nr-node3 systemd[1]: tcmu-runner.service: Main process exited, code=exited, status=143/n/a Sep 01 22:05:29 ceph-gp-rbd-fqj2nr-node3 systemd[1]: tcmu-runner.service: Failed with result 'exit-code'. Sep 01 22:05:29 ceph-gp-rbd-fqj2nr-node3 systemd[1]: Stopped TCMU Runner. [root@ceph-gp-rbd-fqj2nr-node3 cephuser]# Restart API services and check the status: ------------------------------------------ [root@ceph-gp-rbd-fqj2nr-node3 cephuser]# systemctl enable rbd-target-api Created symlink /etc/systemd/system/multi-user.target.wants/rbd-target-api.service → /usr/lib/systemd/system/rbd-target-api.service. [root@ceph-gp-rbd-fqj2nr-node3 cephuser]# systemctl start rbd-target-api [root@ceph-gp-rbd-fqj2nr-node3 cephuser]# systemctl status tcmu-runner ● tcmu-runner.service - LIO Userspace-passthrough daemon Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; disabled; vendor preset: disabled) Active: active (running) since Wed 2021-09-01 22:12:24 EDT; 4s ago Docs: man:tcmu-runner(8) Main PID: 125704 (tcmu-runner) Tasks: 24 (limit: 23505) Memory: 37.8M CGroup: /system.slice/tcmu-runner.service └─125704 /usr/bin/tcmu-runner cluster: id: 7c80e5b9-c8ae-4fbb-b9e5-36b4288182d0 health: HEALTH_WARN mons are allowing insecure global_id reclaim insufficient standby MDS daemons available 1 pools have too many placement groups services: mon: 3 daemons, quorum ceph-gp-rbd-fqj2nr-node2,ceph-gp-rbd-fqj2nr-node3,ceph-gp-rbd-fqj2nr-node1-installer (age 12m) mgr: ceph-gp-rbd-fqj2nr-node1-installer(active, since 11m), standbys: ceph-gp-rbd-fqj2nr-node2 mds: 1/1 daemons up osd: 12 osds: 12 up (since 9m), 12 in (since 19h) tcmu-runner: 2 daemons active (2 hosts) Check on client side for creating files: ---------------------------------------- [root@ceph-gp-rbd-fqj2nr-node7 iscsi_ditr]# dd if=/dev/zero of=file3.txt count=1024 bs=1048576 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 208.984 s, 5.1 MB/s [root@ceph-gp-rbd-fqj2nr-node7 iscsi_ditr]# cd [root@ceph-gp-rbd-fqj2nr-node7 ~]# [root@ceph-gp-rbd-fqj2nr-node7 ~]# cd /tmp/iscsi_ditr/ [root@ceph-gp-rbd-fqj2nr-node7 iscsi_ditr]# dd if=/dev/zero of=file4.txt count=1024 bs=1048576 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.1933 s, 256 MB/s