Bug 1388247

Summary: [rbd-mirror] : image is shown as non-primary image on both cluster and unable to write to image after promotion/demotion of image on slave cluster(demotion/promotion on master cluster)
Product: Red Hat Ceph Storage Reporter: Rachana Patel <racpatel>
Component: DocumentationAssignee: Bara Ancincova <bancinco>
Status: CLOSED CURRENTRELEASE QA Contact: Rachana Patel <racpatel>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1CC: ceph-eng-bugs, hnallurv, jdillama, kdreyer, kurs, uboppana
Target Milestone: rc   
Target Release: 2.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-28 09:37:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Rachana Patel 2016-10-24 20:38:01 UTC
Description of problem:
========================
Demote image from master cluster and promote it from slave/remote cluster. Write to image using write-bench and then again demoted image.(command status was successful) but unable to promote image from master cluster.
After 2-3 unsucceful try of promoting from master and demoting from slave cluster, it is showing image as primary on both cluster and unable to do write to image.

Version-Release number of selected component (if applicable):
================================================================
10.2.3-8.el7cp.x86_64

How reproducible:
=================
Haven't tried. will update this later


Steps to Reproduce:
===================
1. created 2 cluster- Master and slave-  with 3 MON and 3 OSD node(9 OSD total).each cluster has one rbd-client.
2. setup rbd-mirroring between cluster
3. created image on master node . write to image and allowed it to sync to slave.
4. demoted image from matser cluster
[root@magna025 ubuntu]# rbd mirror image demote pool1/bugre1 --cluster master
Image demoted to non-primary
[root@magna025 ubuntu]# rbd mirror image status  pool1/bugre1  --cluster master
bugre1:
  global_id:   2e0b038f-f863-4cc1-a6a0-36a0f7ded729
  state:       up+stopped
  description: remote image is non-primary or local image is primary
  last_update: 2016-10-24 19:02:55

5. promoted same image from slave cluster
[root@magna078 ubuntu]#  rbd mirror image promote pool1/bugre1 --cluster slave
Image promoted to primary

[root@magna078 ubuntu]# rbd mirror image status  pool1/bugre1  --cluster slave
bugre1:
  global_id:   2e0b038f-f863-4cc1-a6a0-36a0f7ded729
  state:       up+stopped
  description: remote image is non-primary or local image is primary
  last_update: 2016-10-24 19:04:28

6. write to image from slave cluster
[root@magna078 ubuntu]# rbd bench-write pool1/bugre1  --io-total 100M --cluster slave
bench-write  io_size 4096 io_threads 16 bytes 104857600 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1      1234   1247.82  5111090.57
    2      1448    720.63  2951705.20

on master :-
bugre1:
  global_id:   2e0b038f-f863-4cc1-a6a0-36a0f7ded729
  state:       up+replaying
  description: replaying, master_position=[object_number=2, tag_tid=4, entry_tid=2182], mirror_position=[object_number=3, tag_tid=4, entry_tid=63], entries_behind_master=2119
  last_update: 2016-10-24 19:08:26

7. Demote image from slave cluster
[root@magna078 ubuntu]#  rbd mirror image demote pool1/bugre1 --cluster slave
Image demoted to non-primary

8. promote same image from master  cluster
[root@magna025 ubuntu]# rbd mirror image promote pool1/bugre1 --cluster master
rbd: error promoting image to primary
2016-10-24 19:10:21.043848 7faedb03bd80 -1 librbd: image is still primary within a remote cluster


--> On failure checked status on slave cluster and redo demotion

[root@magna078 ubuntu]# rbd mirror image status  pool1/bugre1  --cluster slave
bugre1:
  global_id:   2e0b038f-f863-4cc1-a6a0-36a0f7ded729
  state:       up+stopped
  description: remote image is non-primary or local image is primary
  last_update: 2016-10-24 19:10:28

[root@magna078 ubuntu]#  rbd mirror image demote pool1/bugre1 --cluster slave
rbd: error demoting image to non-primary
2016-10-24 19:10:52.691154 7f85d50fed80 -1 librbd: image is not currently the primary

--> it failed..try twice again.

--> tried to promote image from master cluster and that also failed(tried twice)


[root@magna025 ubuntu]# rbd mirror image promote pool1/bugre1 --cluster master
rbd: error promoting image to primary
2016-10-24 19:10:25.701081 7f614bf13d80 -1 librbd: image is still primary within a remote cluster
[root@magna025 ubuntu]# rbd mirror image promote pool1/bugre1 --cluster master
rbd: error promoting image to primary2016-10-24 19:11:08.763724 7f4628b2ad80 -1 librbd: image is still primary within a remote cluster




Actual results:
================
after few min check status on both cluster and both is showing image as primary and writing to image is failing from both cluster


[root@magna078 ubuntu]# rbd mirror image status  pool1/bugre1  --cluster slave
bugre1:
  global_id:   2e0b038f-f863-4cc1-a6a0-36a0f7ded729
  state:       up+stopped
  description: remote image is non-primary or local image is primary
  last_update: 2016-10-24 19:15:58

[root@magna025 ubuntu]# rbd mirror image status  pool1/bugre1  --cluster master 
bugre1:
  global_id:   2e0b038f-f863-4cc1-a6a0-36a0f7ded729
  state:       up+stopped
  description: remote image is non-primary or local image is primary
  last_update: 2016-10-24 19:37:25


master cluster
=============
[root@magna025 ubuntu]# rbd bench-write pool1/bugre1  --io-total 100M --cluster master
bench-write  io_size 4096 io_threads 16 bytes 104857600 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
2016-10-24 19:55:29.448677 7f7b8affd700 -1 librbd::journal::StandardPolicy: local image not promoted
2016-10-24 19:55:29.448686 7f7b8affd700 -1 librbd::exclusive_lock::AcquireRequest: failed to allocate journal tag: (1) Operation not permitted
2016-10-24 19:55:29.489704 7f7b8affd700 -1 librbd::ExclusiveLock: failed to acquire exclusive lock:(1) Operation not permitted
^C
[root@magna025 ubuntu]# systemctl status -l ceph-rbd-mirror@master
● ceph-rbd-mirror - Ceph rbd mirror daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-10-21 18:09:38 UTC; 3 days ago
 Main PID: 11347 (rbd-mirror)
   CGroup: /system.slice/system-ceph\x2drbd\x2dmirror.slice/ceph-rbd-mirror
           └─11347 /usr/bin/rbd-mirror -f --cluster master --id master --setuser ceph --setgroup ceph

Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909158 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c0939d0 [3/0ecb8cd7-8e05-4169-a947-955a364acae4] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909161 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c094f10 [3/d73bea26-e095-4cc2-b48c-63fda1773123] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909165 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c096450 [3/e3dc42e6-99ed-44b4-8de2-5a195ff897f5] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909168 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c097990 [3/1e1d4355-7655-4755-8219-de1fd6b70db1] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909171 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c098ed0 [3/d98a3b15-820b-4b57-9d8f-51c36be8cdf8] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909174 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c09b950 [3/b422e2f9-1323-4c7b-9318-d041381b8427] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909177 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c09ce90 [3/fb51b78f-c884-4287-ae0b-7b4791c1150f] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909180 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c08a460 [3/2e0b038f-f863-4cc1-a6a0-36a0f7ded729] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909183 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c0a2370 [3/faf4090f-2fff-4552-b8c7-52b7f1886e3f] start: already running
Oct 24 19:55:24 magna025 rbd-mirror[11347]: 2016-10-24 19:55:24.909187 7fd2cbfff700 -1 rbd::mirror::ImageReplayer: 0x7fd24c0a38b0 [3/c69ef283-987e-4712-841a-3d554e435349] start: already running



slave cluster:
==============
[root@magna078 ubuntu]# rbd bench-write pool1/bugre1  --io-total 100M --cluster slave
bench-write  io_size 4096 io_threads 16 bytes 104857600 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
2016-10-24 19:54:54.810612 7fa0ed6f9700 -1 librbd::journal::StandardPolicy: local image not promoted
2016-10-24 19:54:54.810624 7fa0ed6f9700 -1 librbd::exclusive_lock::AcquireRequest: failed to allocate journal tag: (1) Operation not permitted
2016-10-24 19:54:54.839937 7fa0ed6f9700 -1 librbd::ExclusiveLock: failed to acquire exclusive lock:(1) Operation not permitted


^C   
[root@magna078 ubuntu]# systemctl status -l ceph-rbd-mirror@slave
● ceph-rbd-mirror - Ceph rbd mirror daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-10-21 18:44:27 UTC; 3 days ago
 Main PID: 11803 (rbd-mirror)
   CGroup: /system.slice/system-ceph\x2drbd\x2dmirror.slice/ceph-rbd-mirror
           └─11803 /usr/bin/rbd-mirror -f --cluster slave --id slave --setuser ceph --setgroup ceph

Oct 24 18:47:05 magna078 rbd-mirror[11803]: 2016-10-24 18:47:05.795258 7fcebcf90700 -1 rbd::mirror::image_replayer::CreateImageRequest: 0x7fce000bca90 handle_create_image: failed to create local image: (17) File exists
Oct 24 18:47:05 magna078 rbd-mirror[11803]: 2016-10-24 18:47:05.795264 7fcebcf90700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7fceac02af30 handle_create_local_image: failed to create local image: (17) File exists
Oct 24 18:47:05 magna078 rbd-mirror[11803]: 2016-10-24 18:47:05.797140 7fcebcf90700 -1 rbd::mirror::ImageReplayer: 0x7fce38020d20 [3/2e0b038f-f863-4cc1-a6a0-36a0f7ded729] operator(): start failed: (17) File exists
Oct 24 19:02:55 magna078 rbd-mirror[11803]: 2016-10-24 19:02:55.426754 7fce6e7fc700 -1 rbd::mirror::image_sync::ImageCopyRequest: 0x7fce1c10e2d0 compute_snap_map: failed to locate snapshot: .rbd-mirror.0f454219-dd15-49ba-968a-f822833d46e8.db782658-bd31-443c-b95e-5ab40d38831e
Oct 24 19:02:55 magna078 rbd-mirror[11803]: 2016-10-24 19:02:55.426765 7fce6e7fc700 -1 rbd::mirror::ImageSync: 0x7fcdd8105c10 handle_copy_image: failed to copy image: (2) No such file or directory
Oct 24 19:02:55 magna078 rbd-mirror[11803]: 2016-10-24 19:02:55.426773 7fce6e7fc700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7fceac055ee0 handle_image_sync: failed to sync remote image: (2) No such file or directory
Oct 24 19:02:55 magna078 rbd-mirror[11803]: 2016-10-24 19:02:55.434806 7fcebcf90700 -1 rbd::mirror::ImageReplayer: 0x7fce38020d20 [3/2e0b038f-f863-4cc1-a6a0-36a0f7ded729] operator(): start failed: (2) No such file or directory
Oct 24 19:19:05 magna078 rbd-mirror[11803]: 2016-10-24 19:19:05.905403 7fcebcf90700 -1 rbd::mirror::ImageReplayer: 0x7fce38020d20 [3/2e0b038f-f863-4cc1-a6a0-36a0f7ded729] start: already running
Oct 24 19:21:05 magna078 rbd-mirror[11803]: 2016-10-24 19:21:05.922989 7fcebcf90700 -1 rbd::mirror::ImageReplayer: 0x7fce38020d20 [3/2e0b038f-f863-4cc1-a6a0-36a0f7ded729] start: already running
Oct 24 19:45:06 magna078 rbd-mirror[11803]: 2016-10-24 19:45:06.060784 7fcebcf90700 -1 rbd::mirror::ImageReplayer: 0x7fce38020d20 [3/2e0b038f-f863-4cc1-a6a0-36a0f7ded729] start: already running


Expected results:
================
- As mentioned in step 7 - demotion was successful so it should not show image as primary
- should not show same status on both cluster (claiming self primary or other image as non primary) 
- User should be able to write on image.(if it is primary)

Additional info:

Comment 10 John Poelstra 2016-10-26 23:49:18 UTC
Discussed at the program meeting... there is no code to fix here, it is a documentation issue.

Comment 15 Rachana Patel 2016-11-10 01:28:35 UTC
Doc text is as per comment 12, hence moving to verified