Description of problem: 1. Host boots and connects to storage 2. RHEL activates all LVs 3. VDSM Deactivate all LVs on bootstrap. All this works well. However, for the Hosted-Engine Storage Domain most if not all images are active and not open right after vdsm initialization, because they are all activated again a few seconds after vdsm initialization deactivates them. These are stale LVs and this is undesirable. This even caused corruption before Nir's --refresh patch. We don't want to rely on --refresh all the time, these LVs cannot be active. It's ovirt-ha-agent that asks VDSM to prepare all images in the HE SD, and the unused LVs are never deactivated. See: # cat /etc/ovirt-hosted-engine/hosted-engine.conf | grep sdUUID sdUUID=b1806393-a63b-4c0e-a4ab-4fad369c1654 Now let's see how many active but not open images (disks with tags, see IU_ )we have there. # lvs -o +tags | grep b1806393-a63b-4c0e-a4ab-4fad369c1654 | grep IU_ | grep '\-wi\-a\-' | wc -l 13 13 cannot be just OVFs or Hosted-Engine conf volumes. Let's see how many are not active: # lvs -o +tags | grep b1806393-a63b-4c0e-a4ab-4fad369c1654 | grep IU_ | grep '\-wi\-\-\-' | wc -l 0 One example, for a disk that I created 1 minute ago: # lvs -o +tags | grep 85b71ffb-47e3-47bf-af7a-ce135655cc4f de5c96de-6c8d-4b37-a7ba-8d922d95a63c b1806393-a63b-4c0e-a4ab-4fad369c1654 -wi-a----- 1.00g IU_85b71ffb-47e3-47bf-af7a-ce135655cc4f,MD_15,PU_00000000-0000-0000-0000-000000000000 From my investigation, this is what happens: 4. ovirt-ha-agent asks vdsm to prepare all images of the HE SD. So vdsm activates all of them right after boot. def _initialize_storage_images(self): [....] img.prepare_images() [....] While prepare_images comments are good enough: def prepare_images(self): """ It scans for all the available images and volumes on the hosted-engine storage domain and for each of them calls prepareImage on VDSM. prepareImage will create the needed symlinks and it will activate the LV if on block devices. """ 5. So once we see this, the images are all active. MainThread::INFO::2017-04-19 15:43:16,141::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::INFO::2017-04-19 15:43:16,142::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2017-04-19 15:43:18,353::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::INFO::2017-04-19 15:43:18,669::hosted_engine::666::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images MainThread::INFO::2017-04-19 15:43:18,669::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images 6. And nobody asks vdsm to tear the unused ones down. Because ovirt-ha-agent just calls teardown_images() on this exception def _initialize_storage_images(self): [....] try: sserver.connect_storage_server() except ex.DuplicateStorageConnectionException: [....] img.teardown_images() [....] Version-Release number of selected component (if applicable): It's reproducible on pretty much every RHV version, all the way from 3.6 to latest. Just tested it on: ovirt-hosted-engine-ha-2.0.6-1.el7ev.noarch vdsm-4.18.21-1.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Add a disk to the hosted_storage 2. Wait a ovirt-ha-agent cycle 3. check host LVs Actual results: Stale LVs active Expected results: No Stale LVs
Those are only disks on the HE SD, which should only be the HE VM disks, right?
(In reply to Yaniv Kaul from comment #2) > Those are only disks on the HE SD, which should only be the HE VM disks, > right? No, we allow using the HE SD as a normal SD. So user can create as many disks as he wants in the HE SD and attach to the VMs.
The code allows that, but we have always said it is not supported. It is going to change, but we haven't gotten to it yet.
(In reply to Martin Sivák from comment #5) > The code allows that, but we have always said it is not supported. It is > going to change, but we haven't gotten to it yet. Hi Martin, Thanks for linking, 1275552. I think it needs to be escalated asap, will do it now. So if it is not supported this must be made VERY clear, including a warning/blocking the action in the Portal. A small note buried here: https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.0/html-single/self-hosted_engine_guide/ Saying "The self-hosted engine requires a shared storage domain dedicated to the Manager virtual machine." is not enough. We must add a warning/experimental label as we have for OVS. Depending on the names of the Storage Domains, `hosted_storage` is even the default one to create a new disk (as it is in our labs), due to alphabetical order. I understand the previous bugs related to 1275552 were mostly performance/compromised HA issues. But this BZ here - stale LVs - can lead to data corruption as we have already seen in the past with stale LVs (BZ1358348). We cannot rely on the lvm refresh patch to always save VMs from corruption in the HE SD, that is a safety net mechanism. This is very serious.
We plan to make the HE SD a normal SD in the system, I'll be closing the other bugs.
(In reply to Yaniv Dary from comment #7) > We plan to make the HE SD a normal SD in the system, I'll be closing the > other bugs. So please do. There's no actionable item here for storage. Setting devel cond-nack until a clear requirement arises.
Any improvement on this in 4.2?
(In reply to Yaniv Lavi from comment #9) > Any improvement on this in 4.2? It will be probably worst on this side. On node-zero flow, just after the setup, the hosted-engine storage domain will be active in the engine and it will be the master storage domain. No other storage domain is required on technical side to start using the system. Although we don't recommend it in our documentation, the user can create other VMs on the hosted-engine storage domain and the engine doesn't complain at all so at the end the user can have more disks on the hosted-engine storage domain and so more LVs. On ovirt-ha-agent side it's still almost the same: prepare_images will prepare all the images found on the storage domain and this will result in LVs active but not open.
I've deployed clean environment over iSCSI using ansible and created 10 disks on hosted storage of 2GB size each. Here what I saw just after creation of disks from host: alma03 ~]# lvs -o +tags LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LV Tags 0196b277-7292-4512-aeb5-71795dd58ce9 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_0740975b-3d77-4e63-b045-b6de00580139,MD_9,PU_00000000-0000-0000-0000-000000000000 02c9ade8-3d76-45f3-85b4-104b997b54af 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_9bd6d63e-3915-446e-8040-910100fb48d8,MD_8,PU_00000000-0000-0000-0000-000000000000 17ed51d4-df2e-40c1-a4eb-ced43546ed16 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-ao---- 1.00g IU_cc594e63-b22b-40ca-9c12-6c9576c10372,MD_4,PU_00000000-0000-0000-0000-000000000000 1a06eeb9-4230-4a44-ba7d-291616fed6ac 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_59eb6bf7-4c4a-41b2-a6e5-42744bcb0b93,MD_11,PU_00000000-0000-0000-0000-000000000000 29be3876-d794-422c-97bc-a1101d83530b 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 1.00g IU_e8ba1447-e2c8-4d0d-bbda-016e35e3483d,MD_6,PU_00000000-0000-0000-0000-000000000000 47c56bed-8332-4b99-9083-31ba817bed3c 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_fe0ab22c-8d92-4ca9-9445-05eb142b3f59,MD_13,PU_00000000-0000-0000-0000-000000000000 663f0706-b715-41b0-86e6-b42d00af9447 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_34bb3d89-5a8c-414d-8a2d-e42677e49cc6,MD_14,PU_00000000-0000-0000-0000-000000000000 77f70737-2c51-41f4-9181-8c1be655027a 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 1.00g IU_42c15741-4d08-4360-b968-0f43e6abd284,MD_5,PU_00000000-0000-0000-0000-000000000000 816db21b-b746-4566-bd95-932acc5a6814 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_8868b703-d716-4c2c-8fe0-deab42f298bd,MD_10,PU_00000000-0000-0000-0000-000000000000 a20c86c4-ef74-4d72-9c7a-fe1ed0ed7739 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_bd35c468-4501-4189-99c9-bbe24d6fbf87,MD_12,PU_00000000-0000-0000-0000-000000000000 a52b729b-836f-46f7-8046-1c345e0143d8 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-ao---- 50.00g IU_1f0754c4-2066-44bf-a044-94c8ea279b41,MD_7,PU_00000000-0000-0000-0000-000000000000 b1648a60-a21c-4ae2-a3f4-dd21eacce714 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_9f22be1b-2277-48d6-be3d-4cf2ac57f651,MD_16,PU_00000000-0000-0000-0000-000000000000 c843ead8-3f03-4a97-8bde-3655308e466d 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_beaf4558-cd4a-4ce7-bff1-2e9a8adabef1,MD_17,PU_00000000-0000-0000-0000-000000000000 ce68569a-5537-4cdc-ac4e-2de72ce25259 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_9eea0a1e-3139-42c8-8700-c08c44812518,MD_15,PU_00000000-0000-0000-0000-000000000000 ids 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-ao---- 128.00m inbox 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 128.00m leases 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 2.00g master 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-ao---- 1.00g metadata 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 512.00m outbox 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 128.00m xleases 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 1.00g Then I restarted ha-agent and broker and checked again: [root@alma03 ~]# systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent [root@alma03 ~]# lvs -o +tags LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LV Tags 0196b277-7292-4512-aeb5-71795dd58ce9 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_0740975b-3d77-4e63-b045-b6de00580139,MD_9,PU_00000000-0000-0000-0000-000000000000 02c9ade8-3d76-45f3-85b4-104b997b54af 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_9bd6d63e-3915-446e-8040-910100fb48d8,MD_8,PU_00000000-0000-0000-0000-000000000000 17ed51d4-df2e-40c1-a4eb-ced43546ed16 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-ao---- 1.00g IU_cc594e63-b22b-40ca-9c12-6c9576c10372,MD_4,PU_00000000-0000-0000-0000-000000000000 1a06eeb9-4230-4a44-ba7d-291616fed6ac 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_59eb6bf7-4c4a-41b2-a6e5-42744bcb0b93,MD_11,PU_00000000-0000-0000-0000-000000000000 29be3876-d794-422c-97bc-a1101d83530b 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 1.00g IU_e8ba1447-e2c8-4d0d-bbda-016e35e3483d,MD_6,PU_00000000-0000-0000-0000-000000000000 47c56bed-8332-4b99-9083-31ba817bed3c 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_fe0ab22c-8d92-4ca9-9445-05eb142b3f59,MD_13,PU_00000000-0000-0000-0000-000000000000 663f0706-b715-41b0-86e6-b42d00af9447 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_34bb3d89-5a8c-414d-8a2d-e42677e49cc6,MD_14,PU_00000000-0000-0000-0000-000000000000 77f70737-2c51-41f4-9181-8c1be655027a 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 1.00g IU_42c15741-4d08-4360-b968-0f43e6abd284,MD_5,PU_00000000-0000-0000-0000-000000000000 816db21b-b746-4566-bd95-932acc5a6814 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_8868b703-d716-4c2c-8fe0-deab42f298bd,MD_10,PU_00000000-0000-0000-0000-000000000000 a20c86c4-ef74-4d72-9c7a-fe1ed0ed7739 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_bd35c468-4501-4189-99c9-bbe24d6fbf87,MD_12,PU_00000000-0000-0000-0000-000000000000 a52b729b-836f-46f7-8046-1c345e0143d8 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-ao---- 50.00g IU_1f0754c4-2066-44bf-a044-94c8ea279b41,MD_7,PU_00000000-0000-0000-0000-000000000000 b1648a60-a21c-4ae2-a3f4-dd21eacce714 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_9f22be1b-2277-48d6-be3d-4cf2ac57f651,MD_16,PU_00000000-0000-0000-0000-000000000000 c843ead8-3f03-4a97-8bde-3655308e466d 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_beaf4558-cd4a-4ce7-bff1-2e9a8adabef1,MD_17,PU_00000000-0000-0000-0000-000000000000 ce68569a-5537-4cdc-ac4e-2de72ce25259 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi------- 2.00g IU_9eea0a1e-3139-42c8-8700-c08c44812518,MD_15,PU_00000000-0000-0000-0000-000000000000 ids 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-ao---- 128.00m inbox 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 128.00m leases 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 2.00g master 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-ao---- 1.00g metadata 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 512.00m outbox 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 128.00m xleases 0d528e5a-43f8-4b73-b53c-61a909def9e7 -wi-a----- 1.00g All disks were with -wi------- as expected, they were not active/open or active but not open. Moving to verified. Worked for me on these components on host: rhvm-appliance-4.2-20180202.0.el7.noarch ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch Red Hat Enterprise Linux Server release 7.4 (Maipo) Linux 3.10.0-693.19.1.el7.x86_64 #1 SMP Thu Feb 1 12:34:44 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1472
BZ<2>Jira Resync