Bug 2023344 - vdsmd fails to shut down cleanly when it tries to deactivate a used LV
Summary: vdsmd fails to shut down cleanly when it tries to deactivate a used LV
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.50.0.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.10
: 4.40.100.1
Assignee: Nir Soffer
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-15 13:19 UTC by Milan Zamazal
Modified: 2022-01-19 07:00 UTC (History)
3 users (show)

Fixed In Version: vdsm-4.40.100.1
Doc Type: Bug Fix
Doc Text:
Cause: Vdsm tried to deactivate logical volumes during shutdown. Consequence: Vdsm shutdown can fail, or programs using logical volumes managed by vdsm could fail. Fix: Vdsm does not deactivate logical volumes during shutdown. Result: Vdsm shutdown faster and programs using logical volumes managed by vdsm are not affected.
Clone Of:
Environment:
Last Closed: 2022-01-19 07:00:13 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44016 0 None None None 2021-11-15 13:24:09 UTC
oVirt gerrit 117616 0 master MERGED monitor: Do not tear down domain during shutdown 2021-11-15 14:59:40 UTC
oVirt gerrit 117633 0 ovirt-4.4.z MERGED monitor: Do not tear down domain during shutdown 2021-11-15 16:12:09 UTC

Description Milan Zamazal 2021-11-15 13:19:53 UTC
Description of problem:

In some environments, Vdsm tries, when it is being stopped by systemd, to tear down a logical volume that is being used. This may block further shutdown steps performed by vdsmd, resulting in exceeding service TimeoutStopSec timeout, vdsmd being killed forcefully by systemd and leaving vdsmd shutdown procedure unfinished.

Version-Release number of selected component (if applicable):

current master (4.5 development)

How reproducible:

100% in ost-basic-suite-master-host-1 after test_004_basic_sanity.py completes while vm0 is running there.

Steps to Reproduce:
1. Run `systemctl stop vdsmd' on the host.
2. Check vdsm.log.

Actual results:

vdsmd doesn't complete its shutdown actions. The end of the log looks like this (notice it's not finished with "Exiting" message indicating vdsmd stopped its activity completely):

    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,831+0000 INFO  (monitor/0187cf2) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,838+0000 INFO  (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,844+0000 INFO  (monitor/15fa3d6) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,864+0000 INFO  (monitor/3a729aa) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,865+0000 INFO  (monitor/186180c) [storage.check] Stop checking '/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135)
    2021-11-12 07:09:32,867+0000 INFO  (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
    2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
    2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n  Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186)
    2021-11-12 07:09:38,066+0000 WARN  (monitor/186180c) [storage.lvm] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--available\', \'n\', \'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \'  Can\\\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\']' (lvm:532)
    2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)

Expected results:

vdsmd stops cleanly within TimeoutStopSec interval, logging "[vds] Exiting" as its last log message.

Comment 1 Nir Soffer 2021-11-15 14:00:51 UTC
(In reply to Milan Zamazal from comment #0) 
> Expected results:
> 
> vdsmd stops cleanly within TimeoutStopSec interval, logging "[vds] Exiting"
> as its last log message.

This is not possible to guarantee. The expect result is not to tear down
storage during shutdown. Tearing down storage must be done only when engnine
ask to deactivate storage domain or disconnect from storage.

When the vdsmd service stop, storage should be left as is. Storage domains
that were active before vdsm was stopped should remain active.

Comment 2 Nir Soffer 2021-11-15 20:35:35 UTC
Testing this change:

1. Make sure host is up in a data center with active storage domain

2. List active special volumes:

# lvs | egrep 'ids|inbox|leases|measter|metadata|outbox|xleases'
  ids                                  313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-ao---- 128.00m                                                    
  inbox                                313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a----- 128.00m                                                    
  leases                               313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a-----   2.00g                                                    
  metadata                             313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a----- 128.00m                                                    
  outbox                               313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a----- 128.00m                                                    
  xleases                              313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a-----   1.00g                                                    
  ids                                  4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-ao---- 128.00m                                                    
  inbox                                4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a----- 128.00m                                                    
  leases                               4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a-----   2.00g                                                    
  metadata                             4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a----- 128.00m                                                    
  outbox                               4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a----- 128.00m                                                    
  xleases                              4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a-----   1.00g                                                    
  ids                                  feab3738-c158-4d48-8a41-b5a95c057a50 -wi-ao---- 128.00m                                                    
  inbox                                feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a----- 128.00m                                                    
  leases                               feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a-----   2.00g                                                    
  metadata                             feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a----- 128.00m                                                    
  outbox                               feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a----- 128.00m                                                    
  xleases                              feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a-----   1.00g                                                    
  ids                                  ffc11717-c272-4a41-be97-9ac75d401413 -wi-ao---- 128.00m                                                    
  inbox                                ffc11717-c272-4a41-be97-9ac75d401413 -wi-a----- 128.00m                                                    
  leases                               ffc11717-c272-4a41-be97-9ac75d401413 -wi-a-----   2.00g                                                    
  metadata                             ffc11717-c272-4a41-be97-9ac75d401413 -wi-a----- 128.00m                                                    
  outbox                               ffc11717-c272-4a41-be97-9ac75d401413 -wi-a----- 128.00m                                                    
  xleases                              ffc11717-c272-4a41-be97-9ac75d401413 -wi-a-----   1.00g                                                    


3. On the host, stop the vdsmd service

    systemctl stop vdsmd

4. Make sure this log does not apear:

   INFO  ... [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)\

5. Check that the active special volumes were not deactivated

To reproduce, use vdsm from 4.4.9. After stopping it, all the special lvs except
"ids" will be inactive.

Comment 3 sshmulev 2021-11-24 13:38:17 UTC
Verified.

Versions:
ovirt-engine-4.4.10-0.17.el8ev
vdsm-4.40.100.1-1.el8ev

Steps to reproduce as described in comment2.

Also tried to reproduce it on 4.4.9 - Same steps and got the line of "2021-11-24 15:29:18,672+0200 INFO  (monitor/6cf08a4) [storage.BlockSD] Tearing down domain 6cf08a47-7cdc-42b1-b7d5-7c685a0ef45e (blockSD:995)
"
In this version(4.4.10), it does not reproduce, and the special volumes didn't activate.

Comment 5 Sandro Bonazzola 2022-01-19 07:00:13 UTC
This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.