Bug 2023344

Summary: vdsmd fails to shut down cleanly when it tries to deactivate a used LV
Product: [oVirt] vdsm Reporter: Milan Zamazal <mzamazal>
Component: CoreAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: sshmulev
Severity: high Docs Contact:
Priority: high    
Version: 4.50.0.1CC: bugs, dfodor, eshames
Target Milestone: ovirt-4.4.10Keywords: ZStream
Target Release: 4.40.100.1Flags: pm-rhel: ovirt-4.4+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.40.100.1 Doc Type: Bug Fix
Doc Text:
Cause: Vdsm tried to deactivate logical volumes during shutdown. Consequence: Vdsm shutdown can fail, or programs using logical volumes managed by vdsm could fail. Fix: Vdsm does not deactivate logical volumes during shutdown. Result: Vdsm shutdown faster and programs using logical volumes managed by vdsm are not affected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-19 07:00:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Milan Zamazal 2021-11-15 13:19:53 UTC
Description of problem:

In some environments, Vdsm tries, when it is being stopped by systemd, to tear down a logical volume that is being used. This may block further shutdown steps performed by vdsmd, resulting in exceeding service TimeoutStopSec timeout, vdsmd being killed forcefully by systemd and leaving vdsmd shutdown procedure unfinished.

Version-Release number of selected component (if applicable):

current master (4.5 development)

How reproducible:

100% in ost-basic-suite-master-host-1 after test_004_basic_sanity.py completes while vm0 is running there.

Steps to Reproduce:
1. Run `systemctl stop vdsmd' on the host.
2. Check vdsm.log.

Actual results:

vdsmd doesn't complete its shutdown actions. The end of the log looks like this (notice it's not finished with "Exiting" message indicating vdsmd stopped its activity completely):

    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,831+0000 INFO  (monitor/0187cf2) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,838+0000 INFO  (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,844+0000 INFO  (monitor/15fa3d6) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,864+0000 INFO  (monitor/3a729aa) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,865+0000 INFO  (monitor/186180c) [storage.check] Stop checking '/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135)
    2021-11-12 07:09:32,867+0000 INFO  (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
    2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
    2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n  Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186)
    2021-11-12 07:09:38,066+0000 WARN  (monitor/186180c) [storage.lvm] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--available\', \'n\', \'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \'  Can\\\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\']' (lvm:532)
    2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)

Expected results:

vdsmd stops cleanly within TimeoutStopSec interval, logging "[vds] Exiting" as its last log message.

Comment 1 Nir Soffer 2021-11-15 14:00:51 UTC
(In reply to Milan Zamazal from comment #0) 
> Expected results:
> 
> vdsmd stops cleanly within TimeoutStopSec interval, logging "[vds] Exiting"
> as its last log message.

This is not possible to guarantee. The expect result is not to tear down
storage during shutdown. Tearing down storage must be done only when engnine
ask to deactivate storage domain or disconnect from storage.

When the vdsmd service stop, storage should be left as is. Storage domains
that were active before vdsm was stopped should remain active.

Comment 2 Nir Soffer 2021-11-15 20:35:35 UTC
Testing this change:

1. Make sure host is up in a data center with active storage domain

2. List active special volumes:

# lvs | egrep 'ids|inbox|leases|measter|metadata|outbox|xleases'
  ids                                  313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-ao---- 128.00m                                                    
  inbox                                313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a----- 128.00m                                                    
  leases                               313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a-----   2.00g                                                    
  metadata                             313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a----- 128.00m                                                    
  outbox                               313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a----- 128.00m                                                    
  xleases                              313e6d78-80f7-41ab-883b-d1bddf77a5da -wi-a-----   1.00g                                                    
  ids                                  4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-ao---- 128.00m                                                    
  inbox                                4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a----- 128.00m                                                    
  leases                               4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a-----   2.00g                                                    
  metadata                             4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a----- 128.00m                                                    
  outbox                               4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a----- 128.00m                                                    
  xleases                              4345b708-77ab-4477-9d7b-b9e6d053cb1f -wi-a-----   1.00g                                                    
  ids                                  feab3738-c158-4d48-8a41-b5a95c057a50 -wi-ao---- 128.00m                                                    
  inbox                                feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a----- 128.00m                                                    
  leases                               feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a-----   2.00g                                                    
  metadata                             feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a----- 128.00m                                                    
  outbox                               feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a----- 128.00m                                                    
  xleases                              feab3738-c158-4d48-8a41-b5a95c057a50 -wi-a-----   1.00g                                                    
  ids                                  ffc11717-c272-4a41-be97-9ac75d401413 -wi-ao---- 128.00m                                                    
  inbox                                ffc11717-c272-4a41-be97-9ac75d401413 -wi-a----- 128.00m                                                    
  leases                               ffc11717-c272-4a41-be97-9ac75d401413 -wi-a-----   2.00g                                                    
  metadata                             ffc11717-c272-4a41-be97-9ac75d401413 -wi-a----- 128.00m                                                    
  outbox                               ffc11717-c272-4a41-be97-9ac75d401413 -wi-a----- 128.00m                                                    
  xleases                              ffc11717-c272-4a41-be97-9ac75d401413 -wi-a-----   1.00g                                                    


3. On the host, stop the vdsmd service

    systemctl stop vdsmd

4. Make sure this log does not apear:

   INFO  ... [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)\

5. Check that the active special volumes were not deactivated

To reproduce, use vdsm from 4.4.9. After stopping it, all the special lvs except
"ids" will be inactive.

Comment 3 sshmulev 2021-11-24 13:38:17 UTC
Verified.

Versions:
ovirt-engine-4.4.10-0.17.el8ev
vdsm-4.40.100.1-1.el8ev

Steps to reproduce as described in comment2.

Also tried to reproduce it on 4.4.9 - Same steps and got the line of "2021-11-24 15:29:18,672+0200 INFO  (monitor/6cf08a4) [storage.BlockSD] Tearing down domain 6cf08a47-7cdc-42b1-b7d5-7c685a0ef45e (blockSD:995)
"
In this version(4.4.10), it does not reproduce, and the special volumes didn't activate.

Comment 5 Sandro Bonazzola 2022-01-19 07:00:13 UTC
This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.