Bug 2012843 - Copying master file system can fail or miss data updated during the copy
Summary: Copying master file system can fail or miss data updated during the copy
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.50
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Nir Soffer
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-11 12:55 UTC by Nir Soffer
Modified: 2022-05-25 11:21 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-05-24 13:13:30 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)
vdsm and change_master.py logs reproducing on NFS storage (1.28 MB, application/x-xz)
2021-10-11 12:55 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43789 0 None None None 2021-10-11 12:59:39 UTC
oVirt gerrit 116413 0 master NEW task: Fix locking when switching master 2021-10-11 13:00:51 UTC

Description Nir Soffer 2021-10-11 12:55:27 UTC
Created attachment 1831861 [details]
vdsm and change_master.py logs reproducing on NFS storage

Description of problem:

When coping master file system from old master domain during
StoragePool.switchMaster or StorageDomain.deactivate, vdsm users tar to copy
the contents of the master file system. If another task changes state during 
the copy, the copy can fail, or be incorrect.

We have 2 issues:

1. tar fails with "/usr/bin/tar: ./path: file changed as we read it"

2021-09-14 00:33:44,369+0300 ERROR (tasks/1) [storage.taskmanager.task] (Task='928b287c-fdd0-487f-bec5-f98d63b63ba2') Unexpect
ed error (task:877)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 989, in switchMaster
    self.masterMigrate(oldMasterUUID, newMasterUUID, masterVersion)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 914, in masterMigrate
    exclude=('./lost+found',))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileUtils.py", line 101, in tarCopy
    raise se.TarCommandError(errors)
vdsm.storage.exception.TarCommandError: Tar command failed: ({'reader': {'cmd': ['/usr/bin/tar', 'cf', '-', '--exclude=./lost+
found', '-C', '/rhev/data-center/mnt/sparse:_export_00/60edfd87-a97c-497c-85a2-2f044993bc2e/master', '.'], 'rc': 1, 'err': '/u
sr/bin/tar: ./tasks/1b1ef2f3-6e2b-4dde-9e07-f720569ccd76: file changed as we read it\n/usr/bin/tar: ./tasks/1b1ef2f3-6e2b-4dde
-9e07-f720569ccd76.temp: File removed before we read it\n/usr/bin/tar: ./tasks: file changed as we read it\n'}},)

This will fail the copy and the API call. The user can try the operation
again.

2. Tasks status is persisted after a file was copied with tar.

The tasks on the new master storage domain will not include the correct status
of the task. This can cause failure when trying to roll back the tasks on
the new master storage domain.

This is silent failure, we don't know if this happens, and we don't have a way to
detect the issue in the logs.

Version-Release number of selected component (if applicable):
- Exists since first vdsm version.
- More important since 4.4.4, when StoragePool.switchMaster was introduced.

How reproducible:
Very hard to reproduce. I could reproduce once when running 200 switch master
operations in a loop, while moving disks between other storage domains.

Steps to Reproduce:
1. Run change_mater.py example from the sdk in a loop
2. At the same time, move disks between storage domains in a loop

I reproduced the issue only on NFS storage, but it should affect any storage
type.

Comment 2 Arik 2022-05-24 13:13:12 UTC
Moved to GitHub: https://github.com/oVirt/vdsm/issues/203


Note You need to log in before you can comment on or make changes to this bug.