Bug 2012843

Summary: Copying master file system can fail or miss data updated during the copy
Product: [oVirt] vdsm Reporter: Nir Soffer <nsoffer>
Component: GeneralAssignee: Nir Soffer <nsoffer>
Status: CLOSED DEFERRED QA Contact: Evelina Shames <eshames>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.50CC: ahadas, bugs
Target Milestone: ---Flags: pm-rhel: ovirt-4.5?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-24 13:13:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm and change_master.py logs reproducing on NFS storage none

Description Nir Soffer 2021-10-11 12:55:27 UTC
Created attachment 1831861 [details]
vdsm and change_master.py logs reproducing on NFS storage

Description of problem:

When coping master file system from old master domain during
StoragePool.switchMaster or StorageDomain.deactivate, vdsm users tar to copy
the contents of the master file system. If another task changes state during 
the copy, the copy can fail, or be incorrect.

We have 2 issues:

1. tar fails with "/usr/bin/tar: ./path: file changed as we read it"

2021-09-14 00:33:44,369+0300 ERROR (tasks/1) [storage.taskmanager.task] (Task='928b287c-fdd0-487f-bec5-f98d63b63ba2') Unexpect
ed error (task:877)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 989, in switchMaster
    self.masterMigrate(oldMasterUUID, newMasterUUID, masterVersion)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 914, in masterMigrate
    exclude=('./lost+found',))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileUtils.py", line 101, in tarCopy
    raise se.TarCommandError(errors)
vdsm.storage.exception.TarCommandError: Tar command failed: ({'reader': {'cmd': ['/usr/bin/tar', 'cf', '-', '--exclude=./lost+
found', '-C', '/rhev/data-center/mnt/sparse:_export_00/60edfd87-a97c-497c-85a2-2f044993bc2e/master', '.'], 'rc': 1, 'err': '/u
sr/bin/tar: ./tasks/1b1ef2f3-6e2b-4dde-9e07-f720569ccd76: file changed as we read it\n/usr/bin/tar: ./tasks/1b1ef2f3-6e2b-4dde
-9e07-f720569ccd76.temp: File removed before we read it\n/usr/bin/tar: ./tasks: file changed as we read it\n'}},)

This will fail the copy and the API call. The user can try the operation
again.

2. Tasks status is persisted after a file was copied with tar.

The tasks on the new master storage domain will not include the correct status
of the task. This can cause failure when trying to roll back the tasks on
the new master storage domain.

This is silent failure, we don't know if this happens, and we don't have a way to
detect the issue in the logs.

Version-Release number of selected component (if applicable):
- Exists since first vdsm version.
- More important since 4.4.4, when StoragePool.switchMaster was introduced.

How reproducible:
Very hard to reproduce. I could reproduce once when running 200 switch master
operations in a loop, while moving disks between other storage domains.

Steps to Reproduce:
1. Run change_mater.py example from the sdk in a loop
2. At the same time, move disks between storage domains in a loop

I reproduced the issue only on NFS storage, but it should affect any storage
type.

Comment 2 Arik 2022-05-24 13:13:12 UTC
Moved to GitHub: https://github.com/oVirt/vdsm/issues/203