Created attachment 1831861 [details] vdsm and change_master.py logs reproducing on NFS storage Description of problem: When coping master file system from old master domain during StoragePool.switchMaster or StorageDomain.deactivate, vdsm users tar to copy the contents of the master file system. If another task changes state during the copy, the copy can fail, or be incorrect. We have 2 issues: 1. tar fails with "/usr/bin/tar: ./path: file changed as we read it" 2021-09-14 00:33:44,369+0300 ERROR (tasks/1) [storage.taskmanager.task] (Task='928b287c-fdd0-487f-bec5-f98d63b63ba2') Unexpect ed error (task:877) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 989, in switchMaster self.masterMigrate(oldMasterUUID, newMasterUUID, masterVersion) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 914, in masterMigrate exclude=('./lost+found',)) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileUtils.py", line 101, in tarCopy raise se.TarCommandError(errors) vdsm.storage.exception.TarCommandError: Tar command failed: ({'reader': {'cmd': ['/usr/bin/tar', 'cf', '-', '--exclude=./lost+ found', '-C', '/rhev/data-center/mnt/sparse:_export_00/60edfd87-a97c-497c-85a2-2f044993bc2e/master', '.'], 'rc': 1, 'err': '/u sr/bin/tar: ./tasks/1b1ef2f3-6e2b-4dde-9e07-f720569ccd76: file changed as we read it\n/usr/bin/tar: ./tasks/1b1ef2f3-6e2b-4dde -9e07-f720569ccd76.temp: File removed before we read it\n/usr/bin/tar: ./tasks: file changed as we read it\n'}},) This will fail the copy and the API call. The user can try the operation again. 2. Tasks status is persisted after a file was copied with tar. The tasks on the new master storage domain will not include the correct status of the task. This can cause failure when trying to roll back the tasks on the new master storage domain. This is silent failure, we don't know if this happens, and we don't have a way to detect the issue in the logs. Version-Release number of selected component (if applicable): - Exists since first vdsm version. - More important since 4.4.4, when StoragePool.switchMaster was introduced. How reproducible: Very hard to reproduce. I could reproduce once when running 200 switch master operations in a loop, while moving disks between other storage domains. Steps to Reproduce: 1. Run change_mater.py example from the sdk in a loop 2. At the same time, move disks between storage domains in a loop I reproduced the issue only on NFS storage, but it should affect any storage type.
Moved to GitHub: https://github.com/oVirt/vdsm/issues/203