(first, i'm sorry if i didn't categorize this bug on the right place) Description of problem: i'm using this script : https://github.com/wefixit-AT/oVirtBackup who makes, beside of the fact it's using the old API, a good job for backuping VM on the export storage domains. During the process, sometimes (on rare occurence on my biggest VMs), the snapshots fails (i still don't know exactly why, i'm still trying to reproduce that). The main problem is than it puts all the VMs on this host on none responsive state, and not only the one concerned by the snapshot. For information, all VMs concerned are stored on the same host, with NFS. Someone (on IRC chat) think it might come from network problem, but there's should be something explicit about it (and, for now, i didn't found anything) Version-Release number of selected component (if applicable): How reproducible: Mostly happening on >400 Gb VMs, randomly Steps to Reproduce: 1. Make a snapshot 2. Have it fail Actual results: VMs on this host are unresponsive, only the force remove of the snapshot make the other VM responsive (the VMs concerned by snapshot has to be reboot to regain responsive state) Expected results: On the best case, no VMs should be none responsive, but it should be, at last, contained to the only VMs concerned with the snapshot Additional info: i'm available to try to reproduce the problem on irc chat if needed (under slx name)
Can you please attach a ovirt-log-collector report including the host where this issue happened?
i wish i could : the ovirt-log-collector command doesn't work : ovirt-log-collector --version -bash: ovirt-log-collector: command not found Not on the engine, nor on the node, and i can't find it on the system. I don't really understand why since the documentation says it's included in ovirt, should i do a installation from https://github.com/oVirt/ovirt-log-collector (not really sure i'm able to do it, but i could try) Since this cluster is in production, i had to disable the backup of the biggest VMs to avoid another crash. So my last occurence of this problem is on November 4th (if i don't make any mistake), this means i have still a bit of time (but not much) to backup the concerned logs file another way if you want.
(In reply to Slx from comment #2) > i wish i could : > the ovirt-log-collector command doesn't work : > > ovirt-log-collector --version > -bash: ovirt-log-collector: command not found Please be sure you've installed it. On Engine host: "yum install ovirt-log-collector" > Not on the engine, nor on the node, and i can't find it on the system. I > don't really understand why since the documentation says it's included in > ovirt, should i do a installation from > https://github.com/oVirt/ovirt-log-collector (not really sure i'm able to do > it, but i could try) > > > > Since this cluster is in production, i had to disable the backup of the > biggest VMs to avoid another crash. > So my last occurence of this problem is on November 4th (if i don't make any > mistake), this means i have still a bit of time (but not much) to backup the > concerned logs file another way if you want.
Logs have been send to Tal Nisan (with links inside your mailbox) Thank you for your help
I remember when we spoke on IRC/mail you said you had new hardware coming in (as it seemed a configuration issue), is this still relevant?
The new switch is running since last week, so we still coudn't conduct more test on this matter, sorry. May i come back to you in a few weeks on this subject ?
I will close it for now. Please reopen if you encounter this again.
For information, after more test, this problem seems not to happened again. We think (but could not confirm) that this might have come from the old hardware or a software raid check occuring at the same time of the snapshot. Thank you for the help, and sorry for disturbing