1647491 – All VMs on the same host crash when a snapshot fail on one of them

Bug 1647491 - All VMs on the same host crash when a snapshot fail on one of them

Summary: All VMs on the same host crash when a snapshot fail on one of them

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.6.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.3.1
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-07 15:29 UTC by Slx
Modified:	2019-03-12 08:30 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-02-19 15:24:08 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3? sbonazzo: blocker? sbonazzo: planning_ack? sbonazzo: devel_ack? sbonazzo: testing_ack?

Attachments	(Terms of Use)

Description Slx 2018-11-07 15:29:05 UTC

(first, i'm sorry if i didn't categorize this bug on the right place)

Description of problem:

i'm using this script : https://github.com/wefixit-AT/oVirtBackup who makes, beside of the fact it's using the old API, a good job for backuping VM on the export storage domains.
During the process, sometimes (on rare occurence on my biggest VMs), the snapshots fails (i still don't know exactly why, i'm still trying to reproduce that).
The main problem is than it puts all the VMs on this host on none responsive state, and not only the one concerned by the snapshot.

For information, all VMs concerned are stored on the same host, with NFS.

Someone (on IRC chat) think it might come from network problem, but there's should be something explicit about it (and, for now, i didn't found anything)


Version-Release number of selected component (if applicable):


How reproducible:

Mostly happening on >400 Gb VMs, randomly


Steps to Reproduce:
1. Make a snapshot
2. Have it fail

Actual results:
VMs on this host are unresponsive, only the force remove of the snapshot make the other VM responsive (the VMs concerned by snapshot has to be reboot to regain responsive state)


Expected results:
On the best case, no VMs should be none responsive, but it should be, at last, contained to the only VMs concerned with the snapshot


Additional info:
i'm available to try to reproduce the problem on irc chat if needed (under slx name)

Comment 1 Sandro Bonazzola 2018-11-08 07:44:13 UTC

Can you please attach a ovirt-log-collector report including the host where this issue happened?

Comment 2 Slx 2018-11-09 09:10:12 UTC

i wish i could :
the ovirt-log-collector command doesn't work :

ovirt-log-collector --version
-bash: ovirt-log-collector: command not found

Not on the engine, nor on the node, and i can't find it on the system. I don't really understand why since the documentation says it's included in ovirt, should i do a installation from https://github.com/oVirt/ovirt-log-collector (not really sure i'm able to do it, but i could try)



Since this cluster is in production, i had to disable the backup of the biggest VMs to avoid another crash. 
So my last occurence of this problem is on November 4th (if i don't make any mistake), this means i have still a bit of time (but not much) to backup the concerned logs file another way if you want.

Comment 3 Sandro Bonazzola 2018-11-12 12:55:08 UTC

(In reply to Slx from comment #2)
> i wish i could :
> the ovirt-log-collector command doesn't work :
> 
> ovirt-log-collector --version
> -bash: ovirt-log-collector: command not found

Please be sure you've installed it. On Engine host:
"yum install ovirt-log-collector"


> Not on the engine, nor on the node, and i can't find it on the system. I
> don't really understand why since the documentation says it's included in
> ovirt, should i do a installation from
> https://github.com/oVirt/ovirt-log-collector (not really sure i'm able to do
> it, but i could try)
> 
> 
> 
> Since this cluster is in production, i had to disable the backup of the
> biggest VMs to avoid another crash. 
> So my last occurence of this problem is on November 4th (if i don't make any
> mistake), this means i have still a bit of time (but not much) to backup the
> concerned logs file another way if you want.

Comment 4 Slx 2018-11-12 14:54:33 UTC

Logs have been send to Tal Nisan (with links inside your mailbox)
Thank you for your help

Comment 5 Benny Zlotnik 2019-02-19 12:14:48 UTC

I remember when we spoke on IRC/mail you said you had new hardware coming in (as it seemed a configuration issue), is this still relevant?

Comment 6 Slx 2019-02-19 13:02:01 UTC

The new switch is running since last week, so we still coudn't conduct more test on this matter, sorry.
May i come back to you in a few weeks on this subject ?

Comment 7 Fred Rolland 2019-02-19 15:24:08 UTC

I will close it for now.
Please reopen if you encounter this again.

Comment 8 Slx 2019-03-12 08:30:05 UTC

For information, after more test, this problem seems not to happened again.
We think (but could not confirm) that this might have come from the old hardware or a software raid check occuring at the same time of the snapshot.

Thank you for the help, and sorry for disturbing

Note You need to log in before you can comment on or make changes to this bug.