Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1142904

Summary: [TEXT] RHEVM - cluster fencing disabled on 3.4 cluster - host is non-responsive -expected alert appears only once but will not reappear
Product: Red Hat Enterprise Virtualization Manager Reporter: sefi litmanovich <slitmano>
Component: ovirt-engineAssignee: Moti Asayag <masayag>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.5.0CC: bazulay, gklein, lpeer, mperina, oourfali, pstehlik, rbalakri, Rhev-m-bugs, slitmano, srevivo, ykaul
Target Milestone: ovirt-3.6.0-rc   
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-20 01:10:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log none

Description sefi litmanovich 2014-09-17 15:17:08 UTC
Created attachment 938507 [details]
engine log

Description of problem:

Steps to Reproduce:
1. 3.4 DC - > 3.4 Cluster -> 2 hosts (one with working pm agent configured).
2. disable cluster fencing in edit_cluster->fencing_policy.
3. stop network on host_with_pm
4. relevant messages in audit log:


'Host my_host is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
'

'Host my_host is not responding.'

'Host other_host from cluster Cluster34 was chosen as a proxy to execute Status command on Host ny_host'

'Host other_host from cluster Cluster34 was chosen as a proxy to execute Restart command on Host ny_host'

'Host myhost became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy.'

5. manual fence the host to restart network.
6. after host is up again, stop network again.



Expected results:

Same messages appear in audit log.
Maybe after the first issue was resolved, meaning after host was back up, the last alert can be deleted. then re-appear if issue occurs again.

Actual results:

instead of the last alert informing about fencing policy the following message is issue (every 3 minutes):

'Host my_host is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.'

Comment 1 Oved Ourfali 2014-09-21 06:14:39 UTC
Not sure I follow.
The alert remains there? Is that the issue?
I guess it is relevant for other cluster levels as well, not only 3.4. Am I right?

In addition, please don't set a target release by yourself in the future.

Comment 2 Martin Perina 2014-09-25 13:05:39 UTC
The actual problem is, that alert messages are shown only once. This is the scenario:

1) Configure 3.5 cluster with 2 hosts and NFS storage, disable Skip fencing on cluster connectivity issues

2) Block connection from engine to host1 (connection from host1 to storage must be available

3) Fencing flow will finish with alert:

   Host host1 became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy

4) Unblock connection from engine to host1 and wait until host will be Up

5) Block connection from engine to host1 again

6) Fencing flow will finish, but alert mentioned in 3) is not shown

It looks that alerts of same type even with different times are showed only once. This is a bit different scenario, but fencing alerts has all same configuration so most probably it affects all of them.

Comment 3 sefi litmanovich 2014-10-02 07:56:09 UTC
Please see this related bz I opened which emesika has resolved.
It sounds to me as though this is basically the same issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1133611

Comment 4 Martin Perina 2014-11-03 13:14:19 UTC
Currently ALERT audit log messages with the same type and hostname are stored into database only once. If there already exist an alert of the same type for the same host, the alert is not saved into db and it's ignored.

To solve this situation we propose adding boolean attribute "repeatable" to AuditLog class, so developer can specify if multiple alerts of the same type for the same host will be stored to database. By default "repeatable" will be set to false in order not to break backward compatibility.

Comment 5 Antonin Pagac 2015-04-29 12:33:26 UTC
Build 3.6.0-1

After first disconnect of a host I can see the message in UI:

2015-Apr-29, 14:06 Host host1 became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy.

and in engine.log:

2015-04-29 14:06:09,436 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-8-thread-46) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host host1 became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy.

However, on the second disconnect I can see only these messages in UI:

2015-Apr-29, 14:09 Host host1 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2015-Apr-29, 14:10 Host host1 is non responsive.

but in the engine.log:

2015-04-29 14:10:29,720 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-8-thread-30) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host host1 became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy.


Is this expected behaviour? Should the repeated message be seen only in the engine.log and not in UI? Martin, can you please advise?

Comment 6 Martin Perina 2015-05-11 15:29:37 UTC
Hmm, you are right, it doesn't work again even on oVirt master. I will investigate why, because it definitely worked in November

Comment 7 Oved Ourfali 2015-06-08 08:04:06 UTC
(In reply to Martin Perina from comment #6)
> Hmm, you are right, it doesn't work again even on oVirt master. I will
> investigate why, because it definitely worked in November

Was this addressed?

Comment 8 Martin Perina 2015-06-08 08:14:09 UTC
Sorry, I moved it to MODIFIED by mistake, should be ASSIGNED.

It worked in Novemeber when posted, but it doesn't work now, something has changed in audit log events handling, I will investigate.