Bug 1142904
| Summary: | [TEXT] RHEVM - cluster fencing disabled on 3.4 cluster - host is non-responsive -expected alert appears only once but will not reappear | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | sefi litmanovich <slitmano> | ||||
| Component: | ovirt-engine | Assignee: | Moti Asayag <masayag> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Matyáš <pmatyas> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.5.0 | CC: | bazulay, gklein, lpeer, mperina, oourfali, pstehlik, rbalakri, Rhev-m-bugs, slitmano, srevivo, ykaul | ||||
| Target Milestone: | ovirt-3.6.0-rc | ||||||
| Target Release: | 3.6.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-04-20 01:10:12 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Not sure I follow. The alert remains there? Is that the issue? I guess it is relevant for other cluster levels as well, not only 3.4. Am I right? In addition, please don't set a target release by yourself in the future. The actual problem is, that alert messages are shown only once. This is the scenario: 1) Configure 3.5 cluster with 2 hosts and NFS storage, disable Skip fencing on cluster connectivity issues 2) Block connection from engine to host1 (connection from host1 to storage must be available 3) Fencing flow will finish with alert: Host host1 became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy 4) Unblock connection from engine to host1 and wait until host will be Up 5) Block connection from engine to host1 again 6) Fencing flow will finish, but alert mentioned in 3) is not shown It looks that alerts of same type even with different times are showed only once. This is a bit different scenario, but fencing alerts has all same configuration so most probably it affects all of them. Please see this related bz I opened which emesika has resolved. It sounds to me as though this is basically the same issue: https://bugzilla.redhat.com/show_bug.cgi?id=1133611 Currently ALERT audit log messages with the same type and hostname are stored into database only once. If there already exist an alert of the same type for the same host, the alert is not saved into db and it's ignored. To solve this situation we propose adding boolean attribute "repeatable" to AuditLog class, so developer can specify if multiple alerts of the same type for the same host will be stored to database. By default "repeatable" will be set to false in order not to break backward compatibility. Build 3.6.0-1 After first disconnect of a host I can see the message in UI: 2015-Apr-29, 14:06 Host host1 became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy. and in engine.log: 2015-04-29 14:06:09,436 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-8-thread-46) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host host1 became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy. However, on the second disconnect I can see only these messages in UI: 2015-Apr-29, 14:09 Host host1 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 2015-Apr-29, 14:10 Host host1 is non responsive. but in the engine.log: 2015-04-29 14:10:29,720 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-8-thread-30) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host host1 became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy. Is this expected behaviour? Should the repeated message be seen only in the engine.log and not in UI? Martin, can you please advise? Hmm, you are right, it doesn't work again even on oVirt master. I will investigate why, because it definitely worked in November (In reply to Martin Perina from comment #6) > Hmm, you are right, it doesn't work again even on oVirt master. I will > investigate why, because it definitely worked in November Was this addressed? Sorry, I moved it to MODIFIED by mistake, should be ASSIGNED. It worked in Novemeber when posted, but it doesn't work now, something has changed in audit log events handling, I will investigate. |
Created attachment 938507 [details] engine log Description of problem: Steps to Reproduce: 1. 3.4 DC - > 3.4 Cluster -> 2 hosts (one with working pm agent configured). 2. disable cluster fencing in edit_cluster->fencing_policy. 3. stop network on host_with_pm 4. relevant messages in audit log: 'Host my_host is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. ' 'Host my_host is not responding.' 'Host other_host from cluster Cluster34 was chosen as a proxy to execute Status command on Host ny_host' 'Host other_host from cluster Cluster34 was chosen as a proxy to execute Restart command on Host ny_host' 'Host myhost became Non Responsive and was not restarted due to disabled fencing in the Cluster Fencing Policy.' 5. manual fence the host to restart network. 6. after host is up again, stop network again. Expected results: Same messages appear in audit log. Maybe after the first issue was resolved, meaning after host was back up, the last alert can be deleted. then re-appear if issue occurs again. Actual results: instead of the last alert informing about fencing policy the following message is issue (every 3 minutes): 'Host my_host is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.'