Description of problem: In these 2 scenarios [1] "rhevm failed to elect SPM for a DC" event is not generated: 1. A single data center, with 2 hosts. One of them was SPM. 1 nfs storage domain. Put spm host in maintenance, then while second host in in 'contentending' status, I've blocked it from storage server (iptable DROP). while second host was in this 'contending' status - The DC/host remained at that state for 12min. Then, host status became normal, along with "Failed to Reconstruct Master Domain for Data Center Default." event. 2. A single datacenter with 2 hosts, and 1 data nfs domain. all up, and 1 of the hosts is SPM. Blocking the storage server on both hosts, by iptables DROP: the host that was SPM, became non operational as it cannot connect to the service. The second host is contending and eventually turns to SPM while the storage domain itself turns to unknown state. Comments: 1. The [1] event is generated only if there is a network exception, that prevents the connection to vdsm, and in addition, fencing fail from some reason (single host on data center). Version-Release number of selected component (if applicable): SF13.1 How reproducible: allways
Ilanit, I'm not sure what we want to solve here, can you please elaborate? From what it seems to be, that AuditLog is being shown only on a very specific case (and possibly - we should edit the it's text regardless). In case of problem, the engine will try to select an spm indefinitely, so i guess that we don't want to flood the log with that event or on what phase do we need to log it, it seems to me like it would just be annoying to the user and will flood the event log.
The main point in this bug is that there is no consistency on when "rhevm failed to elect SPM for a DC" event is sent: For network exception, that prevents the connection to vdsm, we send it, On the other hand for the 2 cases, described in the bug description, we don't. Why should it differ? If, for example, this event is reported in event notification, for some cases of SPM election failure, the user will be notified, and for others not.
What is scenario to generate this message please?
"Fencing failed on Storage Pool Manager for Data Center ${StoragePoolName}. Setting status to Non-Operational" how to reproduce the scenario that generates the above error ?
David - where do you see this? Do you see ${StoragePoolName} in the UI ? If so , this is a different bug.
Hi, The message I should see and now I can see is-> Fencing failed on Storage Pool Manager south-01.xx.xx.xx.com for Data Center DC33-is12-Tabs. Setting status to Non-Operational. But what scenario should I use ?
3.3/is12 & is13 ---------------------- The main problem I see disregarding the Text issue, Is the recovering process - it fails With two hosts ,1 DC I removed all iptables rules I tried "Select SPM" - failed "Host south-01.lab.bos.redhat.com was force selected by admin@internal" Only the first host that was SPM succeeds eventualy - Even after I recover host1 to SPM I try to "Select SPM" for the second, and it fails "Host south-01.lab.bos.redhat.com was force selected by admin@internal" - after couple of tries it succeeds to become SPM with 1 host 1 DC I removed all iptables rules I tried "Select SPM" - failed host is active but - "Failed to connect Host south-05.lab.bos.redhat.com to Storage Servers" "Failed to Reconstruct Master Domain for Data Center DC33-is13-BOS."
Created attachment 796301 [details] south05
Created attachment 796302 [details] south01
Created attachment 796303 [details] south07
Closing - RHEV 3.3 Released