Bug 1412012

Summary: HPE [RFE] [Fault Management] Low latency fault detection and notification on failure
Product: Red Hat OpenStack Reporter: hrushi <hrushikesh.gangur>
Component: openstack-aodhAssignee: Aaron Smith <aasmith>
Status: CLOSED NOTABUG QA Contact: Aaron Smith <aasmith>
Severity: high Docs Contact:
Priority: low    
Version: unspecifiedCC: aasmith, apevec, atelang, fbaudin, hrushikesh.gangur, jschluet, lhh, markmc, mburns, mmagr, pkilambi, pvaanane, srevivo
Target Milestone: gaKeywords: FutureFeature, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-17 16:58:44 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1476900, 1521118    

Description hrushi 2017-01-11 00:50:26 UTC
Description of problem:

NFV use case that deals with availability and manageability of VNF needs immediate notification of unavailability of virtualized resources from VIM, to process recovery of VNFs on them.

A combination of workload high-availability with a choice of "notify only" using notification as a service is a critical ask from most of the telco service providers. VNF Manager after spawning a VNF subscribes to alarm engine to ensure if any virtualized resource failure impacts the VNF, it should be notified immediately. On notification VNF Manager can take appropriate action. A virtualization failure, in this scenario, targets a compute node failure. Few implementation uses ceilometer-aodh as alarming service with remote pacemaker clustering for compute node failure detection. 

This requirement suggests productizing this implementation through OSPd.

Comment 1 hrushi 2017-01-11 00:52:03 UTC
Additional Info:
https://wiki.opnfv.org/display/doctor/Doctor+Home

Comment 2 Franck Baudin 2017-01-20 14:06:00 UTC
We do have an on-going activity in OPNFV doctor and we are working on the end to end solution. But nothing will be ready for RHOSP12. I will let the experts comment.

Comment 3 Franck Baudin 2017-01-27 14:05:46 UTC
This is heavily under development upstream in OPNFV and won't be ready for product inclusion before being ready upstream. Meaning this is out of RHOSP12 scope, and I believe RHOSP13, let's reassess when upstream will be ready.

Comment 4 Franck Baudin 2017-01-27 14:11:10 UTC
Let's flag it for RHOSP13 as this is the further we can post-pone it and reassess in 4 months when scoping RHOSP13.

Comment 5 Aaron Smith 2017-01-30 13:12:49 UTC
(In reply to Franck Baudin from comment #2)
> We do have an on-going activity in OPNFV doctor and we are working on the
> end to end solution. But nothing will be ready for RHOSP12. I will let the
> experts comment.

We have been monitoring the Doctor and Barometer projects closely.  Our focus is on the monitoring and notification aspect of an NFV HA solution.  We hope to work with the Doctor and Barometer projects to define a monitoring and notification framework that operates within the performance constraints typical in NFV; failure detection + notification + repair < 50ms.  Failure detection will focus on NIC interface, Kernel, and VM failure on a node.  For node failure, discrimination of switch vs node failure with central theme.

Comment 6 Franck Baudin 2017-07-05 13:47:04 UTC
Won't be included in RHOSP13, covered by Red Hat SLA monitoring initiative and not Telemetry