Bug 1236372

Summary: HAProxy health check for nova_ec2 fails for all backend nodes
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Giulio Fidente <gfidente>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: abeekhof, achernet, bperkins, dmacpher, eglynn, fdinitto, gfidente, hbrock, jstransk, kbasil, mburns, nbarcet, rhel-osp-director-maint, sgordon
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
A misconfiguration of the health check for Nova EC2 API caused HAProxy to believe the API was down. This meant the API was unreachable through HAProxy. This fix corrects the health check to query the API service state correctly. Now the Nova EC2 API is reachable through HAProxy.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-07 21:37:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2015-06-28 10:53:36 UTC
Description of problem:
HAProxy health check for nova_ec2 fails for all backend nodes in HA setup with 3 controller nodes. Backend nova_ec2 API is up but requires authentication so it returns a 400 response for the haproxy checks.

Version-Release number of selected component (if applicable):
openstack-puppet-modules-2015.1.7-5.el7ost.noarch

How reproducible:
100%

Actual results:
[root@overcloud-controller-2 ~]# journalctl -l -o cat -u haproxy | grep -i nova_ec2
Server nova_ec2/overcloud-controller-0 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server nova_ec2/overcloud-controller-1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server nova_ec2/overcloud-controller-2 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
proxy nova_ec2 has no server available!

[root@overcloud-controller-2 ~]# grep -A5 nova_ec2 /etc/haproxy/haproxy.cfg 
listen nova_ec2
  bind 192.0.2.17:8773 
  option httpchk GET /
  server overcloud-controller-0 192.0.2.23:8773 check fall 5 inter 2000 rise 2
  server overcloud-controller-1 192.0.2.19:8773 check fall 5 inter 2000 rise 2
  server overcloud-controller-2 192.0.2.21:8773 check fall 5 inter 2000 rise 2

[root@overcloud-controller-2 ~]# curl -I http://192.0.2.23:8773
HTTP/1.1 400 Bad Request
Content-Type: text/xml
Content-Length: 203
Date: Sun, 28 Jun 2015 10:50:22 GMT

[root@overcloud-controller-2 ~]# curl -I http://192.0.2.23:8773
HTTP/1.1 400 Bad Request
Content-Type: text/xml
Content-Length: 203
Date: Sun, 28 Jun 2015 10:50:24 GMT

[root@overcloud-controller-2 ~]# curl -I http://192.0.2.23:8773
HTTP/1.1 400 Bad Request
Content-Type: text/xml
Content-Length: 203
Date: Sun, 28 Jun 2015 10:50:28 GMT

[root@overcloud-controller-2 ~]# lsof -i :8773 -n -P
COMMAND    PID    USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nova-api 12241    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12643    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12644    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12645    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12646    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12766    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12767    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12770    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12772    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12821    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12822    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12823    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
nova-api 12824    nova    6u  IPv4 130484      0t0  TCP 192.0.2.21:8773 (LISTEN)
haproxy  20229 haproxy   18u  IPv4  48484      0t0  TCP 192.0.2.17:8773 (LISTEN)


Expected results:
HAProxy detects valid nova_ec2 api status.

Comment 3 chris alfonso 2015-06-29 17:11:46 UTC
Giulio, please let us know once you have had a chance look at this to assess whether it's really a blocking issue.

Comment 4 chris alfonso 2015-06-29 17:12:38 UTC
JK.

Jiri, please take a look and let us know.

Comment 5 Jiri Stransky 2015-06-30 09:04:52 UTC
Thanks Marius for excellent info in the bug description. EC2 API returns 400 unless authentication details are provided, and HAProxy then thinks there's some problem with the service.

However, the core issue for us here is that EC2 API probably shouldn't be running at all, as it's not present in the Pacemaker HA docs and we don't have it pacemakerized. It's getting started and haproxied because the puppet modules take over the defaults from non-pacemaker deployment.

Comment 6 Jiri Stransky 2015-06-30 09:06:13 UTC
I don't think having it running would negatively affect the other services, but we probably shouldn't run services which we don't support. I think disabling it should be easy.

Comment 7 Jiri Stransky 2015-06-30 10:05:51 UTC
(Correction of my comment #5 -- we do pacemakerize the EC2 API because the APIs are served by a single service openstack-nova-api, it's just listening on multiple ports.)

The Pacemaker HA docs probably don't disable EC2 API because they don't set the enabled_apis config option and the default is "ec2,osapi_compute,metadata" [1]. But there's no mention of port 8773 in the loadbalancer doc [2]. So in the ref arch the EC2 API is enabled (accessible on physical IPs) but not HAProxied (not accessible on VIP)? I can update the config to do exactly that but i'm not sure if that's the expected correct state. Andrew can you please shed some light on this? Maybe we should disable the EC2 API entirely, or add it to HAProxy?

[1] https://github.com/beekhof/osp-ha-deploy/blob/f73eec96ddd9c7f2c85dbb0348ff909f144631ec/pcmk/nova.scenario
[2] https://github.com/beekhof/osp-ha-deploy/blob/f73eec96ddd9c7f2c85dbb0348ff909f144631ec/pcmk/lb.scenario

Comment 9 David Vossel 2015-07-01 15:55:26 UTC
(In reply to Jiri Stransky from comment #7)
> (Correction of my comment #5 -- we do pacemakerize the EC2 API because the
> APIs are served by a single service openstack-nova-api, it's just listening
> on multiple ports.)
> 
> The Pacemaker HA docs probably don't disable EC2 API because they don't set
> the enabled_apis config option and the default is
> "ec2,osapi_compute,metadata" [1]. But there's no mention of port 8773 in the
> loadbalancer doc [2]. So in the ref arch the EC2 API is enabled (accessible
> on physical IPs) but not HAProxied (not accessible on VIP)? I can update the
> config to do exactly that but i'm not sure if that's the expected correct
> state. Andrew can you please shed some light on this? Maybe we should
> disable the EC2 API entirely, or add it to HAProxy?


I can narrow the question down. Do we want to support the EC2 API.

yes - add port 8773 to the load balancer.
no - disable EC2 with the enabled_apis config option.

The fact that EC2 is enabled yet not load balanced indicates that this was an oversight on our part. The decision as to whether or not we are interested in exposing the EC2 API is not something I can answer.

Comment 12 Jiri Stransky 2015-07-02 08:00:58 UTC
Talked to jayg on daily scrum - Astapor doesn't expose EC2 API because it doesn't behave well under HA. There's no port 8773 in Astapor's Nova loadbalancer manifest either [1], maybe the right thing could be to stay consistent with Astapor then.

[1] https://github.com/redhat-openstack/astapor/blob/d793eeb5f559874bab95177189aacbdcf06c092e/puppet/modules/quickstack/manifests/load_balancer/nova.pp

Comment 13 Stephen Gordon 2015-07-02 15:37:56 UTC
> I can narrow the question down. Do we want to support the EC2 API.
> 
> yes - add port 8773 to the load balancer.
> no - disable EC2 with the enabled_apis config option.
> 
> The fact that EC2 is enabled yet not load balanced indicates that this was
> an oversight on our part. The decision as to whether or not we are
> interested in exposing the EC2 API is not something I can answer.

The EC2 API is supported from a Nova point of view, albeit deprecated at this point in the hope that we will eventually move to the newer out of tree EC2 API implementation which includes broader coverage of the relevant APIS. We do have a number of customers that use it.

I do not however  believe this is a release blocker, as long as at worst we release with a similar level of inclusivity with regards to the EC2 API as we did in the OpenStack Platform 6 deployment architecture.

Comment 14 Jiri Stransky 2015-09-17 11:59:19 UTC
Done upstream. Can be backported once we get acks.

Comment 19 Giulio Fidente 2016-03-29 15:11:48 UTC
Using openstack-puppet-modules-7.0.16-1.el7ost.noarch.rpm the HAProxy listener is not using httpchk anymore for the nova_ec2 listener; backends are still seen DOWN though because we don't enable the ec2 API by default.

I think the BZ can be closed as the problem won't be seen anymore if the service is enabled.

Comment 21 errata-xmlrpc 2016-04-07 21:37:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0604.html