Bug 1238700

Summary:	NameNode HA for HDP2 does not set up Oozie correctly
Product:	Red Hat OpenStack	Reporter:	Luigi Toscano <ltoscano>
Component:	openstack-sahara	Assignee:	Elise Gafford <egafford>
Status:	CLOSED ERRATA	QA Contact:	Luigi Toscano <ltoscano>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.0 (Kilo)	CC:	kbasil, matt, mimccune, mlopes, pkshiras, yeylon
Target Milestone:	ga
Target Release:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-sahara-2015.1.0-5.el7ost	Doc Type:	Bug Fix
Doc Text:	Prior to this update, while NameNode HA for HDP was functional and feature-complete upstream, Sahara continued to point Oozie at a single NameNode IP for all jobs. Consequently, Oozie and Sahara's EDP were only successful when a single, arbitrary node was designated active (in an A/P HA model). This update addresses this issue by directing Oozie to the nameservice, rather than any one namenode. As a result, Oozie and EDP jobs can succeed regardless of which NameNode is active.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-08-05 13:28:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Luigi Toscano 2015-07-02 12:45:39 UTC

Description of problem:
The NameNode HA feature for HDP, described here:
http://specs.openstack.org/openstack/sahara-specs/specs/kilo/hdp-plugin-enable-hdfs-ha.html
even if it is able to turn a cluster into NameNode HA configuration, does not change the configuration of Oozie, which still points to one of the two NameNodes.  If it points to the standby node, job execution does not even start with a strange error:

2015-07-01 13:46:58.882 15549 WARNING sahara.service.edp.job_manager [-] Can't run job execution 437c1c6a-72e8-4b86-b036-6fa4b5657538 (reason: type Status report
 message
 description This request requires HTTP authentication. )

and keystone reports
2015-07-01 13:47:46.604 31419 WARNING keystone.token.controllers [-] User 0545bfa11fc444bb8782acb14f3e871e is unauthorized for tenant bd133d1e161345a69a15778cf7a580ca
2015-07-01 13:47:46.605 31419 WARNING keystone.common.wsgi [-] Authorization failed. The request you have made requires authentication. from x.y.z.t

which would translate as that "User admin is unauthorized for tenant services".

The errors, which could be improved, seems to be a red herring. The real issue is that Oozie returns with an exception that:
2015-07-01 09:50:07,693 INFO BaseJobServlet:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] AuthorizationException
org.apache.oozie.service.AuthorizationException: E0501: Could not perform authorization operation, Operation category READ is not supported in state standby
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
        [...]

This is the cluster configuration used, detailed by node groups and number of nodes for each of them:

* master-ha-common (1 node)
   - AMBARI_SERVER
   - HISTORYSERVER
   - OOZIE_SERVER
   - RESOURCEMANAGER
   - SECONDARY_NAMENODE

* master-ha-nn (2 nodes)
   - NAMENODE
   - ZOOKEEPER_SERVER
   - JOURNALNODE

* master-ha-node (1 node)
   - ZOOKEEPER_SERVER
   - JOURNALNODE

* worker-ha (3 nodes)
   - DATANODE
   - HDFS_CLIENT
   - MAPREDUCE2_CLIENT
   - NODEMANAGER
   - OOZIE_CLIENT
   - PIG
   - YARN_CLIENT
   - ZOOKEEPER_CLIENT

The configuration key hdfs.nnha is set to true, as described by the documentation.

I tested using a beta version of RHEL-OSP7, so basically Kilo, but the relevant code did not change in master:
openstack-sahara-common-2015.1.0-4.el7ost.noarch
openstack-sahara-engine-2015.1.0-4.el7ost.noarch
openstack-sahara-api-2015.1.0-4.el7ost.noarch

The discussion is mirrored upstream on the linked launchpad bug.

Comment 3 Elise Gafford 2015-07-02 13:08:41 UTC

From http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.3/bk_using_Ambari_book/content/install-ha_2x.html:

   If you are using Oozie, you need to use the Nameservice URI instead of the NameNode URI in your workflow files. For example, where the Nameservice ID is mycluster:

  <workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
      <start to="mr-node"/>
      <action name="mr-node">
          <map-reduce>
              <job-tracker>${jobTracker}</job-tracker>
              <name-node>hdfs://mycluster</name-node>

From http://172.24.4.230:8080/#/main/hosts/my-hdp2ha-b090a511-master-ha-common-f2ebfbc6-001.novalocal/configs (which pulls from hdfs-site.xml):

  dfs.nameservices: my-hdp2ha-b090a511

From your cluster definition (which should be valid):
  | info                       | {u'HDFS': {u'NameNode':                         |
  |                            | u'hdfs://172.24.4.229:8020', u'Web UI':         |
  |                            | u'http://172.24.4.229:50070'}, u'JobFlow':      |
  |                            | {u'Oozie': u'http://172.24.4.230:11000'},       |
  |                            | u'MapReduce2': {u'Web UI':                      |
  |                            | u'http://172.24.4.230:19888', u'History         |
  |                            | Server': u'172.24.4.230:10020'}, u'Yarn':       |
  |                            | {u'Web UI': u'http://172.24.4.230:8088',        |
  |                            | u'ResourceManager': u'172.24.4.230:8050'},      |
  |                            | u'Ambari Console': {u'Web UI':                  |
  |                            | u'http://172.24.4.230:8080'}}                   |
  
From sahara/sahara/service/edp/oozie/engine.py:
        nn_path = self.get_name_node_uri(self.cluster)
        ...
        job_parameters = {
            "jobTracker": rm_path,
            "nameNode": nn_path,
            "user.name": hdfs_user,
            oozie_libpath_key: oozie_libpath,
            app_path: "%s%s" % (nn_path, path_to_workflow),
            "oozie.use.system.libpath": "true"}

From sahara/sahara/plugins/hdp/edp_engine.py:
    def get_name_node_uri(self, cluster):
        return cluster['info']['HDFS']['NameNode']

We've succeeded in setting up a highly available cluster, but we're hard-coding ourselves into only using Oozie through one of the nodes, rather than using the nameservice. I believe this to be the root cause of the issue; however, fixing it is a non-trivial change, as nameservice designation is dynamic. To be discussed.

It is notable that this bug does not wholly block the RFE: through a certain legalistic interpretation, Sahara does now support clusters with HDP HA namenodes. The issue is that whenever the active namenode is not the one at which Oozie is (permanently) pointed through Sahara, Oozie will be unavailable, and with it, Sahara's EDP interface.

Comment 4 Elise Gafford 2015-07-06 21:22:20 UTC

Reproduced locally; repaired; review posted upstream.
https://review.openstack.org/#/c/198895/

Comment 5 Elise Gafford 2015-07-07 22:13:09 UTC

After several positive reviews and no negatives, backporting.

Comment 7 Luigi Toscano 2015-07-13 17:53:54 UTC

The HA NameNode is correctly setup and Oozie points to the active namenode, even if the active NameNode disappear and it is replaced by the standby instance. See also rhbz#1149055.

Tested on:
openstack-sahara-api-2015.1.0-5.el7ost.noarch
openstack-sahara-engine-2015.1.0-5.el7ost.noarch
openstack-sahara-common-2015.1.0-5.el7ost.noarch

Comment 9 errata-xmlrpc 2015-08-05 13:28:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548