1238700 – NameNode HA for HDP2 does not set up Oozie correctly

Bug 1238700 - NameNode HA for HDP2 does not set up Oozie correctly

Summary: NameNode HA for HDP2 does not set up Oozie correctly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-sahara
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	ga
Target Release:	7.0 (Kilo)
Assignee:	Elise Gafford
QA Contact:	Luigi Toscano
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-02 12:45 UTC by Luigi Toscano
Modified:	2015-08-27 04:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:	openstack-sahara-2015.1.0-5.el7ost
Doc Type:	Bug Fix
Doc Text:	Prior to this update, while NameNode HA for HDP was functional and feature-complete upstream, Sahara continued to point Oozie at a single NameNode IP for all jobs. Consequently, Oozie and Sahara's EDP were only successful when a single, arbitrary node was designated active (in an A/P HA model). This update addresses this issue by directing Oozie to the nameservice, rather than any one namenode. As a result, Oozie and EDP jobs can succeed regardless of which NameNode is active.
Clone Of:
Environment:
Last Closed:	2015-08-05 13:28:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1470841	None	None	None	Never
Red Hat Bugzilla	1149055	urgent	CLOSED	[RFE][sahara]: Enable HDFS NameNode High Availability in HDP 2.* plugin	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2015:1548	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform Enhancement Advisory	2015-08-05 17:07:06 UTC

Internal Links: 1149055

Description Luigi Toscano 2015-07-02 12:45:39 UTC

Description of problem:
The NameNode HA feature for HDP, described here:
http://specs.openstack.org/openstack/sahara-specs/specs/kilo/hdp-plugin-enable-hdfs-ha.html
even if it is able to turn a cluster into NameNode HA configuration, does not change the configuration of Oozie, which still points to one of the two NameNodes.  If it points to the standby node, job execution does not even start with a strange error:

2015-07-01 13:46:58.882 15549 WARNING sahara.service.edp.job_manager [-] Can't run job execution 437c1c6a-72e8-4b86-b036-6fa4b5657538 (reason: type Status report
 message
 description This request requires HTTP authentication. )

and keystone reports
2015-07-01 13:47:46.604 31419 WARNING keystone.token.controllers [-] User 0545bfa11fc444bb8782acb14f3e871e is unauthorized for tenant bd133d1e161345a69a15778cf7a580ca
2015-07-01 13:47:46.605 31419 WARNING keystone.common.wsgi [-] Authorization failed. The request you have made requires authentication. from x.y.z.t

which would translate as that "User admin is unauthorized for tenant services".

The errors, which could be improved, seems to be a red herring. The real issue is that Oozie returns with an exception that:
2015-07-01 09:50:07,693 INFO BaseJobServlet:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] AuthorizationException
org.apache.oozie.service.AuthorizationException: E0501: Could not perform authorization operation, Operation category READ is not supported in state standby
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
        [...]

This is the cluster configuration used, detailed by node groups and number of nodes for each of them:

* master-ha-common (1 node)
   - AMBARI_SERVER
   - HISTORYSERVER
   - OOZIE_SERVER
   - RESOURCEMANAGER
   - SECONDARY_NAMENODE

* master-ha-nn (2 nodes)
   - NAMENODE
   - ZOOKEEPER_SERVER
   - JOURNALNODE

* master-ha-node (1 node)
   - ZOOKEEPER_SERVER
   - JOURNALNODE

* worker-ha (3 nodes)
   - DATANODE
   - HDFS_CLIENT
   - MAPREDUCE2_CLIENT
   - NODEMANAGER
   - OOZIE_CLIENT
   - PIG
   - YARN_CLIENT
   - ZOOKEEPER_CLIENT

The configuration key hdfs.nnha is set to true, as described by the documentation.

I tested using a beta version of RHEL-OSP7, so basically Kilo, but the relevant code did not change in master:
openstack-sahara-common-2015.1.0-4.el7ost.noarch
openstack-sahara-engine-2015.1.0-4.el7ost.noarch
openstack-sahara-api-2015.1.0-4.el7ost.noarch

The discussion is mirrored upstream on the linked launchpad bug.

Comment 3 Elise Gafford 2015-07-02 13:08:41 UTC

From http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.3/bk_using_Ambari_book/content/install-ha_2x.html:

   If you are using Oozie, you need to use the Nameservice URI instead of the NameNode URI in your workflow files. For example, where the Nameservice ID is mycluster:

  <workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
      <start to="mr-node"/>
      <action name="mr-node">
          <map-reduce>
              <job-tracker>${jobTracker}</job-tracker>
              <name-node>hdfs://mycluster</name-node>

From http://172.24.4.230:8080/#/main/hosts/my-hdp2ha-b090a511-master-ha-common-f2ebfbc6-001.novalocal/configs (which pulls from hdfs-site.xml):

  dfs.nameservices: my-hdp2ha-b090a511

From your cluster definition (which should be valid):
  | info                       | {u'HDFS': {u'NameNode':                         |
  |                            | u'hdfs://172.24.4.229:8020', u'Web UI':         |
  |                            | u'http://172.24.4.229:50070'}, u'JobFlow':      |
  |                            | {u'Oozie': u'http://172.24.4.230:11000'},       |
  |                            | u'MapReduce2': {u'Web UI':                      |
  |                            | u'http://172.24.4.230:19888', u'History         |
  |                            | Server': u'172.24.4.230:10020'}, u'Yarn':       |
  |                            | {u'Web UI': u'http://172.24.4.230:8088',        |
  |                            | u'ResourceManager': u'172.24.4.230:8050'},      |
  |                            | u'Ambari Console': {u'Web UI':                  |
  |                            | u'http://172.24.4.230:8080'}}                   |
  
From sahara/sahara/service/edp/oozie/engine.py:
        nn_path = self.get_name_node_uri(self.cluster)
        ...
        job_parameters = {
            "jobTracker": rm_path,
            "nameNode": nn_path,
            "user.name": hdfs_user,
            oozie_libpath_key: oozie_libpath,
            app_path: "%s%s" % (nn_path, path_to_workflow),
            "oozie.use.system.libpath": "true"}

From sahara/sahara/plugins/hdp/edp_engine.py:
    def get_name_node_uri(self, cluster):
        return cluster['info']['HDFS']['NameNode']

We've succeeded in setting up a highly available cluster, but we're hard-coding ourselves into only using Oozie through one of the nodes, rather than using the nameservice. I believe this to be the root cause of the issue; however, fixing it is a non-trivial change, as nameservice designation is dynamic. To be discussed.

It is notable that this bug does not wholly block the RFE: through a certain legalistic interpretation, Sahara does now support clusters with HDP HA namenodes. The issue is that whenever the active namenode is not the one at which Oozie is (permanently) pointed through Sahara, Oozie will be unavailable, and with it, Sahara's EDP interface.

Comment 4 Elise Gafford 2015-07-06 21:22:20 UTC

Reproduced locally; repaired; review posted upstream.
https://review.openstack.org/#/c/198895/

Comment 5 Elise Gafford 2015-07-07 22:13:09 UTC

After several positive reviews and no negatives, backporting.

Comment 7 Luigi Toscano 2015-07-13 17:53:54 UTC

The HA NameNode is correctly setup and Oozie points to the active namenode, even if the active NameNode disappear and it is replaced by the standby instance. See also rhbz#1149055.

Tested on:
openstack-sahara-api-2015.1.0-5.el7ost.noarch
openstack-sahara-engine-2015.1.0-5.el7ost.noarch
openstack-sahara-common-2015.1.0-5.el7ost.noarch

Comment 9 errata-xmlrpc 2015-08-05 13:28:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548

Note You need to log in before you can comment on or make changes to this bug.