Multi node clustering with JBOSS EAP 7.0 (session replication, load balancing, sticky session and fail over)

Creating a JBOSS EAP Cluster in an Active Passive configuration July 4 2016 EAP Cluster configuration for Simply Energy Version1.3     Overview of the cluster The cluster configuration that we…

Source: Multi node clustering with JBOSS EAP 7.0 (session replication, load balancing, sticky session and fail over)

Advertisements

Multi node clustering with JBOSS EAP 7.0 (session replication, load balancing, sticky session and fail over)

Creating a JBOSS EAP Cluster in an Active Passive configuration July 4

2016

EAP Cluster configuration for Simply Energy Version1.3

 


 

Overview of the cluster

The cluster configuration that we are trying to achieve is an ACTIVE – PASSIVE configuration.

We will be running 2 CLUSTERS as part of this configuration

THE ACTIVE CLUSTER will have 2 ACTIVE NODES each running a single instance of JBOSS EAP 7.0.0

The two active nodes would be running on two different machines but would be grouped together as part of one server group active-cluster-server-group.

The nodes will be configured with a domain controller in a Master slave configuration

The configuration and deployment of the slaves would always be controlled from a single point called the Domain controller (Master)

All the configuration and resources are controlled through a single profile  full-ha-active for active cluster and a single profile called full-ha-passive for passive cluster.

THE PASSIVE CLUSTER will have 2 PASSIVE NODES each running a single instance of JBOSS EAP.

The two passive nodes would be running on two different machines but would be grouped together as part of one server group passive-cluster-server-group.

There shall be one Machine running Apache HTTP Server for the software load balancer.

The active and the passive clusters will be front ended with a software load balancer which in our case would be JBOSS Core Apache HTTP Server connecting to them on ports 6667 & 6668 respectively. 6667 pointing towards the active cluster group and 6668 pointing towards the passive cluster group.

We will have the same instance of database to which both the clusters will be pointing to.

Please Note:

As of writing this document failover, load balancing and sticky sessions were achieved, however in EAP 7.0.0 the default configuration is not supporting session replication. It is observed that Session replication is only supported when two instances of EAP7.0 are run on the same node but on different ports. So session replication is still a topic of discussion which requires further study and investigation from the author.

 

 

 

 

 

HIGH LEVEL ARCHITECTURE OF THE EAP CLUSTERED ENVIRONMENT

activecluster

passivecluster

 

Detailed Instruction on setting up the environment.

Pre requisite software on all the machines.

JDK “1.8.0_72”

Installation of Software

Machine 1:

Install Red Hat JBoss Core Services Apache HTTP Server for Linux (which ever flavour available) (L.B)

Machine 2,3,4,5:

Install Red Hat JBoss Enterprise Application Platform 7.0.0

I will discuss here the setup of the Active Cluster; passive cluster would be similar with differences in ports of mod cluster.

  • Go to the installation directory of JBoss on Machine 2.
  • This machine we would be converted as Domain Controller or the master node.
  • This machine is node1
  • On node 1

/opt/jboss-eap-7.0/bin/add-user.sh admin password1!

 

  • On node2 (slave) – Machine 3
  • Just run /opt/jboss-eap-7.0/bin/add-user.sh

No arguments

What type of user do you wish to add?

  1. a) Management User (mgmt-users.properties)
  2. b) Application User (application-users.properties)

(a): a

  • Simply press enter to accept the default selection of (a)

Enter the details of the new user to add.

Realm (ManagementRealm) :

  • Once again simply press enter to continue

Username : node2

  • Enter the username and press enter

Password : password1!

  • Enter password1! as the password, and press enter

Re-enter Password : password1!

  • Enter password1! again to confirm, and press enter

About to add user ‘node2’ for realm ‘ManagementRealm’

Is this correct yes/no? yes

  • Type yes and press enter to continue

Is this new user going to be used for one AS process to connect to another

AS process?

e.g. for a slave host controller connecting to the master or for a Remoting

connection for server to server EJB calls.

yes/no?

  • Type yes and press enter

To represent the user add the following to the server-identities definition

<secret value=”cGFzc3dvcmQxIQ==” />

  • Copy and paste the provided secret hash into the notes. The host XML file of any

servers that are not domain controllers need to provide this password hash along with

the associated username to connect to the domain controller.

 

 

 

  • Add Application User

An application user is also required on all 6 nodes to allow a remote Java class to authenticate and invoke a deployed EJB. Creating an application user requires the interactive mode so run add-user.sh and provide the following 2 times, once on each EAP 7 installation

 

 

What type of user do you wish to add?

  1. a) Management User (mgmt-users.properties)
  2. b) Application User (application-users.properties)

(a): b

  • Enter b and press enter to add an application user

Enter the details of the new user to add.

Realm (ApplicationRealm) :

  • Simply press enter to continue

Username : ejbcaller

  • Enter the username as ejbcaller press enter

Password : password1!

  • Enter “password1!” as the password, and press enter

Re-enter Password : password1!

  • Enter “password1!” again to confirm, and press enter

What roles do you want this user to belong to? (Please enter a comma

separated list, or leave blank for none)[ ]:

  • Press enter to leave this blank and continue

About to add user ‘ejbcaller’ for realm ‘ApplicationRealm’

Is this correct yes/no? yes

  • Type yes and press enter to continue

…[confirmation of user being added to files]

Is this new user going to be used for one AS process to connect to another

AS process?

e.g. for a slave host controller connecting to the master or for a Remoting

connection for server to server EJB calls.

yes/no? no

  • Type no and press enter to complete user setup

At this point, all the required users have been created; move forward to making minimal edits

to the server configuration files before starting the servers.

 

 

 Domain Controller , MASTER: Node1  Manual Configuration

I have illustrated manual configuration below, the same can be achieved from the

 cli command interface of JBOSS.

 

 

  • Go to node 1

Go to the host.xml file located under

/opt/jboss-eap-7.0/domain/configuration/host.xml

Change the host name to

<host xmlns=”urn:jboss:domain:4.1″ name=”node1″>

 

  • Create  a properties file call it domain.properties under opt/jboss-eap-7.0

/domain/configuration/

 

Add the following content to it

 

jboss.bind.address.management=< IP ADDRESS OF NODE 1>

jboss.bind.address =< IP ADDRESS OF NODE 1>

 

 

Save it.

 

 

  • Edit the servers section in the host.xml file to contain only this

<servers>

<server name=”node1-active-server” group=”active-cluster-server-group” auto-start=”true”>

<system-properties>

<property name=”jboss.tx.node.id” value=”111″ boot-time=”true”/>

</system-properties>

</server>

 

</servers>

 

 

  • In the location   /opt/jboss-eap-7.0/domain/configuration/domain.xml

Open the domain.xml file.

Copy the profile full- ha and create another profile with the name

full-ha-active so your profiles section should look like

<profiles>

<profile name=” full-ha”>

.

.

.</profile>

 

<profile name=” full-ha-active “>

.

.

.</profile>

 

 

  • In the profile full-ha-active configure data source to represent the following.

<datasource jta=”false” jndi-name=”java:/jdbc/cisDS” pool-name=”CisDS” enabled=”true” spy=”true” use-ccm=”false”>

<connection-url>jdbc:oracle:thin:@hsndsd58z1.hsntech.int:1531:SERGN1D</connection-url>

<driver-class>oracle.jdbc.OracleDriver</driver-class>

<driver>oracle</driver>

<pool>

<min-pool-size>1</min-pool-size>

<max-pool-size>4</max-pool-size>

</pool>

<security>

<user-name>DEV01</user-name>

<password>DEV01</password>

</security>

<validation>

<validate-on-match>false</validate-on-match>

<background-validation>false</background-validation>

</validation>

<statement>

<share-prepared-statements>false</share-prepared-statements>

</statement>

</datasource>

<drivers>

<driver name=”oracle” module=”oracle.jdbc”>

<xa-datasource-class>oracle.jdbc.xa.client.OracleXADataSource</xa-datasource-class>

</driver>

<driver name=”h2″ module=”com.h2database.h2″>

<xa-datasource-class>org.h2.jdbcx.JdbcDataSource</xa-datasource-class>

</driver>

</drivers>

 

 

 

  • Make sure  the server groups element in the domain.xml represent the following

<server-groups>

<server-group name=”active-cluster-server-group” profile=” full-ha-active”>

<jvm name=”default”>

<heap size=”1000m” max-size=”1000m”/>

</jvm>

<socket-binding-group ref=”full-ha-sockets”/>

</server-group>

</server-groups>

 

 

  • In the modules directory

opt/jboss-eap-7.0/modules have the following oracle module expanded.

 

This step concludes the configuration of the master.

 

Now you can run the master server by executing the following command

 

/opt/jboss-eap-7.0/bin/domain.sh  –domain-config domain.xml –host-config=host.xml -P <path to domain.properties that you saved earlier>

 

Or

 

opt/jboss-eap-7.0/bin/domain.sh -b <IP_ADDRESS_OF NODE1>

-Djboss.bind.address.management=<IP_ADDRESS_OF NODE1>

-Djboss.bind.address=<IP_ADDRESS_OF NODE1>

 

Check the server logs of the master make sure everything looks fine and no errors in the logs.

Go to the Admin to see the server you have configured

http://<IP_ADDRESS_OF MASTER>:9990/console/App.html#home

user id :admin

password:password1!

Go to runtime -> server groups -> active-cluster->server-group->  and servers will be listed.

 

Configuration of SLAVE – Node 2

  • On Node2  Machine 3 we go to the location where we installed

JBoss  and go to the  location

/opt/jboss-eap-7.0/domain/configuration/host-slave.xml

 

  • Create a properties file call it domain.properties under opt/jboss-eap-7.0

/domain/configuration/

 

Add the following content to it

 

jboss.bind.address.management=< IP ADDRESS OF NODE 2>

jboss.bind.address=< IP ADDRESS OF NODE 2>

jboss.domain.master.address=<IPADDRESS OF MASTER NODE 1>

 

 

Save it.

 

 

 

  • In the host-slave.xml  do the following changes

<host xmlns=”urn:jboss:domain:4.1″>  to

<host xmlns=”urn:jboss:domain:4.1″  name=”node2″ >

 

  • Also replace  the secret value  tag

<secret value=”cGFzc3dvcmQxIQ==”/>       with the secret value that you stored earlier

 

This allows communication between the master and the node.

 

  • Then change the domain controller to reflect

<domain-controller>

<remote protocol=”remote” host=”${jboss.domain.master.address:<IP_ADDRESS_OF_MASTER}” port=”9999″ security-realm=”ManagementRealm” username=”admin”/>

</domain-controller>

 

 

  • In the interfaces change the ip address

<interfaces>

<interface name=”management”>

<inet-address value=”${jboss.bind.address.management:<IP_ADDRESS_OF _SLAVE>}”/>

</interface>

<interface name=”public”>

<inet-address value=”${jboss.bind.address: :<IP_ADDRESS_OF _SLAVE>}”/>

</interface>

<interface name=”unsecured”>

<inet-address value=”:<IP_ADDRESS_OF _SLAVE>” />

</interface>

</interfaces>

 

 

 

  • Configure this node to be part of the server group that you created earlier

Which would be to change

 

<server name=”node2-active-server” group=”active-cluster-server-group” auto-start=”true”>

<system-properties>

<property name=”jboss.tx.node.id” value=”222″ boot-time=”true”/>

</system-properties>

</server>

Save it

 

  • In the modules directory

opt/jboss-eap-7.0/modules has the following oracle module expanded.

 

 

Please note that there is no configuration of data source or any other resources will be required on the slave node and the profile of all the slave nodes will be controlled from a single point master.

 

  • Start the node2 –slave by executing the command

domain -b <IP_ADDRESS_OF_SLAVE>

-Djboss.domain.master.address=<IP_ADDRESS_OF_MASTER> –host-config=host-slave.xml

 

Or

 

opt/jboss-eap-7.0/bin/domain.sh –host-config=host-slave.xml  -P <path to dmain.properties that you saved earlier>

 

 

Check the server logs of the slave make sure everything looks fine and no errors in the logs.

Go to the Admin to see the server you have configured is present

http://<IP_ADDRESS_OF MASTER>:9990/console/App.html#home

user id :admin

password:password1!

Go to runtime -> server groups -> active-cluster->server-group->  and servers will be listed.

 

 

 

 

 

 

 

 

Now the good part about the domain and the server group is that a single deployment artefact can be deployed to all nodes that belong to a server group in a single instance through the master domain controller.

 

This is how you do it

Go to admin console on Master/Domain Controller

Go to Deployments->active-cluster-server-group-> deployments

Add select the artefact and deploy and you can see it’s deployed on all nodes.

 

 

Go to the individual address to check the deployment

http://masternode:8080/context

http://slavenode:8080/context

 

 

 

Configuring the load balancer for the EAP cluster.

 

 

 

  • On Machine 1

Go to the location where we installed opt/jbcs-httpd24-2.4/

 

The jbcs-httpd24-2.4/httpd directory created by extracting the ZIP archive is the top-level directory for Apache HTTP Server. This is referred to in this documentation as HTTPD_HOME.

 

Follow this documentation for installation of ApacheHttpServer on linux

 

https://access.redhat.com/documentation/en/red-hat-jboss-core-services-apache-http-server/2.4/apache-http-server-installation-guide/chapter-2-installing-jboss-core-services-apache-http-server-on-red-hat-enterprise-linux

 

Go to the location

The jbcs-httpd24-2.4/httpd

And take a backup of the existing  httpd.conf file.

 

The following files  need to be put in conf directory. These files should only be used as a reference as they are modified for windows environment. The linux installation will have a different set of params.

 

 

The above files need to modified for Linux environment, as they were initially written for windows environment.

 

For  Eg

 

Server root should be modified from

 

ServerRoot “C:/jbcs-httpd24-2.4/etc/httpd”

 

To linux env

ServerRoot “opt/jbcs-httpd24-2.4/httpd”

 

ServerAdmin Administrator@localhost  to  ServerAdmin root@localhost

  • Once the load balancer is working and has no errors then configure mod_cluster on the Master controller.

Go to the location on Master – node 1

/opt/jboss-eap-7.0/domain/configuration/domain.xml

Open the domain.xml file

Change the modcluster subsystem to reflect this in the active-full-ha profile.

<subsystem xmlns=”urn:jboss:domain:modcluster:2.0″>

<mod-cluster-config advertise-socket=”modcluster” proxies=”my-proxy-one” balancer=”mycluster” advertise=”false” sticky-session-force=”false” connector=”ajp”>

<dynamic-load-provider>

<load-metric type=”busyness”/>

</dynamic-load-provider>

</mod-cluster-config>

</subsystem>

 

  • In the <socket-binding-groups> in the domain.xml under the same profile active-full-ha Add

 

<outbound-socket-binding name=”my-proxy-one”>

<remote-destination host=”<IP_ADDRESS_OF_MACHINE_1> ” port=”6667″/>

</outbound-socket-binding>

Save the file.

 

Go to the admin console:

Runtime->active-cluster-server-group-> node1-active-server: reload

 

Runtime->active-cluster-server-group-> node2-active-server: reload

 

 

  • Come back to Machine 1 where we have the load balancer installed.

 

Access the url

http:// <IP_ADDRESS_OF_MACHINE_1>:6667/mod_cluster_manager

You should see something like below

This concludes the configuration of EAP clusters.

Now to test the load balancer, failover and sticky session’s part

Deploy this test application on the server group using the EAP admin console

Hello world page for you and request would go to node1 or node2,

Shutdown the node the request is going to from admin console of EAP and again access the url through LB

http://<ip_address_of_machine_1>/cluster-demo it should still come up with hello world page without any hicks.

 

 

Changes to be made on Master domain.xml to make SESSION REPLICATION work

 

  • Go to the location on Master node

/opt/jboss-eap-7.0/domain/configuration/domain.xml

 

In the subsystem jgroups

<subsystem xmlns=”urn:jboss:domain:jgroups:4.0″>

Change the channel stack to tcp

 

<channels default=”ee”>

<channel name=”ee” stack=”tcp”/>

</channels>

 

  • Now in the same subsystem make changes to the stackname =tcp as below

 

<property name=”initial_hosts”><IP_ADDRESS_OF MASTER_NODE>[7600], <IP_ADDRESS_OF_SLAVE_NODE> [7600]</property>

 

 

<stack name=”tcp”>

<transport type=”TCP” socket-binding=”jgroups-tcp”/>

<protocol type=”TCPPING”>

<property name=”initial_hosts”>10.50.35.73[7600],10.50.35.72[7600]</property>

<property name=”num_initial_members”>

2

</property>

<property name=”port_range”>

0

</property>

<property name=”timeout”>

3600

</property>

</protocol>

<protocol type=”MERGE3″/>

<protocol type=”FD_SOCK” socket-binding=”jgroups-tcp-fd”/>

<protocol type=”FD”/>

<protocol type=”VERIFY_SUSPECT”/>

<protocol type=”pbcast.NAKACK2″/>

<protocol type=”UNICAST3″/>

<protocol type=”pbcast.STABLE”/>

<protocol type=”pbcast.GMS”/>

<protocol type=”MFC”/>

<protocol type=”FRAG2″/>

</stack>

 

Save the file

 

  • Now Start the Master with the following command

/opt/jboss-eap-7.0/bin/domain.sh  -b <IP_ADDRESS_OF_MASTER>

-Djboss.bind.address.management= <IP_ADDRESS_OF_MASTER>

-Djboss.bind.address.private=<IP_ADDRESS_OF_MASTER>

Or

/opt/jboss-eap-7.0/bin/domain.sh  –domain-config domain.xml –host-config=host.xml -P <path to domain.properties that you saved earlier>

Make sure you add the property jboss.bind.address.private=<IP_ADDRESS_OF_MASTER>  to that property file.

 

Similarly start the slave with following command

 

domain -b <IP_ADDRESS_OF_SLAVE>

-Djboss.domain.master.address=<IP_ADDRESS_OF_MASTER> –host-config=host-slave.xml -Djboss.bind.address.private=<IP_ADDRESS_OF_SLAVE>

 

 

or

 

opt/jboss-eap-7.0/bin/domain.sh –host-config=host-slave.xml  -P <path to dmain.properties that you saved earlier>Make sure you add the property jboss.bind.address.private=<IP_ADDRESS_OF_SLAVE>  to that property file.

Clustering won’t happen until you start the master and slave nodes with the above commands.

  • Now to test the session replication  Do the following

Go to the url  http://<ip_address_of_machine_1>/cluster-demo  it should  open Hello world page for you

Access the url http://<ip_address_of_machine_1>/cluster-demo/put.jsp

You will see time now is 2:28PM and request would go to node1 or node2,

Shutdown the node the request is going to from admin console of EAP and again access the url through LB

http://<ip_address_of_machine_1>/cluster-demo/get.jsp

You should be same message time now is 2:28PM

 

This concludes EAP clustering on two nodes with load balancer, fail over, sticky session and session fail over (session replication).

The same instructions need to be carried out for setting up passive cluster.

Image

Beep beep boop….Lets Learn Hadoop

What is Hadoop…..

===========================

The Hadoop project provides and supports the development of open source software that supplies a framework for the development of highly scalable distributed computing applications.

Hadoop framework handles processing details so that developers can concentrate on application logic.

Some Background:

==========================

If you visit the page http://hadoop.apache.org/ you will find that

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

=================================================================================================

Lets get back to Hadoop Core

  • Hadoop Core, flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.

The Hadoop Core project provides the basic services for building a cloud computing environment with commodity hardware, and the APIs for developing software that will run on that cloud.

The two fundamental pieces of Hadoop Core are

a) MapReduce framework, the cloud computing environment,

b) Hadoop Distributed File System (HDFS).

Within the Hadoop Core framework, MapReduce is often referred to as mapred, and HDFS is often referred to as dfs.

The Hadoop Core MapReduce framework requires a distributed file system, as long there is a DFS( Distributed File System) plug-in available to the framework it should be fine.

Hadoop Core supports :

CloudStore http://kosmosfs.sourceforge.net/

Amazon Simple Storage Service (S3) file system (http://aws.amazon.com/s3/)

By the way

The Hadoop Core framework comes with plug-ins for HDFS, CloudStore, and S3.

HDFS is used as the shared file system, Hadoop is able to take advantage of knowledge about which node hosts a physical copy of input data, and will attempt to schedule the task that is to read that data, to run on that machine

Ok before we jump into some more details let see what is  Map Reduce Model

Definition:

The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

MapReduce is a fairly new platform designed around processing parallelizable challenges on very large datasets spread across a large array of computers. Each computer is often labeled as a node. And this collection of nodes is labeled as a cluster. Right in the name itself is the two primary tasks, map and reduce.

Now if we take a closer look at the Map Reduce it is observed that

The top-level node, also referred to as the master node, receives the input. It then divides this input into several smaller inputs, and begins to distribute these to other computer nodes. Depending on the infrastructure, these computer nodes can further divide their input and pass off to an additional sub-node under it – similar to a tree structure. Each node, when complete, will pass the result up the tree. A second primary step for us to look at, is the reduce step

A second primary step is to look at, is the reduce step. As we just mentioned in the map step, a question with problems are distributed in a top-down, tree-style methodology. When the results of the sub-problem are returned to a parent node, and they’re returned to their parent, and so forth – and so on until the master node, the results are parsed for the final output.

As long as each operation challenge is independent of another, performing in parallel is available.

If the entire process of Map Reduce is put into steps it will be as below.

************************************************************************************************************************************************************

Step one is where the map is determined itself.

Step two is the actual mapping, run for each assignment for the first step.

Step three is to partition down the tree of nodes.

The remaining steps are the actual comparison of results, and performing the reduce function until the desired output is found.

************************************************************************************************************************************************************

the map function basically takes an array of key/value pairs

these keys that are subdivided and handed off to the other nodes.

Reducers are assigned by the partition-function process and the keys get distributed from there.

For reduce, the input is based on the application’s comparison function during the actual map step. So, for every key that is unique, the reduce function is called and it may produce zero or more results, based on that. And the final step in the MapReduce process is the actual storing of the result itself.This can be a database, file system, or some other writable medium.

=================================================================================================

Ok the theory has been good so far how about we install Hadoop….cloudera…

Apologies I have not been updating this blog for a while…

There are different ways of installing  Hadoop.

  • To enable unit-testing of Hadoop programs, Hadoop needs to be installed in stand-alone mode. This process is relatively straightforward for Linux-based systems, but it is more involved for Windows-based systems.

  • To enable simulation of Hadoop programs in a real cluster, Hadoop provides a pseudo-distributed cluster mode of operation.

Stand-Alone Mode:

Stand-alone is the simplest mode of operation and most suitable for debugging. In this mode, the Hadoop processes run in a single JVM. Although this mode is obviously the least efficient from a performance perspective, it is the most efficient for development turnaround time.

Pseudo-Distributed Cluster:

In this mode, Hadoop runs on a single node in a pseudo-distributed manner, and all the daemons run in a separate Java process. This mode is used to simulate a clustered environment.

Multinode Node Cluster Installation:

In this mode, Hadoop is indeed set up on a cluster of machines. It is the most complex to setup and is often a task for an experienced Linux system administrator. From a logical perspective, it is identical to the pseudo-distributed cluster.

VM Image:

The below method is a VM method installation , a speed way to get you started with Hadoop development, however as prerequisite this method requires you to have a VM player installed on your machine.

I have used the VM Image method installation

Here I illustrate installation, assuming you have Oracle Virtual box installed.(Oracle Virtual Box 4.3.2o installed)

1)Please download the quickstart VM from the link

http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-4-7-x.html

Once you download the VM zip

VirtualBox June 2014 SHA1: 0b12484778925d58a02ea442a92b00d38ef74e04 Download for VirtualBox

I prefer using firefox to download as IE and Chrome tend to timeout once the file size receives 2GB mark, I might be wrong with the choice of browsers , you may want to give it a try.

The two files contained in the zip are.

cloudera-quickstart-vm-4.7.0-0-virtualbox

cloudera-quickstart-vm-4.7.0-0-virtualbox-disk1

2) From the Oracle Virtual box

1. Open VirtualBox
2. Select File->Import Appliance

Screenshot 2015-02-08 15.01.48

Screenshot 2015-02-08 14.59.36

Now select the file cloudera-quickstart-vm-4.7.0-0-virtualbox

Click on Next

3) Leave the size of the RAM to be 4096 Mb.

Screenshot 2015-02-08 15.14.12

4) Click on Import

5) The in Oracle VM Box Start the VM .

6) Click on CloudeEra Manager when the below screen appears

HueCloudeEra

7) Loign as cloudera, cloudera

Cloudera

8) You should be able to see all the services run.

ClouderaManager

 

 

 

 

Prime

I created this blog obviously to post technical articles for self and for all those who come to visit this blog
Disclaimer: The article beep beep boop …lets learn hadoop posted on this blog is collected from numerous resources available on the internet, you might find some examples that are owned by the author, and such shall be explicitly mentioned with a note :Example:Author:Name