top

Big Data? Hadoop might be your thing

Software as a service(SaaS) can turn out to be the Sales and Marketing’s biggest demon if the business plan is a faulty one. The Software as a service(SaaS) model itself has little to take blame for: after all, the initial plan was to be shared with entities that could construct the ideal enterprise level solution and provide its constant support during the implementation. Hadoop like many of the other big data solutions has been a game changer but unlike other big data solutions, Hadoop is more accessible.  Amongst others, its benefits include,though: Computing Power: Its distributed computing model quickly processes big data. The more computing nodes you use, the more processing power you have. Flexibility: Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images, and videos. Fault tolerance: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data. Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data. Scalability: You can easily grow your system simply by adding more nodes. The Little administration is required. -          SAS Most of us have complained about Java’s enterprise level having a low storage tolerance. Coding repositories do try to address that, but there’s just not enough power in that. For those of you looking for Hadoop’s standalone can now do so in some relatively easier steps. Virtual Box You will need to get started with the Unix flavour of choice. I am not for ditching the base OS, especially if you are using windows. Even if you are not using Windows, I still recommend a virtual machine with a Unix flavour. Optimally Ubuntu is a very good choice, but Red Hat also works well. You can follow the step by step for the installation over here. Just ensure that you keep your ram proper on the VM as Hadoop is not a light software to configure. Map Reduce Think of Map reduction like splitting your data into equal chunks for better storage. There are two sets of operations to it: the map tasks and the reduce tasks. Once the reduction is complete, the storage takes place in a different format of the file system. Optimal and simple. The framework possesses the same computer and storage nodes, though, which can be attributed as one of its criticisms. The framework is also divided between the master Job Tracker and the slave Task tracker per cluster node. Since your data is stored on these clusters, the Tracker schedules and then executes. Take a look at the source code for an example as Apache is pretty much in line with Hadoop.     Public class WordCount { public static class Map extends MapReduceBase implements private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one);      }   } } public static class Reduce extends MapReduceBase implements Reducer public void reduce(Text key, Iterator values, OutputCollector int sum = 0; while (values.hasNext()) { Mapper { output, Reporter reporter) throws IOException { IntWritable, Text, IntWritable> { IntWritable> output, Reporter reporter) throws IOException { sum += values.next().get();         } output.collect(key, new IntWritable(sum));     }  } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);  }   Output $ bin/hadoopdfs -cat /usr/joe/wordcount/output/part-00000  Bye 1  Goodbye 1  Hadoop 2  Hello 2  World 2    - Apache   You can always make the Map Reduction on platforms and languages of your choice like Python or Golang. If you are adopting the framework, though, you should always make a note to check the support the framework has for the language of your choice. In case the scripting interface is undergoing a change like Angular or PHP.   Getting used to the ecosystem For all the users this is a tricky bit even after getting through the commands on the VM. You can follow any follow through on setting up the standalone, but the components you should be going through is still up for debate. My recommendation is to get going with Hive as its tabular sequences and querying is very much similar to SQL, only with increased functions for data summarization. You will also be needing Sqoop if you plan on using multiple tracks or parts of data for your BI solution. Typically if you are working with E-commerce, then your data sources may be multiple I know for a fact that porting data from Oracle DBs or MySQL DBs is not an easy task. Oracle DBs are particularly difficult to port for most solutions cut components like SQoop make the task a tad bit simpler. HBase, Ansari, Solr,etc. are also names you should keep in mind. Of course, you will have your favorites but go through as many of components as possible if you want to maximize Hadoop's usage fully. Happy hunting!
Rated 4.0/5 based on 20 customer reviews
Normal Mode Dark Mode

Big Data? Hadoop might be your thing

Susan May
Blog
22nd Jun, 2016
Big Data? Hadoop might be your thing

Software as a service(SaaS) can turn out to be the Sales and Marketing’s biggest demon if the business plan is a faulty one. The Software as a service(SaaS) model itself has little to take blame for: after all, the initial plan was to be shared with entities that could construct the ideal enterprise level solution and provide its constant support during the implementation. Hadoop like many of the other big data solutions has been a game changer but unlike other big data solutions, Hadoop is more accessible.  Amongst others, its benefits include,though:

Computing Power: Its distributed computing model quickly processes big data. The more computing nodes you use, the more processing power you have.

Flexibility: Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images, and videos.

Fault tolerance: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.

Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data.

Scalability: You can easily grow your system simply by adding more nodes. The Little administration is required.

-          SAS

Most of us have complained about Java’s enterprise level having a low storage tolerance. Coding repositories do try to address that, but there’s just not enough power in that. For those of you looking for Hadoop’s standalone can now do so in some relatively easier steps.

Virtual Box

Virtial Box

You will need to get started with the Unix flavour of choice. I am not for ditching the base OS, especially if you are using windows. Even if you are not using Windows, I still recommend a virtual machine with a Unix flavour. Optimally Ubuntu is a very good choice, but Red Hat also works well.

You can follow the step by step for the installation over here. Just ensure that you keep your ram proper on the VM as Hadoop is not a light software to configure.

Map Reduce

Map Reduce

Think of Map reduction like splitting your data into equal chunks for better storage. There are two sets of operations to it: the map tasks and the reduce tasks. Once the reduction is complete, the storage takes place in a different format of the file system. Optimal and simple.

The framework possesses the same computer and storage nodes, though, which can be attributed as one of its criticisms. The framework is also divided between the master Job Tracker and the slave Task tracker per cluster node. Since your data is stored on these clusters, the Tracker schedules and then executes.

Take a look at the source code for an example as Apache is pretty much in line with Hadoop.


 

 

Public class

WordCount {

public static class Map extends MapReduceBase implements

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one);

     }

  }

}

public static class Reduce extends MapReduceBase implements Reducer

public void reduce(Text key, Iterator values, OutputCollector

int sum = 0;

while (values.hasNext()) {

Mapper {

output, Reporter reporter) throws IOException {

IntWritable, Text, IntWritable> {

IntWritable> output, Reporter reporter) throws IOException {

sum += values.next().get();

        }

output.collect(key, new IntWritable(sum));

    }

 }

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

 }

 

Output

$ bin/hadoopdfs -cat /usr/joe/wordcount/output/part-00000  Bye 1  Goodbye 1  Hadoop 2  Hello 2  World 2    - Apache

 

You can always make the Map Reduction on platforms and languages of your choice like Python or Golang. If you are adopting the framework, though, you should always make a note to check the support the framework has for the language of your choice. In case the scripting interface is undergoing a change like Angular or PHP.

 

Getting used to the ecosystem

Getting used to the ecosystem

For all the users this is a tricky bit even after getting through the commands on the VM. You can follow any follow through on setting up the standalone, but the components you should be going through is still up for debate. My recommendation is to get going with Hive as its tabular sequences and querying is very much similar to SQL, only with increased functions for data summarization.

You will also be needing Sqoop if you plan on using multiple tracks or parts of data for your BI solution. Typically if you are working with E-commerce, then your data sources may be multiple I know for a fact that porting data from Oracle DBs or MySQL DBs is not an easy task. Oracle DBs are particularly difficult to port for most solutions cut components like SQoop make the task a tad bit simpler.

HBase, Ansari, Solr,etc. are also names you should keep in mind. Of course, you will have your favorites but go through as many of components as possible if you want to maximize Hadoop's usage fully. Happy hunting!

Susan

Susan May

Writer, Developer, Explorer

Susan is a gamer, internet scholar and an entrepreneur, specialising in Big Data, Hadoop, Web Development and many other technologies. She is the author of several articles published on Zeolearn and KnowledgeHut blogs. She has gained a lot of experience by working as a freelancer and is now working as a trainer. As a developer, she has spoken at various international tech conferences around the globe about Big Data.


Website : https://www.zeolearn.com

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE OUR BLOG

Follow Us On

Share on

other Blogs