Archive: June, 2016

Analyzing Apache logs with Pig

June 27, 2016 by S4

Filed under Hadoop, Pig

Last modified June 27, 2016

Analyzing Apache logs with Pig Analyzing log files, churning them and extracting meaningful information is a potential use case in Hadoop. We don’t have to go in for MapReduce programming for these analyses; instead we can go for tools like Pig and Hive for this log analysis. I’d just give you a start off on the analysis part. Let us consider Pig for apache log …

Implementing basic SQL Update statement in Hive

June 27, 2016 by S4

Filed under Hadoop, Hive

Last modified June 27, 2016

Implementing basic SQL Update statement in Hive Hive is not meant for point to point queries and hence sql update functionality would be least required in hive that should be the reason hive doesn’t have update functionality for rows or rather individual columns in a row. There would be cases you find a much more suitable use case in hive, but the same can’t be …

Joins with plain Map Reduce or Multiple Inputs

June 27, 2016 by S4

Filed under Hadoop, Hive

Last modified June 27, 2016

Joins with plain Map Reduce or Multiple Inputs Being a map reduce developer I’d never recommend to write joins of data sets using custom map reduce code. You have very intelligent and powerful tools handy in hadoop like hive and pig that can easily join huge data sets with the choice of join like inner, outer etc. But if such a scenario arises where you …

Optimizing Joins in hive Sorting Java Heap issues with hive joins

June 27, 2016 by S4

Filed under Hadoop, Hive

Last modified June 27, 2016

Optimizing Joins in hive/Sorting Java Heap issues with hive joins Optimizing Joins in hive/Sorting Java Heap issues with hive joins, In hadoop we tent to use hive extensively since it is SQL like language and easier in framing our jobs with stored structured data. (Even Pig is great but still needs a little time to get comfortable with Pig Latin). But as beginners we often …

Generating Recommendations with mahout for Boolean data sets

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Generating Recommendations with mahout for Boolean data sets (data sets with no preference value) Boolean data Sets Input data set that doesn’t have a preference value, ie input data set would be of the format UserId1,ItemId1 UserId2,ItemId2 Here it’d based on some data where an user either likes an item or he doesn’t, there is no preference value associated with this. When we use Boolean …

Map Reduce best practices

June 27, 2016 by S4

Filed under Hadoop

Last modified June 27, 2016

Map Reduce best practices 1. User larger HDFS blocks for better performance If smaller HDFS blocks are used more time would be spend for seeking records on disk. This is a massive overhead when we deal with large files. 2. Always use Combiner if possible for local aggregation Shuffle and Sort is a really expensive process hence try reducing the no of records involved for …

Use Compression with Mapreduce

June 27, 2016 by S4

Filed under Hadoop

Last modified June 27, 2016

Use Compression with Mapreduce Hadoop is intended for storing large data volumes, so compression becomes a mandatory requirement here. There are different compression formats available like gzip,Bzip,LZO etc. Of these Bzip(the latest) and LZO are splittable and in that Bzip offers a better compression but the decompression of the same is expensive. When we look at both space and time LZO is more advisable. Also …

Evaluating Mahout based Recommender Implementations

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Evaluating Mahout based Recommender Implementations Mahout provides you an option to evaluate your generated recommendations against the actual preference values. In mahout recommender evaluators, a part of the real preference data set is kept as test data. These test preferences won’t be there in the training data set (actual data set – test data set) which is fed to the recommender under evaluation (ie all …

Evaluating Mahout based Recommender Implementations

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Evaluating Mahout based Recommender Implementations Mahout provides you an option to evaluate your generated recommendations against the actual preference values. In mahout recommender evaluators, a part of the real preference data set is kept as test data. These test preferences won’t be there in the training data set (actual data set – test data set) which is fed to the recommender under evaluation (ie all …

Mahout Recommendations with Data Sets containing Alpha Numeric Item Ids

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Mahout Recommendations with Data Sets containing Alpha Numeric Item Ids In real world data we can’t always ensure that the input data supplied to us in order to generate recommendations should contain only integer values for User and Item Ids. If these values or any one of these are not integers then default data models that mahout provides won’t be suitable to process our data. …