Category: Hadoop

Generating Recommendations with mahout for Boolean data sets

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Generating Recommendations with mahout for Boolean data sets (data sets with no preference value) Boolean data Sets Input data set that doesn’t have a preference value, ie input data set would be of the format UserId1,ItemId1 UserId2,ItemId2 Here it’d based on some data where an user either likes an item or he doesn’t, there is no preference value associated with this. When we use Boolean …

Map Reduce best practices

June 27, 2016 by S4

Filed under Hadoop

Last modified June 27, 2016

Map Reduce best practices 1. User larger HDFS blocks for better performance If smaller HDFS blocks are used more time would be spend for seeking records on disk. This is a massive overhead when we deal with large files. 2. Always use Combiner if possible for local aggregation Shuffle and Sort is a really expensive process hence try reducing the no of records involved for …

Use Compression with Mapreduce

June 27, 2016 by S4

Filed under Hadoop

Last modified June 27, 2016

Use Compression with Mapreduce Hadoop is intended for storing large data volumes, so compression becomes a mandatory requirement here. There are different compression formats available like gzip,Bzip,LZO etc. Of these Bzip(the latest) and LZO are splittable and in that Bzip offers a better compression but the decompression of the same is expensive. When we look at both space and time LZO is more advisable. Also …

Evaluating Mahout based Recommender Implementations

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Evaluating Mahout based Recommender Implementations Mahout provides you an option to evaluate your generated recommendations against the actual preference values. In mahout recommender evaluators, a part of the real preference data set is kept as test data. These test preferences won’t be there in the training data set (actual data set – test data set) which is fed to the recommender under evaluation (ie all …

Evaluating Mahout based Recommender Implementations

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Evaluating Mahout based Recommender Implementations Mahout provides you an option to evaluate your generated recommendations against the actual preference values. In mahout recommender evaluators, a part of the real preference data set is kept as test data. These test preferences won’t be there in the training data set (actual data set – test data set) which is fed to the recommender under evaluation (ie all …

Mahout Recommendations with Data Sets containing Alpha Numeric Item Ids

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Mahout Recommendations with Data Sets containing Alpha Numeric Item Ids In real world data we can’t always ensure that the input data supplied to us in order to generate recommendations should contain only integer values for User and Item Ids. If these values or any one of these are not integers then default data models that mahout provides won’t be suitable to process our data. …

Mahout Recommendations in Distributed mode with Hadoop Map Reduce

June 27, 2016 by S4

Filed under Hadoop, Mahout

Last modified June 27, 2016

Mahout Recommendations in Distributed mode with Hadoop Map Reduce The implementation of mahout Recommendations in completely different in distributed environment compared to the same on stand alone. In distributed environment the concept of Data Model and neighborhood ceases exist, as data is distributed across multiple machines and computations are not just based on local data. In core when we take mahout into distributed mode there …

Installation of Apache Spark

June 19, 2016 by S4

Filed under Hadoop

Last modified June 19, 2016

Installation of Apache Spark We are going to look at installation of Apache spark on a Hadoop. Lets try to setup hadoop yarn here once again with screenshots from scratch, as i received some comments that my installation needs more screenshots so i am doing one with screenshots. In this post, we will look at creating a new user account on Ubuntu 14.04 and installing …

ARCHITECTURE OF HDFS WRITE AND READ

June 19, 2016 by S4

Filed under Hadoop

Last modified June 19, 2016

ARCHITECTURE OF HDFS WRITE AND READ HDFS is a distributed file system which is designed to overcome some of the limitations of huge amount of storage, scalability and redundancy Deals with huge amount of data (terabytes, petabytes and exabyte) As machines in big data are commodity hardware there is risk of machine failure, so data is stored reliably in multiple machines to make data available …

Install MongoDB on Ubuntu

June 17, 2016 by S4

Filed under Hadoop, MongoDB

Last modified June 17, 2016

Install MongoDB on Ubuntu Configure Package Management System (APT) The Ubuntu package management tool (i.e.dpkg and apt) ensure package consistency and authenticity by requiring that distributors sign packages with GPG keys. Issue the following command to import the MongoDB public GPG Key sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv 7F0CEB10 Create a /etc/apt/sources.list.d/mongodb.list file using the following command. echo ‘deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen’ | sudo tee /etc/apt/sources.list.d/mongodb.list …