“Data, data, data. I cannot make bricks without clay” by Sherlock Homes.
Ever since the advancement of technology Data is also growing every day, If we go back in time like in 70’s and 80’s not many people were using computers that’s why data fled into computers was also very less, But now everyone owns a gadget, laptop, mobile phones and generating data every day, every hour and every minute.
There can be several factors for the generation of data. One of the main factors is IOT, leading to the development of smart devices. Another can be social media, We humans are a social animal who love to interact and social media provide what we love, almost every min somewhat 60milions likes and other activities are done leading to the generation of a lot of data. It has been observed that data has been rising exponentially leading to the development of Big Data
Big Data as the named suggest is a large cluster of data but is a problem statement i.e it is an incapability of traditional system to process it. When these systems were created we never thought we have to deal with such amount of data which is produced at high speed and high amount, which traditional system cannot store and process it.
In order to identify Big Data problem, IBM has suggested 5V’S-
Volume– Traditional systems were unable to store and process huge data. Variety-Now a day’s data generated is not in a structured form which traditional system could handle it but they are in a semi-structured and unstructured form which can’t be handled by the traditional system. Velocity– Today data is generated at a very high speed which can‘t be incapacitated. Value– the Traditional system could not find the valuable data among the amount of data. Veracity-Sparse net of data i.e. data we get may not always be correct.
In order to solve all the above problems, an Apache Hadoop Framework is provided which stores and process data set in parallel and distributed fashion. Hadoop is divided into two parts – HDFS and Map Reduce.
HDFS (Hadoop Distributed File System) provides the storage of data in the distributed environment. It follows the client/server architecture. Where Data is stored in slave nodes (or Data Node) and controlled by Master node (or Name node). In this manner, large data can be stored in different places at the same time and handled by the master controller.
Map reduce-It is a programming unit in Hadoop. This unit uses the advantages of the distributed framework in order to process large data set. It consists of two distinct task one is MAP other is Reduce. Map job produces the key-value pair as an intermediate result which is given to reducer as input producing a single output known as reduced data.
Hadoop uses various tools in order to analyze data like HIVE similar to SQL, PIG, SPARK, LIV etc. these all tools are scalable and reliable. According to need, a tool is selected and data is processed