The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high performance data analysis tools.
Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
Here are the difference between Data Science, Machine Learning and AI.
Criteria
Data Science
Machine Learning
Artificial Intelligence
Definition
Data Science is not exactly a subset of machine learning but it uses machine learning to analyse and make future predictions.
A subset of AI that focuses on narrow range of activities.
A wide term that focuses on applications ranging from Robotics to Text Analysis.
Role
It can take on a business role.
It is a purely technical role.
It is a combination of both business and technical aspects.
Scope
Data Science is a broad term for diverse disciplines and is not merely about developing and training models.
Machine learning fits within the data science spectrum.
AI is a sub-field of computer science.
AI
Loosely integrated
Machine learning is a sub field of AI and is tightly integrated.
A sub-field of computer science consisting of various task like planning, moving around in the world, recognizing objects and sounds, speaking, translating, performing social or business transactions, creative work.
The YARN Resource Manager is responsible for managing the resources in a cluster and scheduling applications.Prior to Hadoop 2.4, the Resource Manager was a single point of failure in a YARN cluster.
The Resource Manager provides High Availability (HA) by implementing an active-standby Resource Manager pair to remove this single point of failure. When the active Resource Manager Node fails, the control switches to the standby Resource Manager, and all halted applications resume from the last state saved in the state store. This allows handling failover without any performance degradation in the following situations:
Unplanned events such as machine crashes
Planned maintenance events of software or hardware upgrades to the machine running the ResourceManager
ResourceManager HA requires the ZooKeeper and HDFS services to be running
Hadoop Streaming is an API that allows writing Mappers and Reduces in any language. It uses Unix standard streams as the interface between Hadoop and the user application.
Streaming is naturally suited for text processing. The data view is line-oriented and processed as a key-value pair separated by a 'tab' character. The Reduce function reads lines from the standard input, which is sorted by key, and writes its results to the standard output.
The implementation of Standby and Secondary NameNodes from Hadoop 2.x ensures High Availability in Hadoop clusters, which was not present in Hadoop 1.x. In the case of Hadoop 1.x clusters (one NameNode, multiple DataNodes), a NameNode was a single point of failure. If a NameNode went down owing to lack of backup, then the entire cluster would become unavailable. Hadoop 2.x solved this problem of a single point of failure by including an additional Standby/Secondary NameNode to the cluster. In Hadoop, a pair of NameNodes is in an active-standby configuration. The standby NameNode acts as a backup for the NameNode metadata. The standby NameNode also receives block reports from the DataNodes and maintains a synced copy of edit logs with the active NameNode, and in case the NameNode is down, the standby NameNode takes charge and ensures cluster availability.
Data Block: HDFS stores data by first splitting it into smaller chunks. HDFS splits a large file into smaller chunks known as blocks. Thus, it stores each file as a set of data blocks. These data blocks are replicated and distributed across multiple DataNodes.
Input Split: An input split represents the amount of data that is processed by an individual Mapper at a time. In MapReduce, the number of input splits is equal to that of Map tasks. Hence, it is used to configure the number of Map tasks which is equal to the number of Input Splits.
HDFS is designed for processing/storing big data. So, in case of small files, it is not prepared to efficiently process/store numerous small files. These files generate a lot of overhead to the NameNode and the DataNodes. Reading through small files normally causes a lot of seeks and hopping from one DataNode to another to retrieve each small file. All of this adds up to inefficient data read/write operations.
No. When two clients try to write on a file simultaneously, the second client has to wait until the first client has completed its job. This does not apply to reading a file, that is, multiple clients can access a file simultaneously. Therefore, Hadoop is built to work on the write once read many (WORM) functionality.