Common Data Science Interview Questions

Displaying 1 - 10 of 74

What are the important skills to have in Python with regard to data analysis?

06/24/2021 - 00:51 by devraj

The following are some of the important skills to possess which will come handy when performing data analysis using Python.

  • Good understanding of the built-in data types especially lists, dictionaries, tuples, and sets.
  • Mastery of N-dimensional NumPy Arrays.
  • Mastery of Pandas dataframes.
  • Ability to perform element-wise vector and matrix operations on NumPy arrays.
  • Knowing that you should use the Anaconda distribution and the conda package manager.
  • Familiarity with Scikit-learn. **Scikit-Learn Cheat Sheet**
  • Ability to write efficient list comprehensions instead of traditional for loops.
  • Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
  • Knowing how to profile the performance of a Python script and how to optimize bottlenecks.

What is Data Science?

06/24/2021 - 00:20 by devraj

Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. 

Differentiate between Data Science, Machine Learning and AI.

06/24/2021 - 00:19 by devraj

Here are the difference between Data Science, Machine Learning and AI.

Criteria Data Science Machine Learning Artificial Intelligence
Definition Data Science is not exactly a subset of machine learning but it uses machine learning to analyse and make future predictions. A subset of AI that focuses on narrow range of activities. A wide term that focuses on applications ranging from Robotics to Text Analysis.
Role It can take on a business role.     It is a purely technical role. It is a combination of both business and technical aspects.
Scope Data Science is a broad term for diverse disciplines and is not merely about developing and training models. Machine learning fits within the data science spectrum. AI is a sub-field of computer science.
AI Loosely integrated Machine learning is a sub field of AI and is tightly integrated. A sub-field of computer science consisting of various task like planning, moving around in the world, recognizing objects and sounds, speaking, translating, performing social or business transactions, creative work.

What is HA in YARN ResourceManager?

05/08/2021 - 18:35 by devraj

The YARN Resource Manager is responsible for managing the resources in a cluster and scheduling applications.Prior to Hadoop 2.4, the Resource Manager was a single point of failure in a YARN cluster.

The Resource Manager provides High Availability (HA) by implementing an active-standby Resource Manager pair to remove this single point of failure. When the active Resource Manager Node fails, the control switches to the standby Resource Manager, and all halted applications resume from the last state saved in the state store. This allows handling failover without any performance degradation in the following situations:

  • Unplanned events such as machine crashes
  • Planned maintenance events of software or hardware upgrades to the machine running the ResourceManager
  • ResourceManager HA requires the ZooKeeper and HDFS services to be running

What is Hadoop Streaming?

05/08/2021 - 18:34 by devraj

Hadoop Streaming is an API that allows writing Mappers and Reduces in any language. It uses Unix standard streams as the interface between Hadoop and the user application.

Streaming is naturally suited for text processing. The data view is line-oriented and processed as a key-value pair separated by a 'tab' character. The Reduce function reads lines from the standard input, which is sorted by key, and writes its results to the standard output.

What is HA in a NameNode?

05/08/2021 - 18:31 by devraj

The implementation of Standby and Secondary NameNodes from Hadoop 2.x ensures High Availability in Hadoop clusters, which was not present in Hadoop 1.x. In the case of Hadoop 1.x clusters (one NameNode, multiple DataNodes), a NameNode was a single point of failure. If a NameNode went down owing to lack of backup, then the entire cluster would become unavailable. Hadoop 2.x solved this problem of a single point of failure by including an additional Standby/Secondary NameNode to the cluster. In Hadoop, a pair of NameNodes is in an active-standby configuration. The standby NameNode acts as a backup for the NameNode metadata. The standby NameNode also receives block reports from the DataNodes and maintains a synced copy of edit logs with the active NameNode, and in case the NameNode is down, the standby NameNode takes charge and ensures cluster availability.

What is the difference between Data Block and Input Split?

05/08/2021 - 18:30 by devraj

Data Block: HDFS stores data by first splitting it into smaller chunks. HDFS splits a large file into smaller chunks known as blocks. Thus, it stores each file as a set of data blocks. These data blocks are replicated and distributed across multiple DataNodes.

Input Split: An input split represents the amount of data that is processed by an individual Mapper at a time. In MapReduce, the number of input splits is equal to that of Map tasks. Hence, it is used to configure the number of Map tasks which is equal to the number of Input Splits.

Explain the small files problem in Hadoop.

05/08/2021 - 18:29 by devraj

HDFS is designed for processing/storing big data. So, in case of small files, it is not prepared to efficiently process/store numerous small files. These files generate a lot of overhead to the NameNode and the DataNodes. Reading through small files normally causes a lot of seeks and hopping from one DataNode to another to retrieve each small file. All of this adds up to inefficient data read/write operations.

Can two clients write to an HDFS file simultaneously?

05/08/2021 - 18:26 by devraj

No. When two clients try to write on a file simultaneously, the second client has to wait until the first client has completed its job. This does not apply to reading a file, that is, multiple clients can access a file simultaneously. Therefore, Hadoop is built to work on the write once read many (WORM) functionality.