Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Data Science Interview Questions- Part 5

Question 81: What is the difference between Causation and Correlation?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_81_What_is_the_diffe.mp3
  • Causation denotes any causal relationship between two events and represents its cause and effects.
  • Correlation determines the relationship between two or more variables.
  • Causation necessarily denotes the presence of correlation, but correlation doesn’t necessarily denote causation.

Question 82: What happens if two users access the same HDFS file at the same time?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_82_What_happens_if_t.mp3

When the first user is accessing the file, the second user’s inputs will be rejected because HDFS NameNode supports exclusive write.

Question 83: What is PyTorch?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_83_What_is_PyTorch_.mp3

PyTorch is a Python-based scientific computing package designed to perform numerical calculations using the programming of tensors. It also allows its execution on GPU to speed up calculations. PyTorch is used to replace NumPy and process calculations on GPUs and for research and development in the field of machine learning, mainly focused on the development of neural networks.

Question 84: How should you maintain a deployed model?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_84_How_should_you_ma.mp3

The steps to maintain a deployed model are:

Monitor- Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it’s doing what it’s supposed to do.

Evaluate- Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.

Compare– The new models are compared to each other to determine which model performs the best.

Rebuild- The best-performing model is re-built on the current state of data.

Question 85: What information is gained in a decision tree algorithm?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_85_What_information_.mp3

Information gain is the expected reduction in entropy. Information gain decides the building of the tree. Information Gain makes the decision tree smarter. Information gain includes parent node R and a set E of K training examples. It calculates the difference between entropy before and after the split.

Question 86: What is Dropout?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_86_What_is_Dropout_.mp3

In Data Science, the term “dropout” refers to the process of randomly removing visible and hidden network units. By eliminating up to 20% of the nodes, they avoid overfitting the data and allow for the necessary space to be set up for the network’s iterative convergence process.

Question 87: What is a Bias-Variance trade-off in Data Science?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_87_What_is_a_biasva.mp3

When building a model using Data Science or Machine Learning, our goal is to build one that has low bias and variance. We know that bias and variance are both errors that occur due to either an overly simplistic model or an overly complicated model. Therefore, when we are building a model, the goal of getting high accuracy is only going to be accomplished if we are aware of the tradeoff between bias and variance.

Bias is an error that occurs when a model is too simple to capture the patterns in a dataset. To reduce bias, we need to make our model more complex. Although making the model more complex can lead to reducing bias, and if we make the model too complex, it may end up becoming too rigid, leading to high variance. So, the tradeoff between bias and variance is that if we increase the complexity, the bias reduces and the variance increases, and if we reduce complexity, the bias increases and the variance reduces. Our goal is to find a point at which our model is complex enough to give low bias but not so complex to end up having high variance.

Question 88: What is RMSE?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_88_What_is_RMSE_Ans.mp3

RMSE stands for the root mean square error. It is a measure of accuracy in regression. RMSE allows us to calculate the magnitude of error produced by a regression model. The way RMSE is calculated is as follows:

First, we calculate the errors in the predictions made by the regression model. For this, we calculate the differences between the actual and the predicted values. Then, we square the errors.

After this step, we calculate the mean of the squared errors, and finally, we take the square root of the mean of these squared errors. This number is the RMSE, and a model with a lower value of RMSE is considered to produce lower errors, i.e., the model will be more accurate.

Question 89: What is a kernel function in SVM?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_89_What_is_a_kernel_.mp3

In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable (cannot be separated using a straight line) into one that is linearly separable.

Question 90: How can we select an appropriate value of k in k-means?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_90_How_can_we_select.mp3

Selecting the correct value of k is an important aspect of k-means clustering. We can make use of the elbow method to pick the appropriate k value. To do this, we run the k-means algorithm on a range of values, e.g., 1 to 15. For each value of k, we compute an average score. This score is also called inertia or the inter-cluster variance.

This is calculated as the sum of squares of the distances of all values in a cluster. As k starts from a low value and goes up to a high value, we start seeing a sharp decrease in the inertia value. After a certain value of k, in the range, the drop in the inertia value becomes quite small. This is the value of k that we need to choose for the k-means clustering algorithm.

Question 91: What is batch normalization?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_91_What_is_batch_nor.mp3

One method for attempting to enhance the functionality and stability of the neural network is batch normalization. To do this, normalize the inputs in each layer such that the mean output activation stays at 0 and the standard deviation is set to 1.

Question 92: What is an Activation function?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_92_What_is_an_Activa.mp3

An activation function is a function that is incorporated into an artificial neural network to aid in the network’s learning of complicated patterns in the input data. In contrast to a neuron-based model seen in human brains, the activation function determines what signals should be sent to the following neuron at the very end.

Question 93: How to detect if the time series data is stationary?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_93_How_to_detect_if_.mp3

Time series data is considered stationary when variance or mean is constant with time. If the variance or mean does not change over a period of time in the dataset, then we can draw the conclusion that, for that period, the data is stationary.

Question 94: What happens when some of the assumptions required for linear regression are violated?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_94_What_happens_when.mp3

These assumptions may be violated lightly (i.e., some minor violations) or strongly (i.e., the majority of the data has violations). Both of these violations will have different effects on a linear regression model.

Strong violations of these assumptions make the results entirely redundant. Light violations of these assumptions make the results have greater bias or variance.

Question 95: How to deal with unbalanced binary classification?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_95_How_to_deal_with_.mp3

Following are the points that will teach you to deal with unbalanced binary classification:

  • Use other formulas to determine the model’s performance, such as precision/recall, F1 score, etc.
  • Re-sample the data using strategies such as undersampling (decreasing the sample size of the bigger class), oversampling (raising the sample size of the smaller class using repetition, SMOTE, and other similar strategies), and so on.
  • K-fold cross-validation is used.
  • Use ensemble learning such that each decision tree only takes into account a portion of the bigger class and the complete sample of the smaller class.

Question 96: How can outlier values be treated?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_96_How_can_outlier_v.mp3

Any graphical analysis technique, even univariate, may be used to detect outlier values. If there are only a few outlier values, each one may be evaluated separately, but if there are several, the values can be replaced with either the 99th or the 1st percentile values.

Not every extreme value is an outlier value. The most typical methods for handling outlier values are:

  • Adjust the value such that it is inside a certain range.
  • Just eliminate the value.

Question 97: Which cross-validation method would you use to a batch of time series data?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_97_Which_crossvalid.mp3

Instead of utilizing k-fold cross-validation, you should be aware that a time series is fundamentally organized by chronological order and is not made up of randomly dispersed data. Use approaches like forward-chaining, where you model on previous data and then look at forward-facing data, when dealing with time series data.

Question 98: What is the difference between Point Estimates and Confidence Interval?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_98_What_is_the_diffe.mp3

Point Estimates: A specific number known as the point estimate provides an estimate of the population parameter. The Maximum Likelihood estimator and the Method of Moments are two common techniques used to produce Population Parameter Point, estimators.

Confidence Interval: The confidence interval provides a range of values that most likely contain the population parameter. It even reveals the likelihood that the population parameter may be found in that specific period. The similarity is represented by the Confidence Coefficient (or Confidence level), which is indicated by 1-alpha. The significance level is indicated by alpha.

Question 99: Define quality assurance and six sigma?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_99_Define_quality_as.mp3

Quality assurance: an activity or set of activities focused on maintaining a desired level of quality by minimizing mistakes and defects.

Six sigma: a specific type of quality assurance methodology composed of a set of techniques and tools for process improvement. A six sigma process is one in which 99.99966% of all outcomes are free of defects.

Question 100: What do you understand by feature vectors?

Answer:

https://www.synergisticit.com/wp-content/uploads/2023/05/Question_100_What_do_you_unde.mp3

Feature vectors are the set of variables containing values describing each observation’s characteristics in a dataset. These vectors serve as input vectors to a machine learning model.

The post Data Science Interview Questions- Part 5 appeared first on SynergisticIT.



This post first appeared on Student Loan Crisis In The United States Solution, please read the originial post: here

Share the post

Data Science Interview Questions- Part 5

×

Subscribe to Student Loan Crisis In The United States Solution

Get updates delivered right to your inbox!

Thank you for your subscription

×