Ans: Supervised learning means learning from example. We have data and we have output too and we train a model to predict that output based on the feature columns we already have.
Ans: EDA is basically a graphical interpretation of data. We perform different analysis and frame our data to give insightful results using EDA.
Ans: Linear Regression
Reasons :
- Simple
- Solves most types of data problems
- Can be enhanced by feature engineering and tuning
Ans: Logistic regression is used when the target variable is categorical and not continuous. eg, email spam detection, yes/no type.
It works on the probability of an event rather than predicting values.
Ans : .describe()
Ans: p-value is a statistical value of how your data is matching with Null Hypothesis.
Higher p-values determine how much is your data is with true null
Ans: Confusion matrix is for classification problems and it helps to understand the model through its various metrics and how well it generalized the test data
Ans : Hypothesis testing includes null hypothesis and alternative hypothesis. We make a statement and assume it to be a null hypothesis and then after statistical inferences , we choose whether to select null hypothesis or reject it
Ans : Imputation is the way to replace missing and null values with the value suitable according to data sample. For non-normal or non-Gaussian type of data , generally median or mean is used for imputation. For nearly normally distributed data, the mode is correct value of imputation.
Sometimes, imputation also depends on dataset and can be a random value.
Ans: Sometimes, some data points are either too small or too big from the normal data points available in the dataset. These are called outliers
Ans : In simple words , precision tells you predict something positive, how many times its actually positive and recall tells out of actual positive data , how many times we predicted correctly
Ans : k-NN algorithm is a supervised algorithm where we determine a constant value k and classify a data point within a vicinity of k data points to which class it actually belongs to.
Ans : Bias-variance trade-off is like a see-saw . When data exhibits higher variance , bias is increased to reduce variance and vice-versa. When model is simple and features are minimal , then it shows high bias and low variance and when model has too many features then it shows low bias and high variance
Ans : Underfitting is when model cant cover enough data points and is unable to predict on most data points whereas overfitting is where model tries to cover each and every data point and cant be generalized as would fail on some random dataset.
Ans: Box plot
- Numpy uses vector multiplication process
- Numpy is written in C which runs behind the screen which makes it faster
- Numpy arrays are more compact than lists, that is they take less storage than lists
Yes, it is possible by using the function .loc() and .iloc() where .loc() can be used to index based on the location of the element and .iloc() can be used for indexing based on the index of that element.
The whole dataset or data frame is valuable if the null values present are less than 40 or 30% if the null values are more than 50% then we don’t get much valuable information. It is good to add information to a dataset but a bad method to exaggerate information, but often it is better to let missing values be there and continue with the analysis rather than manipulating the available information
There different types of datasets generally we can state as an inner join, outer join, left join, right join. For merging of datasets generally full outer join is preferred because this will merge the data and keep data which is common to both datasets along with data which is unique to both datasets.
The ordered categorical variable will have some kind of ordering or hierarchy in their set, like high salary low salary, months of years, etc., but unordered categorical variable don’t have a notion of high or low ex: colors, types of loans, etc.,
Median is always preferred over the mean to describe the characteristics of a population because mean takes quantity aggregates it, and says how it looks if it is evenly distributed and also mean always affected by outliers hence it will not describe the exact characteristics.
Correlation measures both the strength and direction of the relationship between two variables, whereas covariance indicates only the direction of the relationship between two variables so it is always preferred to use correlation.
The event which is affected by previous events or the events that are already happened.
If there are ‘n’ people in a group then how many a. total number of handshakes b. unique handshakes are possible
1. a – n(n-1) 2. b – n(n-1)/2
- Tossing a coin ‘n’ number of times
- Asking n number of people who are randomly selected if they are older than 30 years.
- Drawing 3 red balls from a bag, putting them back after drawing it.
NO both are not same, if there are not sufficient evidence to support the alternate hypothesis it means we fail to reject the null hypo but it doesn’t mean that we accepted the null hypothesis.
Forgiven dataset, if we already know how the correct output will look like then it will be a supervised ML, but in case of unsupervised ML we don’t have any idea how output will come.
RSS stands for Residual Sum of Squares it is the sum of the squares of the variance of the data points.
R2 score mainly gives you or tells you how perfectly the line fits for the given set of data, forex if the r2 score is 0.73 it means that we can explain or cover 73% of the variance present in the given set of data.
The post Data Science with Python Interview Questions and Answers appeared first on Besant Technologies.