(Jump to Machine Learning interview questions directly)

Ok so now you’ve reached the stage where you think you’ve a good grasp over the subject and now want to look up ahead on this road? Well this was my reason to first google ‘Machine Learning interview questions’. Maybe yours too to goole “facebook data scientist interview questions”, “amazon machine learning interview”, “machine learning quiz”, “interview questions for data scientist” bla bla and bla. One thing is for sure, if we’re googling this, it is most likely that we’re sort of in love with Machine Learning now. Haha, don’t know about you but I AM.

Enough of chitchat, let’s get down to business. Let’s get down to Machine Learning interview questions! I curated this list of sample interview answers by googling a lot! And with a lot, I mean A LOT. Plus, I also cross-checked the questions with some of the professionals working in this field, and trust me, I gained so much knowledge just by writing this article, and why not? Learning by asking questions is my favourite way of learning, but for that, you also need the art of asking questions (it’s a golden skill). I hope you also experience the same and get what you were looking for. (And do not worry, I’m not gonna answer the age old question : "Different types of Machine Learning!" EVERYBODY KNOWS THAT!)

To not go overboard with ‘Structural Representation’, I kept it simple and divided Machine Learning interview questions into 3 level of difficulties (had some fun with the names though):

I’m just a noobie
I’ve made 10 projects you silly
I publish one research paper a day, dumbass.

(Psst….I also have a bonus question in the end for you. Don’t tell anyone ok?)

Machine Learning interview questions

a. I’m just a noobie (Easy)

Q1. What are the different types of Machine Learning?

Haha …..just kidding!! Got you :p

Real Q1. How does Deep Learning differ from Machine Learning?

It’s like asking how Potato is different from vegetables! Exactly, they aren’t different. Deep Learning is just a subset of Machine Learning,
Deep Learning is nothing but the evolution of Machine Learning.
Yes, I agree that a Machine learning model does get progressively better over time in whatever they are being trained on, but they still need some guidance. When the ML model starts giving wrong predictions (high error), an engineer has to step in to tinker with some parameters to optimize the algorithm further, manually.
Whereas, a Deep Learning algorithm is intelligent in its purest form. It fixes its error itself, i.e, it optimizes itself when it starts getting wrong predictions with no human intervention through its own Neural Network.
Though, a Deep Learning model as to outperform an Machine Learning model requires a lot more training and a lot more Data. But if given the right ingredients, it can outperform Machine Learning models like crazyy!

For more detailed explanation, just go and read (after this though, promise?)
“Machine Learning Deep Learning - Comparison in 2020 (Updated!)”

Q2. Which is more important to you– model accuracy, or model performance?

First of all, model accuracy is just another “evaluation metric” among thousand others in model performance.
Model accuracy is nothing but just the percentage of how many times your model was correct out of the total number of predictions.
There are tons of other measures too, and which evaluation metric can depend on various measures like the type of problem, requirements, type of solutions etc.
Lastly, as model accuracy is just another part of model performance, they are directly proportional (+ve correlation) anyways.

Q3. Explain Classification and Regression

Classification and Regression are the two categories of Supervised Learning. The only difference between them is the type of question asked. Confused? Ok wait…

Classification:

As the name suggests, it CLASSIFies things.
It basically is the discrete answer to anything. Male/Female, Buy/Don’t But, 0/1
Understand it like this. ‘Will it rain today?’ → ‘Yes’; ‘Is he old?’ → ‘No’

Regression:

As the name suggests, it REGRESS….ok wait, this won’t work with Regression.
It basically is the type of problem where the answers cannot be discrete but continuous. Age, Price
Understand it like this. ‘How much will it rain today?’ → ‘2mm’; ‘How old is he?’ → ‘22’

This is how you can frame a same question as a Classification or a Regression problem. This is generally what I did to understand the concept.

Q4. What is the Training/Test split?

The first thing we learn in Machine Learning is that age old good rule of thumb of splitting the dataset into 80% Training and 20% Testing sets. And as the legend says, it works ...most of the time.
The thing is, it totally depends on our dataset volume. For example:

Assume that we’ve a dataset with 1 million samples, according to the thumb rule, 20% of our dataset would be 2 freaking hundred thousand samples allocated for the test set. You know how dumb that move is? If we’ve such a huge dataset, we can easily go with allocating even just 1% of the dataset, which still is 10,000 samples (which sounds just about right) and the performance won’t be tempered, if anything, the model will perform better. And do you realize what this move did, changed the split to 99:1!
Again, let’s assume that we’ve a small dataset with just 500 samples. In that case, we might have to allocate a little more than 20%, let’s say 30% of out dataset (150 samples) to test set, just so that the model evaluation is done properly. Now that makes it 70:30 split!

b. I’ve made 10 projects you silly (Medium)

Q5. What is a Confusion Matrix?

For that, we need to understand 4 different types of answers:

Ok so you’re in love. And you, hopeless lover, sends a love letter to your girl and asks your best friend, Trump, to go and find out the reply. He comes back and spits out:

“Yeah Man, you got it!” (Positive answer)
Listening to this, you run to her and hug her….

She hugs you back. No confusion. Trump was right. → True Positive
She slaps you. Trump was wrong. → False Positive

“Sorry buddy, we’ll find someone better for you don’t worry..” (Negative answer)
Listening to this, you run back to home and cry in your pillow…

She kept waiting for you and cried when you didn’t come. Trump was wrong. → False Negative
No confusion, everybody is clear. Trump was right. → True Negative

And THAT, is what a Confusion Matrix is made up of. It contains the counts of True Positive, True Negatives, False Positives, False Negatives.

(Sorry for the wierd analogy!)

Confusion Matrix

This matrix is very useful in analyzing/evaluating the predictions and make further changes accordingly.

Q6. What’s the difference between Type I and Type II error?

This looks like WHAT THE HECK!? But trust me, it’s a lot easier than it sounds. Remember the False Positives and False Negatives we discussed earlier?

TYPE I is just another name for False Positives, and
TYPE II is just another name for False Negatives.

A very cool example I read (source: springboard.com ) which I always refer to when I get confused (completely quoting it here):

“A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.”

Q7. What are Bias and Variance?

These twins are nothing but just another terms in the Machine Learning family. But do not overlook them, they’re really important.

Bias is nothing but just the difference between the predicted output and the original label. High bias tells us that our lovely model isn’t accurate.
Variance on the other half, is the difference between the predictions of the training sets. High variance shows fluctuations, i.e, that the model is not stable.

Q8. What do you understand by Precision and Recall?

Did you notice one thing? The Machine Learning family contains a lot of twins! Type I/Type II, Bias/Variance, Regression/Classification. And here it is again, yet another twin ...Precision and Recall.

In the most simplest words put, Recall is nothing but simply the accuracy of our model, i.e, how many of the total answers are correct.
Whereas Precision is simply the ratio of a number of events you can correctly recall, to the total number of events you can recall (mix of correct and wrong recalls).

Precision and Recall

Q9. How is KNN different from K-means clustering?

KNN stands for K-Nearest Neighbours, whereas K-means clustering stands for, well, K-means clustering. The only thing common between them is the letter ‘K’, and that creates all the confusion.

Q10. What is the "Curse of Dimensionality?"

You lost your car keys. And you’ve to go home as soon as possible for the delicious dinner. White sauce pasta, beer, mozzarella cheese ‘n’ corn pizza and some chicken wings oh my god! Wait, let’s come back to Machine Learning interview questions now (*sigh), now you’ve to search for your keys asap or else the food will get cold. Now just imagine,

I say,”Ok so this is a line, and you only have to search for the keys on this line only!” Won’t that be super-duper easy? Just keep following the line.
I say,”Ok, so I think you must have lost it on the third floor.” Now the work got tedious. Right? You now have to search for the small keys in the whole floor.
I say,”Ok, man you could’ve lost them anywhere in this building. Good luck!” Now you really need that ‘Good Luck’. Now the job really got warmed up cause you’ll have to search for the keys up and down the floors! You gotta turn on the Mr.Sherlock mode now.

Did you understand? How difficult it gets to reach the goal as the dimension keeps on incrementing? Line→ 1D, Floor→ 2D, Building→ 3D

Same happens in a model. The more the features, the more tougher it gets to minimize the loss. Though a good number of features are always recommended, but, too much of anything ain’t good either!

Q11. Explain the Bias-Variance Tradeoff.

Predictive models have a tradeoff between bias (how well the model fits the data) and variance (how much the model changes based on changes in the inputs).
Simpler models are stable (low variance) but they don't get close to the truth (high bias).
More complex models are more prone to being overfit (high variance) but they are expressive enough to get close to the truth (low bias).
The best model for a given problem usually lies somewhere in the middle.
We need to aim for Low Bias + Low Variance to build a good model.

Bias-Variance Tradeoff

c. I publish one research paper a day, you dumbass. (I’ve to tell?)

Q12. Is it better to have too many false positives or too many false negatives? Explain.

Well, there is no definitive answer for that. Actually, it depends on the question as well as on the domain for which we are brainstorming to solve the problem. If you’re using Machine Learning in the Healthcare/Medicine domain, then a false negative is super risky, since the report will not show any health problem when a person is actually unwell, and c’mon, who wants to play with people’s lives when it comes to it? It’s better to falsely accuse someone of a disease rather than risking it otherwise. Similarly, if Machine Learning is used in spam detection, then a false positive is very risky because the algorithm may classify an important email as spam. Just imagine a Google job offer email rotting in your spam for a year! (Felt it?)

Q13.Explain Ensemble learning technique in Machine Learning.

Now we’re talking about the big stuff! A good Ensemble learning technique applied can take your models to reaches it has never seen before! (Believe me, I’ve experienced it myself.)

Ensemble learning is a technique that is used to create multiple Machine Learning models, which are then merged together to produce more accurate results. Isn’t that just great? So simple yet so powerful. A general Machine Learning model is built by using the entire training data set. However, in Ensemble Learning the training data set is split into multiple subsets, wherein each subset is used to build a separate model. After the models are trained, they are then combined to predict an outcome in such a way that the variance in the output is reduced.

Q14. - Explain Principle Component Analysis (PCA).

Remember the ‘Curse of dimensionality’ we talked about before? Well, what can you do about it!

Principal Component Analysis Example

PCA to the rescue.

PCA is a part of Feature Engineering wherein we try to reduce the number of features without losing the information by combining them into uncorrelated linear combinations.
These new features (also called principal components), sequentially maximize the variance represented (i.e. the first principal component has the most variance, the second principal component has the second most, and so on).
As a result, PCA is useful for dimensionality reduction because you can set an arbitrary variance cutoff

Q15. Explain the difference between L1 and L2 regularization.

Before that, why do we even need regularization?

“To avoid overfitting”

How to prevent overfitting:
Both L1 and L2 regularization prevents overfitting by shrinking (imposing a penalty) on the coefficients.

Difference between L1 and L2:
L2 (Ridge) shrinks all the coefficients by the same proportions but eliminates none, while L1 (Lasso) can shrink some coefficients to zero, performing variable selection.

Which to use?
If all the features are correlated with the label, ridge outperforms lasso, as the coefficients are never zero in ridge. If only a subset of features are correlated with the label, lasso outperforms ridge as in the lasso model some coefficient can be shrunk to zero.

Bonus Machine learning interview questions time..!

QSuper. What’s your favorite algorithm, and can you explain it to me in less than a minute?

Trust me when I say this, I’ve heard this question almost everywhere. And this my friend is not a question that I can provide you the solution with. It totally depends on your perspective and liking. The only motive of putting this question among this list of machine learning interview questions is to ask you to learn at least one algorithm in and out. It does matter if it is as simple as Linear Regression, the fact that you know it the best matters. Knowledge is essential.

(My personal favourite is Random Forest just if you wanna know!)

So, these were some top Machine Learning interview questions about learning ...duh! Machine Learning obviously!
But do not limit yourself! There is a lot to learn out there. Many other important topics which I just couldn't include in my article. The possibilities are endless!

> My other blogs on Machine Learning:

> More Machine Learning Interview Questions/Resources to refer:

> For cracking any Software Engineering Interview (very highly recommended by everybody):

> More Machine Learning Interview Questions (or in general) Books (Very Popular):

1 comments

Transfer learning...? You’ve only heard about it right? You’ve seen “transfer learning deep learning”, “transfer learning tensorflow”, “transfer learning tutorial keras” all over the internet but don’t have a clear picture. Welcome.

Ever created a Deep Neural Network (or specifically, Convolutional Neural Network)? It’s a complex and time-consuming process, right? And due to that, ever wondered how things are done in real-life scenarios, where high accuracies are required in low computational resources and less time?

If you’ve ever created a deep learning model, you know how long it takes to train it on a GPU let alone CPU! Good Deep learning models are computationally expensive and take a lot of time. I personally trained a DL model for 4 days straight on my laptop with GPU, the craziest part is, it still wasn’t trained completely.

So the question is, how do we train Big Deep Learning models on our petty and cute machines? The answer is simple, Transfer Learning.

Again, let’s see what Miss Wikipedia has to say,

Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.

Well, if you’ve been reading my blogs, you know the drill. Let’s make this definition simpler to understand.

We, humans, have come up with ways to transfer our knowledge between tasks. We tend to recognize and apply knowledge gained from the previous experience/task to a similar kind of task that we encounter. The more the similarity in the two tasks, the more and better knowledge we can apply. Isn’t that just true?

For example, if you know how to ride roller-skates, it gets easier to learn Skateboard. If you know how to ride a bicycle, it gets easier to learn to drive a bike. And for all the techies out there, if you know one programming language, it gets easier to learn a completely new one with completely different syntax! I personally experienced that ;)

What do we observe in the above scenarios? C’mon, we’re machine learners here, you gotta find the pattern… we don’t start learning new things completely from scratch, we’ve somewhat basic knowledge of how to perform that new task because we had a kind of experience learning a similar task before. This saves us a lot of time and energy.

The same goes for Transfer Learning in Deep Learning.

If you ask me to explain Transfer Learning in one word, I would say it means ‘a head start’.

Traditionally, what do we do exactly?

We create a Network Structure.
We assign some random parameters (weights and biases).
We iterate and optimize those parameters to increase accuracy.

So tell me, what process takes time? And computational resources? Yes, optimization!

What if I tell you, you don’t have to optimize any parameter, in fact, you don’t even have to create any model or … wait for it …basically perform any of the above steps!

Imagine a vast, high performing model is made for you, the parameters are already optimized for you and that too using big datasets, you just have to use it “right away”! Well, this is what I meant by ‘Head Start’. Well, not “right away”, we still need to tune a few things.

And that is the power of TRANSFER LEARNING.

People/Researchers have created several Networks that are very vast (generally) and are trained on various huge datasets like ImageNet. And the result of that is that the network has learned very complex features/parameters are very highly optimized and because of being trained on huge datasets, they’re more flexible to be generalized. Basically, the same parameters can be used again on a similar kind of dataset because the features will somewhat be the same. Won’t they?

How to use Transfer Learning?

There are multiple ways you can use Transfer Learning. It basically depends on the size of your data, that’s all.

1. Short On Data?

Transfer Learning got you covered. Do not worry.

If you don’t have a vast amount of data, what you can do is to just leave most of the layers of the model as it is and tune the last layers of the model according to your needs i.e freezing training of the layers and only training the fine-tuned layers. This will ensure your model doesn’t overfit and also most of the pre-trained parameters can be used. And Transfer learning is mostly used like this only.

2. Oh, so your grandpa left you the Data?

Well if you’re rich with data, there are again two ways you can go with!

Re-train from Scratch: This, as the name suggests, is nothing but retraining the whole model from scratch, i.e from randomly initialized parameters! I understand your doubt …how is this transfer learning in that case and how is it even helpful?

Well, if you’ve got a huge database, training from scratch will ensure that the model is optimized specially for that one special dataset of yours. And moreover, the Network Structure you’ll be using is already very carefully created to give high accuracy. So, re-training from Scratch really does work. Though Transfer learning this way will take time and computational resources, but will also give a very high accuracy!

The recommended way: Everybody wants to know the best possible way. Which in the case of Transfer learning is, not to re-train from scratch, but re-train some deep layers.

What I mean with that is, recall what we did when we had a small dataset, we left most of the layers in their original condition and fine-tuned the last layers. But now that we’re rich (yeahh!), we can now afford to train more layers than before and put Transfer learning to some good use.

So, what is done is, few of the layers in the beginning (also called shallow layers) are left as they are, pre-optimized. And we train a few of the deep layers again so that they get optimized according to our dataset better.

Why do we do this? The answer is simple. See, the shallow layers of most of the CNNs detect the edges, canny edges and few other very basic features that can be very easily generalized. Specific features (according to the dataset) are extracted in the deeper layers. Now did you get it?

Though, there’s one thing that we HAVE to do! We have to manipulate the output layer to be in sync with our data (like the number of classes, ImageNet has 1000, your data might have just 2. Just imagine the Fiasco if we don’t do it!

I know it can be hard to believe how Transfer Learning can even work! Trust me I was in awe too when I first heard about this process, but Transfer learning does work. And not only just work, but Transfer learning also works remarkably well. And one explanation of why Transfer learning works so well can be that it skips the time taken in back-propagation and gradient descent i.e the time taken for updating the weights and biases to reach the optimum values. Second, as the networks are trained on huge datasets, they’ve got a vast knowledge base and thus works amazingly in our personal datasets given that our dateset is somewhat similar to the data it was trained on. Like if the model was trained on car images, Transfer learning can make it work on truck images pretty well. I hope you got the intuition correct by now.

Well, this is all there is to know about Transfer Learning. Nowadays, Transfer Learning is used almost everywhere in industries. Nobody (except Researchers) tries to create a network from scratch, instead, fall back on big networks that are already made some of which are, ResNets, VGG version, AlexNet, ConvNet and many more. It’s really fun to use them.

So why wait more, go on and create your first Transfer Learning project!

More Transfer Learning Resources to refer to:

Machine Learning Books (Very Popular):

2 comments

Backpropagation Gentle Introduction

In the last article, we got to know what exactly are Neural Networks and created one from Scratch! Yay. There was however one thing that I thought should be carried forward to another article just so I can explain it better (It was beyond the scope of that article). Yeah, you guessed it right, its nothing else but Backpropagation!

Well, if you ask me to explain Backprop in one line, I would say;

Backpropagation is a way to compute the length of the cost-optimization step.

Puzzled? Don’t be. Let’s dive deeper.

But before diving into complex explanations and calculus, we should understand why are we even studying Backpropagation and what it basically does.

If we recall the workflow we came up with, in the last article (if you haven’t read my article on Neural Networks in a Nutshell, then I strongly recommend you to at least skim through it)

Take Info → Do some magic → Give output → Calculate the mistakes (errors) → Optimize the magic → Repeat

Basically, on a high level:

We took the inputs.
We performed Forward Propagation.
We computed the Cost Function.
We performed Back Propagation.
We performed Gradient Descent to optimize the parameters (weights and biases).
Repeat from step 2 until the desired results are achieved.

That means, our core job is to optimize the parameters, right? For that, we need to optimize/change the weights and biases. There we go, that ‘change’ is computed by backpropagation. Thinking where calculus fits in? Tell me, what do we call ‘change’ in mathematics?

Derivative.

Okey dokey! Now we’re ready to dive deeper…finally!

2-L Basic Neural Network

Let’s go with this basic NN. Now, I’ll just write the necessary equations (which I’ve already explained in my last article. Again, you might wanna check that out I’m telling you!) required to do backpropagation.

Layer 1

Now,
y → original labels
y_hat → predicted labels

and here, m = 1

Therefore, total error,

Now, we’re left with just one tini-tiny job (haha), i.e. to find the ‘change’.

Enters Backpropagation.

If you’re anything like me, I’m sure you too have a hard time understanding what exactly ‘derivation’ means. Whenever I asked my teachers, they would just say that it’s the slope of the tangent …or …or the rate of change …or breaking down value into smaller values. To be honest, I never understood it, until I came across backpropagation.

Let me put it this way, say you’ve got a car. A car’s speed depends on what? On a higher level, it’s engine (duh!), wheels and maybe spoilers (tell me if I’m wrong, not a car expert you know!). So now, we’ve three components that affect the speed of the car.

Now, one question, how will the speed of the car be affected if I change the engine? Ok, let me put this another way, we have to find the change in speed w.r.t the engine, right?

Now, let’s add maths to the mix.

Let ‘v’ be the speed of the car,

e → engine, w → wheels, s → spoilers

So,

c1, c2 and c3: constants

NOTE: Here, we’re using a special type of derivation called Partial Derivation, which is basically used when a function depends on multi-variables (here, e, w, and s).

Therefore,

The other variables (w and s) are taken as constants while not derivating w.r.t them.

I hope you guys got a little intuition of what exactly is meant by derivation. Ok then! Let’s move forward…

Now the big question arises, on what variable/parameter does the error even depends? Let’s backpropagate (pun intended ;)

How did we reach our error? Cost Function? Right.

What does Cost Function depend on? Original and Predicted labels? Right. Can we change the original labels? Duh! NO! What’s left? Yes, Predicted labels.

Now, we’ve rolled back to our forward propagation step. How did we reach predicted labels again?

Here we see that the Forward Propagation basically depends on the inputs and the weights we initialized. Are you getting what I’m trying to point at? We cannot change the input (how convenient would that be if we could? Imagine!). So, we’re only left to tinker with the weights and biases. That means, they can be customized as per our convenience, so…let’s!

Now to see how W_7 is affecting our cost function (remember the intuition?), let’s take the help of Miss. Maths,

The derivative can be expanded like this, right?

now, let’s solve it term by term,

note, a_4 = y_hat

Derivative of Sigmoid

expanding z_4 as in Forward Propagation

Finally, bringing all the terms together…

For simplicity, let’s just say we’ve got dW_7 (rather than saying it de_t/dW_7).
Now, we just repeat this process w.r.t every weight (W1-W9) and every bias (b1 and b2) to find respective derivatives.

Now what? See, we’ve got the change in the parameters that we wanted ever since the beginning of the article. Now we just need to update the respective parameter.

Not so fast kid!

See, if we only talk about Backpropagation functionality, then we’re done. The sole purpose of backpropagation is to just calculate the change to be made. Updating parameters is whole another step.

And that step is, ‘Gradient Descent’.

Gradient Descent and Backpropagation (Intuition):

In a nutshell, Gradient Descent is the most basic optimization algorithm which takes some calculated steps towards the minima of the Cost Function, until it converges. Thus, minimizing error and maximizing accuracy!

What Gradient Descent basically does is that it updates the value of a weight/bias.

Gradient Descent takes a parameter called ‘Learning Rate (alpha)’. Now, this is interesting. Learning Rate decides how big steps the function will take towards the minima (or sometimes even away from the minima.)

Now I know what you’re thinking. First, we did Backprop to find the change, and now we’re saying that the learning rate decides the change, what is this fiasco?

Stick with me for just one more minute.

So basically, the derivative we got from Backprop decides in what direction we need to move (obviously towards the minima, but the minima could be either at left or right of the current position, right?) and the learning rate determines how fast or how big steps we are taking towards that direction.

Learning Rate Fiasco: There’s a catch with Learning Rate (alpha) which I think I should tell you right away quickly. See, if we set alpha to be very small, our function will take a very long time to converge to the minima and on the other hand, if we set it high, our function might never even converge! The following gif will make things clearer.

Small Learning Rate

Large Learning Rate

So, generally, the value is taken among the following :
alpha ≥ 0.001, 0.003, 0.01, 0.03, 0.1 ≥ alpha
And of course, you’re all allowed to take any value and check what works best for you.

Now what, we’ve calculated everything we need for Backpropagation. We just need to update the parameters using the above Gradient Descent Equation.

Now if we go back to the workflow, we’ve Optimized our function, and now to until the desired results are reached, we have to iterate (repeat) the workflow again and again and again and again and again and again and again……

More Machine Learning Deep Learning Resources to refer:

More Machine Learning Deep Learning Books to refer (Very Popular):

4 comments

Parthik Talks