Category: Data Science
-
Reinventing Capitalism in the Age of Big Data: Summary
I’ve finished this book a few months ago, and realized it’ll be useful for me to recap the points I’ve learnt from the book. This book talks about how data is the new currency in this era, and how the capitalism landscape will change. The book also proposes some steps to manage this change, and […]
-
Activation Functions
The structure of a deep learning model consists mainly of nodes, and connections between them. Most of the time, every single node is connected to every other node in the next layer, which we call a Dense layer. Within each node is a mathematical equation, decides, based on the input values and their weights, what […]
-
The Interpretation of ROC and AUC
The ROC curve and it’s AUC is a common metric for evaluation the performance of a model. In this post, we dig deeper to find out how to interpret the results, and what corrective actions to take to improve it. What is it? The ROC curve, or Receiver Operating Characteristic curve works on binary classification […]
-
Regularization
One of the major problems in training a model in machine learning is overfitting. Especially when your model gets more and more complex, it starts to memorize the patterns in the training data. This makes it perform poorly on unseen data, which has new patterns. Overfitting is the result of low-bias and high-variance, where it […]
-
Microsoft Kaggle Competition
This is the write up for my solution for the Microsoft Malware Prediction https://www.kaggle.com/c/microsoft-malware-prediction I got pretty high up the leader board, but it was nothing that I was proud of, because: I grossly overfitted my model The final result was a blend of another kernel All my attempts at feature engineering failed And I’m […]
-
Model Capacity
While studying the book Deep Learning by Ian Goodfellow, I came across this concept of model capacity, and it was really intuitive in helping me understand the models representation of a given problem. This ties to the concept of overfitting and underfitting Capacity Put simply, the capacity of the model is the complexity of the […]
-
Counts Based Featurization
While doing the Microsoft Malware Classification challenge, I encountered a way of Feature representation called Count Based Features (CBF). CBF is good to use with very high cardinality features, and it transforms the high number of categories in the data to the number of it’s occurrences. This representation is helpful because it extracts out a […]
-
LightGBM
For some time, XGBoost was considered the Kaggle-Killer, being the winning model for most prediction problems. Recently Microsoft released their own gradient boosting framework called LightGBM, and it is way faster than XGB. In this post, I’m going to touch on the interesting portions of LightGBM. What is LightGBM? Similar to XGBoost, LightGBM is a […]
-
Feature Engineering
Feature Engineering is one of the neglected portion of machine learning. Most topics revolve around Model Training (parameter tuning, cross validation). While that might be really important, feature engineering is equally important as well, but I can’t seem to find good resources that talk about this. I suspect this is because to perform feature engineering, […]
-
Microsoft Kaggle Challenge: Adversarial Validation
Overview This was a concept I came across while doing a Kaggle challenge issued by Microsoft to predict if a computer would get hit by a malware or not. This challenge was different from their previous one, where they wanted you to predict if the malware class of a given binary. This challenge was really […]