Evolution of Machine Learning and Artificial Intelligence




by Meghal Dani, Jaswant Singh



Artificial Intelligence is changing every aspect of society today and is transforming greatly with new algorithms coming up each day. Review of classic papers can help with improved  understanding of artificial intelligence. Here are some of the most important algorithms in machine learning and deep learning with their referenced articles  to help you get started.



Machine Learning

Linear Regression [1]

A linear regression model is a data analysis technique that tries to predict the dependent variable as a linear function of the independent variables. Consider the function, 

y=α+ βx;  where x and y are independent and dependent variables respectively. This is an equation of line with β slope and 𝛼 intercept.

Ridge Regression [2] 

If we increase the number of features in Linear Regression, the number of β coefficients will be increased. The model trained will be more flexible, but this may lead to overfitting.  To avoid this, regularization is generally applied.  

Therefore Ridge regression is basically a regularized linear regression which tries to punish higher values of coefficients and make the model simpler and thus avoid overfitting.

K Nearest-Neighbor (KNN) [3]: 

 It is a supervised algorithm, with input consisting of k nearest training example and output is the class the sample belongs to. The distance between sample points is generally measured in terms of euclidean distance. The output is determined by vote of its neighbors.


Support Vector Machine (SVM) [4]

It is a supervised algorithm which is mostly used for classification tasks though can be used for regression tasks also. For the simplest case, if we have data points with two (can be N) features and we plot each data point into 2-D space then, we perform classification by finding a line that differentiates the two classes. This line is called Hyperplane (plane in case of points with three features).The class which is assigned to the new data point depends on the side of Hyperplane on which it falls.

 There may be cases when data cannot be separated linearly. In this type of cases kernelling is done in which data is mapped into higher dimensions so that data points can be separated.   

Decision Trees [5]  

As the name suggests Decision tree algorithms are based on trees-like structures wherein, internal nodes represent a “test” on an attribute, and leaf nodes are final decisions for classification.  It is a supervised algorithm that splits the population into sub-categories based on some attributes/ variables. These splits or branching in the tree is based on Gini Index, Chi-Square, Information Gain, Variance or Entropy depending on the target variable. The final path from the root to leaf node is responsible for making the decision about the sample in the population.

Random Forest [6]

This model is made up of many decision trees. When building trees random sampling of data points is done and when splitting nodes random subsets of features are considered. Each tree is trained with different samples to avoid higher variance or overfitting on overall forest. The variance can be reduced in a single tree by limiting the depth and there may not be a need for random forest  but this is done on the expense of increasing bias and this is where Random Forest helps.

The final prediction of Random Forest is done by averaging the predictions of each individual tree. 


Gradient Boosting [7]

There is a common term in machine learning i.e, Ensembling techniques. In which different models try to predict the same target and reason being that many models will perform better than one. It is further classified in Bagging and Boosting. In Bagging different models are combined with some techniques like weighted average or mean. Bagging is also implemented in Random forest in which averaging of predictions from different trees is done.

In Boosting different models are combined sequentially. Gradient Boosting is an example of a Boosting algorithm in which new  models learn from mistakes committed by previous models. Basically we are updating the predictions by adding new models so that the sum of residuals is minimum. XGBoost [8] is an implementation of decision trees with gradient boosting for speed and performance. 


Neural Network

Deep Neural Network

Neural Network basically consists of an input, hidden and output layer. When the number of hidden layers increases from one to many, the network is considered a Deep Neural Network.  Usually a network with more than 2 or 3 layers is considered as deep, though there is no exact definition. They are different from Machine Learning Algorithms mainly because they do not require any feature engineering of the input data.

Convolutional Neural Network (CNN) 

CNN are the variants of neural networks which are mostly used in the field of Computer Vision. The general constituents include convolutional layers, pooling layers and normalization layers. They take an image as input, extract features and finally perform the task such as image classification, object detection, segmentation. 

In the field of deep learning advances have been made and variants of CNN architectures have been developed. There are famous architectures that have shown improvement over the years and success in competitions like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). AlexNet [9], VGG [10], ResNet [11] are some of the famous networks.

As seen in AlexNet, more layers were added in CNNs to increase the performance. Therefore networks were getting deeper. As VGG was having more layers than AlexNet. 

As very deep neural networks are difficult to train because of the vanishing gradient problem. ResNets residual block was introduced which helps with vanishing gradient problems. These residual blocks use skip connection which prevents the magnitude of the gradient in initial layers from vanishing during backpropagation. There are other networks like:

Faster RCNN  [12]

 A state-of-the-art object detection network that combines Region Proposal Network (RPN) and Fast R-CNN. The RPN generates bounding box proposals around the possible objects which are used to pool features and identify the classes of objects. The regression layer in the end prunes the coordinates of bounding boxes.

U-Net [13]

 This segmentation network won the ISBI challenge in 2012 and since then has been popular for segmentation tasks especially in the biomedical domain. The architecture is U-shaped, consisting of two parts: contracting path (containing general convolution process) and expansive path (contains transposed 2D convolution layers).


Machine Translation

RNN [14]

We cannot understand a sentence or a video if we start each word from scratch and forget the earlier part. The earlier discussed neural networks cannot understand them either. Recurrent Neural Networks (RNN) were designed to address this issue. RNNs can be thought of as neural network in loop with each network passing information to its successor for information to persist. This is how while reading a sentence it can keep track of each word in the sentence and form meaning out of it. It finds major applications in Natural Language Processing (NLP).

LSTM [15]

RNNs cannot retain long term information. Long short Term Memory is a special kind of RNN that provide a solution to this problem as they have a memory module. Thus their default function is to retain information for long periods of time. The basic architecture consists of gates, which decides whether to pass certain information to next cell or forget it if not useful. These gates make LSTM special and alike humans.

 Bi-LSTM [16]

Bi-LSTMs are nothing but two independent RNNs put together wherein the information flows two ways: from past to future and vice versa. Cases where we need to predict a word not only from the previous word but also what follows to know the context better, Bi-LSTMs make a better choice. 



This article walks you through basic algorithms in machine learning which can be used for classification and regression problems. In general, these are majorly used and build a strong foundation for further understanding. Later we move to Deep neural networks, where we present famous architectures developed in the past used for image classification, object detection and segmentation problems. Not limited to image based problems, the article presents RNN, LSTMs used for textual or time based data. These days using graphs to solve problems in medical image analysis, 3D Vision, Neuroimaging have been widely used. Understanding graph neural networks and its variations can promote new algorithms to be developed in this regard.


  1. Jeffrey M. Stanton (2001) Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors, Journal of Statistics Education, 9:3, , DOI: 10.1080/10691898.2001.11910537
  2. Hoerl, Arthur E., and Robert W. Kennard. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics, vol. 12, no. 1, 1970, pp. 55–67. JSTOR, www.jstor.org/stable/1267351. Accessed 28 July 2020.
  3. T. Cover and P. Hart, "Nearest neighbor pattern classification," in IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, January 1967, doi: 10.1109/TIT.1967.1053964.
  4. Hearst, Marti & Dumais, S.T. & Osman, E. & Platt, John & Scholkopf, B.. (1998). Support vector machines. Intelligent Systems and their Applications, IEEE. 13. 18 - 28. 10.1109/5254.708428.
  5. Tom M. Mitchell, (1997). Chapter 3.Decision Tree Learning. Machine Learning, Singapore, McGraw-Hill
  6. Tin Kam Ho, "Random decision forests," Proceedings of 3rd International Conference on Document Analysis and Recognition, Montreal, Quebec, Canada, 1995, pp. 278-282 vol.1, doi: 10.1109/ICDAR.1995.598994.
  7. Friedman, Jerome. (2002). Stochastic Gradient Boosting. Computational Statistics & Data Analysis. 38. 367-378. 10.1016/S0167-9473(01)00065-2.
  8. Chen, Tianqi, and Carlos Guestrin. “XGBoost.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016): n. pag. Crossref. Web.
  9. Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems. 25. 10.1145/3065386.
  10. Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556.
  11. He, Kaiming & Zhang, Xiangyu & Ren, Shaoqing & Sun, Jian. (2016). Deep Residual Learning for Image Recognition. 770-778. 10.1109/CVPR.2016.90.
  12. Ren, Shaoqing & He, Kaiming & Girshick, Ross & Sun, Jian. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 39. 10.1109/TPAMI.2016.2577031.
  13. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.
  14. Giles, C.Lee & Kuhn, Gary & Williams, Ronald. (1994). Dynamic recurrent neural networks: Theory and applications. IEEE Transactions on Neural Networks. 5. 153-156. 10.1109/TNN.1994.8753425.
  15. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9, 1735-1780.
  16. Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM networks. Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., 4, 2047-2052 vol. 4.




Meghal Dani 

Graduated from IIIT-Delhi, currently I am a researcher at Tata Research and Innovation Labs working in the area of Deep Learning and Artificial Intelligence. Working in the areas where technology is put to use for healthcare interests me.




Jaswant Singh

I am currently working with TCS Research & Innovation Labs after completing my masters in Industrial Engineering & Operations Research (IEOR)  from IIT Bombay. My area of expertise include Computer Vision and Medical Image Analysis.