Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. We present a Bayesian probabilistic perspective, and provide a number of insights, for example, more efficient algorithms for optimization and hyper-parameter tuning, and an explanation of finding good predictors. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), projection pursuit regression (PPR) are all shown to be shallow learners. Their deep learning counterparts exploit multiple deep layers of data reduction which provide performance gains. We discuss stochastic gradient descent (SGD) training optimisation, and Dropout (DO) that provide estimation and variable selection, as well as Bayesian regularization, which is central to finding weights and connections in networks to optimize the bias-variance trade-off. To illustrate our methodology, we provide an analysis of spatio-temporal data. Finally, we conclude with directions for future research. |