Predictive analytics with gradient boosting in clinical medicine
Predictive analytics play an important role in clinical research. An accurate predictive model can help clinicians stratify risk thereby allowing the identification of a target population which might benefit from a certain intervention. Conventionally, predictive analytics is performed using parametric modeling which comes with a number of assumptions. For example, generalized linear regression models require linearity and additivity to hold for the underlying data. However, these assumptions may not hold in practice. Especially in the era of big data, a large number of covariates or features can be extracted from an electronic database which might have complex interactions and higher-order terms among the covariates. Conventional modeling methods have trouble capturing such high-dimensional relationships. However, some sophisticated machine learning techniques have been invented to handle this situation. Gradient boosting is one of these techniques which is able to recursively fit a weak learner to the residual so as to improve model performance with a gradually increasing number of iterations. It can automatically discover complex data structure, including nonlinearity and high-order interactions, even in the context of hundreds, thousands, or tens-of-thousands of potential predictors. This paper aims to introduce how gradient boosting works. The principles behind this learning machine are explained with a small example in a step-by-step manner. The formal implementation of gradient tree boosting is then illustrated with the caret package. In the simulated example complexity of data structure is created by generating certain interactions between the covariates. This example shows that gradient boosting can better capture these complex relationships than a generalized linear model-based approach.