Parallel XGBoost Grid Search Using Multiprocessing

Redesigning your application to run multithreaded on a multicore machine is a little like learning to swim by jumping into the deep end.
- Herb Sutter, chair of the ISO C++ standards committee, Microsoft.

XGBoost is one the most widely used and accurate algorithms for different machine learning applications. Part of success of this algorithm depends on choosing the right hyperparameters which can take quite a while if you are doing it manually. A way out is to automate and parallelize the whole process and identify the best parameters in least possible time. Enter multiprocessing...

multiprocessing is a python package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock (this is important when running codes in interactive environment) by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

The way multiprocessing works is that it requires

  • a function
  • a list of values to be used by the function
  • number of threads, defined as an argument in multiprocessing.Pool. Use cpu_count() to get maximum threads

An example is shown below.

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

You can either follow along or check references for the code...

Step 1: Data Preparation

You can use any of your training datasets to start with. Make sure you have already pre-processed and the data (missing value imputation, null treatment and converting categorical variables to numeric) to remain compliant for xgboost training.

You would also want to convert your dataset into DMatrix Object. A DMatrix is an internal data structure used by XGBoost which is optimized for both memory efficiency and training speed.

For demonstration purposes, I have used a very small dataset and transformed it into a DMatrix object. I saved object as train.buffer, so that parallel nodes can refer to the same dataset.

import xgboost as xgb
train = pd.read_csv('train.csv')
train_dm = xgb.DMatrix(train.drop('dep_var', axis = 1), label = train.dep_var.values)
train_dm.save_binary('train.buffer')


Step 2 : Creating Parameter Space

The tuples of parameters created as a list is what will feed into the multiprocessing.map function.

I have selected 4 dimensions on which i will tune parameters. You can choose any number of dimensions. However, the number of models to be built can exponentially increase if you choose more dimensions.

from itertools import product
param_lr = [0.01,0.05,0.1]
param_cn = [5,10,15,20]
param_depth = [3,4,5]
param_trees = [10, 50, 100, 200]
paramlist = list(product(param_lr, param_cn,param_depth, param_trees))

Step 3 : Defining the training function

The function is defined such that each process first reads the saved DMatrix training object. Extract from parameter space tuple, the individual parameters, train a model and return the model object as an output.

Note: To run this in interactive python notebook, you would need to save this function as .py file and import as library in the notebook.

def XGBGridSearch(p):
    import xgboost as xgb
    train_dm = xgb.DMatrix('train.buffer')
    param = {
        'max_depth' : p[2],
        'eta': p[0],
        'objective': 'binary:logistic',
        'silent' : 1,
        'min_child_weight': p[1]
    }
    model = xgb.train(param, train_dm, p[3])
    return model

Step 4 : Executing Grid Search

I saved the funtion as XGBGridSearch.py and imported in my notebook as a library. Now i just simply call the function from the library (without the parenthesis) in the map method, pass on the list of tuples of parameters and voila...!

from multiprocessing import Pool, cpu_count
import XGBGridSearch

if __name__ ==  '__main__': 
    p=Pool(processes = cpu_count())
    output = p.map(XGBGridSearch.XGBGridSearch,[p for p in paramlist])

Results

I also ran an embarassing for loop to compare it with the power of multiprocessing. A total of 144 models were trained and stored as model objects in a pandas dataframe. It took 33.5 seconds (since the dataset was too small)

The multiprocessing code ran in 12.1 seconds. It means, that multiprocessing trained all 144 iterations of model in 1/3rd amount of time than the embarassing for loop. The number of threads used here were 12 (Hexa-Core CPU).

References

  1. iPython Notebook
  2. Multiprocessing official documentation
  3. Multiprocessing in Python on Windows and Jupyter/Ipython — Making it work