python – sklearn dumping model using joblib, dumps multiple files. Which one is the correct model?

python – sklearn dumping model using joblib, dumps multiple files. Which one is the correct model?

To save everything into 1 file you should set compression to True or any number (1 for example).

But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And dont want to save memmap np arrays) – i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle.

import numpy as np
from scikit-learn.externals import joblib

vector = np.arange(0, 10**7)

%timeit joblib.dump(vector, vector.pkl)
# 1 loops, best of 3: 818 ms per loop
# file size ~ 80 MB
%timeit vector_load = joblib.load(vector.pkl)
# 10 loops, best of 3: 47.6 ms per loop

# Compressed
%timeit joblib.dump(vector, vector.pkl, compress=1)
# 1 loops, best of 3: 1.58 s per loop
# file size ~ 15.1 MB
%timeit vector_load = joblib.load(vector.pkl)
# 1 loops, best of 3: 442 ms per loop

# Pickle
%%timeit
with open(vector.pkl, wb) as f:
    pickle.dump(vector, f)
# 1 loops, best of 3: 927 ms per loop
%%timeit                                    
with open(vector.pkl, rb) as f:
    vector_load = pickle.load(f)
# 10 loops, best of 3: 94.1 ms per loop

python – sklearn dumping model using joblib, dumps multiple files. Which one is the correct model?

Leave a Reply

Your email address will not be published. Required fields are marked *