Why is joblib in docker so slow?

less than 1 minute read

Published:

Joblib in a docker container is slow when handling big arrays. This shows a workaround for the problem.

Problem

When passing big arrays to joblib.Parallel in docker container, parallel processing does not start immediately.

Why?

Joblib uses the folder specified by “JOBLIB_TEMP_FOLDER” for memmap of the arrays. Without specifying, JOBLIB_TEMP_FOLDER is set to /dev/shm, which usually has small size and is not enough for the big arrays.

https://pythonhosted.org/joblib/generated/joblib.Parallel.html

Solution

  • Set JOBLIB_TEMP_FOLDER
  • Or specify the folder when running the parallel processing
from sklearn.externals.joblib import Parallel, delayed
r = Parallel(n_jobs=-1, temp_folder=".")(delayed(hogehoge)(bigarray[i] for i in List_i))

(In Japanese)

docker上のjoblibのParallelで大きいサイズのデータを渡すと,並列処理がなかなか全く始まらないことがある. joblibは,memmap用の領域として,JOBLIB_TEMP_FOLDERで指定したフォルダを利用するが,指定がない場合/dev/shmを利用する.

しかしdockerではデフォルトの/dev/shmの領域が非常に小さく,上記の問題を引き起こす. 解消するためには,JOBLIB_TEMP_FOLDERを指定するか,Parallelの引数temp_folderを指定する. https://pythonhosted.org/joblib/generated/joblib.Parallel.html