如何使用在 docker 镜像或新容器中不断变化的 python 库?
问题描述
我在 python 包中组织我的代码(通常在 virtualenv
和/或 conda
等虚拟环境中),然后通常调用:
I organize my code in a python package (usually in a virtual environment like virtualenv
and/or conda
) and then usually call:
python <path_to/my_project/setup.py> develop
这样我就可以使用我的代码的最新版本.由于我主要开发统计或机器学习算法,因此我做了很多原型并每天更改我的代码.但是,最近在我可以访问的集群上运行我们的实验的推荐方法是通过 docker.我了解了 docker,我想我对如何使它工作有一个粗略的想法,但我不太确定我的解决方案是否很好,或者是否有更好的解决方案.
so that I can use the most recent version of my code. Since I develop mostly statistical or machine learning algorithms, I prototype a lot and change my code daily. However, recently the recommended way to run our experiments on the clusters I have access is through docker. I learned about docker and I think I have a rough idea of how to make it work but wanted wasn't quite sure if my solutions was good or if there might be better solutions out there.
我认为的第一个解决方案是将数据复制到我的 docker 映像中的解决方案:
The first solution that I thought is having a solution that copied the data in my docker image with:
COPY /path_to/my_project
pip install /path_to/my_project
然后 pip 安装它.这个解决方案的问题是我每次都必须实际构建一个新图像,这看起来很愚蠢,并希望我能有更好的东西.为此,我正在考虑创建一个 bash 文件,例如:
and then pip installing it. The issue with this solution is that I have to actually build a new image each time which seems silly and was hoping I could have something better. To do this I was thinking of having a bash file like:
#BASH FILE TO BUILD AND REBUILD MY STUFF
# build the image with the newest version of
# my project code and it pip installs it and its depedencies
docker build -t image_name .
docker run --rm image_name python run_ML_experiment_file.py
docker kill current_container #not sure how to do get id of container
docker rmi image_name
正如我所说,我的直觉告诉我这很愚蠢,所以我希望有一种使用 Docker 或单个 Dockerfile 的单一命令方式来执行此操作.另外,请注意,该命令应该使用 -v ~/data/:/data
来获取数据和其他一些卷/挂载,以便在完成训练后写入(在主机中).
as I said, my intuition tells me this is silly so I was hoping there was a single command way to do this with Docker or with a single Dockerfile. Also, note the command should use -v ~/data/:/data
to be able to get the data and some other volume/mount to write to (in the host) when it finishes training.
我认为的另一个解决方案是在 Dockerfile(以及图像中)中包含我的库所需的所有 python 依赖项或其他依赖项,然后以某种方式在正在运行的容器中执行我的库的安装.也许与 docker exec [OPTIONS] CONTAINER COMMAND
为:
Another solution that I thought was to have all the python dependencies or other dependencies that my library needs in the Dockerfile (and hence in the image) and then somehow executing in the running container the installation of my library. Maybe with docker exec [OPTIONS] CONTAINER COMMAND
as:
docker exec CONTAINER pip install /path_to/my_project
在正在运行的容器中.之后,我可以使用相同的 exec 命令运行我想要运行的真实实验:
in the running container. After that then I could run the real experiment I want to run with the same exec command:
docker exec CONTAINER python run_ML_experiment_file.py
尽管如此,我仍然不知道如何系统地获取容器 ID(因为我可能不想每次执行此操作时都查找容器 ID).
though, I still don't know how to systematically get the container id though (because I probably don't want to look up the container id every time I do this).
理想情况下,在我看来,最好的概念性解决方案是让 Dockerfile 从一开始就知道它应该挂载到哪个文件(即 /path_to/my_project
),然后以某种方式执行 python [/path_to/my_project] 在图像中开发
,以便它始终链接到可能发生变化的python包/项目.这样我就可以使用 单个 docker 命令 来运行我的实验,如下所示:
Ideally in my head the best conceptual solution would be to simply have the Dockerfile know from the beginning to which file it should mount to (i.e. /path_to/my_project
) and then somehow do python [/path_to/my_project] develop
inside the image so that it would always be linked to the potentially changing python package/project. That way I can run my experiments with a single docker command as in:
docker run --rm -v ~/data/:/data python run_ML_experiment_file.py
并且不必每次都自己显式更新图像(包括不必重新安装应该是静态的图像部分),因为它始终与真实库同步.此外,让其他脚本每次都从头开始构建新图像并不是我想要的.另外,如果可能的话,最好能避免编写任何 bash.
and not have to explicitly update the image myself every time (that includes not having to re install parts of the image that should be static) since its always in sync with the real library. Also, having some other script build a new image from scratch each time is not what I am looking for. Also, It would be nice to be able to avoid writing any bash too if possible.
我认为我非常接近一个好的解决方案.每次我将简单地运行 CMD
命令进行 python 开发时,我会做什么而不是构建一个新图像,如下所示:
I think I am very close to a good solution. What I will do instead of building a new image each time I will simply run the CMD
command to do python develop as follow:
# install my library (only when the a container is spun)
CMD python ~/my_tf_proj/setup.py develop
优点是它只会在我运行新容器时 pip 安装我的库.这解决了开发问题,因为重新创建新图像需要很长时间.虽然我刚刚意识到,如果我使用 CMD
命令,那么我无法运行给我的 docker run 的其他命令,所以我实际上的意思是运行 ENTRYPOINT
.
the advantage is that it will only pip install my library whenever I run a new container. This solves the development issue because re creating a new image takes to long. Though I just realized that if I use the CMD
command then I can't run other commands given to my docker run, so I actually mean to run ENTRYPOINT
.
目前唯一需要解决的问题是我在使用卷时遇到问题,因为我无法成功链接到 Dockerfile 中的宿主项目库(这似乎需要一个绝对路径一些理由).我目前正在做(这似乎不起作用):
Right now the only issue to complete this is that I am having issues using volume because I can't successfully link to my host project library within the Dockerfile (which seems to require an absolute path for some reason). I am currently doing doing (which doesn't seem to work):
VOLUME /absolute_path_to/my_tf_proj /my_tf_proj
为什么我不能在我的 Dockerfile 中使用 VOLUME 命令进行链接?我使用 VOLUME 的主要目的是在 CMD 命令尝试安装我的库时使我的库(以及此映像始终需要的其他文件)可访问.是否可以在启动容器时始终让我的库可用?
why can't I link using the VOLUME command in my Dockerfile? My main intention with using VOLUME is making my library (and other files that are always needed by this image) accessible when the CMD command tries to install my library. Is it possible to just have my library available all the time when a container is initiated?
理想情况下,我只想在容器运行时自动安装该库,如果可能,由于 始终需要最新版本的库,因此在初始化容器时安装它.
Ideally I wanted to just have the library be installed automatically when a container is run and if possible, since the most recent version of the library is always required, have it install when a container is initialized.
现在作为参考,我的非工作 Dockerfile 如下所示:
As a reference right now my non-working Dockerfile looks as follow:
# This means you derive your docker image from the tensorflow docker image
# FROM gcr.io/tensorflow/tensorflow:latest-devel-gpu
FROM gcr.io/tensorflow/tensorflow
#FROM python
FROM ubuntu
RUN mkdir ~/my_tf_proj/
# mounts my tensorflow lib/proj from host to the container
VOLUME /absolute_path_to/my_tf_proj
#
RUN apt-get update
#
apt-get install vim
#
RUN apt-get install -qy python3
RUN apt-get install -qy python3-pip
RUN pip3 install --upgrade pip
#RUN apt-get install -y python python-dev python-distribute python-pip
# have the dependecies for my tensorflow library
RUN pip3 install numpy
RUN pip3 install keras
RUN pip3 install namespaces
RUN pip3 install pdb
# install my library (only when the a container is spun)
#CMD python ~/my_tf_proj/setup.py develop
ENTRYPOINT python ~/my_tf_proj/setup.py develop
附注:
As a side remark:
另外,出于某种原因,它需要我执行 RUN apt-get update
才能在我的容器中安装 pip 或 vim.人们知道为什么吗?我想这样做是因为以防万一我想使用 bash
终端附加到容器,这将非常有帮助.
Also, for some reason it requires me to do RUN apt-get update
to be able to even install pip or vim in my container. Do people know why? I wanted to do this because just in case I wanted to attach to the container with a bash
terminal, it would be really helpful.
似乎 Docker 只是强制您进行 apt install 以在容器中始终拥有最新版本的软件?
Seems that Docker just forces you to apt install to always have the most recent version of software in the container?
赏金:
COPY
有什么解决方案?或许还有 docker build -f path/Docker .
.请参阅:如何从主用户目录构建 docker 镜像?
what a solution with COPY
? and perhaps docker build -f path/Docker .
. See: How does one build a docker image from the home user directory?
解决方案
在开发过程中,IMO 完全可以将带有不断变化的源的主机目录映射/挂载到 Docker 容器中.其余的(python版本,你所依赖的其他库都可以在docker容器中以正常方式安装.
During development it is IMO perfectly fine to map/mount the hostdirectory with your ever changing sources into the Docker container. The rest (the python version, the other libraries you are dependent upon you can all install in the normal way in the the docker container.
稳定后,我删除地图/安装并将包添加到要使用 pip
安装的项目列表中.我确实有一个运行 devpi
的单独容器,因此我可以 pip
-安装软件包,无论我将它们一直推送到 PyPI 还是将它们推送到我的本地 devpi
容器.
Once stabilized I remove the map/mount and add the package to the list of items to install with pip
. I do have a separate container running devpi
so I can pip
-install packages whether I push them all the way to PyPI or just push them to my local devpi
container.
即使您使用通用(但更有限)python [path_to_project/setup.py] develop
,也可以加快容器创建速度.在这种情况下,您的 Dockerfile
应如下所示:
Doing speed up container creation even if you use the common (but more limited) python [path_to_project/setup.py] develop
. Your Dockerfile
in this case should look like:
# the following seldom changes, only when a package is added to setup.py
COPY /some/older/version/of/project/plus/dependent/packages /older/setup
RUN pip /older/setup/your_package.tar.gz
# the following changes all the time, but that is only a small amount of work
COPY /latest/version/of/project
RUN python [path_to_project/setup.py] develop
如果第一个副本会导致 /older/setup
下的文件发生更改,则容器将从那里重建.
If the first copy would result in changes to files under /older/setup
then the container gets rebuilt from there.
运行 python ... develop
仍然需要更多时间,您需要重建/重新启动容器.由于我的所有软件包也都可以复制/链接到(除了安装之外),这仍然是一个很大的开销.我在容器中运行一个小程序,检查(安装/映射的)源是否更改,然后自动重新运行我正在开发/测试的任何内容.所以我只需要保存一个新版本并观察容器的输出.
Running python ... develop
still makes more time and you need to rebuild/restart the container. Since my packages all can also be just copied in/linked to (in addition to be installed) that is still a large overhead. I run a small program in the container that checks if the (mounted/mapped) sources change and then reruns anything I am developing/testing automatically. So I only have to save a new version and watch the output of the container.
相关文章