This Week in Data: Python Environment Freshness

(“This Week in Glean Data” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

By: Perry McManis and Chelsea Troy

A note on audience: the intended reader for this post is a data scientist or analyst, product owner or manager, or similar who uses Python regularly but has not had the opportunity to work with engineering processes to the degree they may like. Experienced engineers may still benefit from the friendly reminder to keep their environments fresh and up-to-date.

When was the last time you remade your local Python environment? One month ago? Six months ago? 1997?

Wait, please, don’t leave. I know, I might as well have asked you when the last time you cleaned out the food trap in your dishwasher was and I apologize. But this is almost as important. Almost.

If you don’t recall when, go ahead and check when you made your currently most used environment. It might surprise you how long ago it was.

# See this helpful stack overflow post by Timur Shtatland: https://stackoverflow.com/a/69109373
Mac: conda env list -v -v -v | grep -v '^#' | perl -lane 'print $F[-1]' | xargs /bin/ls -lrtd
Linux: conda env list | grep -v '^#' | perl -lane 'print $F[-1]' | xargs ls -lrt1d
Windows: conda env list
# Find the top level directory of your envs, e.g. C:\Users\yourname\miniconda3\envs
Windows: dir /T:C C:\Users\yourname\miniconda3\envs

Don’t feel bad though, if it does surprise you, or the answer is one you’d not admit publicly. Python environments are hard. Not in the everything is hard until you know how way, but in the why doesn’t this work? This worked last week! way. And the impetus is often to just not mess with things. Especially if you have that one environment that you’ve been using for the last 4 years, you know the one you have propped up with popsicle sticks and duct tape? But I’d like to propose that you consider regularly remaking your environments, and you build your own processes for doing so.

It is my opinion that if you can, you should be working in a fresh environment.

Much like the best by date, what is fresh is contextual. But if you start getting that when did I stand this env up? feeling, it’s time. Working in a fresh environment has a few benefits. Firstly, it makes it more likely that other folks will be able to easily duplicate it. Similarly to how providing an accurate forecast becomes increasingly difficult as you go further into the future, as you get further away from the date you completed a task in a changing ecosystem, the less likely it is that task can be successfully completed again.

Perhaps even more relevant is that packages often release security updates, APIs improve, functionality that you originally had to implement yourself may even get an official release. Official releases, especially for higher level programming languages like Python, are often highly optimized. For many researchers, those optimizations are out of the scope of their work, and rightly so. But the included version of that calculation in your favorite stats package will not only have several engineers working on it to make it run as quickly as possible, now you have the benefit of many researchers testing it concurrently with you.

These issues can collide spectacularly in cases where people get stuck trying to replicate your environment due to a deprecated version of a requirement. And if you never update your own environment, it could take someone else bringing it up to you to even notice that one of the packages you are using is no longer available, or an API has been moved from experimental to release, or removed altogether.

There is no best way of making fresh environments, but I have a few suggestions you might consider.

I will preface by saying that my preference is for command line tools, and these suggestions reflect that. Using graphical interfaces is a perfectly valid way to handle your environments, I’m just not that familiar with them, so while I think the ideas of environment freshness still apply, you will have to find your own way with them. And more generally, I would encourage you to develop your own processes anyway. These are more suggestions on where to start, and not all of them need find their way into your routines.

If you are completely unfamiliar with these environments, and you’ve been working in your base environment, I would recommend in the strongest terms possible that you immediately back it up. Python environments are shockingly easy to break beyond repair and tend to do so at the worst possible time in the worst possible way. Think live demo in front of the whole company that’s being simulcast on youtube. LeVar Burton is in the audience. You don’t want to disappoint him, do you? The easiest way to quickly make a backup is to create a new environment through the normal means, confirm it has everything you need in it, and make a copy of the whole install folder of the original.

If you’re not in the habit of making new environments, the next time you need to do an update for a package you use constantly, consider making an entirely new environment for it. As an added bonus this will give you a fallback option in case something goes wrong. If you’ve not done this before, one of the easiest ways is to utilize pip’s freeze function.

pip list --format=freeze > requirements.txt
conda create -n {new env name}
conda activate {new env name}
pip install -r requirements.txt
pip install {package} --upgrade

When you create your requirements.txt file, it’s usually a pretty good idea to go through it. A common gotcha is that you might see local file paths in place of version numbers. That is why we used pip list here. But it never hurts to check.

Take a look at your version numbers, are any of these really out of date? That is something we want to fix, but often some of our important packages have dependencies that require specific versions and we have to be careful not to break those dependencies. But we can work around that while getting the newest version we can by removing those dependencies from our requirements file and installing our most critical packages separately. That way we let pip or conda get the newest versions of everything that will work. For example, if I need pandas, and I know pandas depends on numpy, I can remove both from my requirements document and let pip handle my dependencies for me.

pip install --upgrade pip
pip install -r requirements.txt
pip install pandas

Something you may notice is that this block looks like it should be something that could be packaged up since it’s just a collection of commands. And indeed it can. We can put this in a shell script and with a bit of work, add a command line option, to more or less fire off a new environment for us in one go. This can also be expanded with shell commands for cases where we may need a compiler, tool from another language, a github repo even, etc. Assuming we have a way to run shell scripts,  Let’s call this create_env.sh:

conda deactivate
conda create -n $1
conda activate $1apt install gcc

apt install g++pip install --upgrade pip
pip install pystan==2.19.1.1
python3 -m pip install prophet --no-cache-dir

pip install -r requirements.txt
pip install scikit-learn

git clone https://github.com/mozilla-mobile/fenix.git

cd ./fenix

echo "Finished creating new environment: $1"

And by adding some flag handling, now using bash we can call sh create_env.sh newenv and be ready to go.

It will likely take some experimentation the first time or two. But once you know the steps you need to follow, getting new environment creation down to just a few minutes is as easy as packaging your steps up. And if you want to share, you can send your setup script rather than a list of instructions. Including this in your repository with a descriptive name and a mention in your README.md is a low effort way to help other folks get going with less friction.

There are of course tons of other great ways to package environments, like Docker. I would encourage you to read into them if you are interested in reproducibility beyond the simpler case of just rebuilding your local environment with regularity. There are a huge number of fascinating and immensely powerful tools out there to explore, should you wish to bring even more rigor to your Python working environments.

In the end, the main thing is to work out a fast and repeatable method that enables you to get your environment up and going again quickly from scratch. One that works for you. And then when you get the feeling that your environment has been around for a while, you won’t have to worry about making a new environment being an all-day or even worse, all-week affair. By investing in your own process, you will save yourself loads of time in the long run, and you may even save your colleagues some too. And hopefully, save yourself some frustration, too.

Like anything, the key to working out your process is repetitions. The first time will be hard, though maybe some of the tips here can make it a bit easier. But the second time will be easier. And after a handful, you will have developed a framework that will allow you to make your work more portable, more resilient and less angering, even beyond Python.