On vendorizing - bitprophet.org

On what?

Vendorizing. Including dependencies inside your own source tree as if they were part of your application. Also known as ‘bundling’, ‘omnibus’ etc. I’m going to be talking about Python software specifically, but it’s a general problem.

I’m weighing the pros and cons of vendorizing vs relying on externally-installed libraries, as I don’t see a strong consensus on the topic – and am curious what others with different experiences and use cases have to say.

Why are you writing this?

Here’s the problem sparking this discussion:

Invoke (under development) needs the Fluidity library to operate, and relies¹ on a feature added in version 0.2.1.
Unfortunately, Fluidity 0.2.1 was never actually released or tagged so we can’t simply tell our setup.py to require fluidity-sm==0.2.1 or even rely on a tag-tarball reference.
That leaves us with the following options:
- Wait for upstream to comply and fix the packaging problem.
  - This removes control over our own release schedule, which is unacceptable. As OSS authors ourselves, we sympathize with the “real life & the day job take priority” angle, but have to look out for our own interests.
- Fork and release our own package on PyPI as e.g. fluidity-invoke.
  - This works, but has many the drawbacks of the vendorizing option and offers few of the benefits.
  - It also confuses things re: project ownership and who should receive/act on bug reports. Users new to the space might focus on your fork instead of upstream, forcing you to either handle their problems, or redirect them.
- Keep a copy of Fluidity inside the Invoke source tree.
  - The option under discussion – read on!

What’s at stake?

Distributing software tends to involve the following concerns, many (but not all) of which relate strongly to the question of vendorizing:

Ease of installation
Dependency clashes when installed
Dealing with poorly packaged dependencies

Let’s take these one at a time.

Ease of installation

Every dependency a project has is an opportunity for installation to go wrong – disappearing download locations (PyPI or otherwise), broken setup.py files, non-Python requirements like a C compiler or version control tool, etc.

This is exacerbated for newbies who have no recourse when problems crop up, unlike power users who might know to e.g. slap on -M to pip to get mirror support, or already have a build toolchain installed.

Vendorizing doesn’t solve the problem for C extensions, but it wraps up almost everything else. If the user was able to obtain your software, they have everything needed to run it.

Dependency clashes

In general

Imagine your library depends on sub-library foo but only works with versions 1.2 and above. Or maybe it works with 1.0 and up, but 1.1 had a buggy setup.py that was never fixed, so now you only support 1.0 and 1.2+. (Point being: this can get complicated.)

Unfortunately, somebody using your lib needs an unrelated tool which also relies on foo, and only works with foo 0.9. They’re in trouble: with foo 0.9 or lower, your library won’t work, and with 1.0 or higher, this other dependency breaks.

“Dependency hell” is a big problem because real-world software often has a tangled web of dependency versions – the above is only one, simple example.

If your library, or this other tool (or both) vendorized foo, this problem and any like it become non-issues.

‘Incidental’ tools

A sub-case is where the dependency clash doesn’t affect code typically loaded into a single process. For example, a task runner like Invoke or Fabric, which is used to manage a project but may not import the project’s code or be imported into it².

Incidental/utility tools should ideally behave like standalone applications, in the same way their C analogues like make effectively do, in order to have minimal impact on the projects they’re used in. Vendorizing enables this goal.

Poorly packaged dependencies

This is the problem Invoke is facing: one of our dependencies is not properly packaged, making it difficult for us to depend on it normally. It’s the desire/need to use unreleased-but-stable features, or to use an unofficial fork of a library – or even a lib not listed on PyPI at all³.

While this was mentioned above, a dependency’s setup.py can also cause problems for packagers by being flat-out broken, or by referencing sub-dependencies that conflict with our users’ environments.

Why might vendorizing not be the right solution?

Most of the above focuses on why bundling/vendorizing is the right solution. But it’s clearly not a panacea or we’d see it done far more frequently. Here’s where the approach fails:

Removes control from power users

We’d like to think otherwise, but package maintainers lack omniscience and can’t foresee all problems and version conflicts. Vendorizing helps to solve many conflicts, as per above, but makes life difficult for users who truly need to override the maintainer’s decision re: what dependency versions are included.

Theoretically, we could engineer libraries to use bundled dependencies as a fallback (e.g. try/except at import time) but doing so exposes us to many of the above problems all over again.

Package installation tools get us most of the way there already

Many of the above problems can be solved with judicious use of pip, setup.py and/or requirements.txt files, because they do handle the more common ~90% of use cases. Vendorizing is more of a paranoid approach where one assumes the un-handled 10% will come into play (or, again as with Invoke, where we’re already running into it.)

Vendorizing clutters the main project’s codebase and version control

Once you add dependencies to your source tree, all sorts of common tools and tasks related to “your project” – grep, sloccount, linters, test loaders and so forth – have to be told to ignore these extra directories, or risk a ton of extra noise.

This includes version control – updates to the dependencies’ directories, even if done in large single commits, clutters your Git or Mercurial history.

It’s extra work!

Any dependencies require effort to stay abreast of upstream changes. However, this typically just means you have to manage version numbers in setup.py. And if your requirements are loose enough (e.g. >=1.0,<2.0 to say “any 1.x is fine”) your users can reap the benefits of new upstream features without you having to release anything yourself.

The vendorized approach involves much more: you’re now repeating the work of upstream, re: merging new changes into your existing tree and dealing with the possible conflicts that arise.

Additionally, you’ll often end up curating your local copy, fixing bugs or adding small features. Now you have to send those upstream, and even worse, deal with the resulting patch from the maintainer if (or when) they decide to tweak your submission a bit.

In short, all the problems that come with a private fork of that dependency, and multiplied for each new dependency you decide to vendorize.

A counterpoint: vendorizing is most feasible when the dependency changes very infrequently (and thus, this sort of administrative work is kept to a minimum.) Conversely, projects that are being collaborated on heavily are a bad choice for vendorizing because this overhead becomes insurmountable.

Storage and bandwidth inefficiency

If taken to its logical conclusion, vendorizing could be very wasteful re: storage space and bandwidth required to download multiple copies of common dependencies. Imagine if you had a Django project with a dozen dependencies which all bundled Django itself, as unlikely/silly as that would be.

One can counter that storage is a non-issue, but download speeds still vary widely, so it’s still a valid point in favor of not bundling your deps.

What to do?

I’m not entirely sure. Clearly the solution is different for each project, and I’m not intending to proscribe a general approach – just to solve it for Invoke specifically.

I’m interested in your feedback: what use cases or angles am I missing, and/or what about the above do you disagree with? Please leave a comment or email me. I’ll also try to incorporate feedback into the post itself over time.

Technically we don’t rely on it – it’s mostly useful for debugging. The problem still stands, though! ↩︎
Though this sort of purity is not terrifically common, it’s still important to consider. ↩︎
The argument that “well they should host it on PyPI, WTF!” holds no water here: I still want that software, and I can’t ask my users to install it via easy_install/pip. So it’s my problem to solve. ↩︎