[Numpy-discussion] Announcing toydist, improving distribution and packaging situation

David_Cournapeau · December 29, 2009, 2:34pm

I would like to note that buildout is a solution to a problem that I
don't care to solve. This issue is particularly difficult to explain
to people accustomed with buildout in my experience - I have not found
a way to explain it very well yet.

Buildout, virtualenv all work by sandboxing from the system python:
each of them do not see each other, which may be useful for
development, but as a deployment solution to the casual user who may
not be familiar with python, it is useless. A scientist who installs
numpy, scipy, etc... to try things out want to have everything
available in one python interpreter, and does not want to jump to
different virtualenvs and whatnot to try different packages.

This has strong consequences on how you look at things from a packaging POV:
- uninstall is crucial
- a package bringing down python is a big no no (this happens way too
often when you install things through setuptools)
- if something fails, the recovery should be trivial - the person
doing the installation may not know much about python
- you cannot use sandboxing as a replacement for backward
compatibility (that's why I don't care much about all the discussion
about versioning - I don't think it is very useful as long as python
itself does not support it natively).

In the context of ruby, this article makes a similar point:
http://www.madstop.com/ruby/ruby_has_a_distribution_problem.html

David

···

On Tue, Dec 29, 2009 at 10:27 PM, René Dudfield <renesd@...149...> wrote:

Buildout is what a lot of the python community are using now.

Gael_Varoquaux1 · December 29, 2009, 3:55pm

I think that you are pointing out a large source of misunderstanding
in packaging discussion. People behind setuptools, pip or buildout care
to have a working ensemble of packages that deliver an application (often
a web application)[1]. You and I, and many scientific developers see
libraries as building blocks that need to be assembled by the user, the
scientist using them to do new science. Thus the idea of isolation is not
something that we can accept, because it means that we are restricting
the user to a set of libraries.

Our definition of user is not the same as the user targeted by buildout.
Our user does not push buttons, but he writes code. However, unlike the
developer targeted by buildout and distutils, our user does not want or
need to learn about packaging.

Trying to make the debate clearer...

Gaël

[1] I know your position on why simply focusing on sandboxing working
ensemble of libraries is not a replacement for backward compatibility,
and will only create impossible problems in the long run. While I agree
with you, this is not my point here.

···

On Tue, Dec 29, 2009 at 11:34:44PM +0900, David Cournapeau wrote:

Buildout, virtualenv all work by sandboxing from the system python:
each of them do not see each other, which may be useful for
development, but as a deployment solution to the casual user who may
not be familiar with python, it is useless. A scientist who installs
numpy, scipy, etc... to try things out want to have everything
available in one python interpreter, and does not want to jump to
different virtualenvs and whatnot to try different packages.

_Chris.Barker · December 29, 2009, 9:29pm

David Cournapeau wrote:

Buildout, virtualenv all work by sandboxing from the system python:
each of them do not see each other, which may be useful for
development,

And certain kinds of deployment, like web servers or installed tools.

but as a deployment solution to the casual user who may
not be familiar with python, it is useless. A scientist who installs
numpy, scipy, etc... to try things out want to have everything
available in one python interpreter, and does not want to jump to
different virtualenvs and whatnot to try different packages.

Absolutely true -- which is why Python desperately needs package version selection of some sort. I've been tooting this horn on and off for years but never got any interest at all from the core python developers.

I see putting packages in with no version like having non-versioned dynamic libraries in a system -- i.e. dll hell. If I have a bunch of stuff running just fine with the various package versions I've installed, but then I start working on something (maybe just testing, maybe something more real) that requires the latest version of a package, I have a few choices:
   - install the new package and hope I don't break too much
   - use something like virtualenv, which requires a lot of overhead to setup and use (my evidence is personal, despite working with a team that uses it, somehow I've never gotten around to using for my dev work, even though, in theory, it should be a good solution)
   - setuptools does supposedly support multiple version installs and selection, but it's ugly and poorly documented enough that I've never figured out how to use it.

This has been addressed with a handful of ad-hock solution: wxPython as wxversion.select, and I think PyGTK has something, and who knows what else. It would be really nice to have a standard solution available.

Note that the usual response I've gotten is to use py2exe or something to distribute, so you're defining the whole stack. That's good for some things, but not all (though py2app's "alias" bundles are nice), and really pretty worthless for development. Also, many, many packages are a pain to use with py2exe and friends anyway (see my forthcoming other long post...)

- you cannot use sandboxing as a replacement for backward
compatibility (that's why I don't care much about all the discussion
about versioning - I don't think it is very useful as long as python
itself does not support it natively).

could be -- I'd love to have Python support it natively, though wxversion isn't too bad.

-Chris

···

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@...236...

Nathaniel_Smith · January 3, 2010, 11:05am

What I do -- and documented for people in my lab to do -- is set up
one virtualenv in my user account, and use it as my default python. (I
'activate' it from my login scripts.) The advantage of this is that
easy_install (or pip) just works, without any hassle about permissions
etc. This should be easier, but I think the basic approach is sound.
"Integration with the package system" is useless; the advantage of
distribution packages is that distributions can provide a single
coherent system with consistent version numbers across all packages,
etc., and the only way to "integrate" with that is to, well, get the
packages into the distribution.

On another note, I hope toydist will provide a "source prepare" step,
that allows arbitrary code to be run on the source tree. (For, e.g.,
cython->C conversion, ad-hoc template languages, etc.) IME this is a
very common pain point with distutils; there is just no good way to do
it, and it has to be supported in the distribution utility in order to
get everything right. In particular:
  -- Generated files should never be written to the source tree
itself, but only the build directory
  -- Building from a source checkout should run the "source prepare"
step automatically
  -- Building a source distribution should also run the "source
prepare" step, and stash the results in such a way that when later
building the source distribution, this step can be skipped. This is a
common requirement for user convenience, and necessary if you want to
avoid arbitrary code execution during builds.
And if you just set up the distribution util so that the only place
you can specify arbitrary code execution is in the "source prepare"
step, then even people who know nothing about packaging will
automatically get all of the above right.

Cheers,
-- Nathaniel

···

On Tue, Dec 29, 2009 at 6:34 AM, David Cournapeau <cournape@...149...> wrote:

Buildout, virtualenv all work by sandboxing from the system python:
each of them do not see each other, which may be useful for
development, but as a deployment solution to the casual user who may
not be familiar with python, it is useless. A scientist who installs
numpy, scipy, etc... to try things out want to have everything
available in one python interpreter, and does not want to jump to
different virtualenvs and whatnot to try different packages.

Gael_Varoquaux1 · January 3, 2010, 11:11am

That works because either you use packages that don't have much hard-core
compiled dependencies, or these are already installed.

Think about installing VTK or ITK this way, even something simpler such
as umfpack. I think that you would loose most of your users. In my lab,
I do lose users on such packages actually.

Beside, what you are describing is possible without package isolation, it
is simply the use of a per-user local site-packages, which now semi
automatic in python2.6 using the '.local' directory. I do agree that, in
a research lab, this is a best practice.

Gaël

···

On Sun, Jan 03, 2010 at 03:05:54AM -0800, Nathaniel Smith wrote:

What I do -- and documented for people in my lab to do -- is set up
one virtualenv in my user account, and use it as my default python. (I
'activate' it from my login scripts.) The advantage of this is that
easy_install (or pip) just works, without any hassle about permissions
etc. This should be easier, but I think the basic approach is sound.
"Integration with the package system" is useless; the advantage of
distribution packages is that distributions can provide a single
coherent system with consistent version numbers across all packages,
etc., and the only way to "integrate" with that is to, well, get the
packages into the distribution.

David_Cournapeau · January 3, 2010, 12:23pm

Buildout, virtualenv all work by sandboxing from the system python:
each of them do not see each other, which may be useful for
development, but as a deployment solution to the casual user who may
not be familiar with python, it is useless. A scientist who installs
numpy, scipy, etc... to try things out want to have everything
available in one python interpreter, and does not want to jump to
different virtualenvs and whatnot to try different packages.

What I do -- and documented for people in my lab to do -- is set up
one virtualenv in my user account, and use it as my default python. (I
'activate' it from my login scripts.) The advantage of this is that
easy_install (or pip) just works, without any hassle about permissions
etc.

It just works if you happen to be able to build everything from
sources. That alone means you ignore the majority of users I intend to
target.

No other community (except maybe Ruby) push those isolated install
solutions as a general deployment solutions. If it were such a great
idea, other people would have picked up those solutions.

This should be easier, but I think the basic approach is sound.
"Integration with the package system" is useless; the advantage of
distribution packages is that distributions can provide a single
coherent system with consistent version numbers across all packages,
etc., and the only way to "integrate" with that is to, well, get the
packages into the distribution.

Another way is to provide our own repository for a few major
distributions, with automatically built packages. This is how most
open source providers work. Miguel de Icaza explains this well:

I hope we will be able to reuse much of the opensuse build service
infrastructure.

On another note, I hope toydist will provide a "source prepare" step,
that allows arbitrary code to be run on the source tree. (For, e.g.,
cython->C conversion, ad-hoc template languages, etc.) IME this is a
very common pain point with distutils; there is just no good way to do
it, and it has to be supported in the distribution utility in order to
get everything right. In particular:
-- Generated files should never be written to the source tree
itself, but only the build directory
-- Building from a source checkout should run the "source prepare"
step automatically
-- Building a source distribution should also run the "source
prepare" step, and stash the results in such a way that when later
building the source distribution, this step can be skipped. This is a
common requirement for user convenience, and necessary if you want to
avoid arbitrary code execution during builds.

Build directories are hard to implement right. I don't think toydist
will support this directly. IMO, those advanced builds warrant a real
build tool - one main goal of toydist is to make integration with waf
or scons much easier. Both waf and scons have the concept of a build
directory, which should do everything you described.

David

···

On Sun, Jan 3, 2010 at 8:05 PM, Nathaniel Smith <njs@...503...> wrote:

On Tue, Dec 29, 2009 at 6:34 AM, David Cournapeau <cournape@...149...> wrote:

Nathaniel_Smith · January 3, 2010, 11:42pm

What I do -- and documented for people in my lab to do -- is set up
one virtualenv in my user account, and use it as my default python. (I
'activate' it from my login scripts.) The advantage of this is that
easy_install (or pip) just works, without any hassle about permissions
etc.

It just works if you happen to be able to build everything from
sources. That alone means you ignore the majority of users I intend to
target.

No other community (except maybe Ruby) push those isolated install
solutions as a general deployment solutions. If it were such a great
idea, other people would have picked up those solutions.

AFAICT, R works more-or-less identically (once I convinced it to use a
per-user library directory); install.packages() builds from source,
and doesn't automatically pull in and build random C library
dependencies.

I'm not advocating the 'every app in its own world' model that
virtualenv's designers had min mind, but virtualenv is very useful to
give each user their own world. Normally I only use a fraction of
virtualenv's power this way, but sometimes it's handy that they've
solved the more general problem -- I can easily move my environment
out of the way and rebuild if I've done something stupid, or
experiment with new python versions in isolation, or whatever. And
when you *do* have to reproduce some old environment -- if only to
test that the new improved environment gives the same results -- then
it's *really* handy.

This should be easier, but I think the basic approach is sound.
"Integration with the package system" is useless; the advantage of
distribution packages is that distributions can provide a single
coherent system with consistent version numbers across all packages,
etc., and the only way to "integrate" with that is to, well, get the
packages into the distribution.

Another way is to provide our own repository for a few major
distributions, with automatically built packages. This is how most
open source providers work. Miguel de Icaza explains this well:

OpenSUSE Build System - Miguel de Icaza

I hope we will be able to reuse much of the opensuse build service
infrastructure.

Sure, I'm aware of the opensuse build service, have built third-party
packages for my projects, etc. It's a good attempt, but also has a lot
of problems, and when talking about scientific software it's totally
useless to me :-). First, I don't have root on our compute cluster.
Second, even if I did I'd be very leery about installing third-party
packages because there is no guarantee that the version numbering will
be consistent between the third-party repo and the real distro repo --
suppose that the distro packages 0.1, then the third party packages
0.2, then the distro packages 0.3, will upgrades be seamless? What if
the third party screws up the version numbering at some point? Debian
has "epochs" to deal with this, but third-parties can't use them and
maintain compatibility. What if the person making the third party
packages is not an expert on these random distros that they don't even
use? Will bug reporting tools work properly? Distros are complicated.
Third, while we shouldn't advocate that people screw up backwards
compatibility, version skew is a real issue. If I need one version of
a package and my lab-mate needs another and we have submissions due
tomorrow, then filing bugs is a great idea but not a solution. Fourth,
even if we had expert maintainers taking care of all these third-party
packages and all my concerns were answered, I couldn't convince our
sysadmin of that; he's the one who'd have to clean up if something
went wrong we don't have a big budget for overtime.

Let's be honest -- scientists, on the whole, suck at IT
infrastructure, and small individual packages are not going to be very
expertly put together. IMHO any real solution should take this into
account, keep them sandboxed from the rest of the system, and focus on
providing the most friendly and seamless sandbox possible.

On another note, I hope toydist will provide a "source prepare" step,
that allows arbitrary code to be run on the source tree. (For, e.g.,
cython->C conversion, ad-hoc template languages, etc.) IME this is a
very common pain point with distutils; there is just no good way to do
it, and it has to be supported in the distribution utility in order to
get everything right. In particular:
-- Generated files should never be written to the source tree
itself, but only the build directory
-- Building from a source checkout should run the "source prepare"
step automatically
-- Building a source distribution should also run the "source
prepare" step, and stash the results in such a way that when later
building the source distribution, this step can be skipped. This is a
common requirement for user convenience, and necessary if you want to
avoid arbitrary code execution during builds.

Build directories are hard to implement right. I don't think toydist
will support this directly. IMO, those advanced builds warrant a real
build tool - one main goal of toydist is to make integration with waf
or scons much easier. Both waf and scons have the concept of a build
directory, which should do everything you described.

Maybe I was unclear -- proper build directory handling is nice,
Cython/Pyrex's distutils integration get it wrong (not their fault,
distutils is just impossible to do anything sensible with, as you've
said), and I've never found build directories hard to implement
(perhaps I'm missing something). But what I'm really talking about is
having a "pre-build" step that integrates properly with the source and
binary packaging stages, and that's not something waf or scons have
any particular support for, AFAIK.

-- Nathaniel

···

On Sun, Jan 3, 2010 at 4:23 AM, David Cournapeau <cournape@...149...> wrote:

On Sun, Jan 3, 2010 at 8:05 PM, Nathaniel Smith <njs@...503...> wrote:

David_Cournapeau · January 4, 2010, 7:25am

What I do -- and documented for people in my lab to do -- is set up
one virtualenv in my user account, and use it as my default python. (I
'activate' it from my login scripts.) The advantage of this is that
easy_install (or pip) just works, without any hassle about permissions
etc.

It just works if you happen to be able to build everything from
sources. That alone means you ignore the majority of users I intend to
target.

No other community (except maybe Ruby) push those isolated install
solutions as a general deployment solutions. If it were such a great
idea, other people would have picked up those solutions.

AFAICT, R works more-or-less identically (once I convinced it to use a
per-user library directory); install.packages() builds from source,
and doesn't automatically pull in and build random C library
dependencies.

As mentioned by Robert, this is different from the usual virtualenv
approach. Per-user app installation is certainly a useful (and
uncontroversial) feature.

And R does support automatically-built binary installers.

Sure, I'm aware of the opensuse build service, have built third-party
packages for my projects, etc. It's a good attempt, but also has a lot
of problems, and when talking about scientific software it's totally
useless to me :-). First, I don't have root on our compute cluster.

True, non-root install is a problem. Nothing *prevents* dpkg to run in
non root environment in principle if the packages itself does not
require it, but it is not really supported by the tools ATM.

Second, even if I did I'd be very leery about installing third-party
packages because there is no guarantee that the version numbering will
be consistent between the third-party repo and the real distro repo --
suppose that the distro packages 0.1, then the third party packages
0.2, then the distro packages 0.3, will upgrades be seamless? What if
the third party screws up the version numbering at some point? Debian
has "epochs" to deal with this, but third-parties can't use them and
maintain compatibility.

Actually, at least with .deb-based distributions, this issue has a
solution. As packages has their own version in addition to the
upstream version, PPA-built packages have their own versions.

https://help.launchpad.net/Packaging/PPA/BuildingASourcePackage

Of course, this assumes a simple versioning scheme in the first place,
instead of the cluster-fck that versioning has became within python
packages (again, the scheme used in python is much more complicated
than everyone else, and it seems that nobody has ever stopped and
thought 5 minutes about the consequences, and whether this complexity
was a good idea in the first place).

What if the person making the third party
packages is not an expert on these random distros that they don't even
use?

I think simple rules/conventions + build farms would solve most
issues. The problem is if you allow total flexibility as input, then
automatic and simple solutions become impossible. Certainly, PPA and
the build service provides for a much better experience than anything
pypi has ever given to me.

Third, while we shouldn't advocate that people screw up backwards
compatibility, version skew is a real issue. If I need one version of
a package and my lab-mate needs another and we have submissions due
tomorrow, then filing bugs is a great idea but not a solution.

Nothing prevents you from using virtualenv in that case (I may sound
dismissive of those tools, but I am really not. I use them myselves.
What I strongly react to is when those are pushed as the de-facto,
standard method).

Fourth,
even if we had expert maintainers taking care of all these third-party
packages and all my concerns were answered, I couldn't convince our
sysadmin of that; he's the one who'd have to clean up if something
went wrong we don't have a big budget for overtime.

I am not advocating using only packaged, binary installers. I am
advocating using them as much as possible where it makes sense - on
windows and mac os x in particular.

Toydist also aims at making it easier to build, customize installs.
Although not yet implemented, --user-like scheme would be quite simple
to implement, because toydist installer internally uses autoconf-like
directories description (of which --user is a special case).

If you need sandboxed installs, customized installs, toydist will not
prevent it. It is certainly my intention to make it possible to use
virtualenv and co (you already can by building eggs, actually). I hope
that by having our own "SciPi", we can actually have a more reliable
approach. For example, the static dependency description + mandated
metadata would make this much easier and more robust, as there would
not be a need to run a setup.py to get the dependencies.

If you look at hackageDB
(Introduction | Hackage), they have a very
simple index structure, which makes it easy to download it entirely,
and reuse this locally to avoid any internet access.

Let's be honest -- scientists, on the whole, suck at IT
infrastructure, and small individual packages are not going to be very
expertly put together. IMHO any real solution should take this into
account, keep them sandboxed from the rest of the system, and focus on
providing the most friendly and seamless sandbox possible.

I agree packages will not always be well put together - but I don't
see why this would be worse than the current situation. I also
strongly disagree about the sandboxing as the solution of choice. For
most users, having only one install of most packages is the typical
use-case. Once you start sandboxing, you create artificial barriers
between the sandboxes, and this becomes too complicated for most users
IMHO.

Maybe I was unclear -- proper build directory handling is nice,
Cython/Pyrex's distutils integration get it wrong (not their fault,
distutils is just impossible to do anything sensible with, as you've
said), and I've never found build directories hard to implement

It is simple if you have a good infrastructure in place (node
abstraction, etc...), but that infrastructure is hard to get right.

But what I'm really talking about is
having a "pre-build" step that integrates properly with the source and
binary packaging stages, and that's not something waf or scons have
any particular support for, AFAIK.

Could you explain with a concrete example what a pre-build stage would
look like ? I don't think I understand what you want,

cheers,

David

···

On Mon, Jan 4, 2010 at 8:42 AM, Nathaniel Smith <njs@...503...> wrote:

On Sun, Jan 3, 2010 at 4:23 AM, David Cournapeau <cournape@...149...> wrote:

On Sun, Jan 3, 2010 at 8:05 PM, Nathaniel Smith <njs@...503...> wrote:

_Dag_Sverre_Seljebot · January 4, 2010, 8:48am

I use Sage for this very reason, and others use EPD or FEMHub or
Python(x,y) for the same reasons.

Rolling this into the Python package distribution scheme seems backwards
though, since a lot of binary packages that have nothing to do with Python
are used as well -- the Python stuff is simply thin wrappers around what
should ideally be located in /usr/lib or similar (but are nowadays
compiled into the Python extension .so because of distribution problems).

To solve the exact problem you (and me) have I think the best solution is
to integrate the tools mentioned above with what David is planning (SciPI
etc.). Or if that isn't good enough, find generic "userland package
manager" that has nothing to do with Python (I'm sure a dozen
half-finished ones must have been written but didn't look), finish it, and
connect it to SciPI.

What David does (I think) is seperate the concerns. This makes the task
feasible, and also has the advantage of convenience for the people that
*do* want to use Ubuntu, Red Hat or whatever to roll out scientific
software on hundreds of clients.

Dag Sverre

···

Nathaniel Smith <njs@...503...> wrote:

On Sun, Jan 3, 2010 at 4:23 AM, David Cournapeau <cournape@...149...> > wrote:

Another way is to provide our own repository for a few major
distributions, with automatically built packages. This is how most
open source providers work. Miguel de Icaza explains this well:

OpenSUSE Build System - Miguel de Icaza

I hope we will be able to reuse much of the opensuse build service
infrastructure.

Sure, I'm aware of the opensuse build service, have built third-party
packages for my projects, etc. It's a good attempt, but also has a lot
of problems, and when talking about scientific software it's totally
useless to me :-). First, I don't have root on our compute cluster.

David_Cournapeau · January 4, 2010, 9:11am

Rolling this into the Python package distribution scheme seems backwards
though, since a lot of binary packages that have nothing to do with Python
are used as well

Yep, exactly.

To solve the exact problem you (and me) have I think the best solution is
to integrate the tools mentioned above with what David is planning (SciPI
etc.). Or if that isn't good enough, find generic "userland package
manager" that has nothing to do with Python (I'm sure a dozen
half-finished ones must have been written but didn't look), finish it, and
connect it to SciPI.

You have 0install, autopackage and klik, to cite the ones I know
about. I wish people had looked at those before rolling toy solutions
to complex problems.

What David does (I think) is seperate the concerns.

Exactly - you've describe this better than I did

David

···

On Mon, Jan 4, 2010 at 5:48 PM, Dag Sverre Seljebotn <dagss@...806...> wrote: