Skip to content

Commit

Permalink
Merge pull request #40 from flennerhag/dev
Browse files Browse the repository at this point in the history
0.1.5
  • Loading branch information
flennerhag authored Jul 18, 2017
2 parents 0778271 + d3679f5 commit c717a67
Show file tree
Hide file tree
Showing 26 changed files with 1,099 additions and 761 deletions.
12 changes: 0 additions & 12 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,6 @@ matrix:
- os: linux
python: 2.7
sudo: required
- os: osx
language: generic
sudo: required
python: 3.6
- os: osx
language: generic
sudo: required
python: 3.5
- os: osx
language: generic
sudo: required
python: 2.7

install:
- set -e
Expand Down
5 changes: 2 additions & 3 deletions .travis/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ if [[ $TRAVIS_OS_NAME == 'linux' ]]; then
sudo apt-get update;
fi

pip install coverage coveralls nose-exclude flake8 psutil scikit-learn;

pip install -U coverage coveralls nose-exclude flake8 psutil scipy numpy scikit-learn;
pip install -r requirements.txt;
python setup.py install;

echo "Installation complete"
41 changes: 41 additions & 0 deletions docs/API.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,45 @@ Visualization
exp_var_plot


For developers
==============

The following base classes are good starting points for building new ensembles.
You may want to study the source code directly.

.. _indexer-api:

Indexers
^^^^^^^^

.. currentmodule:: mlens.base

.. autosummary::

IdTrain
BlendIndex
FoldIndex
SubsetIndex
FullIndex
ClusteredSubsetIndex

.. _estimation-api:

Estimation routines
^^^^^^^^^^^^^^^^^^^

.. currentmodule:: mlens.parallel

.. autosummary::

ParallelProcessing
ParallelEvaluation
Stacker
Blender
SubStacker
SingleRun
Evaluation
BaseEstimator


.. _Scikit-learn: http://scikit-learn.org/stable/
31 changes: 31 additions & 0 deletions docs/dev.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
.. Development
.. _dev:

Hacking ML-Ensemble
===================


.. py:currentmodule:: mlens.parallel.estimation
ML-Ensemble implements a modular design that allows straightforward
development of new ensemble classes. The backend is agnostic to the type of
ensemble it is being asked to perform computation on, and only at the moment
of computation will ensemble-specific code be needed. To implement a new
ensemble type, three objects are needed:

1. An cross-validation strategy. This amounts to implementing an
indexer class. See :ref:`current indexers <indexer-api>` for examples.

2. An estimation engine. This is the actual class that will run the
estimation. The :class:`BaseEstimator` class implements most of the
heavy lifting, and unless special-purpose fit and/or predict procedures
are required, the only thing needed is a method for indexing the
base learners to each new features generated by the cross-validation
strategy. See :ref:`current estimation engines <estimation-api>` for examples.

3. A front-end API. These typically only implements a constructor and an
``add`` method. The ``add`` method specifies the indexer to use and
parser keyword arguments. It is also adviced to differentiate between
hidden layers and the meta layer, where cross-validation is not desired.
20 changes: 10 additions & 10 deletions docs/ensemble_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -282,16 +282,16 @@ can differ between layers.
Very few limitation are imposed on the estimator: it must have a ``fit``
method that takes ``X`` (and possibly ``y``) as inputs, and there must be
a method that generates class labels (i.e. partition ids) to a passed dataset.
The default method is ``predict``, but
you can specify another method with the ``attr`` option when adding a layer.
This level of generality does impose some responsibility on the user. In
particular, it is up to the user to ensure that sensible partitions are created.
Problems to watch out for is too small partitions (too many clusters, too uneven
cluster sizes) and clusters with too little variation: for instance with only
a single class label in the entire partition, base learners have nothing to
learn.

So let's see how to do this in practice. For instance, we can use an unsupervised K-Means
The default method is ``predict``, but you can specify another method with the
``attr`` option when adding a layer, and which data to use with this method
(``partition_on='X', 'y', 'both'``). This level of generality does impose some
responsibility on the user. In particular, it is up to the user to ensure that
sensible partitions are created. Problems to watch out for is too small
partitions (too many clusters, too uneven cluster sizes) and clusters with too
little variation: for instance with only a single class label in the entire
partition, base learners have nothing to learn.

Let's see how to do this in practice. For instance, we can use an unsupervised K-Means
clustering estimator to partition the data, like so::

from sklearn.cluster import KMeans
Expand Down
83 changes: 39 additions & 44 deletions docs/gotchas.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,41 @@
Known limitations
=================
.. Known issues
Troubleshooting
===============

Here we collect a set of subtle potential issues and limitations that may
explain odd behavior that you have encountered. Feel free to reach out if your
problem is not addressed here.

.. _third-party-issues:

Bad interaction with third-party packages
-----------------------------------------

Parallel processing with generic Python objects is a difficult task, and while
ML-Ensemble is routinely tested to function seamlessly with Scikit-learn, other machine
learning libraries can cause bad behaviour during parallel estimations. This
is unfortunately a fundamental problem rooted in how `Python runs processes in parallel`_,
and in particular that Python is not thread-safe. ML-Ensemble is by configured
to avoid such issues to the greatest extent possible, but issues can occur.

In particular, ensemble can run either on multiprocessing or multithreading.
For standard Scikit-learn use cases, the GIL_ can be released and
multithreading used. This will speed up estimation and consume less memory.
However, Python is not inherently thread-safe, so this strategy is not stable.
For this reason, the safest choice to avoid corrupting the estimation process
is to use multiprocessing instead. This requires creating sub-process to run
each job, and so increases additional overhead both in terms of job management
and sharing memory. As of this writing, the default setting in ML-Ensemble is
'multiprocessing', but you can change this variable globally: see :ref:`configs`.

In Python 3.4+, ML-Ensemble defaults to ``'forkserver'`` on unix systems
and ``'spawn'`` on Windows for generating sub-processes. These require more
overhead than the default ``'fork'`` method, but avoids corrupting the thread
state and as such is much more stable against third-party conflict. These
conflicts are caused by each worker thinking they have more threads available
than they actually do, leading to deadlocks and race conditions. For more
information on this issue see the `Scikit-learn FAQ`_.

Array copying during fitting
----------------------------
Expand All @@ -18,47 +54,6 @@ of folds beyond 2 does not significantly impact performance and at this time
of writing this is the suggested approach. For further information on
avoiding copying data during estimation, see :ref:`memory`.


Third-party multiprocessed objects
----------------------------------

ML-Ensemble runs by default on multi-threading. This requires releasing the
GIL_, which can cause race conditions. In standard uses cases, releasing the
GIL is harmless since input data is shared in read-only mode and output arrays
are partitioned. If you experience issues with multithreading, you can try
switching to multiprocessing either by the ``backend`` argument or by changing
the global default (``mlens.config.BACKEND``). Estimation is then parallelized
on processes instead of threads, and thus keeps the GIL in place. Multiprocessing however
is not without its issues and can interact badly with third-party classes that
are also multiprocessed, which can lead to deadlocks. This issue is due to a
limitation of how `Python runs processes in parallel`_ and is an issue beyond
the scope of ML-Ensemble.

If you experience issues on both multithreading and multiprocessing, the simplest
solution is to turn off parallelism by setting ``n_jobs`` to ``1``. Start by
switching off parallelism in the learners of the ensemble as this will not
impact the training speed of the ensemble, and only switch off paralllism in the
ensemble as a last resort.

In Python 3.4+, it is possible to spawn a ``forkprocess`` backend within the
native python ``multiprocessing`` library. To do this, set the multiprocessing
start method to ``forkserver`` as below. ::

import multiprocessing

# You can put imports and functions/class definitions other than mlens here

if __name__ == '__main__':

multiprocessing.set_start_method('forkserver')

# Import mlens here

# Your execution here

Note that this solution is currently experimental.
Further information can be found here_.

File permissions on Windows
---------------------------

Expand All @@ -77,5 +72,5 @@ memory performance issues, create an issue at the `issue tracker`_.
.. _GIL: https://wiki.python.org/moin/GlobalInterpreterLock
.. _view: http://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html
.. _Python runs processes in parallel: https://wiki.python.org/moin/ParallelProcessing
.. _here: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
.. _Scikit-learn FAQ: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
.. _issue tracker: https://github.com/flennerhag/mlens/issues
5 changes: 4 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ ML-Ensemble is open for contributions at all levels. There are
some low hanging fruit to build introductory example, use cases and
general benchmarks. If you would like to get involved, reach out to the
project's Github_ repository. We are currently in beta testing, so please do
report any bugs or issues by creating an issue_.
report any bugs or issues by creating an issue_. If you are interested in
contributing to development, see :ref:`dev` for a quick introduction to
ensemble implementation, or check out the issue tracker.

Core Features
-------------
Expand Down Expand Up @@ -142,6 +144,7 @@ ensemble output. Output is summarized for easy comparison of performance. ::
memory
benchmarks
scaling
dev
gotchas

.. toctree::
Expand Down
20 changes: 16 additions & 4 deletions docs/mlens_configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,33 @@ Global configurations

ML-Ensemble allows a set of low-level global configurations to tailor the
behavior of classes during estimation. Every variable is accessible through
``mlens.config``.
``mlens.config``. Alternatively, all variables can be set as global
environmental variables, where the exported variable name is
``MLENS_[VARNAME]``.

* ``mlens.config.BACKEND``
configures the global default backend during parallelized estimation.
Default is ``'threading'``. Options are ``'multiprocessing'`` and
``'forkserver'``. See joblib_ for further information.
``'forkserver'``. See joblib_ for further information. Alter with the
``set_backend`` function.

* ``mlens.config.DTYPE``
determines the default dtype of numpy arrays created during estimation; in
particular, the prediction matrices of each intermediate layer. Default is
:obj:`numpy.float32`.
:obj:`numpy.float32`. Alter with the ``set_backend`` function.

* ``mlens.config.TMPDIR``
The directory where temporary folders are created during estimation.
Default uses the tempfile_ function ``gettempdir()``.
Default uses the tempfile_ function ``gettempdir()``. Alter with the
``set_backend`` function.

* ``mlens.config.START_METHOD``
The method used by the job manager to generate a new job. ML-Ensemble
defaults to ``forkserver``on Unix with Python 3.4+, and ``spawn`` on
windows. For older Python versions, the default is ``fork``. This method
has the least overhead, but it can cause issues with third-party software.
See :ref:`third-party-issues` for details. Set this variable with the
``set_start_method`` function.

.. _joblib: https://pythonhosted.org/joblib/parallel.html
.. _tempfile: https://docs.python.org/3/library/tempfile.html
29 changes: 17 additions & 12 deletions docs/updates.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,23 @@
Change log
==========

* 10/04/2017: Release_ of version 0.1.3
- Initial stable version released.

* 11/07/2017: Release_ of version 0.1.4
- Prediction array dtype option (default=float32)
- :ref:`Feature propagation <propa-tutorial>`
- :ref:`Clustered subsemble partitioning <subsemble-tutorial>`
- No memmaps passed to estimators (only ndarray views)
- Threading as default global backend (changeable through mlens.config.BACKEND)
- Global configuration (mlens.config)
- Scoring exception handling

* 04/2017: Release_ of version 0.1.3
- Initial stable version released.

* 07/2017: Release_ of version 0.1.4
- Prediction array dtype option (default=float32)
- :ref:`Feature propagation <propa-tutorial>`
- :ref:`Clustered subsemble partitioning <subsemble-tutorial>`
- No memmaps passed to estimators (only ndarray views)
- Global configuration (mlens.config)
- Scoring exception handling

* 07/2017: Release_ of version 0.1.5
- Possible to set environmental variables
- ``spawn`` as default start method for parallel jobs (w. multiprocessing)
- Possible to specify ``y`` as partition input in :ref:`Clustered subsemble partitioning <subsemble-tutorial>`
- Minor bug fixes
- Refactored backend for streamlined front-end feature development

.. _Release: https://github.com/flennerhag/mlens/releases
.. _Feature propagation:
4 changes: 3 additions & 1 deletion mlens/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@
ML-Ensemble, a Python library for memory efficient parallelized ensemble
learning.
"""
# Initialize configurations
import mlens.config

__version__ = "0.1.4.dev0"
__version__ = "0.1.5"


__all__ = ['base',
Expand Down
Loading

0 comments on commit c717a67

Please sign in to comment.