Serialization

This module provides load() and dump() functions that can serve as drop-in replacement for the respective functions from the standard pickle module. The main differences between them and the standard ones are:

  • The dump is physically a tarball, in which the pickle is stored as ‘_pkl’ file.
  • A special file ‘_parameters’ in the tarball can contain the data of a selected set of Theano shared variables. This data is referenced from _pkl using persistent id mechanism, which means that no duplication takes place. The goal here is to save the values of the parameters (this is what these shared variables are in most cases) in the most robust way possible. The actual format for ‘_parameters’ file is the one used by numpy.savez(), i.e. a zip file of numpy arrays.
  • More objects can be dumped in the archive using the add_to_dump function. If the object has the same parameters as the one already dumped, then you can avoid to dump those parameters thank to the persistent id mechanism.
  • The dump() strives to catch situations when the user tries to pickle a function or a class not defined in the global namespace and give a meaningful warning.

If briefly, this module proposes a dumping mechanism which allows for greater robustness and persistence than standard pickling.

Examples

Consider a standard main loop (without an algorithm and a data stream for brevity)

>>> from theano import tensor
>>> from blocks.main_loop import MainLoop
>>> from blocks.bricks import MLP, Tanh, Softmax
>>> from blocks.model import Model
>>> mlp = MLP([Tanh(), None], [784, 10, 10])
>>> x = tensor.matrix('features')
>>> y = tensor.lmatrix('targets')
>>> cost = Softmax().categorical_cross_entropy(
...            y.flatten(), mlp.apply(tensor.flatten(x, outdim=2)))
>>> main_loop = MainLoop(None, None, model=Model(cost))

Let’s see how the main loop is dumped by dump()

>>> from blocks.serialization import dump, load
>>> import tarfile
>>> with open('main_loop.tar', 'wb') as dst:
...     dump(main_loop, dst)
>>> tarball = tarfile.open('main_loop.tar', 'r')
>>> tarball 
<tarfile.TarFile object at ...>
>>> tarball.getnames()
['_pkl']
>>> tarball.close()

As promised, the dump is a tarball. Since we did not ask for any additional magic, it just contains the pickled main loop in ‘_pkl’ file.

Let’s do something more interesting:

>>> with open('main_loop.tar', 'wb') as dst:
...     dump(main_loop, dst,
...          parameters=main_loop.model.parameters)
>>> tarball = tarfile.open('main_loop.tar', 'r')
>>> tarball.getnames()
['_parameters', '_pkl']

As requested by specifying the _parameters argument, the parameters were saved in a zip file.

>>> import numpy
>>> ps = numpy.load(tarball.extractfile(tarball.getmember('_parameters')))
>>> sorted(ps.keys()) 
['|mlp|linear_0.W', '|mlp|linear_0.b', '|mlp|linear_1.W', '|mlp|lin...]
>>> ps.close()

The names for parameters are chosen intelligently to reflect their position in the brick hierarchy, if they belong to bricks, and by simply using the .name attribute, if they do not.

The loading of the main loop as a whole still works:

>>> with open('main_loop.tar', 'rb') as src:
...     main_loop_loaded = load(src)
>>> main_loop_loaded 
<blocks.main_loop.MainLoop object at ...>

Additionally, this module provides convenience routine load_parameters():

>>> with open('main_loop.tar', 'rb') as src:
...     parameters = load_parameters(src)
>>> sorted(parameters.keys()) 
['/mlp/linear_0.W', '/mlp/linear_0.b', '/mlp/linear_1.W', '/mlp/line...]

Loading parameters saved by dump() with load_parameters() ensures that their hierarchical names are compatible with Model and Selector classes.

TODO: Add information about add_to_dump().

blocks.serialization.add_to_dump(object_, file_, name, parameters=None, use_cpickle=False, protocol=2, **kwargs)[source]

Pickles an object to an existing tar archive.

This function allows to dump more objects to an existing archive. If the object you want to dump posesses the same set of shared variables as the object already dumped, you can pass them to the parameters argument, which will avoid them to be serialized a second time. However, it won’t work if the shared variable you pass to the parameters argument are not already in the archive.

Parameters:
  • object (object) – The object to pickle.
  • file (file) – The destination for saving, opened in read-write mode (r+).
  • name (str) – The name of the object you are dumping. It will be used as a file name in the archive. ‘_pkl’ and ‘_paramters’ are reserved names and can’t be used.
  • parameters (list, optional) – Shared variables whose internal numpy arrays should be saved separately in the _parameters field of the tar file. Must be a subset of the parameters already in the archive.
  • use_cpickle (bool) – Use cPickle instead of pickle. Setting it to true will disable the warning message if you try to pickle objects from the main module! Be sure that you don’t have the warning before turning this flag on. Default: False.
  • protocol (int, optional) – The pickling protocol to use. Unlike Python’s built-in pickle, the default is set to 2 instead of 0 for Python 2. The Python 3 default (level 3) is maintained.
  • **kwargs – Keyword arguments to be passed to pickle.Pickler.
blocks.serialization.continue_training(path)[source]

Continues training using checkpoint.

Parameters:path (str) – Path to checkpoint.

Notes

Python picklers can unpickle objects from global namespace only if they are present in namespace where unpickling happens. Often global functions are needed for mapping, filtering and other data stream operations. In a case if the main loop uses global objects and this function fails with a message like ` AttributeError: 'module' object has no attribute '...' ` it means that you need to import these objects.

Examples

This function can be used in two ways: in your script where a main loop defined or in a different script. For later options see Notes section.

blocks.serialization.dump(object_, file_, parameters=None, use_cpickle=False, protocol=2, **kwargs)[source]

Pickles an object, optionally saving its parameters separately.

Parameters:
  • object (object) – The object to pickle. If None, only the parameters passed to the parameters argument will be saved.
  • file (file) – The destination for saving.
  • parameters (list, optional) – Shared variables whose internal numpy arrays should be saved separately in the _parameters field of the tar file.
  • pickle_object (bool) – If False, object_ will not be serialized, only its parameters. This flag can be used when object_ is not serializable, but one still want to save its parameters. Default: True
  • use_cpickle (bool) – Use cPickle instead of pickle. Setting it to true will disable the warning message if you try to pickle objects from the main module, so be sure that there is no warning before turning this flag on. Default: False.
  • protocol (int, optional) – The pickling protocol to use. Unlike Python’s built-in pickle, the default is set to 2 instead of 0 for Python 2. The Python 3 default (level 3) is maintained.
  • **kwargs – Keyword arguments to be passed to pickle.Pickler.
blocks.serialization.dump_and_add_to_dump(object_, file_, parameters=None, to_add=None, use_cpickle=False, protocol=2, **kwargs)[source]

Calls both dump and add_to_dump to serialze several objects.

This function is used to serialize several at the same time, using persistent ID. Its main advantage is that it can be used with secure_dump.

Parameters:
  • object (object) – The object to pickle. If None, only the parameters passed to the parameters argument will be saved.
  • file (file) – The destination for saving.
  • parameters (list, optional) – Shared variables whose internal numpy arrays should be saved separately in the _parameters field of the tar file.
  • to_add (dict of objects) – A {‘name’: object} dictionnary of additional objects to save in the tar archive. Its keys will be used as name in the tar file.
  • use_cpickle (bool) – Use cPickle instead of pickle. Setting it to true will disable the warning message if you try to pickle objects from the main module, so be sure that there is no warning before turning this flag on. Default: False.
  • protocol (int, optional) – The pickling protocol to use. Unlike Python’s built-in pickle, the default is set to 2 instead of 0 for Python 2. The Python 3 default (level 3) is maintained.
  • **kwargs – Keyword arguments to be passed to pickle.Pickler.
blocks.serialization.load(file_, name='_pkl', use_cpickle=False, **kwargs)[source]

Loads an object saved using the dump function.

By default, this function loads the object saved by the dump function. If some objects have been added to the archive using the add_to_dump function, then you can load them by passing their name to the name parameter.

Parameters:
  • file (file) – The file that contains the object to load.
  • name (str) – Name of the object to load. Default is _pkl, meaning that it is the original object which have been dumped that is loaded.
  • use_cpickle (bool) – Use cPickle instead of pickle. Default: False.
  • **kwargs – Keyword arguments to be passed to pickle.Unpickler. Used for e.g. specifying the encoding so as to load legacy Python pickles under Python 3.x.
Returns:

Return type:

The object saved in file_.

blocks.serialization.load_parameters(file_)[source]

Loads the parameter values saved by dump().

This functions loads the parameters that have been saved separately by dump(), ie the ones given to its parameter parameters.

Parameters:file (file) – The source to load the parameters from.
Returns:
Return type:A dictionary of (parameter name, numpy array) pairs.
blocks.serialization.secure_dump(object_, path, dump_function=<function dump>, **kwargs)[source]

Robust serialization - does not corrupt your files when failed.

Parameters:
  • object (object) – The object to be saved to the disk.
  • path (str) – The destination for saving.
  • dump_function (function) – The function that is used to perform the serialization. Must take an object and file object as arguments. By default, dump() is used. An alternative would be pickle.dump().
  • **kwargs – Keyword arguments to be passed to dump_function.