Bricks¶

Convolutional bricks
Routing bricks
Recurrent bricks
Attention bricks
Sequence generators
Cost bricks

blocks.bricks.application(*args, **kwargs)[source]¶

Decorator for methods that apply a brick to inputs.

Parameters:	optional (kwargs,) – The application method to wrap. optional** – Attributes to attach to this application.

Notes

This decorator replaces application methods with Application instances. It also sets the attributes given as keyword arguments to the decorator.

Note that this decorator purposely does not wrap the original method using e.g. wraps() or update_wrapper(), since that would make the class impossible to pickle (see notes at Application).

Examples

>>> class Foo(Brick):
...     @application(inputs=['x'], outputs=['y'])
...     def apply(self, x):
...         return x + 1
...     @application
...     def other_apply(self, x):
...         return x - 1
>>> foo = Foo()
>>> Foo.apply.inputs
['x']
>>> foo.apply.outputs
['y']
>>> Foo.other_apply 
<blocks.bricks.base.Application object at ...>

class blocks.bricks.Brick(name=None, children=None)[source]¶

Bases: blocks.graph.annotations.Annotation

A brick encapsulates Theano operations with parameters.

A brick goes through the following stages:

Construction: The call to __init__() constructs a Brick instance with a name and creates any child bricks as well.
Allocation of parameters:
1. Allocation configuration of children: The push_allocation_config() method configures any children of this block.
2. Allocation: The allocate() method allocates the shared Theano variables required for the parameters. Also allocates parameters for all children.
The following can be done in either order:
1. Application: By applying the brick to a set of Theano variables a part of the computational graph of the final model is constructed.
2. The initialization of parameters:
  1. Initialization configuration of children: The push_initialization_config() method configures any children of this block.
  2. Initialization: This sets the initial values of the parameters by a call to initialize(), which is needed to call the final compiled Theano function. Also initializes all children.

Not all stages need to be called explicitly. Step 3(a) will automatically allocate the parameters if needed. Similarly, step 3(b.2) and 2(b) will automatically perform steps 3(b.1) and 2(a) if needed. They only need to be called separately if greater control is required. The only two methods which always need to be called are an application method to construct the computational graph, and the initialize() method in order to initialize the parameters.

At each different stage, a brick might need a certain set of configuration settings. All of these settings can be passed to the __init__() constructor. However, by default many bricks support lazy initialization. This means that the configuration settings can be set later.

Note

Some arguments to __init__() are always required, even when lazy initialization is enabled. Other arguments must be given before calling allocate(), while others yet only need to be given in order to call initialize(). Always read the documentation of each brick carefully.

Lazy initialization can be turned off by setting Brick.lazy = False. In this case, there is no need to call initialize() manually anymore, but all the configuration must be passed to the __init__() method.

Parameters:	name (str, optional) – The name of this brick. This can be used to filter the application of certain modifications by brick names. By default, the brick receives the name of its class (lowercased).

name¶: str – The name of this brick.

print_shapes¶: bool – False by default. If True it logs the shapes of all the input and output variables, which can be useful for debugging.

parameters¶: list of TensorSharedVariable and None – After calling the allocate() method this attribute will be populated with the shared variables storing this brick’s parameters. Allows for None so that parameters can always be accessed at the same index, even if some parameters are only defined given a particular configuration.

children¶: list of bricks – The children of this brick.

allocated¶: bool – False if allocate() has not been called yet. True otherwise.

initialized¶: bool – False if allocate() has not been called yet. True otherwise.

allocation_config_pushed¶: bool – False if allocate() or push_allocation_config() hasn’t been called yet. True otherwise.

initialization_config_pushed¶: bool – False if initialize() or push_initialization_config() hasn’t been called yet. True otherwise.

Notes

To provide support for lazy initialization, apply the lazy() decorator to the __init__() method.

Brick implementations must call the __init__() constructor of their parent using super(BlockImplementation, self).__init__(**kwargs) at the beginning of the overriding __init__.

The methods _allocate() and _initialize() need to be overridden if the brick needs to allocate shared variables and initialize their values in order to function.

A brick can have any number of methods which apply the brick on Theano variables. These methods should be decorated with the application() decorator.

If a brick has children, they must be listed in the children attribute. Moreover, if the brick wants to control the configuration of its children, the _push_allocation_config() and _push_initialization_config() methods need to be overridden.

Examples

Most bricks have lazy initialization enabled.

>>> import theano
>>> from blocks.initialization import IsotropicGaussian, Constant
>>> from blocks.bricks import Linear
>>> linear = Linear(input_dim=5, output_dim=3,
...                 weights_init=IsotropicGaussian(),
...                 biases_init=Constant(0))
>>> x = theano.tensor.vector()
>>> linear.apply(x)  # Calls linear.allocate() automatically
linear_apply_output
>>> linear.initialize()  # Initializes the weight matrix

allocate()[source]¶

Allocate shared variables for parameters.

Based on the current configuration of this Brick create Theano shared variables to store the parameters. After allocation, parameters are accessible through the parameters attribute.

This method calls the allocate() method of all children first, allowing the _allocate() method to override the parameters of the children if needed.

Raises:	`ValueError` – If the configuration of this brick is insufficient to determine the number of parameters or their dimensionality to be initialized.

Notes

This method sets the parameters attribute to an empty list. This is in order to ensure that calls to this method completely reset the parameters.

children

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

get_dims(names)[source]¶

Get list of dimensions for a set of input/output variables.

Parameters:	names (list) – The variable names.
Returns:	dims – The dimensions of the sources.
Return type:	list

get_hierarchical_name(parameter, delimiter='/')[source]¶

Return hierarhical name for a parameter.

Returns a path of the form brick1/brick2/brick3.parameter1. The delimiter is configurable.

Parameters:	delimiter (str) – The delimiter used to separate brick names in the path.

get_unique_path()[source]¶: Returns unique path to this brick in the application graph.

initialize()[source]¶

Initialize parameters.

Intialize parameters, such as weight matrices and biases.

Notes

If the brick has not allocated its parameters yet, this method will call the allocate() method in order to do so.

parameters

print_shapes = False: See Brick.print_shapes

push_allocation_config()[source]¶

Push the configuration for allocation to child bricks.

Bricks can configure their children, based on their own current configuration. This will be automatically done by a call to allocate(), but if you want to override the configuration of child bricks manually, then you can call this function manually.

push_initialization_config()[source]¶

Push the configuration for initialization to child bricks.

Bricks can configure their children, based on their own current configuration. This will be automatically done by a call to initialize(), but if you want to override the configuration of child bricks manually, then you can call this function manually.

blocks.bricks.lazy(allocation=None, initialization=None)[source]¶

Makes the initialization lazy.

This decorator allows the user to define positional arguments which will not be needed until the allocation or initialization stage of the brick. If these arguments are not passed, it will automatically replace them with a custom None object. It is assumed that the missing arguments can be set after initialization by setting attributes with the same name.

Parameters:	allocation (list) – A list of argument names that are needed for allocation. initialization (list) – A list of argument names that are needed for initialization.

Examples

>>> class SomeBrick(Brick):
...     @lazy(allocation=['a'], initialization=['b'])
...     def __init__(self, a, b, c='c', d=None):
...         print(a, b, c, d)
>>> brick = SomeBrick('a')
a NoneInitialization c None
>>> brick = SomeBrick(d='d', b='b')
NoneAllocation b c d

class blocks.bricks.BatchNormalization(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.RNGMixin, blocks.bricks.interfaces.Feedforward

Normalizes activations, parameterizes a scale and shift.

Parameters:

input_dim (int or tuple) – Shape of a single input example. It is assumed that a batch axis will be prepended to this.
broadcastable (tuple, optional) – Tuple of the same length as input_dim which specifies which of the per-example axes should be averaged over to compute means and standard deviations. For example, in order to normalize over all spatial locations in a (batch_index, channels, height, width) image, pass (False, True, True). The batch axis is always averaged out.
conserve_memory (bool, optional) – Use an implementation that stores less intermediate state and therefore uses less memory, at the expense of 5-10% speed. Default is True.
epsilon (float, optional) – The stabilizing constant for the minibatch standard deviation computation (when the brick is run in training mode). Added to the variance inside the square root, as in the batch normalization paper.
scale_init (object, optional) – Initialization object to use for the learned scaling parameter ($\gamma$ in [BN]). By default, uses constant initialization of 1.
shift_init (object, optional) – Initialization object to use for the learned shift parameter ($\beta$ in [BN]). By default, uses constant initialization of 0.
mean_only (bool, optional) – Perform “mean-only” batch normalization as described in [SK2016].
learn_scale (bool, optional) – Whether to include a learned scale parameter ($\gamma$ in [BN]) in this brick. Default is True. Has no effect if mean_only is True (i.e. a scale parameter is never learned in mean-only mode).
learn_shift (bool, optional) – Whether to include a learned shift parameter ($\beta$ in [BN]) in this brick. Default is True.

Notes

In order for trained models to behave sensibly immediately upon upon deserialization, by default, this brick runs in inference mode, using a population mean and population standard deviation (initialized to zeros and ones respectively) to normalize activations. It is expected that the user will adapt these during training in some fashion, independently of the training objective, e.g. by taking a moving average of minibatch-wise statistics.

In order to train with batch normalization, one must obtain a training graph by transforming the original inference graph. See apply_batch_normalization() for a routine to transform graphs, and batch_normalization() for a context manager that may enable shorter compile times (every instance of BatchNormalization is itself a context manager, entry into which causes applications to be in minibatch “training” mode, however it is usually more convenient to use batch_normalization() to enable this behaviour for all of your graph’s BatchNormalization bricks at once).

Note that training in inference mode should be avoided, as this brick introduces scales and shift parameters (tagged with the PARAMETER role) that, in the absence of batch normalization, usually makes things unstable. If you must do this, filter for and remove BATCH_NORM_SHIFT_PARAMETER and BATCH_NORM_SCALE_PARAMETER from the list of parameters you are training, and this brick should behave as a (somewhat expensive) no-op.

This Brick accepts scale_init and shift_init arguments but is not an instance of Initializable, and will therefore not receive pushed initialization config from any parent brick. In almost all cases, you will probably want to stick with the defaults (unit scale and zero offset), but you can explicitly pass one or both initializers to override this.

This has the necessary properties to be inserted into a blocks.bricks.conv.ConvolutionalSequence as-is, in which case the input_dim should be omitted at construction, to be inferred from the layer below.

[BN]	(1, 2, 3, 4) Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. ICML (2015), pp. 448-456.

[SK2016]

Tim Salimans and Diederik P. Kingma. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. arXiv 1602.07868.

apply¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

image_size¶

normalization_axes¶

num_channels¶

num_output_channels¶

output_dim¶

class blocks.bricks.SpatialBatchNormalization(**kwargs)[source]¶

Bases: blocks.bricks.bn.BatchNormalization

Convenient subclass for batch normalization across spatial inputs.

Parameters:	input_dim (int or tuple) – The input size of a single example. Must be length at least 2. It’s assumed that the first axis of this tuple is a “channels” axis, which should not be summed over, and all remaining dimensions are spatial dimensions.

Notes

See BatchNormalization for more details (and additional keyword arguments).

class blocks.bricks.BatchNormalizedMLP(**kwargs)[source]¶

Bases: blocks.bricks.sequences.MLP

Convenient subclass for building an MLP with batch normalization.

Parameters:	conserve_memory (bool, optional, by keyword only) – See `BatchNormalization`. mean_only (bool, optional, by keyword only) – See `BatchNormalization`. learn_scale (bool, optional, by keyword only) – See `BatchNormalization`. learn_shift (bool, optional, by keyword only) – See `BatchNormalization`.

Notes

All other parameters are the same as MLP. Each activation brick is wrapped in a Sequence containing an appropriate BatchNormalization brick and the activation that follows it.

By default, the contained Linear bricks will not contain any biases, as they could be canceled out by the biases in the BatchNormalization bricks being added. Pass use_bias with a value of True if you really want this for some reason.

mean_only, learn_scale and learn_shift are pushed down to all created BatchNormalization bricks as allocation config.

conserve_memory¶: Conserve memory.

class blocks.bricks.Feedforward(name=None, children=None)[source]¶

Bases: blocks.bricks.base.Brick

Declares an interface for bricks with one input and one output.

Many bricks have just one input and just one output (activations, Linear, MLP). To make such bricks interchangable in most contexts they should share an interface for configuring their input and output dimensions. This brick declares such an interface.

input_dim¶: int – The input dimension of the brick.

output_dim¶: int – The output dimension of the brick.

class blocks.bricks.Initializable(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.RNGMixin, blocks.bricks.base.Brick

Base class for bricks which push parameter initialization.

Many bricks will initialize children which perform a linear transformation, often with biases. This brick allows the weights and biases initialization to be configured in the parent brick and pushed down the hierarchy.

Parameters:

weights_init (object) – A NdarrayInitialization instance which will be used by to initialize the weight matrix. Required by initialize().
biases_init (object, optional) – A NdarrayInitialization instance that will be used to initialize the biases. Required by initialize() when use_bias is True. Only supported by bricks for which has_biases is True.
use_bias (bool, optional) – Whether to use a bias. Defaults to True. Required by initialize(). Only supported by bricks for which has_biases is True.
rng (numpy.random.RandomState) –

has_biases¶: bool – False if the brick does not support biases, and only has weights_init. For an example of this, see Bidirectional. If this is False, the brick does not support the arguments biases_init or use_bias.

has_biases = True

class blocks.bricks.LinearLike(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Initializable

Initializable subclass with logic for Linear-like classes.

Notes

Provides W and b properties that can be overridden in subclasses to implement pre-application transformations on the weights and biases. Application methods should refer to self.W and self.b rather than accessing the parameters list directly.

This assumes a layout of the parameters list with the weights coming first and biases (if use_bias is True) coming second.

W¶

b¶

class blocks.bricks.Random(theano_seed=None, **kwargs)[source]¶

Bases: blocks.bricks.base.Brick

A mixin class for Bricks which need Theano RNGs.

Parameters:	theano_seed (int or list, optional) – Seed to use for a `MRG_RandomStreams` object.

seed_rng = <mtrand.RandomState object>¶

theano_rng¶

Returns Brick’s Theano RNG, or a default one.

The default seed can be set through blocks.config.

theano_seed¶

class blocks.bricks.Linear(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.LinearLike, blocks.bricks.interfaces.Feedforward

A linear transformation with optional bias.

Brick which applies a linear (affine) transformation by multiplying the input with a weight matrix. By default, a bias term is added (see Initializable for information on disabling this).

Parameters:	input_dim (int) – The dimension of the input. Required by `allocate()`. output_dim (int) – The dimension of the output. Required by `allocate()`.

Notes

See Initializable for initialization parameters.

A linear transformation with bias is a matrix multiplication followed by a vector summation.

\[f(\mathbf{x}) = \mathbf{W}\mathbf{x} + \mathbf{b}\]

apply¶

Apply the linear transformation.

Parameters:	input (`TensorVariable`) – The input on which to apply the transformation
Returns:	output – The transformed input plus optional bias
Return type:	`TensorVariable`

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

class blocks.bricks.Bias(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Feedforward, blocks.bricks.interfaces.Initializable

Add a bias (i.e. sum with a vector).

apply¶

Apply the linear transformation.

Parameters:	input (`TensorVariable`) – The input on which to apply the transformation
Returns:	output – The transformed input plus optional bias
Return type:	`TensorVariable`

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

input_dim¶

output_dim¶

class blocks.bricks.Maxout(**kwargs)[source]¶

Bases: blocks.bricks.base.Brick

Maxout pooling transformation.

A brick that does max pooling over groups of input units. If you use this code in a research project, please cite [GWFM13].

[GWFM13]

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio, Maxout networks, ICML (2013), pp. 1319-1327.

Parameters:	num_pieces (int) – The size of the groups the maximum is taken over.

Notes

Maxout applies a set of linear transformations to a vector and selects for each output dimension the result with the highest value.

apply¶

Apply the maxout transformation.

Parameters:	input (`TensorVariable`) – The input on which to apply the transformation
Returns:	output – The transformed input
Return type:	`TensorVariable`

class blocks.bricks.LinearMaxout(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Initializable, blocks.bricks.interfaces.Feedforward

Maxout pooling following a linear transformation.

This code combines the Linear brick with a Maxout brick.

Parameters:	input_dim (int) – The dimension of the input. Required by `allocate()`. output_dim (int) – The dimension of the output. Required by `allocate()`. num_pieces (int) – The number of linear functions. Required by `allocate()`.

Notes

See Initializable for initialization parameters.

apply¶

Apply the linear transformation followed by maxout.

Parameters:	input (`TensorVariable`) – The input on which to apply the transformations
Returns:	output – The transformed input
Return type:	`TensorVariable`

input_dim¶

class blocks.bricks.Identity(name=None, children=None)[source]¶

Bases: blocks.bricks.interfaces.Activation

Elementwise application of identity function.

apply¶

Apply the identity function element-wise.

Parameters:	input (`TensorVariable`) – Theano variable to apply identity to, element-wise.
Returns:	output – The input with the activation function applied.
Return type:	`TensorVariable`

class blocks.bricks.Tanh(name=None, children=None)[source]¶

Bases: blocks.bricks.interfaces.Activation

Elementwise application of tanh function.

apply¶

Apply the tanh function element-wise.

Parameters:	input (`TensorVariable`) – Theano variable to apply tanh to, element-wise.
Returns:	output – The input with the activation function applied.
Return type:	`TensorVariable`

class blocks.bricks.Logistic(name=None, children=None)[source]¶

Bases: blocks.bricks.interfaces.Activation

Elementwise application of logistic function.

apply¶

Apply the logistic function element-wise.

Parameters:	input (`TensorVariable`) – Theano variable to apply logistic to, element-wise.
Returns:	output – The input with the activation function applied.
Return type:	`TensorVariable`

class blocks.bricks.Softplus(name=None, children=None)[source]¶

Bases: blocks.bricks.interfaces.Activation

Elementwise application of softplus function.

apply¶

Apply the softplus function element-wise.

Parameters:	input (`TensorVariable`) – Theano variable to apply softplus to, element-wise.
Returns:	output – The input with the activation function applied.
Return type:	`TensorVariable`

class blocks.bricks.Rectifier(name=None, children=None)[source]¶

Bases: blocks.bricks.interfaces.Activation

Elementwise application of rectifier function.

apply¶

Apply the rectifier function element-wise.

Parameters:	input (`TensorVariable`) – Theano variable to apply rectifier to, element-wise.
Returns:	output – The input with the activation function applied.
Return type:	`TensorVariable`

class blocks.bricks.LeakyRectifier(leak=0.01, **kwargs)[source]¶

Bases: blocks.bricks.interfaces.Activation

Elementwise application of leakyrectifier function.

apply¶

Apply the leakyrectifier function element-wise.

Parameters:	input (`TensorVariable`) – Theano variable to apply leakyrectifier to, element-wise.
Returns:	output – The input with the activation function applied.
Return type:	`TensorVariable`

class blocks.bricks.Softmax(name=None, children=None)[source]¶

Bases: blocks.bricks.base.Brick

A softmax brick.

Works with 2-dimensional inputs only. If you need more, see NDimensionalSoftmax.

apply¶

Standard softmax.

Parameters:	input (`Variable`) – A matrix, each row contains unnormalized log-probabilities of a distribution.
Returns:	output_ – A matrix with probabilities in each row for each distribution from input_.
Return type:	`Variable`

categorical_cross_entropy¶

Computationally stable cross-entropy for pre-softmax values.

Parameters:	y (`TensorVariable`) – In the case of a matrix argument, each row represents a probabilility distribution. In the vector case, each element represents a distribution by specifying the position of 1 in a 1-hot vector. x (`TensorVariable`) – A matrix, each row contains unnormalized probabilities of a distribution.
Returns:	cost – A vector of cross-entropies between respective distributions from y and x.
Return type:	`TensorVariable`

log_probabilities¶

Normalize log-probabilities.

Converts unnormalized log-probabilities (exponents of which do not sum to one) into actual log-probabilities (exponents of which sum to one).

Parameters:	input (`Variable`) – A matrix, each row contains unnormalized log-probabilities of a distribution.
Returns:	output – A matrix with normalized log-probabilities in each row for each distribution from input_.
Return type:	`Variable`

class blocks.bricks.NDimensionalSoftmax(name=None, children=None)[source]¶

Bases: blocks.bricks.simple.Softmax

A wrapped brick class.

This brick was automatically constructed by wrapping Softmax with WithExtraDims.

See also

BrickWrapper: For explanation of brick wrapping.

Softmax WithExtraDims

apply¶

Wraps the application method with reshapes.

Parameters:	extra_ndim (int, optional) – The number of extra dimensions. Default is zero.

See also

Softmax.apply(): For documentation of the wrapped application method.

apply_delegate()[source]¶

categorical_cross_entropy¶

Wraps the application method with reshapes.

Parameters:	extra_ndim (int, optional) – The number of extra dimensions. Default is zero.

See also

Softmax.categorical_cross_entropy(): For documentation of the wrapped application method.

categorical_cross_entropy_delegate()[source]¶

decorators = [<blocks.bricks.wrappers.WithExtraDims object>]¶

log_probabilities¶

Wraps the application method with reshapes.

Parameters:	extra_ndim (int, optional) – The number of extra dimensions. Default is zero.

See also

Softmax.log_probabilities(): For documentation of the wrapped application method.

log_probabilities_delegate()[source]¶

class blocks.bricks.Sequence(application_methods, **kwargs)[source]¶

Bases: blocks.bricks.base.Brick

A sequence of bricks.

This brick applies a sequence of bricks, assuming that their in- and outputs are compatible.

Parameters:	application_methods (list) – List of `BoundApplication` or `Brick` to apply. For Brick`s, the `.apply`` method is used.

apply¶

apply_inputs()[source]¶

apply_outputs()[source]¶

class blocks.bricks.FeedforwardSequence(application_methods, **kwargs)[source]¶

Bases: blocks.bricks.sequences.Sequence, blocks.bricks.interfaces.Feedforward

A sequence where the first and last bricks are feedforward.

Parameters:	application_methods (list) – List of `BoundApplication` to apply. The first and last application method should belong to a `Feedforward` brick.

input_dim¶

output_dim¶

class blocks.bricks.MLP(**kwargs)[source]¶

Bases: blocks.bricks.sequences.FeedforwardSequence, blocks.bricks.interfaces.Initializable

A simple multi-layer perceptron.

Parameters:

activations (list of Brick, BoundApplication,) – or None A list of activations to apply after each linear transformation. Give None to not apply any activation. It is assumed that the application method to use is apply. Required for __init__().
dims (list of ints) – A list of input dimensions, as well as the output dimension of the last layer. Required for allocate().
prototype (Brick, optional) – The transformation prototype. A copy will be created for every activation. If not provided, an instance of Linear will be used.

Notes

See Initializable for initialization parameters.

Note that the weights_init, biases_init (as well as use_bias if set to a value other than the default of None) configurations will overwrite those of the layers each time the MLP is re-initialized. For more fine-grained control, push the configuration to the child layers manually before initialization.

>>> from blocks.bricks import Tanh
>>> from blocks.initialization import IsotropicGaussian, Constant
>>> mlp = MLP(activations=[Tanh(), None], dims=[30, 20, 10],
...           weights_init=IsotropicGaussian(),
...           biases_init=Constant(1))
>>> mlp.push_initialization_config()  # Configure children
>>> mlp.children[0].weights_init = IsotropicGaussian(0.1)
>>> mlp.initialize()

input_dim¶

output_dim¶

class blocks.bricks.WithExtraDims[source]¶

Bases: blocks.bricks.wrappers.BrickWrapper

Wraps a brick’s applications to handle inputs with extra dimensions.

A brick can be often reused even when data has more dimensions than in the default setting. An example is a situation when one wants to apply categorical_cross_entropy() to temporal data, that is when an additional ‘time’ axis is prepended to its both x and y inputs.

This wrapper adds reshapes required to use application methods of a brick with such data by merging the extra dimensions with the first non-extra one. Two key assumptions are made: that all inputs and outputs have the same number of extra dimensions and that these extra dimensions are equal throughout all inputs and outputs.

While this might be inconvinient, the wrapped brick does not try to guess the number of extra dimensions, but demands it as an argument. The considerations of simplicity and reliability motivated this design choice. Upon availability in Blocks of a mechanism to request the expected number of dimensions for an input of a brick, this can be reconsidered.

wrap(wrapped, namespace)[source]¶

Wrap an application of the base brick.

This method should be overriden to write into its namespace argument all required changes.

Parameters:	mcs (type) – The metaclass. wrapped (`Application`) – The application to be wrapped. namespace (dict) – The namespace of the class being created.

class blocks.bricks.lookup.LookupTable(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Initializable, blocks.bricks.interfaces.Feedforward

Encapsulates representations of a range of integers.

This brick can be used to embed integers, e.g. word indices, into a vector space.

Parameters:	length (int) – The size of the lookup table, or in other words, one plus the maximum index for which a representation is contained. dim (int) – The dimensionality of representations.

Notes

See Initializable for initialization parameters.

W¶

apply¶

Perform lookup.

Parameters:	indices (`TensorVariable`) – The indices of interest. The dtype must be integer.
Returns:	output – Representations for the indices of the query. Has $k+1$ dimensions, where $k$ is the number of dimensions of the indices parameter. The last dimension stands for the representation element.
Return type:	`TensorVariable`

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

has_bias = False¶

input_dim¶

output_dim¶

Convolutional bricks¶

class blocks.bricks.conv.AveragePooling(**kwargs)[source]¶

Bases: blocks.bricks.conv.Pooling

Average pooling layer.

Parameters:	include_padding (bool, optional) – When calculating an average, include zeros that are the result of zero padding added by the padding argument. A value of True is only accepted if ignore_border is also True. False by default.

Notes

For documentation on the remainder of the arguments to this class, see MaxPooling.

class blocks.bricks.conv.Convolutional(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.LinearLike

Performs a 2D convolution.

Parameters:

filter_size (tuple) – The height and width of the filter (also called kernels).
num_filters (int) – Number of filters per channel.
num_channels (int) – Number of input channels in the image. For the first layer this is normally 1 for grayscale images and 3 for color (RGB) images. For subsequent layers this is equal to the number of filters output by the previous convolutional layer. The filters are pooled over the channels.
batch_size (int, optional) – Number of examples per batch. If given, this will be passed to Theano convolution operator, possibly resulting in faster execution.
image_size (tuple, optional) – The height and width of the input (image or feature map). If given, this will be passed to the Theano convolution operator, resulting in possibly faster execution times.
step (tuple, optional) – The step (or stride) with which to slide the filters over the image. Defaults to (1, 1).
border_mode ({'valid', 'full'}, optional) – The border mode to use, see scipy.signal.convolve2d() for details. Defaults to ‘valid’.
tied_biases (bool) – Setting this to False will untie the biases, yielding a separate bias for every location at which the filter is applied. If True, it indicates that the biases of every filter in this layer should be shared amongst all applications of that filter. Defaults to True.

apply¶

Perform the convolution.

Parameters:	input (`TensorVariable`) – A 4D tensor with the axes representing batch size, number of channels, image height, and image width.
Returns:	output – A 4D tensor of filtered images (feature maps) with dimensions representing batch size, number of filters, feature map height, and feature map width. The height and width of the feature map depend on the border mode. For ‘valid’ it is `image_size - filter_size + 1` while for ‘full’ it is `image_size + filter_size - 1`.
Return type:	`TensorVariable`

static conv2d_impl(input, filters, input_shape=None, filter_shape=None, border_mode='valid', subsample=(1, 1), filter_flip=True, image_shape=None, filter_dilation=(1, 1), num_groups=1, unshared=False, **kwargs)[source]¶

This function will build the symbolic graph for convolving a mini-batch of a stack of 2D inputs with a set of 2D filters. The implementation is modelled after Convolutional Neural Networks (CNN).

Parameters:	input (symbolic 4D tensor) – Mini-batch of feature map stacks, of shape (batch size, input channels, input rows, input columns). See the optional parameter `input_shape`. filters (symbolic 4D or 6D tensor) – Set of filters used in CNN layer of shape (output channels, input channels, filter rows, filter columns) for normal convolution and (output channels, output rows, output columns, input channels, filter rows, filter columns) for unshared convolution. See the optional parameter `filter_shape`. input_shape (None, tuple/list of len 4 or 6 of int or Constant variable) – The shape of the input parameter. Optional, possibly used to choose an optimal implementation. You can give `None` for any element of the list to specify that this element is not known at compile time. filter_shape (None, tuple/list of len 4 or 6 of int or Constant variable) – The shape of the filters parameter. Optional, possibly used to choose an optimal implementation. You can give `None` for any element of the list to specify that this element is not known at compile time. border_mode (str, int or a tuple of two ints or pairs of ints) – Either of the following: `'valid'`: apply filter wherever it completely overlaps with the input. Generates output of shape: input shape - filter shape + 1 `'full'`: apply filter wherever it partly overlaps with the input. Generates output of shape: input shape + filter shape - 1 `'half'`: pad input with a symmetric border of `filter rows // 2` rows and `filter columns // 2` columns, then perform a valid convolution. For filters with an odd number of rows and columns, this leads to the output shape being equal to the input shape. `int`: pad input with a symmetric border of zeros of the given width, then perform a valid convolution. `(int1, int2)`: (for 2D) pad input with a symmetric border of `int1`, `int2`, then perform a valid convolution. `(int1, (int2, int3))` or `((int1, int2), int3)`: (for 2D) pad input with one symmetric border of int1` or `int3`, and one asymmetric border of `(int2, int3)` or `(int1, int2)`. subsample (tuple of len 2) – Factor by which to subsample the output. Also called strides elsewhere. filter_flip (bool) – If `True`, will flip the filter rows and columns before sliding them over the input. This operation is normally referred to as a convolution, and this is the default. If `False`, the filters are not flipped and the operation is referred to as a cross-correlation. image_shape (None, tuple/list of len 4 of int or Constant variable) – Deprecated alias for input_shape. filter_dilation (tuple of len 2) – Factor by which to subsample (stride) the input. Also called dilation elsewhere. num_groups (int) – Divides the image, kernel and output tensors into num_groups separate groups. Each which carry out convolutions separately unshared (bool) – If true, then unshared or ‘locally connected’ convolution will be performed. A different filter will be used for each region of the input. kwargs (Any other keyword arguments are accepted for backwards) – compatibility, but will be ignored.
Returns:	Set of feature maps generated by convolutional layer. Tensor is of shape (batch size, output channels, output rows, output columns)
Return type:	Symbolic 4D tensor

Notes

If cuDNN is available, it will be used on the GPU. Otherwise, it is the CorrMM convolution that will be used “caffe style convolution”.

This is only supported in Theano 0.8 or the development version until it is released.

The parameter filter_dilation is an implementation of dilated convolution.

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

static get_output_shape(image_shape, kernel_shape, border_mode, subsample, filter_dilation=None)[source]¶

This function compute the output shape of convolution operation.

Parameters:	image_shape (tuple of int (symbolic or numeric) corresponding to the input) – image shape. Its four (or five) element must correspond respectively to: batch size, number of input channels, height and width (and possibly depth) of the image. None where undefined. kernel_shape (tuple of int (symbolic or numeric) corresponding to the) – kernel shape. For a normal convolution, its four (for 2D convolution) or five (for 3D convolution) elements must correspond respectively to : number of output channels, number of input channels, height and width (and possibly depth) of the kernel. For an unshared 2D convolution, its six channels must correspond to : number of output channels, height and width of the output, number of input channels, height and width of the kernel. None where undefined. border_mode (string, int (symbolic or numeric) or tuple of int (symbolic) – or numeric) or pairs of ints. If it is a string, it must be ‘valid’, ‘half’ or ‘full’. If it is a tuple, its two (or three) elements respectively correspond to the padding on height and width (and possibly depth) axis. For asymmetric padding, provide a pair of ints for each dimension. subsample (tuple of int (symbolic or numeric) Its two or three elements) – espectively correspond to the subsampling on height and width (and possibly depth) axis. filter_dilation (tuple of int (symbolic or numeric) Its two or three) – elements correspond respectively to the dilation on height and width axis. - The shape of the convolution output does not depend on the 'unshared' (Note) – or the ‘num_groups’ parameters.
Returns:	output_shape – four element must correspond respectively to: batch size, number of output channels, height and width of the image. None where undefined.
Return type:	tuple of int corresponding to the output image shape. Its

num_output_channels¶

class blocks.bricks.conv.ConvolutionalSequence(**kwargs)[source]¶

Bases: blocks.bricks.sequences.Sequence, blocks.bricks.interfaces.Initializable, blocks.bricks.interfaces.Feedforward

A sequence of convolutional (or pooling) operations.

Parameters:

layers (list) – List of convolutional bricks (i.e. Convolutional, ConvolutionalActivation, or Pooling bricks), or application methods from such bricks. Activation bricks that operate elementwise can also be included.
num_channels (int) – Number of input channels in the image. For the first layer this is normally 1 for grayscale images and 3 for color (RGB) images. For subsequent layers this is equal to the number of filters output by the previous convolutional layer.
batch_size (int, optional) – Number of images in batch. If given, will be passed to theano’s convolution operator resulting in possibly faster execution.
image_size (tuple, optional) – Width and height of the input (image/featuremap). If given, will be passed to theano’s convolution operator resulting in possibly faster execution.
border_mode ('valid', 'full' or None, optional) – The border mode to use, see scipy.signal.convolve2d() for details. Unlike with Convolutional, this defaults to None, in which case no default value is pushed down to child bricks at allocation time. Child bricks will in this case need to rely on either a default border mode (usually valid) or one provided at construction and/or after construction (but before allocation).
tied_biases (bool, optional) – Same meaning as in Convolutional. Defaults to None, in which case no value is pushed to child Convolutional bricks.

Notes

The passed convolutional operators should be ‘lazy’ constructed, that is, without specifying the batch_size, num_channels and image_size. The main feature of ConvolutionalSequence is that it will set the input dimensions of a layer to the output dimensions of the previous layer by the push_allocation_config() method.

The push behaviour of tied_biases mirrors that of use_bias or any initialization configuration: only an explicitly specified value is pushed down the hierarchy. border_mode also has this behaviour. The reason the border_mode parameter behaves the way it does is that pushing a single default border_mode makes it very difficult to have child bricks with different border modes. Normally, such things would be overridden after push_allocation_config(), but this is a particular hassle as the border mode affects the allocation parameters of every subsequent child brick in the sequence. Thus, only an explicitly specified border mode will be pushed down the hierarchy.

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

class blocks.bricks.conv.ConvolutionalTranspose(**kwargs)[source]¶

Bases: blocks.bricks.conv.Convolutional

Performs the transpose of a 2D convolution.

Parameters:

num_filters (int) – Number of filters at the output of the transposed convolution, i.e. the number of channels in the corresponding convolution.
num_channels (int) – Number of channels at the input of the transposed convolution, i.e. the number of output filters in the corresponding convolution.
step (tuple, optional) – The step (or stride) of the corresponding convolution. Defaults to (1, 1).
image_size (tuple, optional) – Image size of the input to the transposed convolution, i.e. the output of the corresponding convolution. Required for tied biases. Defaults to None.
unused_edge (tuple, optional) – Tuple of pixels added to the inferred height and width of the output image, whose values would be ignored in the corresponding forward convolution. Must be such that 0 <= unused_edge[i] <= step[i]. Note that this parameter is ignored if original_image_size is specified in the constructor or manually set as an attribute.
original_image_size (tuple, optional) – The height and width of the image that forms the output of the transpose operation, which is the input of the original (non-transposed) convolution. By default, this is inferred from image_size to be the size that has each pixel of the original image touched by at least one filter application in the original convolution. Degenerate cases with dropped border pixels (in the original convolution) are possible, and can be manually specified via this argument. See notes below.

See also

Convolutional: For the documentation of other parameters.

Notes

By default, original_image_size is inferred from image_size as being the minimum size of image that could have produced this output. Let hanging[i] = original_image_size[i] - image_size[i] * step[i]. Any value of hanging[i] greater than filter_size[i] - step[i] will result in border pixels that are ignored by the original convolution. With this brick, any original_image_size such that filter_size[i] - step[i] < hanging[i] < filter_size[i] for all i can be validly specified. However, no value will be output by the transposed convolution itself for these extra hanging border pixels, and they will be determined entirely by the bias.

conv2d_impl(input_, W, input_shape, subsample, border_mode, filter_shape)[source]¶

This function will build the symbolic graph for convolving a mini-batch of a stack of 2D inputs with a set of 2D filters. The implementation is modelled after Convolutional Neural Networks (CNN).

Parameters:	input (symbolic 4D tensor) – Mini-batch of feature map stacks, of shape (batch size, input channels, input rows, input columns). See the optional parameter `input_shape`. filters (symbolic 4D or 6D tensor) – Set of filters used in CNN layer of shape (output channels, input channels, filter rows, filter columns) for normal convolution and (output channels, output rows, output columns, input channels, filter rows, filter columns) for unshared convolution. See the optional parameter `filter_shape`. input_shape (None, tuple/list of len 4 or 6 of int or Constant variable) – The shape of the input parameter. Optional, possibly used to choose an optimal implementation. You can give `None` for any element of the list to specify that this element is not known at compile time. filter_shape (None, tuple/list of len 4 or 6 of int or Constant variable) – The shape of the filters parameter. Optional, possibly used to choose an optimal implementation. You can give `None` for any element of the list to specify that this element is not known at compile time. border_mode (str, int or a tuple of two ints or pairs of ints) – Either of the following: `'valid'`: apply filter wherever it completely overlaps with the input. Generates output of shape: input shape - filter shape + 1 `'full'`: apply filter wherever it partly overlaps with the input. Generates output of shape: input shape + filter shape - 1 `'half'`: pad input with a symmetric border of `filter rows // 2` rows and `filter columns // 2` columns, then perform a valid convolution. For filters with an odd number of rows and columns, this leads to the output shape being equal to the input shape. `int`: pad input with a symmetric border of zeros of the given width, then perform a valid convolution. `(int1, int2)`: (for 2D) pad input with a symmetric border of `int1`, `int2`, then perform a valid convolution. `(int1, (int2, int3))` or `((int1, int2), int3)`: (for 2D) pad input with one symmetric border of int1` or `int3`, and one asymmetric border of `(int2, int3)` or `(int1, int2)`. subsample (tuple of len 2) – Factor by which to subsample the output. Also called strides elsewhere. filter_flip (bool) – If `True`, will flip the filter rows and columns before sliding them over the input. This operation is normally referred to as a convolution, and this is the default. If `False`, the filters are not flipped and the operation is referred to as a cross-correlation. image_shape (None, tuple/list of len 4 of int or Constant variable) – Deprecated alias for input_shape. filter_dilation (tuple of len 2) – Factor by which to subsample (stride) the input. Also called dilation elsewhere. num_groups (int) – Divides the image, kernel and output tensors into num_groups separate groups. Each which carry out convolutions separately unshared (bool) – If true, then unshared or ‘locally connected’ convolution will be performed. A different filter will be used for each region of the input. kwargs (Any other keyword arguments are accepted for backwards) – compatibility, but will be ignored.
Returns:	Set of feature maps generated by convolutional layer. Tensor is of shape (batch size, output channels, output rows, output columns)
Return type:	Symbolic 4D tensor

Notes

If cuDNN is available, it will be used on the GPU. Otherwise, it is the CorrMM convolution that will be used “caffe style convolution”.

This is only supported in Theano 0.8 or the development version until it is released.

The parameter filter_dilation is an implementation of dilated convolution.

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

original_image_size¶

class blocks.bricks.conv.Flattener(name=None, children=None)[source]¶

Bases: blocks.bricks.base.Brick

Flattens the input.

It may be used to pass multidimensional objects like images or feature maps of convolutional bricks into bricks which allow only two dimensional input (batch, features) like MLP.

apply¶

class blocks.bricks.conv.MaxPooling(**kwargs)[source]¶

Bases: blocks.bricks.conv.Pooling

Max pooling layer.

Parameters:

pooling_size (tuple) – The height and width of the pooling region i.e. this is the factor by which your input’s last two dimensions will be downscaled.
step (tuple, optional) – The vertical and horizontal shift (stride) between pooling regions. By default this is equal to pooling_size. Setting this to a lower number results in overlapping pooling regions.
input_dim (tuple, optional) – A tuple of integers representing the shape of the input. The last two dimensions will be used to calculate the output dimension.
padding (tuple, optional) – A tuple of integers representing the vertical and horizontal zero-padding to be applied to each of the top and bottom (vertical) and left and right (horizontal) edges. For example, an argument of (4, 3) will apply 4 pixels of padding to the top edge, 4 pixels of padding to the bottom edge, and 3 pixels each for the left and right edge. By default, no padding is performed.
ignore_border (bool, optional) – Whether or not to do partial downsampling based on borders where the extent of the pooling region reaches beyond the edge of the image. If True, a (5, 5) image with (2, 2) pooling regions and (2, 2) step will be downsampled to shape (2, 2), otherwise it will be downsampled to (3, 3). True by default.

Notes

Warning

As of this writing, setting ignore_border to False with a step not equal to the pooling size will force Theano to perform pooling computations on CPU rather than GPU, even if you have specified a GPU as your computation device. Additionally, Theano will only use [cuDNN] (if available) for pooling computations with ignure_border set to True. You can ensure that the entire input is captured by at least one pool by using the padding argument to add zero padding prior to pooling being performed.

[cuDNN]

NVIDIA cuDNN.

class blocks.bricks.conv.Pooling(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Initializable, blocks.bricks.interfaces.Feedforward

Base Brick for pooling operations.

This should generally not be instantiated directly; see MaxPooling.

apply¶

Apply the pooling (subsampling) transformation.

Parameters:	input (`TensorVariable`) – An tensor with dimension greater or equal to 2. The last two dimensions will be downsampled. For example, with images this means that the last two dimensions should represent the height and width of your image.
Returns:	output – A tensor with the same number of dimensions as input_, but with the last two dimensions downsampled.
Return type:	`TensorVariable`

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

image_size¶

num_channels¶

num_output_channels¶

Routing bricks¶

class blocks.bricks.parallel.Distribute(**kwargs)[source]¶

Bases: blocks.bricks.parallel.Fork

Transform an input and add it to other inputs.

This brick is designed for the following scenario: one has a group of variables and another separate variable, and one needs to somehow distribute information from the latter across the former. We call that “to distribute a varible across other variables”, and refer to the separate variable as “the source” and to the variables from the group as “the targets”.

Given a prototype brick, a Parallel brick makes several copies of it (each with its own parameters). At the application time the copies are applied to the source and the transformation results are added to the targets (in the literate sense).

>>> from theano import tensor
>>> from blocks.initialization import Constant
>>> x = tensor.matrix('x')
>>> y = tensor.matrix('y')
>>> z = tensor.matrix('z')
>>> distribute = Distribute(target_names=['x', 'y'], source_name='z',
...                         target_dims=[2, 3], source_dim=3,
...                         weights_init=Constant(2))
>>> distribute.initialize()
>>> new_x, new_y = distribute.apply(x=x, y=y, z=z)
>>> new_x.eval({x: [[2, 2]], z: [[1, 1, 1]]}) 
array([[ 8.,  8.]]...
>>> new_y.eval({y: [[1, 1, 1]], z: [[1, 1, 1]]}) 
array([[ 7.,  7.,  7.]]...

Parameters:

target_names (list) – The names of the targets.
source_name (str) – The name of the source.
target_dims (list) – A list of target dimensions, corresponding to target_names.
source_dim (int) – The dimension of the source input.
prototype (Feedforward, optional) – The transformation prototype. A copy will be created for every input. By default a linear transformation is used.

target_dims¶: list

source_dim¶: int

Notes

See Initializable for initialization parameters.

apply¶

Distribute the source across the targets.

Parameters:	*kwargs (dict*) – The source and the target variables.
Returns:	output – The new target variables.
Return type:	list

apply_inputs()[source]¶

apply_outputs()[source]¶

class blocks.bricks.parallel.Fork(**kwargs)[source]¶

Bases: blocks.bricks.parallel.Parallel

Several outputs from one input by applying similar transformations.

Given a prototype brick, a Fork brick makes several copies of it (each with its own parameters). At the application time the copies are applied to the input to produce different outputs.

A typical usecase for this brick is to produce inputs for gates of gated recurrent bricks, such as GatedRecurrent.

>>> from theano import tensor
>>> from blocks.initialization import Constant
>>> x = tensor.matrix('x')
>>> fork = Fork(output_names=['y', 'z'],
...             input_dim=2, output_dims=[3, 4],
...             weights_init=Constant(2), biases_init=Constant(1))
>>> fork.initialize()
>>> y, z = fork.apply(x)
>>> y.eval({x: [[1, 1]]}) 
array([[ 5.,  5.,  5.]]...
>>> z.eval({x: [[1, 1]]}) 
array([[ 5.,  5.,  5.,  5.]]...

Parameters:	output_names (list of str) – Names of the outputs to produce. input_dim (int) – The input dimension. prototype (`Feedforward`, optional) – The transformation prototype. A copy will be created for every input. By default an affine transformation is used.

input_dim¶: int – The input dimension.

output_dims¶: list – The output dimensions as a list of integers, corresponding to output_names.

See also

Parallel, Initializable

apply¶

apply_outputs()[source]¶

class blocks.bricks.parallel.Merge(**kwargs)[source]¶

Bases: blocks.bricks.parallel.Parallel

Merges several variables by applying a transformation and summing.

Parameters:

input_names (list) – The input names.
input_dims (list) – The dictionary of input dimensions, keys are input names, values are dimensions.
output_dim (int) – The output dimension of the merged variables.
prototype (Feedforward, optional) – A transformation prototype. A copy will be created for every input. If None, a linear transformation is used.
child_prefix (str, optional) – A prefix for children names. By default “transform” is used.

:param .. warning::: Note that if you want to have a bias you can pass a Linear: brick as a prototype, but this will result in several redundant biases. It is a better idea to use merge.children[0].use_bias = True.

input_names¶: list – The input names.

input_dims¶: list – List of input dimensions corresponding to input_names.

output_dim¶: int – The output dimension.

Examples

>>> from theano import tensor
>>> from blocks.initialization import Constant
>>> a = tensor.matrix('a')
>>> b = tensor.matrix('b')
>>> merge = Merge(input_names=['a', 'b'], input_dims=[3, 4],
...               output_dim=2, weights_init=Constant(1.))
>>> merge.initialize()
>>> c = merge.apply(a=a, b=b)
>>> c.eval({a: [[1, 1, 1]], b: [[2, 2, 2, 2]]})  
array([[ 11.,  11.]]...

apply¶

apply_inputs()[source]¶

class blocks.bricks.parallel.Parallel(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Initializable

Apply similar transformations to several inputs.

Given a prototype brick, a Parallel brick makes several copies of it (each with its own parameters). At the application time every copy is applied to the respective input.

>>> from theano import tensor
>>> from blocks.initialization import Constant
>>> x, y = tensor.matrix('x'), tensor.matrix('y')
>>> parallel = Parallel(
...     prototype=Linear(use_bias=False),
...     input_names=['x', 'y'], input_dims=[2, 3], output_dims=[4, 5],
...     weights_init=Constant(2))
>>> parallel.initialize()
>>> new_x, new_y = parallel.apply(x=x, y=y)
>>> new_x.eval({x: [[1, 1]]}) 
array([[ 4.,  4.,  4.,  4.]]...
>>> new_y.eval({y: [[1, 1, 1]]}) 
array([[ 6.,  6.,  6.,  6.,  6.]]...

Parameters:

input_names (list) – The input names.
input_dims (list) – List of input dimensions, given in the same order as input_names.
output_dims (list) – List of output dimensions.
prototype (Feedforward) – The transformation prototype. A copy will be created for every input.
child_prefix (str, optional) – The prefix for children names. By default “transform” is used.

input_names¶: list – The input names.

input_dims¶: list – Input dimensions.

output_dims¶: list – Output dimensions.

Notes

See Initializable for initialization parameters.

apply¶

apply_inputs()[source]¶

apply_outputs()[source]¶

Recurrent bricks¶

Recurrent architectures¶

class blocks.bricks.recurrent.architectures.GatedRecurrent(**kwargs)[source]¶

Bases: blocks.bricks.recurrent.base.BaseRecurrent, blocks.bricks.interfaces.Initializable

Gated recurrent neural network.

Gated recurrent neural network (GRNN) as introduced in [CvMG14]. Every unit of a GRNN is equipped with update and reset gates that facilitate better gradient propagation.

Parameters:	dim (int) – The dimension of the hidden state. activation (`Brick` or None) – The brick to apply as activation. If `None` a `Tanh` brick is used. gate_activation (`Brick` or None) – The brick to apply as activation for gates. If `None` a `Logistic` brick is used.

Notes

See Initializable for initialization parameters.

[CvMG14]

Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, EMNLP (2014), pp. 1724-1734.

apply¶

Apply the gated recurrent transition.

Parameters:	states (`TensorVariable`) – The 2 dimensional matrix of current states in the shape (batch_size, dim). Required for one_step usage. inputs (`TensorVariable`) – The 2 dimensional matrix of inputs in the shape (batch_size, dim) gate_inputs (`TensorVariable`) – The 2 dimensional matrix of inputs to the gates in the shape (batch_size, 2 * dim). mask (`TensorVariable`) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1-s only if not given.
Returns:	output – Next states of the network.
Return type:	`TensorVariable`

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_states¶

state_to_gates¶

state_to_state¶

class blocks.bricks.recurrent.architectures.LSTM(**kwargs)[source]¶

Bases: blocks.bricks.recurrent.base.BaseRecurrent, blocks.bricks.interfaces.Initializable

Long Short Term Memory.

Every unit of an LSTM is equipped with input, forget and output gates. This implementation is based on code by Mohammad Pezeshki that implements the architecture used in [GSS03] and [Grav13]. It aims to do as many computations in parallel as possible and expects the last dimension of the input to be four times the output dimension.

Unlike a vanilla LSTM as described in [HS97], this model has peephole connections from the cells to the gates. The output gates receive information about the cells at the current time step, while the other gates only receive information about the cells at the previous time step. All ‘peephole’ weight matrices are diagonal.

[GSS03]

Gers, Felix A., Nicol N. Schraudolph, and Jürgen Schmidhuber, Learning precise timing with LSTM recurrent networks, Journal of Machine Learning Research 3 (2003), pp. 115-143.

[Grav13]

(1, 2) Graves, Alex, Generating sequences with recurrent neural networks, arXiv preprint arXiv:1308.0850 (2013).

[HS97]

Sepp Hochreiter, and Jürgen Schmidhuber, Long Short-Term Memory, Neural Computation 9(8) (1997), pp. 1735-1780.

Parameters:	dim (int) – The dimension of the hidden state. activation (`Brick`, optional) – The activation function. The default and by far the most popular is `Tanh`. gate_activation (`Brick` or None) – The brick to apply as activation for gates (input/output/forget). If `None` a `Logistic` brick is used.

Notes

See Initializable for initialization parameters.

apply¶

Apply the Long Short Term Memory transition.

Parameters:

states (TensorVariable) – The 2 dimensional matrix of current states in the shape (batch_size, features). Required for one_step usage.
cells (TensorVariable) – The 2 dimensional matrix of current cells in the shape (batch_size, features). Required for one_step usage.
inputs (TensorVariable) – The 2 dimensional matrix of inputs in the shape (batch_size, features * 4). The inputs needs to be four times the dimension of the LSTM brick to insure each four gates receive different transformations of the input. See [Grav13] equations 7 to 10 for more details. The inputs are then split in this order: Input gates, forget gates, cells and output gates.
mask (TensorVariable) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1-s only if not given.
[Grav13] Graves, Alex, Generating sequences with recurrent (.) – neural networks, arXiv preprint arXiv:1308.0850 (2013).

Returns:

states (TensorVariable) – Next states of the network.
cells (TensorVariable) – Next cell activations of the network.

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_states¶

class blocks.bricks.recurrent.architectures.SimpleRecurrent(**kwargs)[source]¶

Bases: blocks.bricks.recurrent.base.BaseRecurrent, blocks.bricks.interfaces.Initializable

The traditional recurrent transition.

The most well-known recurrent transition: a matrix multiplication, optionally followed by a non-linearity.

Parameters:	dim (int) – The dimension of the hidden state activation (`Brick`) – The brick to apply as activation.

Notes

See Initializable for initialization parameters.

W¶

apply¶

Apply the simple transition.

Parameters:	inputs (`TensorVariable`) – The 2D inputs, in the shape (batch, features). states (`TensorVariable`) – The 2D states, in the shape (batch, features). mask (`TensorVariable`) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1-s only if not given.

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_states¶

Helper bricks for recurrent networks¶

class blocks.bricks.recurrent.misc.Bidirectional(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Initializable

Bidirectional network.

A bidirectional network is a combination of forward and backward recurrent networks which process inputs in different order.

Parameters:	prototype (instance of `BaseRecurrent`) – A prototype brick from which the forward and backward bricks are cloned.

Notes

See Initializable for initialization parameters.

apply¶: Applies forward and backward networks and concatenates outputs.

apply_delegate()[source]¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

has_bias = False¶

class blocks.bricks.recurrent.misc.RecurrentStack(transitions, fork_prototype=None, states_name='states', skip_connections=False, **kwargs)[source]¶

Bases: blocks.bricks.recurrent.base.BaseRecurrent, blocks.bricks.interfaces.Initializable

Stack of recurrent networks.

Builds a stack of recurrent layers from a supplied list of BaseRecurrent objects. Each object must have a sequences, contexts, states and outputs parameters to its apply method, such as the ones required by the recurrent decorator from blocks.bricks.recurrent.

In Blocks in general each brick can have an apply method and this method has attributes that list the names of the arguments that can be passed to the method and the name of the outputs returned by the method. The attributes of the apply method of this class is made from concatenating the attributes of the apply methods of each of the transitions from which the stack is made. In order to avoid conflict, the names of the arguments appearing in the states and outputs attributes of the apply method of each layers are renamed. The names of the bottom layer are used as-is and a suffix of the form ‘#<n>’ is added to the names from other layers, where ‘<n>’ is the number of the layer starting from 1, used for first layer above bottom.

The contexts of all layers are merged into a single list of unique names, and no suffix is added. Different layers with the same context name will receive the same value.

The names that appear in sequences are treated in the same way as the names of states and outputs if skip_connections is “True”. The only exception is the “mask” element that may appear in the sequences attribute of all layers, no suffix is added to it and all layers will receive the same mask value. If you set skip_connections to False then only the arguments of the sequences from the bottom layer will appear in the sequences attribute of the apply method of this class. When using this class, with skip_connections set to “True”, you can supply all inputs to all layers using a single fork which is created with output_names set to the apply.sequences attribute of this class. For example, SequenceGenerator will create a such a fork.

Whether or not skip_connections is set, each layer above the bottom also receives an input (values to its sequences arguments) from a fork of the state of the layer below it. Not to be confused with the external fork discussed in the previous paragraph. It is assumed that all states attributes have a “states” argument name (this can be configured with states_name parameter.) The output argument with this name is forked and then added to all the elements appearing in the sequences of the next layer (except for “mask”.) If skip_connections is False then this fork has a bias by default. This allows direct usage of this class with input supplied only to the first layer. But if you do supply inputs to all layers (by setting skip_connections to “True”) then by default there is no bias and the external fork you use to supply the inputs should have its own separate bias.

Parameters:

transitions (list) – List of recurrent units to use in each layer. Each derived from BaseRecurrent Note: A suffix with layer number is added to transitions’ names.
fork_prototype (FeedForward, optional) – A prototype for the transformation applied to states_name from the states of each layer. The transformation is used when the states_name argument from the outputs of one layer is used as input to the sequences of the next layer. By default it Linear transformation is used, with bias if skip_connections is “False”. If you supply your own prototype you have to enable/disable bias depending on the value of skip_connections.
states_name (string) – In a stack of RNN the state of each layer is used as input to the next. The states_name identify the argument of the states and outputs attributes of each layer that should be used for this task. By default the argument is called “states”. To be more precise, this is the name of the argument in the outputs attribute of the apply method of each transition (layer.) It is used, via fork, as the sequences (input) of the next layer. The same element should also appear in the states attribute of the apply method.
skip_connections (bool) – By default False. When true, the sequences of all layers are add to the sequences of the apply of this class. When false only the sequences of the bottom layer appear in the sequences of the apply of this class. In this case the default fork used internally between layers has a bias (see fork_prototype.) An external code can inspect the sequences attribute of the apply method of this class to decide which arguments it need (and in what order.) With skip_connections you can control what is exposed to the externl code. If it is false then the external code is expected to supply inputs only to the bottom layer and if it is true then the external code is expected to supply inputs to all layers. There is just one small problem, the external inputs to the layers above the bottom layer are added to a fork of the state of the layer below it. As a result the output of two forks is added together and it will be problematic if both will have a bias. It is assumed that the external fork has a bias and therefore by default the internal fork will not have a bias if skip_connections is true.

Notes

See BaseRecurrent for more initialization parameters.

apply¶

Apply the stack of transitions.

Parameters:	low_memory (bool) – Use the slow, but also memory efficient, implementation of this code. args (`TensorVariable`, optional) – Positional argumentes in the order in which they appear in self.apply.sequences followed by self.apply.contexts. *kwargs (`TensorVariable`) – Named argument defined in self.apply.sequences, self.apply.states or self.apply.contexts
Returns:	outputs – The outputs of all transitions as defined in self.apply.outputs
Return type:	(list of) `TensorVariable`

See also

See docstring of this class for arguments appearing in the lists self.apply.sequences, self.apply.states, self.apply.contexts. See recurrent() : for all other parameters such as iterate and return_initial_states however reverse is currently not implemented.

do_apply(*args, **kwargs)[source]¶

Apply the stack of transitions.

This is the undecorated implementation of the apply method. A method with an @apply decoration should call this method with iterate=True to indicate that the iteration over all steps should be done internally by this method. A method with a @recurrent method should have iterate=False (or unset) to indicate that the iteration over all steps is done externally.

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_states¶

low_memory_apply¶

normal_inputs(level)[source]¶

static split_suffix(name)[source]¶

static suffix(name, level)[source]¶

static suffixes(names, level)[source]¶

Base definitions for recurrent bricks¶

class blocks.bricks.recurrent.base.BaseRecurrent(name=None, children=None)[source]¶

Bases: blocks.bricks.base.Brick

Base class for brick with recurrent application method.

has_bias = False¶

initial_states¶

Return initial states for an application call.

Default implementation assumes that the recurrent application method is called apply. It fetches the state names from apply.states and a returns a zero matrix for each of them.

SimpleRecurrent, LSTM and GatedRecurrent override this method with trainable initial states initialized with zeros.

Parameters:	batch_size (int) – The batch size. args – The positional arguments of the application call. *kwargs – The keyword arguments of the application call.

initial_states_outputs()[source]¶

blocks.bricks.recurrent.base.recurrent(*args, **kwargs)[source]¶

Wraps an apply method to allow its iterative application.

This decorator allows you to implement only one step of a recurrent network and enjoy applying it to sequences for free. The idea behind is that its most general form information flow of an RNN can be described as follows: depending on the context and driven by input sequences the RNN updates its states and produces output sequences.

Given a method describing one step of an RNN and a specification which of its inputs are the elements of the input sequence, which are the states and which are the contexts, this decorator returns an application method which implements the whole RNN loop. The returned application method also has additional parameters, see documentation of the recurrent_apply inner function below.

Parameters:	sequences (list of strs) – Specifies which of the arguments are elements of input sequences. states (list of strs) – Specifies which of the arguments are the states. contexts (list of strs) – Specifies which of the arguments are the contexts. outputs (list of strs) – Names of the outputs. The outputs whose names match with those in the state parameter are interpreted as next step states.
Returns:	recurrent_apply – The new application method that applies the RNN to sequences.
Return type:	`Application`

See also

The tutorial on RNNs

Attention bricks¶

This module defines the interface of attention mechanisms and a few concrete implementations. For a gentle introduction and usage examples see the tutorial TODO.

An attention mechanism decides to what part of the input to pay attention. It is typically used as a component of a recurrent network, though one can imagine it used in other conditions as well. When the input is big and has certain structure, for instance when it is sequence or an image, an attention mechanism can be applied to extract only information which is relevant for the network in its current state.

For the purpose of documentation clarity, we fix the following terminology in this file:

network is the network, typically a recurrent one, which uses the attention mechanism.
The network has states. Using this word in plural might seem weird, but some recurrent networks like LSTM do have several states.
The big structured input, to which the attention mechanism is applied, is called the attended. When it has variable structure, e.g. a sequence of variable length, there might be a mask associated with it.
The information extracted by the attention from the attended is called glimpse, more specifically glimpses because there might be a few pieces of this information.

Using this terminology, the attention mechanism computes glimpses given the states of the network and the attended.

An example: in the machine translation network from [BCB] the attended is a sequence of so-called annotations, that is states of a bidirectional network that was driven by word embeddings of the source sentence. The attention mechanism assigns weights to the annotations. The weighted sum of the annotations is further used by the translation network to predict the next word of the generated translation. The weights and the weighted sum are the glimpses. A generalized attention mechanism for this paper is represented here as SequenceContentAttention.

class blocks.bricks.attention.AbstractAttention(**kwargs)[source]¶

Bases: blocks.bricks.base.Brick

The common interface for attention bricks.

First, see the module-level docstring for terminology.

A generic attention mechanism functions as follows. Its inputs are the states of the network and the attended. Given these two it produces so-called glimpses, that is it extracts information from the attended which is necessary for the network in its current states

For computational reasons we separate the process described above into two stages:

1. The preprocessing stage, preprocess(), includes computation that do not involve the state. Those can be often performed in advance. The outcome of this stage is called preprocessed_attended.

The main stage, take_glimpses(), includes all the rest.

When an attention mechanism is applied sequentially, some glimpses from the previous step might be necessary to compute the new ones. A typical example for that is when the focus position from the previous step is required. In such cases take_glimpses() should specify such need in its interface (its docstring explains how to do that). In addition initial_glimpses() should specify some sensible initialization for the glimpses to be carried over.

Todo

Only single attended is currently allowed.

preprocess() and initial_glimpses() might end up needing masks, which are currently not provided for them.

Parameters:	state_names (list) – The names of the network states. state_dims (list) – The state dimensions corresponding to state_names. attended_dim (int) – The dimension of the attended.

state_names¶: list

state_dims¶: list

attended_dim¶: int

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_glimpses(batch_size, attended)[source]¶

Return sensible initial values for carried over glimpses.

Parameters:	batch_size (int or `Variable`) – The batch size. attended (`Variable`) – The attended.
Returns:	initial_glimpses – The initial values for the requested glimpses. These might simply consist of zeros or be somehow extracted from the attended.
Return type:	list of `Variable`

preprocess¶

Perform the preprocessing of the attended.

Stage 1 of the attention mechanism, see AbstractAttention docstring for an explanation of stages. The default implementation simply returns attended.

Parameters:	attended (`Variable`) – The attended.
Returns:	preprocessed_attended – The preprocessed attended.
Return type:	`Variable`

take_glimpses(attended, preprocessed_attended=None, attended_mask=None, **kwargs)[source]¶

Extract glimpses from the attended given the current states.

Stage 2 of the attention mechanism, see AbstractAttention for an explanation of stages. If preprocessed_attended is not given, should trigger the stage 1.

This application method must declare its inputs and outputs. The glimpses to be carried over are identified by their presence in both inputs and outputs list. The attended must be the first input, the preprocessed attended must be the second one.

Parameters:

attended (Variable) – The attended.
preprocessed_attended (Variable, optional) – The preprocessed attended computed by preprocess(). When not given, preprocess() should be called.
attended_mask (Variable, optional) – The mask for the attended. This is required in the case of padded structured output, e.g. when a number of sequences are force to be the same length. The mask identifies position of the attended that actually contain information.
**kwargs (dict) – Includes the states and the glimpses to be carried over from the previous step in the case when the attention mechanism is applied sequentially.

class blocks.bricks.attention.AbstractAttentionRecurrent(name=None, children=None)[source]¶

Bases: blocks.bricks.recurrent.base.BaseRecurrent

The interface for attention-equipped recurrent transitions.

When a recurrent network is equipped with an attention mechanism its transition typically consists of two steps: (1) the glimpses are taken by the attention mechanism and (2) the next states are computed using the current states and the glimpses. It is required for certain usecases (such as sequence generator) that apart from a do-it-all recurrent application method interfaces for the first step and the second steps of the transition are provided.

apply(**kwargs)[source]¶: Compute next states taking glimpses on the way.

compute_states(**kwargs)[source]¶: Compute next states given current states and glimpses.

take_glimpses(**kwargs)[source]¶: Compute glimpses given the current states.

class blocks.bricks.attention.AttentionRecurrent(transition, attention, distribute=None, add_contexts=True, attended_name=None, attended_mask_name=None, **kwargs)[source]¶

Bases: blocks.bricks.attention.AbstractAttentionRecurrent, blocks.bricks.interfaces.Initializable

Combines an attention mechanism and a recurrent transition.

This brick equips a recurrent transition with an attention mechanism. In order to do this two more contexts are added: one to be attended and a mask for it. It is also possible to use the contexts of the given recurrent transition for these purposes and not add any new ones, see add_context parameter.

At the beginning of each step attention mechanism produces glimpses; these glimpses together with the current states are used to compute the next state and finish the transition. In some cases glimpses from the previous steps are also necessary for the attention mechanism, e.g. in order to focus on an area close to the one from the previous step. This is also supported: such glimpses become states of the new transition.

To let the user control the way glimpses are used, this brick also takes a “distribute” brick as parameter that distributes the information from glimpses across the sequential inputs of the wrapped recurrent transition.

Parameters:

transition (BaseRecurrent) – The recurrent transition.
attention (Brick) – The attention mechanism.
distribute (Brick, optional) – Distributes the information from glimpses across the input sequences of the transition. By default a Distribute is used, and those inputs containing the “mask” substring in their name are not affected.
add_contexts (bool, optional) – If True, new contexts for the attended and the attended mask are added to this transition, otherwise existing contexts of the wrapped transition are used. True by default.
attended_name (str) – The name of the attended context. If None, “attended” or the first context of the recurrent transition is used depending on the value of add_contents flag.
attended_mask_name (str) – The name of the mask for the attended context. If None, “attended_mask” or the second context of the recurrent transition is used depending on the value of add_contents flag.

Notes

See Initializable for initialization parameters.

Wrapping your recurrent brick with this class makes all the states mandatory. If you feel this is a limitation for you, try to make it better! This restriction does not apply to sequences and contexts: those keep being as optional as they were for your brick.

Those coming to Blocks from Groundhog might recognize that this is a RecurrentLayerWithSearch, but on steroids :)

apply¶

Preprocess a sequence attending the attended context at every step.

Preprocesses the attended context and runs do_apply(). See do_apply() documentation for further information.

apply_contexts()[source]¶

apply_delegate()[source]¶

compute_states¶

Compute current states when glimpses have already been computed.

Combines an application of the distribute that alter the sequential inputs of the wrapped transition and an application of the wrapped transition. All unknown keyword arguments go to the wrapped transition.

Parameters:	**kwargs – Should contain everything what self.transition needs and in addition the current glimpses.
Returns:	current_states – Current states computed by self.transition.
Return type:	list of `TensorVariable`

compute_states_outputs()[source]¶

do_apply¶

Process a sequence attending the attended context every step.

In addition to the original sequence this method also requires its preprocessed version, the one computed by the preprocess method of the attention mechanism. Unknown keyword arguments are passed to the wrapped transition.

Parameters:	**kwargs – Should contain current inputs, previous step states, contexts, the preprocessed attended context, previous step glimpses.
Returns:	outputs – The current step states and glimpses.
Return type:	list of `TensorVariable`

do_apply_contexts()[source]¶

do_apply_outputs()[source]¶

do_apply_sequences()[source]¶

do_apply_states()[source]¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_states¶

initial_states_outputs()[source]¶

take_glimpses¶

Compute glimpses with the attention mechanism.

A thin wrapper over self.attention.take_glimpses: takes care of choosing and renaming the necessary arguments.

Parameters:	**kwargs – Must contain the attended, previous step states and glimpses. Can optionaly contain the attended mask and the preprocessed attended.
Returns:	glimpses – Current step glimpses.
Return type:	list of `TensorVariable`

take_glimpses_outputs()[source]¶

class blocks.bricks.attention.GenericSequenceAttention(**kwargs)[source]¶

Bases: blocks.bricks.attention.AbstractAttention

Logic common for sequence attention mechanisms.

compute_weighted_averages¶

Compute weighted averages of the attended sequence vectors.

Parameters:	weights (`Variable`) – The weights. The shape must be equal to the attended shape without the last dimension. attended (`Variable`) – The attended. The index in the sequence must be the first dimension.
Returns:	weighted_averages – The weighted averages of the attended elements. The shape is equal to the attended shape with the first dimension dropped.
Return type:	`Variable`

compute_weights¶

Compute weights from energies in softmax-like fashion.

Todo

Use Softmax.

Parameters:	energies (`Variable`) – The energies. Must be of the same shape as the mask. attended_mask (`Variable`) – The mask for the attended. The index in the sequence must be the first dimension.
Returns:	weights – Summing to 1 non-negative weights of the same shape as energies.
Return type:	`Variable`

class blocks.bricks.attention.SequenceContentAttention(**kwargs)[source]¶

Bases: blocks.bricks.attention.GenericSequenceAttention, blocks.bricks.interfaces.Initializable

Attention mechanism that looks for relevant content in a sequence.

This is the attention mechanism used in [BCB]. The idea in a nutshell:

The states and the sequence are transformed independently,
The transformed states are summed with every transformed sequence element to obtain match vectors,
A match vector is transformed into a single number interpreted as energy,
Energies are normalized in softmax-like fashion. The resulting summing to one weights are called attention weights,
Weighted average of the sequence elements with attention weights is computed.

In terms of the AbstractAttention documentation, the sequence is the attended. The weighted averages from 5 and the attention weights from 4 form the set of glimpses produced by this attention mechanism.

Parameters:

state_names (list of str) – The names of the network states.
attended_dim (int) – The dimension of the sequence elements.
match_dim (int) – The dimension of the match vector.
state_transformer (Brick) – A prototype for state transformations. If None, a linear transformation is used.
attended_transformer (Feedforward) – The transformation to be applied to the sequence. If None an affine transformation is used.
energy_computer (Feedforward) – Computes energy from the match vector. If None, an affine transformations preceeded by $tanh$ is used.

Notes

See Initializable for initialization parameters.

[BCB]

(1, 2) Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate.

compute_energies¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_glimpses¶

preprocess¶

Preprocess the sequence for computing attention weights.

Parameters:	attended (`TensorVariable`) – The attended sequence, time is the 1-st dimension.

take_glimpses¶

Compute attention weights and produce glimpses.

Parameters:

attended (TensorVariable) – The sequence, time is the 1-st dimension.
preprocessed_attended (TensorVariable) – The preprocessed sequence. If None, is computed by calling preprocess().
attended_mask (TensorVariable) – A 0/1 mask specifying available data. 0 means that the corresponding sequence element is fake.
**states – The states of the network.

Returns:

weighted_averages (Variable) – Linear combinations of sequence elements with the attention weights.
weights (Variable) – The attention weights. The first dimension is batch, the second is time.

take_glimpses_inputs()[source]¶

class blocks.bricks.attention.ShallowEnergyComputer(**kwargs)[source]¶

Bases: blocks.bricks.sequences.Sequence, blocks.bricks.interfaces.Initializable, blocks.bricks.interfaces.Feedforward

A simple energy computer: first tanh, then weighted sum.

Parameters:	use_bias (bool, optional) – Whether a bias should be added to the energies. Does not change anything if softmax normalization is used to produce the attention weights, but might be useful when e.g. spherical softmax is used.

input_dim¶

output_dim¶

Sequence generators¶

Recurrent networks are often used to generate/model sequences. Examples include language modelling, machine translation, handwriting synthesis, etc.. A typical pattern in this context is that sequence elements are generated one often another, and every generated element is fed back into the recurrent network state. Sometimes also an attention mechanism is used to condition sequence generation on some structured input like another sequence or an image.

This module provides SequenceGenerator that builds a sequence generating network from three main components:

a core recurrent transition, e.g. LSTM or GatedRecurrent
a readout component that can produce sequence elements using the network state and the information from the attention mechanism
an attention mechanism (see attention for more information)

Implementation-wise SequenceGenerator fully relies on BaseSequenceGenerator. At the level of the latter an attention is mandatory, moreover it must be a part of the recurrent transition (see AttentionRecurrent). To simulate optional attention, SequenceGenerator wraps the pure recurrent network in FakeAttentionRecurrent.

class blocks.bricks.sequence_generators.AbstractEmitter(name=None, children=None)[source]¶

Bases: blocks.bricks.base.Brick

The interface for the emitter component of a readout.

readout_dim¶: int – The dimension of the readout. Is given by the Readout brick when allocation configuration is pushed.

See also

Readout

SoftmaxEmitter: for integer outputs

Notes

An important detail about the emitter cost is that it will be evaluated with inputs of different dimensions so it has to be flexible enough to handle this. The two ways in which it can be applied are:

1. In :meth:BaseSequenceGenerator.cost_matrix where it will be applied to the whole sequence at once.

2. In :meth:BaseSequenceGenerator.generate where it will be applied to only one step of the sequence.

cost(readouts, outputs)[source]¶: Implements the respective method of Readout.

emit(readouts)[source]¶: Implements the respective method of Readout.

initial_outputs(batch_size)[source]¶: Implements the respective method of Readout.

class blocks.bricks.sequence_generators.AbstractFeedback(name=None, children=None)[source]¶

Bases: blocks.bricks.base.Brick

The interface for the feedback component of a readout.

See also

Readout, LookupFeedback

feedback(outputs)[source]¶: Implements the respective method of Readout.

class blocks.bricks.sequence_generators.AbstractReadout(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Initializable

The interface for the readout component of a sequence generator.

The readout component of a sequence generator is a bridge between the core recurrent network and the output sequence.

Parameters:	source_names (list) – A list of the source names (outputs) that are needed for the readout part e.g. `['states']` or `['states', 'weighted_averages']` or `['states', 'feedback']`. readout_dim (int) – The dimension of the readout.

source_names¶: list

readout_dim¶: int

See also

BaseSequenceGenerator: see how exactly a readout is used
Readout: the typically used readout brick

cost(readouts, outputs)[source]¶

Compute generation cost of outputs given readouts.

Parameters:	readouts (`Variable`) – Readouts produced by the `readout()` method of a (…, readout dim) shape. outputs (`Variable`) – Outputs whose cost should be computed. Should have as many or one less dimensions compared to readout. If readout has n dimensions, first n - 1 dimensions of outputs should match with those of readouts.

emit(readouts)[source]¶

Produce outputs from readouts.

Parameters:	readouts (`Variable`) – Readouts produced by the `readout()` method of a (batch_size, readout_dim) shape.

feedback(outputs)[source]¶: Feeds outputs back to be used as inputs of the transition.

initial_outputs(batch_size)[source]¶

Compute initial outputs for the generator’s first step.

In the notation from the BaseSequenceGenerator documentation this method should compute $y_0$.

readout(**kwargs)[source]¶

Compute the readout vector from states, glimpses, etc.

Parameters:	*kwargs (dict*) – Contains sequence generator states, glimpses, contexts and feedback from the previous outputs.

class blocks.bricks.sequence_generators.BaseSequenceGenerator(**kwargs)[source]¶

Bases: blocks.bricks.interfaces.Initializable

A generic sequence generator.

This class combines two components, a readout network and an attention-equipped recurrent transition, into a context-dependent sequence generator. Third component must be also given which forks feedback from the readout network to obtain inputs for the transition.

The class provides two methods: generate() and cost(). The former is to actually generate sequences and the latter is to compute the cost of generating given sequences.

The generation algorithm description follows.

Definitions and notation:

States $s_i$ of the generator are the states of the transition as specified in transition.state_names.
Contexts of the generator are the contexts of the transition as specified in transition.context_names.
Glimpses $g_i$ are intermediate entities computed at every generation step from states, contexts and the previous step glimpses. They are computed in the transition’s apply method when not given or by explicitly calling the transition’s take_glimpses method. The set of glimpses considered is specified in transition.glimpse_names.
Outputs $y_i$ are produced at every step and form the output sequence. A generation cost $c_i$ is assigned to each output.

Algorithm:

Initialization.

\[\begin{split}y_0 = readout.initial\_outputs(contexts)\\ s_0, g_0 = transition.initial\_states(contexts)\\ i = 1\\\end{split}\]

By default all recurrent bricks from recurrent have trainable initial states initialized with zeros. Subclass them or BaseRecurrent directly to get custom initial states.
New glimpses are computed:

\[g_i = transition.take\_glimpses( s_{i-1}, g_{i-1}, contexts)\]
A new output is generated by the readout and its cost is computed:

\[\begin{split}f_{i-1} = readout.feedback(y_{i-1}) \\ r_i = readout.readout(f_{i-1}, s_{i-1}, g_i, contexts) \\ y_i = readout.emit(r_i) \\ c_i = readout.cost(r_i, y_i)\end{split}\]

Note that the new glimpses and the old states are used at this step. The reason for not merging all readout methods into one is to make an efficient implementation of cost() possible.
New states are computed and iteration is done:

\[\begin{split}f_i = readout.feedback(y_i) \\ s_i = transition.compute\_states(s_{i-1}, g_i, fork.apply(f_i), contexts) \\ i = i + 1\end{split}\]
Back to step 2 if the desired sequence length has not been yet reached.

A scheme of the algorithm described above follows.

../_images/sequence_generator_scheme.png

Parameters:	readout (instance of `AbstractReadout`) – The readout component of the sequence generator. transition (instance of `AbstractAttentionRecurrent`) – The transition component of the sequence generator. fork (`Brick`) – The brick to compute the transition’s inputs from the feedback.

See also

Initializable: for initialization parameters
SequenceGenerator: more user friendly interface to thisbrick

cost¶

Returns the average cost over the minibatch.

The cost is computed by averaging the sum of per token costs for each sequence over the minibatch.

Warning

Note that, the computed cost can be problematic when batches consist of vastly different sequence lengths.

Parameters:	outputs (`TensorVariable`) – The 3(2) dimensional tensor containing output sequences. The axis 0 must stand for time, the axis 1 for the position in the batch. mask (`TensorVariable`) – The binary matrix identifying fake outputs.
Returns:	cost – Theano variable for cost, computed by summing over timesteps and then averaging over the minibatch.
Return type:	`Variable`

Notes

The contexts are expected as keyword arguments.

Adds average cost per sequence element AUXILIARY variable to the computational graph with name per_sequence_element.

cost_matrix¶

Returns generation costs for output sequences.

See also

cost(): Scalar cost.

generate¶

A sequence generation step.

Parameters:	outputs (`TensorVariable`) – The outputs from the previous step.

Notes

The contexts, previous states and glimpses are expected as keyword arguments.

generate_delegate()[source]¶

generate_outputs()[source]¶

generate_states()[source]¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_states¶

initial_states_outputs()[source]¶

class blocks.bricks.sequence_generators.FakeAttentionRecurrent(transition, **kwargs)[source]¶

Bases: blocks.bricks.attention.AbstractAttentionRecurrent, blocks.bricks.interfaces.Initializable

Adds fake attention interface to a transition.

BaseSequenceGenerator requires its transition brick to support AbstractAttentionRecurrent interface, that is to have an embedded attention mechanism. For the cases when no attention is required (e.g. language modeling or encoder-decoder models), FakeAttentionRecurrent is used to wrap a usual recurrent brick. The resulting brick has no glimpses and simply passes all states and contexts to the wrapped one.

Todo

Get rid of this brick and support attention-less transitions in BaseSequenceGenerator.

apply¶

apply_delegate()[source]¶

compute_states¶

compute_states_delegate()[source]¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_states¶

initial_states_outputs()[source]¶

take_glimpses¶

class blocks.bricks.sequence_generators.LookupFeedback(num_outputs=None, feedback_dim=None, **kwargs)[source]¶

Bases: blocks.bricks.sequence_generators.AbstractFeedback, blocks.bricks.interfaces.Initializable

A feedback brick for the case when readout are integers.

Stores and retrieves distributed representations of integers.

feedback¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

class blocks.bricks.sequence_generators.Readout(emitter=None, feedback_brick=None, merge=None, merge_prototype=None, post_merge=None, merged_dim=None, **kwargs)[source]¶

Bases: blocks.bricks.sequence_generators.AbstractReadout

Readout brick with separated emitter and feedback parts.

Readout combines a few bits and pieces into an object that can be used as the readout component in BaseSequenceGenerator. This includes an emitter brick, to which emit(), cost() and initial_outputs() calls are delegated, a feedback brick to which feedback() functionality is delegated, and a pipeline to actually compute readouts from all the sources (see the source_names attribute of AbstractReadout).

The readout computation pipeline is constructed from merge and post_merge brick, whose responsibilites are described in the respective docstrings.

Parameters:

emitter (an instance of AbstractEmitter) – The emitter component.
feedback_brick (an instance of AbstractFeedback) – The feedback component.
merge (Brick, optional) – A brick that takes the sources given in source_names as an input and combines them into a single output. If given, merge_prototype cannot be given.
merge_prototype (FeedForward, optional) – If merge isn’t given, the transformation given by merge_prototype is applied to each input before being summed. By default a Linear transformation without biases is used. If given, merge cannot be given.
post_merge (Feedforward, optional) – This transformation is applied to the merged inputs. By default Bias is used.
merged_dim (int, optional) – The input dimension of post_merge i.e. the output dimension of merge (or merge_prototype). If not give, it is assumed to be the same as readout_dim (i.e. post_merge is assumed to not change dimensions).
**kwargs (dict) – Passed to the parent’s constructor.

See also

BaseSequenceGenerator: see how exactly a readout is used

AbstractEmitter, AbstractFeedback

cost¶

emit¶

feedback¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_outputs¶

readout¶

class blocks.bricks.sequence_generators.SequenceGenerator(readout, transition, attention=None, add_contexts=True, **kwargs)[source]¶

Bases: blocks.bricks.sequence_generators.BaseSequenceGenerator

A more user-friendly interface for BaseSequenceGenerator.

Parameters:

readout (instance of AbstractReadout) – The readout component for the sequence generator.
transition (instance of BaseRecurrent) – The recurrent transition to be used in the sequence generator. Will be combined with attention, if that one is given.
attention (object, optional) – The attention mechanism to be added to transition, an instance of AbstractAttention.
add_contexts (bool) – If True, the AttentionRecurrent wrapping the transition will add additional contexts for the attended and its mask.
**kwargs (dict) – All keywords arguments are passed to the base class. If fork keyword argument is not provided, Fork is created that forks all transition sequential inputs without a “mask” substring in them.

class blocks.bricks.sequence_generators.SoftmaxEmitter(initial_output=0, **kwargs)[source]¶

Bases: blocks.bricks.sequence_generators.AbstractEmitter, blocks.bricks.interfaces.Initializable, blocks.bricks.interfaces.Random

A softmax emitter for the case of integer outputs.

Interprets readout elements as energies corresponding to their indices.

Parameters:	initial_output (int or a scalar `Variable`) – The initial output.

cost¶

emit¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_outputs¶

probs¶

class blocks.bricks.sequence_generators.TrivialEmitter(**kwargs)[source]¶

Bases: blocks.bricks.sequence_generators.AbstractEmitter

An emitter for the trivial case when readouts are outputs.

Parameters:	readout_dim (int) – The dimension of the readout.

Notes

By default cost() always returns zero tensor.

cost¶

emit¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

initial_outputs¶

class blocks.bricks.sequence_generators.TrivialFeedback(**kwargs)[source]¶

Bases: blocks.bricks.sequence_generators.AbstractFeedback

A feedback brick for the case when readout are outputs.

feedback¶

get_dim(name)[source]¶

Get dimension of an input/output variable of a brick.

Parameters:	name (str) – The name of the variable.

Cost bricks¶

class blocks.bricks.cost.AbsoluteError(name=None, children=None)[source]¶

Bases: blocks.bricks.cost.CostMatrix

cost_matrix¶

class blocks.bricks.cost.BinaryCrossEntropy(name=None, children=None)[source]¶

Bases: blocks.bricks.cost.CostMatrix

cost_matrix¶

class blocks.bricks.cost.CategoricalCrossEntropy(name=None, children=None)[source]¶

Bases: blocks.bricks.cost.Cost

apply¶

class blocks.bricks.cost.Cost(name=None, children=None)[source]¶

Bases: blocks.bricks.base.Brick

apply¶

class blocks.bricks.cost.CostMatrix(name=None, children=None)[source]¶

Bases: blocks.bricks.cost.Cost

Base class for costs which can be calculated element-wise.

Assumes that the data has format (batch, features).

apply¶

cost_matrix¶

class blocks.bricks.cost.MisclassificationRate(top_k=1)[source]¶

Bases: blocks.bricks.cost.Cost

Calculates the misclassification rate for a mini-batch.

Parameters:	top_k (int, optional) – If the ground truth class is within the top_k highest responses for a given example, the model is considered to have predicted correctly. Default: 1.

Notes

Ties for top_k-th place are broken pessimistically, i.e. in the (in practice, rare) case that there is a tie for top_k-th highest output for a given example, it is considered an incorrect prediction.

apply¶

class blocks.bricks.cost.SquaredError(name=None, children=None)[source]¶

Bases: blocks.bricks.cost.CostMatrix

cost_matrix¶

Wrapper bricks¶

class blocks.bricks.wrappers.BrickWrapper[source]¶

Bases: object

Base class for wrapper metaclasses.

Sometimes one wants to extend a brick with the capability to handle inputs different from what it was designed to handle. A typical example are inputs with more dimensions that was foreseen at the development stage. One way to proceed in such a situation is to write a decorator that wraps all application methods of the brick class by some additional logic before and after the application call. BrickWrapper serves as a convenient base class for such decorators.

Note, that since directly applying a decorator to a Brick subclass will only take place after __new__() is called, subclasses of BrickWrapper should be applied by setting the decorators attribute of the new brick class, like in the example below:

>>> from blocks.bricks.base import Brick
>>> class WrappedBrick(Brick):
...     decorators = [WithExtraDims()]

wrap(wrapped, namespace)[source]¶

Wrap an application of the base brick.

This method should be overriden to write into its namespace argument all required changes.

Parameters:	mcs (type) – The metaclass. wrapped (`Application`) – The application to be wrapped. namespace (dict) – The namespace of the class being created.

class blocks.bricks.wrappers.WithExtraDims[source]¶

Bases: blocks.bricks.wrappers.BrickWrapper

Wraps a brick’s applications to handle inputs with extra dimensions.

A brick can be often reused even when data has more dimensions than in the default setting. An example is a situation when one wants to apply categorical_cross_entropy() to temporal data, that is when an additional ‘time’ axis is prepended to its both x and y inputs.

This wrapper adds reshapes required to use application methods of a brick with such data by merging the extra dimensions with the first non-extra one. Two key assumptions are made: that all inputs and outputs have the same number of extra dimensions and that these extra dimensions are equal throughout all inputs and outputs.

While this might be inconvinient, the wrapped brick does not try to guess the number of extra dimensions, but demands it as an argument. The considerations of simplicity and reliability motivated this design choice. Upon availability in Blocks of a mechanism to request the expected number of dimensions for an input of a brick, this can be reconsidered.

wrap(wrapped, namespace)[source]¶

Wrap an application of the base brick.

This method should be overriden to write into its namespace argument all required changes.

Parameters:	mcs (type) – The metaclass. wrapped (`Application`) – The application to be wrapped. namespace (dict) – The namespace of the class being created.