Bricks¶
 Convolutional bricks
 Routing bricks
 Recurrent bricks
 Attention bricks
 Sequence generators
 Cost bricks

blocks.bricks.
application
(*args, **kwargs)[source]¶ Decorator for methods that apply a brick to inputs.
Parameters:  optional (**kwargs,) – The application method to wrap.
 optional – Attributes to attach to this application.
Notes
This decorator replaces application methods with
Application
instances. It also sets the attributes given as keyword arguments to the decorator.Note that this decorator purposely does not wrap the original method using e.g.
wraps()
orupdate_wrapper()
, since that would make the class impossible to pickle (see notes atApplication
).Examples
>>> class Foo(Brick): ... @application(inputs=['x'], outputs=['y']) ... def apply(self, x): ... return x + 1 ... @application ... def other_apply(self, x): ... return x  1 >>> foo = Foo() >>> Foo.apply.inputs ['x'] >>> foo.apply.outputs ['y'] >>> Foo.other_apply <blocks.bricks.base.Application object at ...>

class
blocks.bricks.
Brick
(name=None, children=None)[source]¶ Bases:
blocks.graph.annotations.Annotation
A brick encapsulates Theano operations with parameters.
A brick goes through the following stages:
 Construction: The call to
__init__()
constructs aBrick
instance with a name and creates any child bricks as well.  Allocation of parameters:
 Allocation configuration of children: The
push_allocation_config()
method configures any children of this block.  Allocation: The
allocate()
method allocates the shared Theano variables required for the parameters. Also allocates parameters for all children.
 Allocation configuration of children: The
 The following can be done in either order:
 Application: By applying the brick to a set of Theano variables a part of the computational graph of the final model is constructed.
 The initialization of parameters:
 Initialization configuration of children: The
push_initialization_config()
method configures any children of this block.  Initialization: This sets the initial values of the
parameters by a call to
initialize()
, which is needed to call the final compiled Theano function. Also initializes all children.
 Initialization configuration of children: The
Not all stages need to be called explicitly. Step 3(a) will automatically allocate the parameters if needed. Similarly, step 3(b.2) and 2(b) will automatically perform steps 3(b.1) and 2(a) if needed. They only need to be called separately if greater control is required. The only two methods which always need to be called are an application method to construct the computational graph, and the
initialize()
method in order to initialize the parameters.At each different stage, a brick might need a certain set of configuration settings. All of these settings can be passed to the
__init__()
constructor. However, by default many bricks support lazy initialization. This means that the configuration settings can be set later.Note
Some arguments to
__init__()
are always required, even when lazy initialization is enabled. Other arguments must be given before callingallocate()
, while others yet only need to be given in order to callinitialize()
. Always read the documentation of each brick carefully.Lazy initialization can be turned off by setting
Brick.lazy = False
. In this case, there is no need to callinitialize()
manually anymore, but all the configuration must be passed to the__init__()
method.Parameters: name (str, optional) – The name of this brick. This can be used to filter the application of certain modifications by brick names. By default, the brick receives the name of its class (lowercased). 
name
¶ str – The name of this brick.

print_shapes
¶ bool –
False
by default. IfTrue
it logs the shapes of all the input and output variables, which can be useful for debugging.

parameters
¶ list of
TensorSharedVariable
andNone
– After calling theallocate()
method this attribute will be populated with the shared variables storing this brick’s parameters. Allows forNone
so that parameters can always be accessed at the same index, even if some parameters are only defined given a particular configuration.

children
¶ list of bricks – The children of this brick.

allocated
¶ bool –
False
ifallocate()
has not been called yet.True
otherwise.

initialized
¶ bool –
False
ifallocate()
has not been called yet.True
otherwise.

allocation_config_pushed
¶ bool –
False
ifallocate()
orpush_allocation_config()
hasn’t been called yet.True
otherwise.

initialization_config_pushed
¶ bool –
False
ifinitialize()
orpush_initialization_config()
hasn’t been called yet.True
otherwise.
Notes
To provide support for lazy initialization, apply the
lazy()
decorator to the__init__()
method.Brick implementations must call the
__init__()
constructor of their parent using super(BlockImplementation, self).__init__(**kwargs) at the beginning of the overriding __init__.The methods
_allocate()
and_initialize()
need to be overridden if the brick needs to allocate shared variables and initialize their values in order to function.A brick can have any number of methods which apply the brick on Theano variables. These methods should be decorated with the
application()
decorator.If a brick has children, they must be listed in the
children
attribute. Moreover, if the brick wants to control the configuration of its children, the_push_allocation_config()
and_push_initialization_config()
methods need to be overridden.Examples
Most bricks have lazy initialization enabled.
>>> import theano >>> from blocks.initialization import IsotropicGaussian, Constant >>> from blocks.bricks import Linear >>> linear = Linear(input_dim=5, output_dim=3, ... weights_init=IsotropicGaussian(), ... biases_init=Constant(0)) >>> x = theano.tensor.vector() >>> linear.apply(x) # Calls linear.allocate() automatically linear_apply_output >>> linear.initialize() # Initializes the weight matrix

allocate
()[source]¶ Allocate shared variables for parameters.
Based on the current configuration of this
Brick
create Theano shared variables to store the parameters. After allocation, parameters are accessible through theparameters
attribute.This method calls the
allocate()
method of all children first, allowing the_allocate()
method to override the parameters of the children if needed.Raises: ValueError
– If the configuration of this brick is insufficient to determine the number of parameters or their dimensionality to be initialized.Notes
This method sets the
parameters
attribute to an empty list. This is in order to ensure that calls to this method completely reset the parameters.

children

get_dim
(name)[source]¶ Get dimension of an input/output variable of a brick.
Parameters: name (str) – The name of the variable.

get_dims
(names)[source]¶ Get list of dimensions for a set of input/output variables.
Parameters: names (list) – The variable names. Returns: dims – The dimensions of the sources. Return type: list

get_hierarchical_name
(parameter, delimiter='/')[source]¶ Return hierarhical name for a parameter.
Returns a path of the form
brick1/brick2/brick3.parameter1
. The delimiter is configurable.Parameters: delimiter (str) – The delimiter used to separate brick names in the path.

initialize
()[source]¶ Initialize parameters.
Intialize parameters, such as weight matrices and biases.
Notes
If the brick has not allocated its parameters yet, this method will call the
allocate()
method in order to do so.

parameters

print_shapes
= False

push_allocation_config
()[source]¶ Push the configuration for allocation to child bricks.
Bricks can configure their children, based on their own current configuration. This will be automatically done by a call to
allocate()
, but if you want to override the configuration of child bricks manually, then you can call this function manually.

push_initialization_config
()[source]¶ Push the configuration for initialization to child bricks.
Bricks can configure their children, based on their own current configuration. This will be automatically done by a call to
initialize()
, but if you want to override the configuration of child bricks manually, then you can call this function manually.
 Construction: The call to

blocks.bricks.
lazy
(allocation=None, initialization=None)[source]¶ Makes the initialization lazy.
This decorator allows the user to define positional arguments which will not be needed until the allocation or initialization stage of the brick. If these arguments are not passed, it will automatically replace them with a custom
None
object. It is assumed that the missing arguments can be set after initialization by setting attributes with the same name.Parameters: Examples
>>> class SomeBrick(Brick): ... @lazy(allocation=['a'], initialization=['b']) ... def __init__(self, a, b, c='c', d=None): ... print(a, b, c, d) >>> brick = SomeBrick('a') a NoneInitialization c None >>> brick = SomeBrick(d='d', b='b') NoneAllocation b c d

class
blocks.bricks.
BatchNormalization
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.RNGMixin
,blocks.bricks.interfaces.Feedforward
Normalizes activations, parameterizes a scale and shift.
Parameters:  input_dim (int or tuple) – Shape of a single input example. It is assumed that a batch axis will be prepended to this.
 broadcastable (tuple, optional) – Tuple of the same length as input_dim which specifies which of the perexample axes should be averaged over to compute means and standard deviations. For example, in order to normalize over all spatial locations in a (batch_index, channels, height, width) image, pass (False, True, True). The batch axis is always averaged out.
 conserve_memory (bool, optional) – Use an implementation that stores less intermediate state and therefore uses less memory, at the expense of 510% speed. Default is True.
 epsilon (float, optional) – The stabilizing constant for the minibatch standard deviation computation (when the brick is run in training mode). Added to the variance inside the square root, as in the batch normalization paper.
 scale_init (object, optional) – Initialization object to use for the learned scaling parameter ($\gamma$ in [BN]). By default, uses constant initialization of 1.
 shift_init (object, optional) – Initialization object to use for the learned shift parameter ($\beta$ in [BN]). By default, uses constant initialization of 0.
 mean_only (bool, optional) – Perform “meanonly” batch normalization as described in [SK2016].
 learn_scale (bool, optional) – Whether to include a learned scale parameter ($\gamma$ in [BN]) in this brick. Default is True. Has no effect if mean_only is True (i.e. a scale parameter is never learned in meanonly mode).
 learn_shift (bool, optional) – Whether to include a learned shift parameter ($\beta$ in [BN]) in this brick. Default is True.
Notes
In order for trained models to behave sensibly immediately upon upon deserialization, by default, this brick runs in inference mode, using a population mean and population standard deviation (initialized to zeros and ones respectively) to normalize activations. It is expected that the user will adapt these during training in some fashion, independently of the training objective, e.g. by taking a moving average of minibatchwise statistics.
In order to train with batch normalization, one must obtain a training graph by transforming the original inference graph. See
apply_batch_normalization()
for a routine to transform graphs, andbatch_normalization()
for a context manager that may enable shorter compile times (every instance ofBatchNormalization
is itself a context manager, entry into which causes applications to be in minibatch “training” mode, however it is usually more convenient to usebatch_normalization()
to enable this behaviour for all of your graph’sBatchNormalization
bricks at once).Note that training in inference mode should be avoided, as this brick introduces scales and shift parameters (tagged with the PARAMETER role) that, in the absence of batch normalization, usually makes things unstable. If you must do this, filter for and remove BATCH_NORM_SHIFT_PARAMETER and BATCH_NORM_SCALE_PARAMETER from the list of parameters you are training, and this brick should behave as a (somewhat expensive) noop.
This Brick accepts scale_init and shift_init arguments but is not an instance of
Initializable
, and will therefore not receive pushed initialization config from any parent brick. In almost all cases, you will probably want to stick with the defaults (unit scale and zero offset), but you can explicitly pass one or both initializers to override this.This has the necessary properties to be inserted into a
blocks.bricks.conv.ConvolutionalSequence
asis, in which case the input_dim should be omitted at construction, to be inferred from the layer below.[BN] (1, 2, 3, 4) Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. ICML (2015), pp. 448456. [SK2016] Tim Salimans and Diederik P. Kingma. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. arXiv 1602.07868. 
apply
¶

image_size
¶

normalization_axes
¶

num_channels
¶

num_output_channels
¶

output_dim
¶

class
blocks.bricks.
SpatialBatchNormalization
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.bn.BatchNormalization
Convenient subclass for batch normalization across spatial inputs.
Parameters: input_dim (int or tuple) – The input size of a single example. Must be length at least 2. It’s assumed that the first axis of this tuple is a “channels” axis, which should not be summed over, and all remaining dimensions are spatial dimensions. Notes
See
BatchNormalization
for more details (and additional keyword arguments).

class
blocks.bricks.
BatchNormalizedMLP
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.sequences.MLP
Convenient subclass for building an MLP with batch normalization.
Parameters:  conserve_memory (bool, optional, by keyword only) – See
BatchNormalization
.  mean_only (bool, optional, by keyword only) – See
BatchNormalization
.  learn_scale (bool, optional, by keyword only) – See
BatchNormalization
.  learn_shift (bool, optional, by keyword only) – See
BatchNormalization
.
Notes
All other parameters are the same as
MLP
. Each activation brick is wrapped in aSequence
containing an appropriateBatchNormalization
brick and the activation that follows it.By default, the contained
Linear
bricks will not contain any biases, as they could be canceled out by the biases in theBatchNormalization
bricks being added. Pass use_bias with a value of True if you really want this for some reason.mean_only, learn_scale and learn_shift are pushed down to all created
BatchNormalization
bricks as allocation config.
conserve_memory
¶ Conserve memory.
 conserve_memory (bool, optional, by keyword only) – See

class
blocks.bricks.
Feedforward
(name=None, children=None)[source]¶ Bases:
blocks.bricks.base.Brick
Declares an interface for bricks with one input and one output.
Many bricks have just one input and just one output (activations,
Linear
,MLP
). To make such bricks interchangable in most contexts they should share an interface for configuring their input and output dimensions. This brick declares such an interface.
input_dim
¶ int – The input dimension of the brick.

output_dim
¶ int – The output dimension of the brick.


class
blocks.bricks.
Initializable
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.RNGMixin
,blocks.bricks.base.Brick
Base class for bricks which push parameter initialization.
Many bricks will initialize children which perform a linear transformation, often with biases. This brick allows the weights and biases initialization to be configured in the parent brick and pushed down the hierarchy.
Parameters:  weights_init (object) – A NdarrayInitialization instance which will be used by to
initialize the weight matrix. Required by
initialize()
.  biases_init (
object
, optional) – A NdarrayInitialization instance that will be used to initialize the biases. Required byinitialize()
when use_bias is True. Only supported by bricks for whichhas_biases
isTrue
.  use_bias (
bool
, optional) – Whether to use a bias. Defaults to True. Required byinitialize()
. Only supported by bricks for whichhas_biases
isTrue
.  rng (
numpy.random.RandomState
) –

has_biases
¶ bool –
False
if the brick does not support biases, and only hasweights_init
. For an example of this, seeBidirectional
. If this isFalse
, the brick does not support the argumentsbiases_init
oruse_bias
.

has_biases
= True
 weights_init (object) – A NdarrayInitialization instance which will be used by to
initialize the weight matrix. Required by

class
blocks.bricks.
LinearLike
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Initializable
Initializable subclass with logic for
Linear
like classes.Notes
Provides W and b properties that can be overridden in subclasses to implement preapplication transformations on the weights and biases. Application methods should refer to
self.W
andself.b
rather than accessing the parameters list directly.This assumes a layout of the parameters list with the weights coming first and biases (if
use_bias
is True) coming second.
W
¶

b
¶


class
blocks.bricks.
Random
(theano_seed=None, **kwargs)[source]¶ Bases:
blocks.bricks.base.Brick
A mixin class for Bricks which need Theano RNGs.
Parameters: theano_seed (int or list, optional) – Seed to use for a MRG_RandomStreams
object.
seed_rng
= <mtrand.RandomState object>¶

theano_rng
¶ Returns Brick’s Theano RNG, or a default one.
The default seed can be set through
blocks.config
.

theano_seed
¶


class
blocks.bricks.
Linear
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.LinearLike
,blocks.bricks.interfaces.Feedforward
A linear transformation with optional bias.
Brick which applies a linear (affine) transformation by multiplying the input with a weight matrix. By default, a bias term is added (see
Initializable
for information on disabling this).Parameters:  input_dim (int) – The dimension of the input. Required by
allocate()
.  output_dim (int) – The dimension of the output. Required by
allocate()
.
Notes
See
Initializable
for initialization parameters.A linear transformation with bias is a matrix multiplication followed by a vector summation.
\[f(\mathbf{x}) = \mathbf{W}\mathbf{x} + \mathbf{b}\]
apply
¶ Apply the linear transformation.
Parameters: input ( TensorVariable
) – The input on which to apply the transformationReturns: output – The transformed input plus optional bias Return type: TensorVariable
 input_dim (int) – The dimension of the input. Required by

class
blocks.bricks.
Bias
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Feedforward
,blocks.bricks.interfaces.Initializable
Add a bias (i.e. sum with a vector).

apply
¶ Apply the linear transformation.
Parameters: input ( TensorVariable
) – The input on which to apply the transformationReturns: output – The transformed input plus optional bias Return type: TensorVariable

input_dim
¶

output_dim
¶


class
blocks.bricks.
Maxout
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.base.Brick
Maxout pooling transformation.
A brick that does max pooling over groups of input units. If you use this code in a research project, please cite [GWFM13].
[GWFM13] Ian J. Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio, Maxout networks, ICML (2013), pp. 13191327. Parameters: num_pieces (int) – The size of the groups the maximum is taken over. Notes
Maxout applies a set of linear transformations to a vector and selects for each output dimension the result with the highest value.

apply
¶ Apply the maxout transformation.
Parameters: input ( TensorVariable
) – The input on which to apply the transformationReturns: output – The transformed input Return type: TensorVariable


class
blocks.bricks.
LinearMaxout
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Initializable
,blocks.bricks.interfaces.Feedforward
Maxout pooling following a linear transformation.
This code combines the
Linear
brick with aMaxout
brick.Parameters:  input_dim (int) – The dimension of the input. Required by
allocate()
.  output_dim (int) – The dimension of the output. Required by
allocate()
.  num_pieces (int) – The number of linear functions. Required by
allocate()
.
Notes
See
Initializable
for initialization parameters.
apply
¶ Apply the linear transformation followed by maxout.
Parameters: input ( TensorVariable
) – The input on which to apply the transformationsReturns: output – The transformed input Return type: TensorVariable

input_dim
¶
 input_dim (int) – The dimension of the input. Required by

class
blocks.bricks.
Identity
(name=None, children=None)[source]¶ Bases:
blocks.bricks.interfaces.Activation
Elementwise application of identity function.

apply
¶ Apply the identity function elementwise.
Parameters: input ( TensorVariable
) – Theano variable to apply identity to, elementwise.Returns: output – The input with the activation function applied. Return type: TensorVariable


class
blocks.bricks.
Tanh
(name=None, children=None)[source]¶ Bases:
blocks.bricks.interfaces.Activation
Elementwise application of tanh function.

apply
¶ Apply the tanh function elementwise.
Parameters: input ( TensorVariable
) – Theano variable to apply tanh to, elementwise.Returns: output – The input with the activation function applied. Return type: TensorVariable


class
blocks.bricks.
Logistic
(name=None, children=None)[source]¶ Bases:
blocks.bricks.interfaces.Activation
Elementwise application of logistic function.

apply
¶ Apply the logistic function elementwise.
Parameters: input ( TensorVariable
) – Theano variable to apply logistic to, elementwise.Returns: output – The input with the activation function applied. Return type: TensorVariable


class
blocks.bricks.
Softplus
(name=None, children=None)[source]¶ Bases:
blocks.bricks.interfaces.Activation
Elementwise application of softplus function.

apply
¶ Apply the softplus function elementwise.
Parameters: input ( TensorVariable
) – Theano variable to apply softplus to, elementwise.Returns: output – The input with the activation function applied. Return type: TensorVariable


class
blocks.bricks.
Rectifier
(name=None, children=None)[source]¶ Bases:
blocks.bricks.interfaces.Activation
Elementwise application of rectifier function.

apply
¶ Apply the rectifier function elementwise.
Parameters: input ( TensorVariable
) – Theano variable to apply rectifier to, elementwise.Returns: output – The input with the activation function applied. Return type: TensorVariable


class
blocks.bricks.
LeakyRectifier
(leak=0.01, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Activation
Elementwise application of leakyrectifier function.

apply
¶ Apply the leakyrectifier function elementwise.
Parameters: input ( TensorVariable
) – Theano variable to apply leakyrectifier to, elementwise.Returns: output – The input with the activation function applied. Return type: TensorVariable


class
blocks.bricks.
Softmax
(name=None, children=None)[source]¶ Bases:
blocks.bricks.base.Brick
A softmax brick.
Works with 2dimensional inputs only. If you need more, see
NDimensionalSoftmax
.
apply
¶ Standard softmax.
Parameters: input ( Variable
) – A matrix, each row contains unnormalized logprobabilities of a distribution.Returns: output_ – A matrix with probabilities in each row for each distribution from input_. Return type: Variable

categorical_cross_entropy
¶ Computationally stable crossentropy for presoftmax values.
Parameters:  y (
TensorVariable
) – In the case of a matrix argument, each row represents a probabilility distribution. In the vector case, each element represents a distribution by specifying the position of 1 in a 1hot vector.  x (
TensorVariable
) – A matrix, each row contains unnormalized probabilities of a distribution.
Returns: cost – A vector of crossentropies between respective distributions from y and x.
Return type: TensorVariable
 y (

log_probabilities
¶ Normalize logprobabilities.
Converts unnormalized logprobabilities (exponents of which do not sum to one) into actual logprobabilities (exponents of which sum to one).
Parameters: input ( Variable
) – A matrix, each row contains unnormalized logprobabilities of a distribution.Returns: output – A matrix with normalized logprobabilities in each row for each distribution from input_. Return type: Variable


class
blocks.bricks.
NDimensionalSoftmax
(name=None, children=None)[source]¶ Bases:
blocks.bricks.simple.Softmax
A wrapped brick class.
This brick was automatically constructed by wrapping
Softmax
withWithExtraDims
.
apply
¶ Wraps the application method with reshapes.
Parameters: extra_ndim (int, optional) – The number of extra dimensions. Default is zero. See also
Softmax.apply()
 For documentation of the wrapped application method.

categorical_cross_entropy
¶ Wraps the application method with reshapes.
Parameters: extra_ndim (int, optional) – The number of extra dimensions. Default is zero. See also
Softmax.categorical_cross_entropy()
 For documentation of the wrapped application method.

decorators
= [<blocks.bricks.wrappers.WithExtraDims object>]¶

log_probabilities
¶ Wraps the application method with reshapes.
Parameters: extra_ndim (int, optional) – The number of extra dimensions. Default is zero. See also
Softmax.log_probabilities()
 For documentation of the wrapped application method.


class
blocks.bricks.
Sequence
(application_methods, **kwargs)[source]¶ Bases:
blocks.bricks.base.Brick
A sequence of bricks.
This brick applies a sequence of bricks, assuming that their in and outputs are compatible.
Parameters: application_methods (list) – List of BoundApplication
orBrick
to apply. ForBrick`s, the `
.apply`` method is used.
apply
¶


class
blocks.bricks.
FeedforwardSequence
(application_methods, **kwargs)[source]¶ Bases:
blocks.bricks.sequences.Sequence
,blocks.bricks.interfaces.Feedforward
A sequence where the first and last bricks are feedforward.
Parameters: application_methods (list) – List of BoundApplication
to apply. The first and last application method should belong to aFeedforward
brick.
input_dim
¶

output_dim
¶


class
blocks.bricks.
MLP
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.sequences.FeedforwardSequence
,blocks.bricks.interfaces.Initializable
A simple multilayer perceptron.
Parameters:  activations (list of
Brick
,BoundApplication
,) – orNone
A list of activations to apply after each linear transformation. GiveNone
to not apply any activation. It is assumed that the application method to use isapply
. Required for__init__()
.  dims (list of ints) – A list of input dimensions, as well as the output dimension of the
last layer. Required for
allocate()
.  prototype (
Brick
, optional) – The transformation prototype. A copy will be created for every activation. If not provided, an instance ofLinear
will be used.
Notes
See
Initializable
for initialization parameters.Note that the
weights_init
,biases_init
(as well asuse_bias
if set to a value other than the default ofNone
) configurations will overwrite those of the layers each time theMLP
is reinitialized. For more finegrained control, push the configuration to the child layers manually before initialization.>>> from blocks.bricks import Tanh >>> from blocks.initialization import IsotropicGaussian, Constant >>> mlp = MLP(activations=[Tanh(), None], dims=[30, 20, 10], ... weights_init=IsotropicGaussian(), ... biases_init=Constant(1)) >>> mlp.push_initialization_config() # Configure children >>> mlp.children[0].weights_init = IsotropicGaussian(0.1) >>> mlp.initialize()

input_dim
¶

output_dim
¶
 activations (list of

class
blocks.bricks.
WithExtraDims
[source]¶ Bases:
blocks.bricks.wrappers.BrickWrapper
Wraps a brick’s applications to handle inputs with extra dimensions.
A brick can be often reused even when data has more dimensions than in the default setting. An example is a situation when one wants to apply
categorical_cross_entropy()
to temporal data, that is when an additional ‘time’ axis is prepended to its both x and y inputs.This wrapper adds reshapes required to use application methods of a brick with such data by merging the extra dimensions with the first nonextra one. Two key assumptions are made: that all inputs and outputs have the same number of extra dimensions and that these extra dimensions are equal throughout all inputs and outputs.
While this might be inconvinient, the wrapped brick does not try to guess the number of extra dimensions, but demands it as an argument. The considerations of simplicity and reliability motivated this design choice. Upon availability in Blocks of a mechanism to request the expected number of dimensions for an input of a brick, this can be reconsidered.

class
blocks.bricks.lookup.
LookupTable
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Initializable
,blocks.bricks.interfaces.Feedforward
Encapsulates representations of a range of integers.
This brick can be used to embed integers, e.g. word indices, into a vector space.
Parameters: Notes
See
Initializable
for initialization parameters.
W
¶

apply
¶ Perform lookup.
Parameters: indices ( TensorVariable
) – The indices of interest. The dtype must be integer.Returns: output – Representations for the indices of the query. Has \(k+1\) dimensions, where \(k\) is the number of dimensions of the indices parameter. The last dimension stands for the representation element. Return type: TensorVariable

has_bias
= False¶

input_dim
¶

output_dim
¶

Convolutional bricks¶

class
blocks.bricks.conv.
AveragePooling
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.conv.Pooling
Average pooling layer.
Parameters: include_padding (bool, optional) – When calculating an average, include zeros that are the result of zero padding added by the padding argument. A value of True is only accepted if ignore_border is also True. False by default. Notes
For documentation on the remainder of the arguments to this class, see
MaxPooling
.

class
blocks.bricks.conv.
Convolutional
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.LinearLike
Performs a 2D convolution.
Parameters:  filter_size (tuple) – The height and width of the filter (also called kernels).
 num_filters (int) – Number of filters per channel.
 num_channels (int) – Number of input channels in the image. For the first layer this is normally 1 for grayscale images and 3 for color (RGB) images. For subsequent layers this is equal to the number of filters output by the previous convolutional layer. The filters are pooled over the channels.
 batch_size (int, optional) – Number of examples per batch. If given, this will be passed to Theano convolution operator, possibly resulting in faster execution.
 image_size (tuple, optional) – The height and width of the input (image or feature map). If given, this will be passed to the Theano convolution operator, resulting in possibly faster execution times.
 step (tuple, optional) – The step (or stride) with which to slide the filters over the image. Defaults to (1, 1).
 border_mode ({'valid', 'full'}, optional) – The border mode to use, see
scipy.signal.convolve2d()
for details. Defaults to ‘valid’.  tied_biases (bool) – Setting this to
False
will untie the biases, yielding a separate bias for every location at which the filter is applied. IfTrue
, it indicates that the biases of every filter in this layer should be shared amongst all applications of that filter. Defaults toTrue
.

apply
¶ Perform the convolution.
Parameters: input ( TensorVariable
) – A 4D tensor with the axes representing batch size, number of channels, image height, and image width.Returns: output – A 4D tensor of filtered images (feature maps) with dimensions representing batch size, number of filters, feature map height, and feature map width. The height and width of the feature map depend on the border mode. For ‘valid’ it is
image_size  filter_size + 1
while for ‘full’ it isimage_size + filter_size  1
.Return type: TensorVariable

static
conv2d_impl
(input, filters, input_shape=None, filter_shape=None, border_mode='valid', subsample=(1, 1), filter_flip=True, image_shape=None, filter_dilation=(1, 1), **kwargs)[source]¶ This function will build the symbolic graph for convolving a minibatch of a stack of 2D inputs with a set of 2D filters. The implementation is modelled after Convolutional Neural Networks (CNN).
Parameters:  input (symbolic 4D tensor) – Minibatch of feature map stacks, of shape
(batch size, input channels, input rows, input columns).
See the optional parameter
input_shape
.  filters (symbolic 4D tensor) – Set of filters used in CNN layer of shape
(output channels, input channels, filter rows, filter columns).
See the optional parameter
filter_shape
.  input_shape (None, tuple/list of len 4 of int or Constant variable) – The shape of the input parameter.
Optional, possibly used to choose an optimal implementation.
You can give
None
for any element of the list to specify that this element is not known at compile time.  filter_shape (None, tuple/list of len 4 of int or Constant variable) – The shape of the filters parameter.
Optional, possibly used to choose an optimal implementation.
You can give
None
for any element of the list to specify that this element is not known at compile time.  border_mode (str, int or tuple of two int) –
Either of the following:
'valid'
: apply filter wherever it completely overlaps with the input. Generates output of shape: input shape  filter shape + 1
'full'
: apply filter wherever it partly overlaps with the input. Generates output of shape: input shape + filter shape  1
'half'
: pad input with a symmetric border offilter rows // 2
 rows and
filter columns // 2
columns, then perform a valid convolution. For filters with an odd number of rows and columns, this leads to the output shape being equal to the input shape. int
: pad input with a symmetric border of zeros of the given width, then perform a valid convolution.
(int1, int2)
: pad input with a symmetric border ofint1
rows and
int2
columns, then perform a valid convolution.
 subsample (tuple of len 2) – Factor by which to subsample the output. Also called strides elsewhere.
 filter_flip (bool) – If
True
, will flip the filter rows and columns before sliding them over the input. This operation is normally referred to as a convolution, and this is the default. IfFalse
, the filters are not flipped and the operation is referred to as a crosscorrelation.  image_shape (None, tuple/list of len 4 of int or Constant variable) – Deprecated alias for input_shape.
 filter_dilation (tuple of len 2) – Factor by which to subsample (stride) the input. Also called dilation elsewhere.
 kwargs (Any other keyword arguments are accepted for backwards) – compatibility, but will be ignored.
Returns: Set of feature maps generated by convolutional layer. Tensor is of shape (batch size, output channels, output rows, output columns)
Return type: Symbolic 4D tensor
Notes
If cuDNN is available, it will be used on the GPU. Otherwise, it is the CorrMM convolution that will be used “caffe style convolution”.
This is only supported in Theano 0.8 or the development version until it is released.
The parameter filter_dilation is an implementation of dilated convolution.
 input (symbolic 4D tensor) – Minibatch of feature map stacks, of shape
(batch size, input channels, input rows, input columns).
See the optional parameter

static
get_output_shape
(image_shape, kernel_shape, border_mode, subsample, filter_dilation=None)[source]¶ This function compute the output shape of convolution operation.
Parameters:  image_shape (tuple of int (symbolic or numeric) corresponding to the input) – image shape. Its four (or five) element must correspond respectively to: batch size, number of input channels, height and width (and possibly depth) of the image. None where undefined.
 kernel_shape (tuple of int (symbolic or numeric) corresponding to the) – kernel shape. Its four (or five) elements must correspond respectively to: number of output channels, number of input channels, height and width (and possibly depth) of the kernel. None where undefined.
 border_mode (string, int (symbolic or numeric) or tuple of int (symbolic) – or numeric). If it is a string, it must be ‘valid’, ‘half’ or ‘full’. If it is a tuple, its two (or three) elements respectively correspond to the padding on height and width (and possibly depth) axis.
 subsample (tuple of int (symbolic or numeric) Its two or three elements) – espectively correspond to the subsampling on height and width (and possibly depth) axis.
 filter_dilation (tuple of int (symbolic or numeric) Its two or three) – elements correspond respectively to the dilation on height and width axis.
Returns: output_shape – four element must correspond respectively to: batch size, number of output channels, height and width of the image. None where undefined.
Return type: tuple of int corresponding to the output image shape. Its

num_output_channels
¶

class
blocks.bricks.conv.
ConvolutionalSequence
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.sequences.Sequence
,blocks.bricks.interfaces.Initializable
,blocks.bricks.interfaces.Feedforward
A sequence of convolutional (or pooling) operations.
Parameters:  layers (list) – List of convolutional bricks (i.e.
Convolutional
,ConvolutionalActivation
, orPooling
bricks), or application methods from such bricks.Activation
bricks that operate elementwise can also be included.  num_channels (int) – Number of input channels in the image. For the first layer this is normally 1 for grayscale images and 3 for color (RGB) images. For subsequent layers this is equal to the number of filters output by the previous convolutional layer.
 batch_size (int, optional) – Number of images in batch. If given, will be passed to theano’s convolution operator resulting in possibly faster execution.
 image_size (tuple, optional) – Width and height of the input (image/featuremap). If given, will be passed to theano’s convolution operator resulting in possibly faster execution.
 border_mode ('valid', 'full' or None, optional) – The border mode to use, see
scipy.signal.convolve2d()
for details. Unlike withConvolutional
, this defaults to None, in which case no default value is pushed down to child bricks at allocation time. Child bricks will in this case need to rely on either a default border mode (usually valid) or one provided at construction and/or after construction (but before allocation).  tied_biases (bool, optional) – Same meaning as in
Convolutional
. Defaults toNone
, in which case no value is pushed to childConvolutional
bricks.
Notes
The passed convolutional operators should be ‘lazy’ constructed, that is, without specifying the batch_size, num_channels and image_size. The main feature of
ConvolutionalSequence
is that it will set the input dimensions of a layer to the output dimensions of the previous layer by thepush_allocation_config()
method.The push behaviour of tied_biases mirrors that of use_bias or any initialization configuration: only an explicitly specified value is pushed down the hierarchy. border_mode also has this behaviour. The reason the border_mode parameter behaves the way it does is that pushing a single default border_mode makes it very difficult to have child bricks with different border modes. Normally, such things would be overridden after push_allocation_config(), but this is a particular hassle as the border mode affects the allocation parameters of every subsequent child brick in the sequence. Thus, only an explicitly specified border mode will be pushed down the hierarchy.
 layers (list) – List of convolutional bricks (i.e.

class
blocks.bricks.conv.
ConvolutionalTranspose
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.conv.Convolutional
Performs the transpose of a 2D convolution.
Parameters:  num_filters (int) – Number of filters at the output of the transposed convolution, i.e. the number of channels in the corresponding convolution.
 num_channels (int) – Number of channels at the input of the transposed convolution, i.e. the number of output filters in the corresponding convolution.
 step (tuple, optional) – The step (or stride) of the corresponding convolution. Defaults to (1, 1).
 image_size (tuple, optional) – Image size of the input to the transposed convolution, i.e.
the output of the corresponding convolution. Required for tied
biases. Defaults to
None
.  unused_edge (tuple, optional) – Tuple of pixels added to the inferred height and width of the
output image, whose values would be ignored in the corresponding
forward convolution. Must be such that 0 <=
unused_edge[i]
<=step[i]
. Note that this parameter is ignored iforiginal_image_size
is specified in the constructor or manually set as an attribute.  original_image_size (tuple, optional) – The height and width of the image that forms the output of the transpose operation, which is the input of the original (nontransposed) convolution. By default, this is inferred from image_size to be the size that has each pixel of the original image touched by at least one filter application in the original convolution. Degenerate cases with dropped border pixels (in the original convolution) are possible, and can be manually specified via this argument. See notes below.
See also
Convolutional
 For the documentation of other parameters.
Notes
By default, original_image_size is inferred from image_size as being the minimum size of image that could have produced this output. Let
hanging[i] = original_image_size[i]  image_size[i] * step[i]
. Any value ofhanging[i]
greater thanfilter_size[i]  step[i]
will result in border pixels that are ignored by the original convolution. With this brick, anyoriginal_image_size
such thatfilter_size[i]  step[i] < hanging[i] < filter_size[i]
for alli
can be validly specified. However, no value will be output by the transposed convolution itself for these extra hanging border pixels, and they will be determined entirely by the bias.
original_image_size
¶

class
blocks.bricks.conv.
Flattener
(name=None, children=None)[source]¶ Bases:
blocks.bricks.base.Brick
Flattens the input.
It may be used to pass multidimensional objects like images or feature maps of convolutional bricks into bricks which allow only two dimensional input (batch, features) like MLP.

apply
¶


class
blocks.bricks.conv.
MaxPooling
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.conv.Pooling
Max pooling layer.
Parameters:  pooling_size (tuple) – The height and width of the pooling region i.e. this is the factor by which your input’s last two dimensions will be downscaled.
 step (tuple, optional) – The vertical and horizontal shift (stride) between pooling regions. By default this is equal to pooling_size. Setting this to a lower number results in overlapping pooling regions.
 input_dim (tuple, optional) – A tuple of integers representing the shape of the input. The last two dimensions will be used to calculate the output dimension.
 padding (tuple, optional) – A tuple of integers representing the vertical and horizontal zeropadding to be applied to each of the top and bottom (vertical) and left and right (horizontal) edges. For example, an argument of (4, 3) will apply 4 pixels of padding to the top edge, 4 pixels of padding to the bottom edge, and 3 pixels each for the left and right edge. By default, no padding is performed.
 ignore_border (bool, optional) – Whether or not to do partial downsampling based on borders where the extent of the pooling region reaches beyond the edge of the image. If True, a (5, 5) image with (2, 2) pooling regions and (2, 2) step will be downsampled to shape (2, 2), otherwise it will be downsampled to (3, 3). True by default.
Notes
Warning
As of this writing, setting ignore_border to False with a step not equal to the pooling size will force Theano to perform pooling computations on CPU rather than GPU, even if you have specified a GPU as your computation device. Additionally, Theano will only use [cuDNN] (if available) for pooling computations with ignure_border set to True. You can ensure that the entire input is captured by at least one pool by using the padding argument to add zero padding prior to pooling being performed.
[cuDNN] NVIDIA cuDNN.

class
blocks.bricks.conv.
Pooling
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Initializable
,blocks.bricks.interfaces.Feedforward
Base Brick for pooling operations.
This should generally not be instantiated directly; see
MaxPooling
.
apply
¶ Apply the pooling (subsampling) transformation.
Parameters: input ( TensorVariable
) – An tensor with dimension greater or equal to 2. The last two dimensions will be downsampled. For example, with images this means that the last two dimensions should represent the height and width of your image.Returns: output – A tensor with the same number of dimensions as input_, but with the last two dimensions downsampled. Return type: TensorVariable

image_size
¶

num_channels
¶

num_output_channels
¶

Routing bricks¶

class
blocks.bricks.parallel.
Distribute
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.parallel.Fork
Transform an input and add it to other inputs.
This brick is designed for the following scenario: one has a group of variables and another separate variable, and one needs to somehow distribute information from the latter across the former. We call that “to distribute a varible across other variables”, and refer to the separate variable as “the source” and to the variables from the group as “the targets”.
Given a prototype brick, a
Parallel
brick makes several copies of it (each with its own parameters). At the application time the copies are applied to the source and the transformation results are added to the targets (in the literate sense).>>> from theano import tensor >>> from blocks.initialization import Constant >>> x = tensor.matrix('x') >>> y = tensor.matrix('y') >>> z = tensor.matrix('z') >>> distribute = Distribute(target_names=['x', 'y'], source_name='z', ... target_dims=[2, 3], source_dim=3, ... weights_init=Constant(2)) >>> distribute.initialize() >>> new_x, new_y = distribute.apply(x=x, y=y, z=z) >>> new_x.eval({x: [[2, 2]], z: [[1, 1, 1]]}) array([[ 8., 8.]]... >>> new_y.eval({y: [[1, 1, 1]], z: [[1, 1, 1]]}) array([[ 7., 7., 7.]]...
Parameters:  target_names (list) – The names of the targets.
 source_name (str) – The name of the source.
 target_dims (list) – A list of target dimensions, corresponding to target_names.
 source_dim (int) – The dimension of the source input.
 prototype (
Feedforward
, optional) – The transformation prototype. A copy will be created for every input. By default a linear transformation is used.

target_dims
¶ list

source_dim
¶ int
Notes
See
Initializable
for initialization parameters.

class
blocks.bricks.parallel.
Fork
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.parallel.Parallel
Several outputs from one input by applying similar transformations.
Given a prototype brick, a
Fork
brick makes several copies of it (each with its own parameters). At the application time the copies are applied to the input to produce different outputs.A typical usecase for this brick is to produce inputs for gates of gated recurrent bricks, such as
GatedRecurrent
.>>> from theano import tensor >>> from blocks.initialization import Constant >>> x = tensor.matrix('x') >>> fork = Fork(output_names=['y', 'z'], ... input_dim=2, output_dims=[3, 4], ... weights_init=Constant(2), biases_init=Constant(1)) >>> fork.initialize() >>> y, z = fork.apply(x) >>> y.eval({x: [[1, 1]]}) array([[ 5., 5., 5.]]... >>> z.eval({x: [[1, 1]]}) array([[ 5., 5., 5., 5.]]...
Parameters:  output_names (list of str) – Names of the outputs to produce.
 input_dim (int) – The input dimension.
 prototype (
Feedforward
, optional) – The transformation prototype. A copy will be created for every input. By default an affine transformation is used.

input_dim
¶ int – The input dimension.

output_dims
¶ list – The output dimensions as a list of integers, corresponding to output_names.
See also

apply
¶

class
blocks.bricks.parallel.
Merge
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.parallel.Parallel
Merges several variables by applying a transformation and summing.
Parameters:  input_names (list) – The input names.
 input_dims (list) – The dictionary of input dimensions, keys are input names, values are dimensions.
 output_dim (int) – The output dimension of the merged variables.
 prototype (
Feedforward
, optional) – A transformation prototype. A copy will be created for every input. IfNone
, a linear transformation is used.  child_prefix (str, optional) – A prefix for children names. By default “transform” is used.
 :param .. warning::: Note that if you want to have a bias you can pass a
Linear
 brick as a prototype, but this will result in several redundant
biases. It is a better idea to use
merge.children[0].use_bias = True
.

input_names
¶ list – The input names.

input_dims
¶ list – List of input dimensions corresponding to input_names.

output_dim
¶ int – The output dimension.
Examples
>>> from theano import tensor >>> from blocks.initialization import Constant >>> a = tensor.matrix('a') >>> b = tensor.matrix('b') >>> merge = Merge(input_names=['a', 'b'], input_dims=[3, 4], ... output_dim=2, weights_init=Constant(1.)) >>> merge.initialize() >>> c = merge.apply(a=a, b=b) >>> c.eval({a: [[1, 1, 1]], b: [[2, 2, 2, 2]]}) array([[ 11., 11.]]...

apply
¶

class
blocks.bricks.parallel.
Parallel
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Initializable
Apply similar transformations to several inputs.
Given a prototype brick, a
Parallel
brick makes several copies of it (each with its own parameters). At the application time every copy is applied to the respective input.>>> from theano import tensor >>> from blocks.initialization import Constant >>> x, y = tensor.matrix('x'), tensor.matrix('y') >>> parallel = Parallel( ... prototype=Linear(use_bias=False), ... input_names=['x', 'y'], input_dims=[2, 3], output_dims=[4, 5], ... weights_init=Constant(2)) >>> parallel.initialize() >>> new_x, new_y = parallel.apply(x=x, y=y) >>> new_x.eval({x: [[1, 1]]}) array([[ 4., 4., 4., 4.]]... >>> new_y.eval({y: [[1, 1, 1]]}) array([[ 6., 6., 6., 6., 6.]]...
Parameters:  input_names (list) – The input names.
 input_dims (list) – List of input dimensions, given in the same order as input_names.
 output_dims (list) – List of output dimensions.
 prototype (
Feedforward
) – The transformation prototype. A copy will be created for every input.  child_prefix (str, optional) – The prefix for children names. By default “transform” is used.

input_names
¶ list – The input names.

input_dims
¶ list – Input dimensions.

output_dims
¶ list – Output dimensions.
Notes
See
Initializable
for initialization parameters.
apply
¶
Recurrent bricks¶
Recurrent architectures¶

class
blocks.bricks.recurrent.architectures.
GatedRecurrent
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.recurrent.base.BaseRecurrent
,blocks.bricks.interfaces.Initializable
Gated recurrent neural network.
Gated recurrent neural network (GRNN) as introduced in [CvMG14]. Every unit of a GRNN is equipped with update and reset gates that facilitate better gradient propagation.
Parameters: Notes
See
Initializable
for initialization parameters.[CvMG14] Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation, EMNLP (2014), pp. 17241734. 
apply
¶ Apply the gated recurrent transition.
Parameters:  states (
TensorVariable
) – The 2 dimensional matrix of current states in the shape (batch_size, dim). Required for one_step usage.  inputs (
TensorVariable
) – The 2 dimensional matrix of inputs in the shape (batch_size, dim)  gate_inputs (
TensorVariable
) – The 2 dimensional matrix of inputs to the gates in the shape (batch_size, 2 * dim).  mask (
TensorVariable
) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1s only if not given.
Returns: output – Next states of the network.
Return type: TensorVariable
 states (

initial_states
¶

state_to_gates
¶

state_to_state
¶


class
blocks.bricks.recurrent.architectures.
LSTM
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.recurrent.base.BaseRecurrent
,blocks.bricks.interfaces.Initializable
Long Short Term Memory.
Every unit of an LSTM is equipped with input, forget and output gates. This implementation is based on code by Mohammad Pezeshki that implements the architecture used in [GSS03] and [Grav13]. It aims to do as many computations in parallel as possible and expects the last dimension of the input to be four times the output dimension.
Unlike a vanilla LSTM as described in [HS97], this model has peephole connections from the cells to the gates. The output gates receive information about the cells at the current time step, while the other gates only receive information about the cells at the previous time step. All ‘peephole’ weight matrices are diagonal.
[GSS03] Gers, Felix A., Nicol N. Schraudolph, and Jürgen Schmidhuber, Learning precise timing with LSTM recurrent networks, Journal of Machine Learning Research 3 (2003), pp. 115143. [Grav13] (1, 2) Graves, Alex, Generating sequences with recurrent neural networks, arXiv preprint arXiv:1308.0850 (2013). [HS97] Sepp Hochreiter, and Jürgen Schmidhuber, Long ShortTerm Memory, Neural Computation 9(8) (1997), pp. 17351780. Parameters: Notes
See
Initializable
for initialization parameters.
apply
¶ Apply the Long Short Term Memory transition.
Parameters:  states (
TensorVariable
) – The 2 dimensional matrix of current states in the shape (batch_size, features). Required for one_step usage.  cells (
TensorVariable
) – The 2 dimensional matrix of current cells in the shape (batch_size, features). Required for one_step usage.  inputs (
TensorVariable
) – The 2 dimensional matrix of inputs in the shape (batch_size, features * 4). The inputs needs to be four times the dimension of the LSTM brick to insure each four gates receive different transformations of the input. See [Grav13] equations 7 to 10 for more details. The inputs are then split in this order: Input gates, forget gates, cells and output gates.  mask (
TensorVariable
) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1s only if not given.  [Grav13] Graves, Alex, Generating sequences with recurrent (.) – neural networks, arXiv preprint arXiv:1308.0850 (2013).
Returns:  states (
TensorVariable
) – Next states of the network.  cells (
TensorVariable
) – Next cell activations of the network.
 states (

initial_states
¶


class
blocks.bricks.recurrent.architectures.
SimpleRecurrent
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.recurrent.base.BaseRecurrent
,blocks.bricks.interfaces.Initializable
The traditional recurrent transition.
The most wellknown recurrent transition: a matrix multiplication, optionally followed by a nonlinearity.
Parameters: Notes
See
Initializable
for initialization parameters.
W
¶

apply
¶ Apply the simple transition.
Parameters:  inputs (
TensorVariable
) – The 2D inputs, in the shape (batch, features).  states (
TensorVariable
) – The 2D states, in the shape (batch, features).  mask (
TensorVariable
) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1s only if not given.
 inputs (

initial_states
¶

Helper bricks for recurrent networks¶

class
blocks.bricks.recurrent.misc.
Bidirectional
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Initializable
Bidirectional network.
A bidirectional network is a combination of forward and backward recurrent networks which process inputs in different order.
Parameters: prototype (instance of BaseRecurrent
) – A prototype brick from which the forward and backward bricks are cloned.Notes
See
Initializable
for initialization parameters.
apply
¶ Applies forward and backward networks and concatenates outputs.

has_bias
= False¶


class
blocks.bricks.recurrent.misc.
RecurrentStack
(transitions, fork_prototype=None, states_name='states', skip_connections=False, **kwargs)[source]¶ Bases:
blocks.bricks.recurrent.base.BaseRecurrent
,blocks.bricks.interfaces.Initializable
Stack of recurrent networks.
Builds a stack of recurrent layers from a supplied list of
BaseRecurrent
objects. Each object must have a sequences, contexts, states and outputs parameters to its apply method, such as the ones required by the recurrent decorator fromblocks.bricks.recurrent
.In Blocks in general each brick can have an apply method and this method has attributes that list the names of the arguments that can be passed to the method and the name of the outputs returned by the method. The attributes of the apply method of this class is made from concatenating the attributes of the apply methods of each of the transitions from which the stack is made. In order to avoid conflict, the names of the arguments appearing in the states and outputs attributes of the apply method of each layers are renamed. The names of the bottom layer are used asis and a suffix of the form ‘#<n>’ is added to the names from other layers, where ‘<n>’ is the number of the layer starting from 1, used for first layer above bottom.
The contexts of all layers are merged into a single list of unique names, and no suffix is added. Different layers with the same context name will receive the same value.
The names that appear in sequences are treated in the same way as the names of states and outputs if skip_connections is “True”. The only exception is the “mask” element that may appear in the sequences attribute of all layers, no suffix is added to it and all layers will receive the same mask value. If you set skip_connections to False then only the arguments of the sequences from the bottom layer will appear in the sequences attribute of the apply method of this class. When using this class, with skip_connections set to “True”, you can supply all inputs to all layers using a single fork which is created with output_names set to the apply.sequences attribute of this class. For example,
SequenceGenerator
will create a such a fork.Whether or not skip_connections is set, each layer above the bottom also receives an input (values to its sequences arguments) from a fork of the state of the layer below it. Not to be confused with the external fork discussed in the previous paragraph. It is assumed that all states attributes have a “states” argument name (this can be configured with states_name parameter.) The output argument with this name is forked and then added to all the elements appearing in the sequences of the next layer (except for “mask”.) If skip_connections is False then this fork has a bias by default. This allows direct usage of this class with input supplied only to the first layer. But if you do supply inputs to all layers (by setting skip_connections to “True”) then by default there is no bias and the external fork you use to supply the inputs should have its own separate bias.
Parameters:  transitions (list) – List of recurrent units to use in each layer. Each derived from
BaseRecurrent
Note: A suffix with layer number is added to transitions’ names.  fork_prototype (
FeedForward
, optional) – A prototype for the transformation applied to states_name from the states of each layer. The transformation is used when the states_name argument from the outputs of one layer is used as input to the sequences of the next layer. By default itLinear
transformation is used, with bias if skip_connections is “False”. If you supply your own prototype you have to enable/disable bias depending on the value of skip_connections.  states_name (string) – In a stack of RNN the state of each layer is used as input to the next. The states_name identify the argument of the states and outputs attributes of each layer that should be used for this task. By default the argument is called “states”. To be more precise, this is the name of the argument in the outputs attribute of the apply method of each transition (layer.) It is used, via fork, as the sequences (input) of the next layer. The same element should also appear in the states attribute of the apply method.
 skip_connections (bool) – By default False. When true, the sequences of all layers are add to the sequences of the apply of this class. When false only the sequences of the bottom layer appear in the sequences of the apply of this class. In this case the default fork used internally between layers has a bias (see fork_prototype.) An external code can inspect the sequences attribute of the apply method of this class to decide which arguments it need (and in what order.) With skip_connections you can control what is exposed to the externl code. If it is false then the external code is expected to supply inputs only to the bottom layer and if it is true then the external code is expected to supply inputs to all layers. There is just one small problem, the external inputs to the layers above the bottom layer are added to a fork of the state of the layer below it. As a result the output of two forks is added together and it will be problematic if both will have a bias. It is assumed that the external fork has a bias and therefore by default the internal fork will not have a bias if skip_connections is true.
Notes
See
BaseRecurrent
for more initialization parameters.
apply
¶ Apply the stack of transitions.
Parameters:  low_memory (bool) – Use the slow, but also memory efficient, implementation of this code.
 *args (
TensorVariable
, optional) – Positional argumentes in the order in which they appear in self.apply.sequences followed by self.apply.contexts.  **kwargs (
TensorVariable
) – Named argument defined in self.apply.sequences, self.apply.states or self.apply.contexts
Returns: outputs – The outputs of all transitions as defined in self.apply.outputs
Return type: (list of)
TensorVariable
See also
See docstring of this class for arguments appearing in the lists self.apply.sequences, self.apply.states, self.apply.contexts. See
recurrent()
: for all other parameters such as iterate and return_initial_states however reverse is currently not implemented.

do_apply
(*args, **kwargs)[source]¶ Apply the stack of transitions.
This is the undecorated implementation of the apply method. A method with an @apply decoration should call this method with iterate=True to indicate that the iteration over all steps should be done internally by this method. A method with a @recurrent method should have iterate=False (or unset) to indicate that the iteration over all steps is done externally.

initial_states
¶

low_memory_apply
¶
 transitions (list) – List of recurrent units to use in each layer. Each derived from
Base definitions for recurrent bricks¶

class
blocks.bricks.recurrent.base.
BaseRecurrent
(name=None, children=None)[source]¶ Bases:
blocks.bricks.base.Brick
Base class for brick with recurrent application method.

has_bias
= False¶

initial_states
¶ Return initial states for an application call.
Default implementation assumes that the recurrent application method is called apply. It fetches the state names from apply.states and a returns a zero matrix for each of them.
SimpleRecurrent
,LSTM
andGatedRecurrent
override this method with trainable initial states initialized with zeros.Parameters:  batch_size (int) – The batch size.
 *args – The positional arguments of the application call.
 **kwargs – The keyword arguments of the application call.


blocks.bricks.recurrent.base.
recurrent
(*args, **kwargs)[source]¶ Wraps an apply method to allow its iterative application.
This decorator allows you to implement only one step of a recurrent network and enjoy applying it to sequences for free. The idea behind is that its most general form information flow of an RNN can be described as follows: depending on the context and driven by input sequences the RNN updates its states and produces output sequences.
Given a method describing one step of an RNN and a specification which of its inputs are the elements of the input sequence, which are the states and which are the contexts, this decorator returns an application method which implements the whole RNN loop. The returned application method also has additional parameters, see documentation of the recurrent_apply inner function below.
Parameters:  sequences (list of strs) – Specifies which of the arguments are elements of input sequences.
 states (list of strs) – Specifies which of the arguments are the states.
 contexts (list of strs) – Specifies which of the arguments are the contexts.
 outputs (list of strs) – Names of the outputs. The outputs whose names match with those in the state parameter are interpreted as next step states.
Returns: recurrent_apply – The new application method that applies the RNN to sequences.
Return type: See also
Attention bricks¶
This module defines the interface of attention mechanisms and a few concrete implementations. For a gentle introduction and usage examples see the tutorial TODO.
An attention mechanism decides to what part of the input to pay attention. It is typically used as a component of a recurrent network, though one can imagine it used in other conditions as well. When the input is big and has certain structure, for instance when it is sequence or an image, an attention mechanism can be applied to extract only information which is relevant for the network in its current state.
For the purpose of documentation clarity, we fix the following terminology in this file:
 network is the network, typically a recurrent one, which uses the attention mechanism.
 The network has states. Using this word in plural might seem weird, but
some recurrent networks like
LSTM
do have several states.  The big structured input, to which the attention mechanism is applied, is called the attended. When it has variable structure, e.g. a sequence of variable length, there might be a mask associated with it.
 The information extracted by the attention from the attended is called glimpse, more specifically glimpses because there might be a few pieces of this information.
Using this terminology, the attention mechanism computes glimpses given the states of the network and the attended.
An example: in the machine translation network from [BCB] the attended is
a sequence of socalled annotations, that is states of a bidirectional
network that was driven by word embeddings of the source sentence. The
attention mechanism assigns weights to the annotations. The weighted sum of
the annotations is further used by the translation network to predict the
next word of the generated translation. The weights and the weighted sum
are the glimpses. A generalized attention mechanism for this paper is
represented here as SequenceContentAttention
.

class
blocks.bricks.attention.
AbstractAttention
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.base.Brick
The common interface for attention bricks.
First, see the modulelevel docstring for terminology.
A generic attention mechanism functions as follows. Its inputs are the states of the network and the attended. Given these two it produces socalled glimpses, that is it extracts information from the attended which is necessary for the network in its current states
For computational reasons we separate the process described above into two stages:
1. The preprocessing stage,
preprocess()
, includes computation that do not involve the state. Those can be often performed in advance. The outcome of this stage is called preprocessed_attended. The main stage,
take_glimpses()
, includes all the rest.
When an attention mechanism is applied sequentially, some glimpses from the previous step might be necessary to compute the new ones. A typical example for that is when the focus position from the previous step is required. In such cases
take_glimpses()
should specify such need in its interface (its docstring explains how to do that). In additioninitial_glimpses()
should specify some sensible initialization for the glimpses to be carried over.Todo
Only single attended is currently allowed.
preprocess()
andinitial_glimpses()
might end up needing masks, which are currently not provided for them.Parameters: 
state_names
¶ list

state_dims
¶ list

attended_dim
¶ int

initial_glimpses
(batch_size, attended)[source]¶ Return sensible initial values for carried over glimpses.
Parameters:  batch_size (int or
Variable
) – The batch size.  attended (
Variable
) – The attended.
Returns: initial_glimpses – The initial values for the requested glimpses. These might simply consist of zeros or be somehow extracted from the attended.
Return type: list of
Variable
 batch_size (int or

preprocess
¶ Perform the preprocessing of the attended.
Stage 1 of the attention mechanism, see
AbstractAttention
docstring for an explanation of stages. The default implementation simply returns attended.Parameters: attended ( Variable
) – The attended.Returns: preprocessed_attended – The preprocessed attended. Return type: Variable

take_glimpses
(attended, preprocessed_attended=None, attended_mask=None, **kwargs)[source]¶ Extract glimpses from the attended given the current states.
Stage 2 of the attention mechanism, see
AbstractAttention
for an explanation of stages. If preprocessed_attended is not given, should trigger the stage 1.This application method must declare its inputs and outputs. The glimpses to be carried over are identified by their presence in both inputs and outputs list. The attended must be the first input, the preprocessed attended must be the second one.
Parameters:  attended (
Variable
) – The attended.  preprocessed_attended (
Variable
, optional) – The preprocessed attended computed bypreprocess()
. When not given,preprocess()
should be called.  attended_mask (
Variable
, optional) – The mask for the attended. This is required in the case of padded structured output, e.g. when a number of sequences are force to be the same length. The mask identifies position of the attended that actually contain information.  **kwargs (dict) – Includes the states and the glimpses to be carried over from the previous step in the case when the attention mechanism is applied sequentially.
 attended (
 The main stage,

class
blocks.bricks.attention.
AbstractAttentionRecurrent
(name=None, children=None)[source]¶ Bases:
blocks.bricks.recurrent.base.BaseRecurrent
The interface for attentionequipped recurrent transitions.
When a recurrent network is equipped with an attention mechanism its transition typically consists of two steps: (1) the glimpses are taken by the attention mechanism and (2) the next states are computed using the current states and the glimpses. It is required for certain usecases (such as sequence generator) that apart from a doitall recurrent application method interfaces for the first step and the second steps of the transition are provided.

class
blocks.bricks.attention.
AttentionRecurrent
(transition, attention, distribute=None, add_contexts=True, attended_name=None, attended_mask_name=None, **kwargs)[source]¶ Bases:
blocks.bricks.attention.AbstractAttentionRecurrent
,blocks.bricks.interfaces.Initializable
Combines an attention mechanism and a recurrent transition.
This brick equips a recurrent transition with an attention mechanism. In order to do this two more contexts are added: one to be attended and a mask for it. It is also possible to use the contexts of the given recurrent transition for these purposes and not add any new ones, see add_context parameter.
At the beginning of each step attention mechanism produces glimpses; these glimpses together with the current states are used to compute the next state and finish the transition. In some cases glimpses from the previous steps are also necessary for the attention mechanism, e.g. in order to focus on an area close to the one from the previous step. This is also supported: such glimpses become states of the new transition.
To let the user control the way glimpses are used, this brick also takes a “distribute” brick as parameter that distributes the information from glimpses across the sequential inputs of the wrapped recurrent transition.
Parameters:  transition (
BaseRecurrent
) – The recurrent transition.  attention (
Brick
) – The attention mechanism.  distribute (
Brick
, optional) – Distributes the information from glimpses across the input sequences of the transition. By default aDistribute
is used, and those inputs containing the “mask” substring in their name are not affected.  add_contexts (bool, optional) – If
True
, new contexts for the attended and the attended mask are added to this transition, otherwise existing contexts of the wrapped transition are used.True
by default.  attended_name (str) – The name of the attended context. If
None
, “attended” or the first context of the recurrent transition is used depending on the value of add_contents flag.  attended_mask_name (str) – The name of the mask for the attended context. If
None
, “attended_mask” or the second context of the recurrent transition is used depending on the value of add_contents flag.
Notes
See
Initializable
for initialization parameters.Wrapping your recurrent brick with this class makes all the states mandatory. If you feel this is a limitation for you, try to make it better! This restriction does not apply to sequences and contexts: those keep being as optional as they were for your brick.
Those coming to Blocks from Groundhog might recognize that this is a RecurrentLayerWithSearch, but on steroids :)

apply
¶ Preprocess a sequence attending the attended context at every step.
Preprocesses the attended context and runs
do_apply()
. Seedo_apply()
documentation for further information.

compute_states
¶ Compute current states when glimpses have already been computed.
Combines an application of the distribute that alter the sequential inputs of the wrapped transition and an application of the wrapped transition. All unknown keyword arguments go to the wrapped transition.
Parameters: **kwargs – Should contain everything what self.transition needs and in addition the current glimpses. Returns: current_states – Current states computed by self.transition. Return type: list of TensorVariable

do_apply
¶ Process a sequence attending the attended context every step.
In addition to the original sequence this method also requires its preprocessed version, the one computed by the preprocess method of the attention mechanism. Unknown keyword arguments are passed to the wrapped transition.
Parameters: **kwargs – Should contain current inputs, previous step states, contexts, the preprocessed attended context, previous step glimpses. Returns: outputs – The current step states and glimpses. Return type: list of TensorVariable

initial_states
¶

take_glimpses
¶ Compute glimpses with the attention mechanism.
A thin wrapper over self.attention.take_glimpses: takes care of choosing and renaming the necessary arguments.
Parameters: **kwargs – Must contain the attended, previous step states and glimpses. Can optionaly contain the attended mask and the preprocessed attended. Returns: glimpses – Current step glimpses. Return type: list of TensorVariable
 transition (

class
blocks.bricks.attention.
GenericSequenceAttention
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.attention.AbstractAttention
Logic common for sequence attention mechanisms.

compute_weighted_averages
¶ Compute weighted averages of the attended sequence vectors.
Parameters:  weights (
Variable
) – The weights. The shape must be equal to the attended shape without the last dimension.  attended (
Variable
) – The attended. The index in the sequence must be the first dimension.
Returns: weighted_averages – The weighted averages of the attended elements. The shape is equal to the attended shape with the first dimension dropped.
Return type: Variable
 weights (

compute_weights
¶ Compute weights from energies in softmaxlike fashion.
Todo
Use
Softmax
.Parameters:  energies (
Variable
) – The energies. Must be of the same shape as the mask.  attended_mask (
Variable
) – The mask for the attended. The index in the sequence must be the first dimension.
Returns: weights – Summing to 1 nonnegative weights of the same shape as energies.
Return type: Variable
 energies (


class
blocks.bricks.attention.
SequenceContentAttention
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.attention.GenericSequenceAttention
,blocks.bricks.interfaces.Initializable
Attention mechanism that looks for relevant content in a sequence.
This is the attention mechanism used in [BCB]. The idea in a nutshell:
 The states and the sequence are transformed independently,
 The transformed states are summed with every transformed sequence element to obtain match vectors,
 A match vector is transformed into a single number interpreted as energy,
 Energies are normalized in softmaxlike fashion. The resulting summing to one weights are called attention weights,
 Weighted average of the sequence elements with attention weights is computed.
In terms of the
AbstractAttention
documentation, the sequence is the attended. The weighted averages from 5 and the attention weights from 4 form the set of glimpses produced by this attention mechanism.Parameters:  state_names (list of str) – The names of the network states.
 attended_dim (int) – The dimension of the sequence elements.
 match_dim (int) – The dimension of the match vector.
 state_transformer (
Brick
) – A prototype for state transformations. IfNone
, a linear transformation is used.  attended_transformer (
Feedforward
) – The transformation to be applied to the sequence. IfNone
an affine transformation is used.  energy_computer (
Feedforward
) – Computes energy from the match vector. IfNone
, an affine transformations preceeded by \(tanh\) is used.
Notes
See
Initializable
for initialization parameters.[BCB] (1, 2) Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 
compute_energies
¶

initial_glimpses
¶

preprocess
¶ Preprocess the sequence for computing attention weights.
Parameters: attended ( TensorVariable
) – The attended sequence, time is the 1st dimension.

take_glimpses
¶ Compute attention weights and produce glimpses.
Parameters:  attended (
TensorVariable
) – The sequence, time is the 1st dimension.  preprocessed_attended (
TensorVariable
) – The preprocessed sequence. IfNone
, is computed by callingpreprocess()
.  attended_mask (
TensorVariable
) – A 0/1 mask specifying available data. 0 means that the corresponding sequence element is fake.  **states – The states of the network.
Returns:  weighted_averages (
Variable
) – Linear combinations of sequence elements with the attention weights.  weights (
Variable
) – The attention weights. The first dimension is batch, the second is time.
 attended (

class
blocks.bricks.attention.
ShallowEnergyComputer
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.sequences.Sequence
,blocks.bricks.interfaces.Initializable
,blocks.bricks.interfaces.Feedforward
A simple energy computer: first tanh, then weighted sum.
Parameters: use_bias (bool, optional) – Whether a bias should be added to the energies. Does not change anything if softmax normalization is used to produce the attention weights, but might be useful when e.g. spherical softmax is used. 
input_dim
¶

output_dim
¶

Sequence generators¶
Recurrent networks are often used to generate/model sequences. Examples include language modelling, machine translation, handwriting synthesis, etc.. A typical pattern in this context is that sequence elements are generated one often another, and every generated element is fed back into the recurrent network state. Sometimes also an attention mechanism is used to condition sequence generation on some structured input like another sequence or an image.
This module provides SequenceGenerator
that builds a sequence
generating network from three main components:
 a core recurrent transition, e.g.
LSTM
orGatedRecurrent
 a readout component that can produce sequence elements using the network state and the information from the attention mechanism
 an attention mechanism (see
attention
for more information)
Implementationwise SequenceGenerator
fully relies on
BaseSequenceGenerator
. At the level of the latter an
attention is mandatory, moreover it must be a part of the recurrent
transition (see AttentionRecurrent
).
To simulate optional attention, SequenceGenerator
wraps the
pure recurrent network in FakeAttentionRecurrent
.

class
blocks.bricks.sequence_generators.
AbstractEmitter
(name=None, children=None)[source]¶ Bases:
blocks.bricks.base.Brick
The interface for the emitter component of a readout.

readout_dim
¶ int – The dimension of the readout. Is given by the
Readout
brick when allocation configuration is pushed.
Notes
An important detail about the emitter cost is that it will be evaluated with inputs of different dimensions so it has to be flexible enough to handle this. The two ways in which it can be applied are:
1. In :meth:BaseSequenceGenerator.cost_matrix where it will be applied to the whole sequence at once.
2. In :meth:BaseSequenceGenerator.generate where it will be applied to only one step of the sequence.


class
blocks.bricks.sequence_generators.
AbstractFeedback
(name=None, children=None)[source]¶ Bases:
blocks.bricks.base.Brick
The interface for the feedback component of a readout.
See also

class
blocks.bricks.sequence_generators.
AbstractReadout
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Initializable
The interface for the readout component of a sequence generator.
The readout component of a sequence generator is a bridge between the core recurrent network and the output sequence.
Parameters: 
source_names
¶ list

readout_dim
¶ int
See also
BaseSequenceGenerator
 see how exactly a readout is used
Readout
 the typically used readout brick

cost
(readouts, outputs)[source]¶ Compute generation cost of outputs given readouts.
Parameters:  readouts (
Variable
) – Readouts produced by thereadout()
method of a (..., readout dim) shape.  outputs (
Variable
) – Outputs whose cost should be computed. Should have as many or one less dimensions compared to readout. If readout has n dimensions, first n  1 dimensions of outputs should match with those of readouts.
 readouts (

emit
(readouts)[source]¶ Produce outputs from readouts.
Parameters: readouts ( Variable
) – Readouts produced by thereadout()
method of a (batch_size, readout_dim) shape.

initial_outputs
(batch_size)[source]¶ Compute initial outputs for the generator’s first step.
In the notation from the
BaseSequenceGenerator
documentation this method should compute \(y_0\).


class
blocks.bricks.sequence_generators.
BaseSequenceGenerator
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.interfaces.Initializable
A generic sequence generator.
This class combines two components, a readout network and an attentionequipped recurrent transition, into a contextdependent sequence generator. Third component must be also given which forks feedback from the readout network to obtain inputs for the transition.
The class provides two methods:
generate()
andcost()
. The former is to actually generate sequences and the latter is to compute the cost of generating given sequences.The generation algorithm description follows.
Definitions and notation:
 States \(s_i\) of the generator are the states of the transition as specified in transition.state_names.
 Contexts of the generator are the contexts of the transition as specified in transition.context_names.
 Glimpses \(g_i\) are intermediate entities computed at every generation step from states, contexts and the previous step glimpses. They are computed in the transition’s apply method when not given or by explicitly calling the transition’s take_glimpses method. The set of glimpses considered is specified in transition.glimpse_names.
 Outputs \(y_i\) are produced at every step and form the output sequence. A generation cost \(c_i\) is assigned to each output.
Algorithm:
Initialization.
\[\begin{split}y_0 = readout.initial\_outputs(contexts)\\ s_0, g_0 = transition.initial\_states(contexts)\\ i = 1\\\end{split}\]By default all recurrent bricks from
recurrent
have trainable initial states initialized with zeros. Subclass them orBaseRecurrent
directly to get custom initial states.New glimpses are computed:
\[g_i = transition.take\_glimpses( s_{i1}, g_{i1}, contexts)\]A new output is generated by the readout and its cost is computed:
\[\begin{split}f_{i1} = readout.feedback(y_{i1}) \\ r_i = readout.readout(f_{i1}, s_{i1}, g_i, contexts) \\ y_i = readout.emit(r_i) \\ c_i = readout.cost(r_i, y_i)\end{split}\]Note that the new glimpses and the old states are used at this step. The reason for not merging all readout methods into one is to make an efficient implementation of
cost()
possible.New states are computed and iteration is done:
\[\begin{split}f_i = readout.feedback(y_i) \\ s_i = transition.compute\_states(s_{i1}, g_i, fork.apply(f_i), contexts) \\ i = i + 1\end{split}\]Back to step 2 if the desired sequence length has not been yet reached.
A scheme of the algorithm described above follows.Parameters:  readout (instance of
AbstractReadout
) – The readout component of the sequence generator.  transition (instance of
AbstractAttentionRecurrent
) – The transition component of the sequence generator.  fork (
Brick
) – The brick to compute the transition’s inputs from the feedback.
See also
Initializable
 for initialization parameters
SequenceGenerator
 more user friendly interface to thisbrick

cost
¶ Returns the average cost over the minibatch.
The cost is computed by averaging the sum of per token costs for each sequence over the minibatch.
Warning
Note that, the computed cost can be problematic when batches consist of vastly different sequence lengths.
Parameters:  outputs (
TensorVariable
) – The 3(2) dimensional tensor containing output sequences. The axis 0 must stand for time, the axis 1 for the position in the batch.  mask (
TensorVariable
) – The binary matrix identifying fake outputs.
Returns: cost – Theano variable for cost, computed by summing over timesteps and then averaging over the minibatch.
Return type: Variable
Notes
The contexts are expected as keyword arguments.
Adds average cost per sequence element AUXILIARY variable to the computational graph with name
per_sequence_element
. outputs (

generate
¶ A sequence generation step.
Parameters: outputs ( TensorVariable
) – The outputs from the previous step.Notes
The contexts, previous states and glimpses are expected as keyword arguments.

initial_states
¶

class
blocks.bricks.sequence_generators.
FakeAttentionRecurrent
(transition, **kwargs)[source]¶ Bases:
blocks.bricks.attention.AbstractAttentionRecurrent
,blocks.bricks.interfaces.Initializable
Adds fake attention interface to a transition.
BaseSequenceGenerator
requires its transition brick to supportAbstractAttentionRecurrent
interface, that is to have an embedded attention mechanism. For the cases when no attention is required (e.g. language modeling or encoderdecoder models),FakeAttentionRecurrent
is used to wrap a usual recurrent brick. The resulting brick has no glimpses and simply passes all states and contexts to the wrapped one.Todo
Get rid of this brick and support attentionless transitions in
BaseSequenceGenerator
.
apply
¶

compute_states
¶

initial_states
¶

take_glimpses
¶


class
blocks.bricks.sequence_generators.
LookupFeedback
(num_outputs=None, feedback_dim=None, **kwargs)[source]¶ Bases:
blocks.bricks.sequence_generators.AbstractFeedback
,blocks.bricks.interfaces.Initializable
A feedback brick for the case when readout are integers.
Stores and retrieves distributed representations of integers.

feedback
¶


class
blocks.bricks.sequence_generators.
Readout
(emitter=None, feedback_brick=None, merge=None, merge_prototype=None, post_merge=None, merged_dim=None, **kwargs)[source]¶ Bases:
blocks.bricks.sequence_generators.AbstractReadout
Readout brick with separated emitter and feedback parts.
Readout
combines a few bits and pieces into an object that can be used as the readout component inBaseSequenceGenerator
. This includes an emitter brick, to whichemit()
,cost()
andinitial_outputs()
calls are delegated, a feedback brick to whichfeedback()
functionality is delegated, and a pipeline to actually compute readouts from all the sources (see the source_names attribute ofAbstractReadout
).The readout computation pipeline is constructed from merge and post_merge brick, whose responsibilites are described in the respective docstrings.
Parameters:  emitter (an instance of
AbstractEmitter
) – The emitter component.  feedback_brick (an instance of
AbstractFeedback
) – The feedback component.  merge (
Brick
, optional) – A brick that takes the sources given in source_names as an input and combines them into a single output. If given, merge_prototype cannot be given.  merge_prototype (
FeedForward
, optional) – If merge isn’t given, the transformation given by merge_prototype is applied to each input before being summed. By default aLinear
transformation without biases is used. If given, merge cannot be given.  post_merge (
Feedforward
, optional) – This transformation is applied to the merged inputs. By defaultBias
is used.  merged_dim (int, optional) – The input dimension of post_merge i.e. the output dimension of merge (or merge_prototype). If not give, it is assumed to be the same as readout_dim (i.e. post_merge is assumed to not change dimensions).
 **kwargs (dict) – Passed to the parent’s constructor.

cost
¶

emit
¶

feedback
¶

initial_outputs
¶

readout
¶
 emitter (an instance of

class
blocks.bricks.sequence_generators.
SequenceGenerator
(readout, transition, attention=None, add_contexts=True, **kwargs)[source]¶ Bases:
blocks.bricks.sequence_generators.BaseSequenceGenerator
A more userfriendly interface for
BaseSequenceGenerator
.Parameters:  readout (instance of
AbstractReadout
) – The readout component for the sequence generator.  transition (instance of
BaseRecurrent
) – The recurrent transition to be used in the sequence generator. Will be combined with attention, if that one is given.  attention (object, optional) – The attention mechanism to be added to
transition
, an instance ofAbstractAttention
.  add_contexts (bool) – If
True
, theAttentionRecurrent
wrapping the transition will add additional contexts for the attended and its mask.  **kwargs (dict) – All keywords arguments are passed to the base class. If fork
keyword argument is not provided,
Fork
is created that forks all transition sequential inputs without a “mask” substring in them.
 readout (instance of

class
blocks.bricks.sequence_generators.
SoftmaxEmitter
(initial_output=0, **kwargs)[source]¶ Bases:
blocks.bricks.sequence_generators.AbstractEmitter
,blocks.bricks.interfaces.Initializable
,blocks.bricks.interfaces.Random
A softmax emitter for the case of integer outputs.
Interprets readout elements as energies corresponding to their indices.
Parameters: initial_output (int or a scalar Variable
) – The initial output.
cost
¶

emit
¶

initial_outputs
¶

probs
¶


class
blocks.bricks.sequence_generators.
TrivialEmitter
(*args, **kwargs)[source]¶ Bases:
blocks.bricks.sequence_generators.AbstractEmitter
An emitter for the trivial case when readouts are outputs.
Parameters: readout_dim (int) – The dimension of the readout. Notes
By default
cost()
always returns zero tensor.
cost
¶

emit
¶

initial_outputs
¶

Cost bricks¶

class
blocks.bricks.cost.
AbsoluteError
(name=None, children=None)[source]¶ Bases:
blocks.bricks.cost.CostMatrix

cost_matrix
¶


class
blocks.bricks.cost.
BinaryCrossEntropy
(name=None, children=None)[source]¶ Bases:
blocks.bricks.cost.CostMatrix

cost_matrix
¶


class
blocks.bricks.cost.
CategoricalCrossEntropy
(name=None, children=None)[source]¶ Bases:
blocks.bricks.cost.Cost

apply
¶


class
blocks.bricks.cost.
Cost
(name=None, children=None)[source]¶ Bases:
blocks.bricks.base.Brick

apply
¶


class
blocks.bricks.cost.
CostMatrix
(name=None, children=None)[source]¶ Bases:
blocks.bricks.cost.Cost
Base class for costs which can be calculated elementwise.
Assumes that the data has format (batch, features).

apply
¶

cost_matrix
¶


class
blocks.bricks.cost.
MisclassificationRate
(top_k=1)[source]¶ Bases:
blocks.bricks.cost.Cost
Calculates the misclassification rate for a minibatch.
Parameters: top_k (int, optional) – If the ground truth class is within the top_k highest responses for a given example, the model is considered to have predicted correctly. Default: 1. Notes
Ties for top_kth place are broken pessimistically, i.e. in the (in practice, rare) case that there is a tie for top_kth highest output for a given example, it is considered an incorrect prediction.

apply
¶


class
blocks.bricks.cost.
SquaredError
(name=None, children=None)[source]¶ Bases:
blocks.bricks.cost.CostMatrix

cost_matrix
¶

Wrapper bricks¶

class
blocks.bricks.wrappers.
BrickWrapper
[source]¶ Bases:
object
Base class for wrapper metaclasses.
Sometimes one wants to extend a brick with the capability to handle inputs different from what it was designed to handle. A typical example are inputs with more dimensions that was foreseen at the development stage. One way to proceed in such a situation is to write a decorator that wraps all application methods of the brick class by some additional logic before and after the application call.
BrickWrapper
serves as a convenient base class for such decorators.Note, that since directly applying a decorator to a
Brick
subclass will only take place after__new__()
is called, subclasses ofBrickWrapper
should be applied by setting the decorators attribute of the new brick class, like in the example below:>>> from blocks.bricks.base import Brick >>> class WrappedBrick(Brick): ... decorators = [WithExtraDims()]

wrap
(wrapped, namespace)[source]¶ Wrap an application of the base brick.
This method should be overriden to write into its namespace argument all required changes.
Parameters:  mcs (type) – The metaclass.
 wrapped (
Application
) – The application to be wrapped.  namespace (dict) – The namespace of the class being created.


class
blocks.bricks.wrappers.
WithExtraDims
[source]¶ Bases:
blocks.bricks.wrappers.BrickWrapper
Wraps a brick’s applications to handle inputs with extra dimensions.
A brick can be often reused even when data has more dimensions than in the default setting. An example is a situation when one wants to apply
categorical_cross_entropy()
to temporal data, that is when an additional ‘time’ axis is prepended to its both x and y inputs.This wrapper adds reshapes required to use application methods of a brick with such data by merging the extra dimensions with the first nonextra one. Two key assumptions are made: that all inputs and outputs have the same number of extra dimensions and that these extra dimensions are equal throughout all inputs and outputs.
While this might be inconvinient, the wrapped brick does not try to guess the number of extra dimensions, but demands it as an argument. The considerations of simplicity and reliability motivated this design choice. Upon availability in Blocks of a mechanism to request the expected number of dimensions for an input of a brick, this can be reconsidered.