Algorithms¶

class
blocks.algorithms.
AdaDelta
(decay_rate=0.95, epsilon=1e06)[source]¶ Bases:
blocks.algorithms.StepRule
Adapts the step size over time using only first order information.
Parameters: Notes
For more information, see [ADADELTA].
[ADADELTA] Matthew D. Zeiler, ADADELTA: An Adaptive Learning Rate Method, arXiv:1212.5701. 
compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (


class
blocks.algorithms.
AdaGrad
(learning_rate=0.002, epsilon=1e06)[source]¶ Bases:
blocks.algorithms.StepRule
Implements the AdaGrad learning rule.
Parameters: Notes
For more information, see [ADAGRAD].
[ADAGRAD] Duchi J, Hazan E, Singer Y., Adaptive subgradient methods for online learning and stochastic optimization, http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf 
compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (


class
blocks.algorithms.
Adam
(learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e08, decay_factor=1)[source]¶ Bases:
blocks.algorithms.StepRule
Adam optimizer as described in [King2014].
[King2014] Diederik Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization, http://arxiv.org/abs/1412.6980 Parameters:  learning_rate (float, optional) – Step size. Default value is set to 0.002.
 beta1 (float, optional) – Exponential decay rate for the first moment estimates. Default value is set to 0.9.
 beta2 (float, optional) – Exponential decay rate for the second moment estimates. Default value is set to 0.999.
 epsilon (float, optional) – Default value is set to 1e8.
 decay_factor (float, optional) – Default value is set to 1.

compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (

class
blocks.algorithms.
BasicMomentum
(momentum=0.0)[source]¶ Bases:
blocks.algorithms.StepRule
Accumulates step with exponential discount.
Parameters: momentum (float, optional) – The momentum coefficient. Defaults to 0. Notes
This step rule is intended to be used in conjunction with another step rule, _e.g._
Scale
. For an allbatteriesincluded experience, look atMomentum
.
compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (


class
blocks.algorithms.
BasicRMSProp
(decay_rate=0.9, max_scaling=100000.0)[source]¶ Bases:
blocks.algorithms.StepRule
Scales the step size by a running average of the recent step norms.
Parameters: Notes
This step rule is intended to be used in conjunction with another step rule, _e.g._
Scale
. For an allbatteriesincluded experience, look atRMSProp
.In general, this step rule should be used _before_ other step rules, because it has normalization properties that may undo their work. For instance, it should be applied first when used in conjunction with
Scale
.For more information, see [Hint2014].

compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (


class
blocks.algorithms.
CompositeRule
(components)[source]¶ Bases:
blocks.algorithms.StepRule
Chains several step rules.
Parameters: components (list of StepRule
) – The learning rules to be chained. The rules will be applied in the order as given.
compute_steps
(previous_steps)[source]¶ Build a Theano expression for steps for all parameters.
Override this method if you want to process the steps with respect to all parameters as a whole, not parameterwise.
Parameters: previous_steps (OrderedDict) – An OrderedDict
of (TensorSharedVariable
TensorVariable
) pairs. The keys are the parameters being trained, the values are the expressions for quantities related to gradients of the cost with respect to the parameters, either the gradients themselves or steps in related directions.Returns:  steps (OrderedDict) – A dictionary of the proposed steps in the same form as previous_steps.
 updates (list) – A list of tuples representing updates to be performed.


class
blocks.algorithms.
GradientDescent
(cost=None, parameters=None, step_rule=None, gradients=None, known_grads=None, consider_constant=None, **kwargs)[source]¶ Bases:
blocks.algorithms.UpdatesAlgorithm
A base class for all gradient descent algorithms.
By “gradient descent” we mean a training algorithm of the following form:
for batch in data: steps = step_rule.compute_steps(parameters, gradients_wr_parameters) for parameter in parameters: parameter = steps[parameter]
Note, that the step is subtracted, not added! This is done in order to make step rule chaining possible.
Parameters:  cost (
TensorVariable
, optional) – The objective to be minimized. Unused if gradients is specified.  parameters (list of
TensorSharedVariable
, optional) – The parameters to be tuned. If not provided, inferred from the keys of gradients (in which case gradients must be an OrderedDict).  step_rule (instance of
StepRule
, optional) – An object encapsulating most of the algorithm’s logic. Its compute_steps method is called to get Theano expression for steps. Note, that the step rule might have a state, e.g. to remember a weighted sum of gradients from previous steps like it is done in gradient descent with momentum. IfNone
, an instance ofScale
is created.  gradients (OrderedDict or list of 2tuples, optional) – A dictionary mapping a parameter to an expression for the cost’s
gradient with respect to the parameter, or equivalently, a list of
(parameter, gradient) tuples. If
None
, the gradient are taken automatically usingtheano.gradient.grad()
.  known_grads (dict, optional) – A passthrough to theano.tensor.grad’s known_grads argument. Useful when you know the [approximate] gradients of some subexpressions and would like Theano to use that information to compute parameter gradients. Only makes sense when gradients is None.
 consider_constant (list, optional) – A passthrough to theano.tensor.grad’s consider_constant argument. A list of expressions through which gradients will not be backpropagated. Only makes sense when gradients is None.

gradients
¶ OrderedDict – The gradient dictionary.
Notes
Changing updates attribute or calling add_updates after the initialize method is called will have no effect.
If a cost and parameters are provided, gradients are taken immediately upon construction, and changes to these attributes after construction will have no effect.
gradients must be an OrderedDict if parameters is unspecified because ordinary dictionaries have an unpredictable iteration order due to hash randomization (which is enabled by default since versions 2.7.3 and 3.2.3 of Python). This source of variability, when combined with Theano’s heuristic graph optimizations, can cause serious reproducibility issues.
 cost (

class
blocks.algorithms.
Momentum
(learning_rate=1.0, momentum=0.0)[source]¶ Bases:
blocks.algorithms.CompositeRule
Accumulates step with exponential discount.
Combines
BasicMomentum
andScale
to form the usual momentum step rule.Parameters: 
learning_rate
¶ SharedVariable
– A variable for learning rate.

momentum
¶ SharedVariable
– A variable for momentum.
See also
SharedVariableModifier


class
blocks.algorithms.
RMSProp
(learning_rate=1.0, decay_rate=0.9, max_scaling=100000.0)[source]¶ Bases:
blocks.algorithms.CompositeRule
Scales the step size by a running average of the recent step norms.
Combines
BasicRMSProp
andScale
to form the step rule described in [Hint2014].[Hint2014] (1, 2) Geoff Hinton, Neural Networks for Machine Learning, lecture 6a, http://cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Parameters:  learning_rate (float, optional) – The learning rate by which the previous step scaled. Defaults to 1.
 decay_rate (float, optional) – How fast the running average decays (lower is faster). Defaults to 0.9.
 max_scaling (float, optional) – Maximum scaling of the step size, in case the running average is really small. Defaults to 1e5.

learning_rate
¶ SharedVariable
– A variable for learning rate.

decay_rate
¶ SharedVariable
– A variable for decay rate.
See also
SharedVariableModifier

class
blocks.algorithms.
RemoveNotFinite
(scaler=1)[source]¶ Bases:
blocks.algorithms.StepRule
A step rule that skips steps with nonfinite elements.
Replaces a step (the parameter update of a single shared variable) which contains nonfinite elements (such as
inf
orNaN
) with a step rescaling the parameters.Parameters: scaler (float, optional) – The scaling applied to the parameter in case the step contains nonfinite elements. Defaults to 1, which means that parameters will not be changed. Notes
This rule should be applied last!
This trick was originally used in the GroundHog framework.

compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (


class
blocks.algorithms.
Restrict
(step_rule, variables)[source]¶ Bases:
blocks.algorithms.StepRule
Applies a given
StepRule
only to certain variables.Example applications include clipping steps on only certain parameters, or scaling a certain kind of parameter’s updates (e.g. adding an additional scalar multiplier to the steps taken on convolutional filters).
Parameters: 
compute_steps
(previous_steps)[source]¶ Build a Theano expression for steps for all parameters.
Override this method if you want to process the steps with respect to all parameters as a whole, not parameterwise.
Parameters: previous_steps (OrderedDict) – An OrderedDict
of (TensorSharedVariable
TensorVariable
) pairs. The keys are the parameters being trained, the values are the expressions for quantities related to gradients of the cost with respect to the parameters, either the gradients themselves or steps in related directions.Returns:  steps (OrderedDict) – A dictionary of the proposed steps in the same form as previous_steps.
 updates (list) – A list of tuples representing updates to be performed.


class
blocks.algorithms.
Scale
(learning_rate=1.0)[source]¶ Bases:
blocks.algorithms.StepRule
A step in the direction proportional to the previous step.
If used in
GradientDescent
alone, this step rule implements steepest descent.Parameters: learning_rate (float) – The learning rate by which the previous step is multiplied to produce the step. 
learning_rate
¶ TensorSharedVariable
– The shared variable storing the learning rate used.

compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (


class
blocks.algorithms.
StepClipping
(threshold=None)[source]¶ Bases:
blocks.algorithms.StepRule
Rescales an entire step if its L2 norm exceeds a threshold.
When the previous steps are the gradients, this step rule performs gradient clipping.
Parameters: threshold (float, optional) – The maximum permitted L2 norm for the step. The step will be rescaled to be not higher than this quanity. If None
, no rescaling will be applied.
threshold
¶ tensor.TensorSharedVariable
– The shared variable storing the clipping threshold used.

compute_steps
(previous_steps)[source]¶ Build a Theano expression for steps for all parameters.
Override this method if you want to process the steps with respect to all parameters as a whole, not parameterwise.
Parameters: previous_steps (OrderedDict) – An OrderedDict
of (TensorSharedVariable
TensorVariable
) pairs. The keys are the parameters being trained, the values are the expressions for quantities related to gradients of the cost with respect to the parameters, either the gradients themselves or steps in related directions.Returns:  steps (OrderedDict) – A dictionary of the proposed steps in the same form as previous_steps.
 updates (list) – A list of tuples representing updates to be performed.


class
blocks.algorithms.
StepRule
[source]¶ Bases:
object
A rule to compute steps for a gradient descent algorithm.

compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (

compute_steps
(previous_steps)[source]¶ Build a Theano expression for steps for all parameters.
Override this method if you want to process the steps with respect to all parameters as a whole, not parameterwise.
Parameters: previous_steps (OrderedDict) – An OrderedDict
of (TensorSharedVariable
TensorVariable
) pairs. The keys are the parameters being trained, the values are the expressions for quantities related to gradients of the cost with respect to the parameters, either the gradients themselves or steps in related directions.Returns:  steps (OrderedDict) – A dictionary of the proposed steps in the same form as previous_steps.
 updates (list) – A list of tuples representing updates to be performed.


class
blocks.algorithms.
TrainingAlgorithm
[source]¶ Bases:
object
Base class for training algorithms.
A training algorithm object has a simple lifecycle. First it is initialized by calling its
initialize()
method. At this stage, for instance, Theano functions can be compiled. After that theprocess_batch()
method is repeatedly called with a batch of training data as a parameter.

class
blocks.algorithms.
UpdatesAlgorithm
(updates=None, theano_func_kwargs=None, on_unused_sources='raise', **kwargs)[source]¶ Bases:
blocks.algorithms.TrainingAlgorithm
Base class for algorithms that use Theano functions with updates.
Parameters:  updates (list of tuples or
OrderedDict
) – The updates that should be performed.  theano_func_kwargs (dict, optional) – A passthrough to theano.function for additional arguments. Useful for passing profile or mode arguments to the theano function that will be compiled for the algorithm.
 on_unused_sources (str, one of 'raise' (default), 'ignore', 'warn') – Controls behavior when not all sources in a batch are used (i.e. there is no variable with a matching name in the inputs of the computational graph of the updates).

updates
¶ list of
TensorSharedVariable
updates – Updates to be done for every batch. It is required that the updates are done using the old values of optimized parameters.
Notes
Changing updates attribute or calling add_updates after the initialize method is called will have no effect.

add_updates
(updates)[source]¶ Add updates to the training process.
The updates will be done _before_ the parameters are changed.
Parameters: updates (list of tuples or OrderedDict
) – The updates to add.

process_batch
(batch)[source]¶ Process a batch of training data.

batch
¶ dict – A dictionary of (source name, data) pairs.


updates
 updates (list of tuples or

class
blocks.algorithms.
VariableClipping
(threshold, axis=None)[source]¶ Bases:
blocks.algorithms.StepRule
Clip the maximum norm of individual variables along certain axes.
This
StepRule
can be used to implement L2 norm constraints on e.g. the weight vectors of individual hidden units, convolutional filters or entire weight tensors. Combine withRestrict
(and possiblyCompositeRule
), to apply such constraints only to certain variables and/or apply different norm constraints to different variables.Parameters: Notes
Because of the way the
StepRule
API works, this particular rule implements norm clipping of the value after update in the following way: it computesparameter  previous_step
, scales it to have (possibly axeswise) norm(s) of at most threshold, then subtracts that value from parameter to yield an ‘equivalent step’ that respects the desired norm constraints. This procedure implicitly assumes one is doing simple (stochastic) gradient descent, and so steps computed by this step rule may not make sense for use in other contexts.Investigations into maxnorm regularization date from [Srebro2005]. The first appearance of this technique as a regularization method for the weight vectors of individual hidden units in feedforward neural networks may be [Hinton2012].
[Srebro2005] Nathan Srebro and Adi Shraibman. “Rank, TraceNorm and MaxNorm”. 18th Annual Conference on Learning Theory (COLT), June 2005. [Hinton2012] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov. “Improving neural networks by preventing coadaptation of feature detectors”. arXiv:1207.0580. 
compute_step
(parameter, previous_step)[source]¶ Build a Theano expression for the step for a parameter.
This method is called by default implementation of
compute_steps()
, it relieves from writing a loop each time.Parameters:  parameter (
TensorSharedVariable
) – The parameter.  previous_step (
TensorVariable
) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:  step (
Variable
) – Theano variable for the step to take.  updates (list) – A list of tuples representing updates to be performed. This
is useful for stateful rules such as
Momentum
which need to update shared variables after itetations.
 parameter (
