Gradient Descent with Identity Initialization Efficiently

Neural computation · Mar 2019

Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks.

We analyze algorithms for approximating a function f(x)=Φx mapping ℜd to ℜd using deep linear neural networks, that is, that learn a function h parameterized by matrices Θ1,…,ΘL and defined by h(x)=ΘLΘL-1…Θ1x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least-squares matrix Φ , in the case where the initial hypothesis Θ1=…=ΘL=I has excess loss bounded by a small enough constant. ⋯ In contrast, we show that if the least-squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that Φ satisfies u⊤Φu0 for all u but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant u⊤ΘLΘL-1…Θ1u0 for all u and the other that "balances" Θ1,…,ΘL so that they have the same singular values.

explore further… or not…
- Peter L Bartlett, David P Helmbold, and Philip M Long.
- Department of Statistics, University of California, Berkeley, Berkeley, CA 94720-3860, U.S.A. bartlett@cs.berkeley.edu.
- Neural Comput. 2019 Mar 1; 31 (3): 477-502.
AbstractWe analyze algorithms for approximating a function http://www.w3.org/1998/Math/MathML">f(x)=Φx mapping http://www.w3.org/1998/Math/MathML">ℜd to http://www.w3.org/1998/Math/MathML">ℜd using deep linear neural networks, that is, that learn a function http://www.w3.org/1998/Math/MathML">h parameterized by matrices http://www.w3.org/1998/Math/MathML">Θ1,…,ΘL and defined by http://www.w3.org/1998/Math/MathML">h(x)=ΘLΘL-1…Θ1x . We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least-squares matrix http://www.w3.org/1998/Math/MathML">Φ , in the case where the initial hypothesis http://www.w3.org/1998/Math/MathML">Θ1=…=ΘL=I has excess loss bounded by a small enough constant. We also show that gradient descent fails to converge for http://www.w3.org/1998/Math/MathML">Φ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If http://www.w3.org/1998/Math/MathML">Φ is symmetric positive definite, we show that an algorithm that initializes http://www.w3.org/1998/Math/MathML">Θi=I learns an http://www.w3.org/1998/Math/MathML">ε -approximation of http://www.w3.org/1998/Math/MathML">f using a number of updates polynomial in http://www.w3.org/1998/Math/MathML">L , the condition number of http://www.w3.org/1998/Math/MathML">Φ , and http://www.w3.org/1998/Math/MathML">log(d/ε) . In contrast, we show that if the least-squares matrix http://www.w3.org/1998/Math/MathML">Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that http://www.w3.org/1998/Math/MathML">Φ satisfies http://www.w3.org/1998/Math/MathML">u⊤Φu0 for all http://www.w3.org/1998/Math/MathML">u but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant http://www.w3.org/1998/Math/MathML">u⊤ΘLΘL-1…Θ1u0 for all http://www.w3.org/1998/Math/MathML">u and the other that "balances" http://www.w3.org/1998/Math/MathML">Θ1,…,ΘL so that they have the same singular values.

Pubmed Full text Copy Citation Plaintext

Add institutional full text...
Notes
Knowledge, pearl, summary or comment to share?

300 characters remaining

help

You can also include formatting, links, images and footnotes in your notes

Simple formatting can be added to notes, such as *italics*, _underline_ or **bold**.

Superscript can be denoted by <sup>text</sup> and subscript <sub>text</sub>.

Numbered or bulleted lists can be created using either numbered lines 1. 2. 3., hyphens - or asterisks *.

Links can be included with: [my link to pubmed](http://pubmed.com)

Images can be included with: ![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")

For footnotes use [^1](This is a footnote.) inline.

Or use an inline reference [^1] to refer to a longer footnote elseweher in the document [^1]: This is a long footnote..
hide…

Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks.

Notes

300 characters remaining

help

You can also include formatting, links, images and footnotes in your notes

What will the 'Medical Journal of You' look like?

Start your free 21 day trial now.