• Neural computation · Mar 2019

    Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks.

    • Peter L Bartlett, David P Helmbold, and Philip M Long.
    • Department of Statistics, University of California, Berkeley, Berkeley, CA 94720-3860, U.S.A. bartlett@cs.berkeley.edu.
    • Neural Comput. 2019 Mar 1; 31 (3): 477-502.

    AbstractWe analyze algorithms for approximating a function http://www.w3.org/1998/Math/MathML">f(x)=Φx mapping http://www.w3.org/1998/Math/MathML">d to http://www.w3.org/1998/Math/MathML">d using deep linear neural networks, that is, that learn a function http://www.w3.org/1998/Math/MathML">h parameterized by matrices http://www.w3.org/1998/Math/MathML">Θ1,,ΘL and defined by http://www.w3.org/1998/Math/MathML">h(x)=ΘLΘL-1Θ1x . We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least-squares matrix http://www.w3.org/1998/Math/MathML">Φ , in the case where the initial hypothesis http://www.w3.org/1998/Math/MathML">Θ1==ΘL=I has excess loss bounded by a small enough constant. We also show that gradient descent fails to converge for http://www.w3.org/1998/Math/MathML">Φ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If http://www.w3.org/1998/Math/MathML">Φ is symmetric positive definite, we show that an algorithm that initializes http://www.w3.org/1998/Math/MathML">Θi=I learns an http://www.w3.org/1998/Math/MathML">ε -approximation of http://www.w3.org/1998/Math/MathML">f using a number of updates polynomial in http://www.w3.org/1998/Math/MathML">L , the condition number of http://www.w3.org/1998/Math/MathML">Φ , and http://www.w3.org/1998/Math/MathML">log(d/ε) . In contrast, we show that if the least-squares matrix http://www.w3.org/1998/Math/MathML">Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that http://www.w3.org/1998/Math/MathML">Φ satisfies http://www.w3.org/1998/Math/MathML">uΦu0 for all http://www.w3.org/1998/Math/MathML">u but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant http://www.w3.org/1998/Math/MathML">uΘLΘL-1Θ1u0 for all http://www.w3.org/1998/Math/MathML">u and the other that "balances" http://www.w3.org/1998/Math/MathML">Θ1,,ΘL so that they have the same singular values.

      Pubmed     Full text   Copy Citation     Plaintext  

      Add institutional full text...

    Notes

     
    Knowledge, pearl, summary or comment to share?
    300 characters remaining
    help        
    You can also include formatting, links, images and footnotes in your notes
    • Simple formatting can be added to notes, such as *italics*, _underline_ or **bold**.
    • Superscript can be denoted by <sup>text</sup> and subscript <sub>text</sub>.
    • Numbered or bulleted lists can be created using either numbered lines 1. 2. 3., hyphens - or asterisks *.
    • Links can be included with: [my link to pubmed](http://pubmed.com)
    • Images can be included with: ![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")
    • For footnotes use [^1](This is a footnote.) inline.
    • Or use an inline reference [^1] to refer to a longer footnote elseweher in the document [^1]: This is a long footnote..

    hide…

Want more great medical articles?

Keep up to date with a free trial of metajournal, personalized for your practice.
1,694,794 articles already indexed!

We guarantee your privacy. Your email address will not be shared.