Getting to the gradient descent algorithmBatch vs incremental gradient descentA constrained gradient descent...

Examples of a statistic that is not independent of sample's distribution?

Are the terms "stab" and "staccato" synonyms?

How much attack damage does the AC boost from a shield prevent on average?

BitNot does not flip bits in the way I expected

Finding algorithms of QGIS commands?

Could you please stop shuffling the deck and play already?

Why doesn't this Google Translate ad use the word "Translation" instead of "Translate"?

How to pass a string to a command that expects a file?

Space in array system equations

Who deserves to be first and second author? PhD student who collected data, research associate who wrote the paper or supervisor?

How did Alan Turing break the enigma code using the hint given by the lady in the bar?

Algorithm to convert a fixed-length string to the smallest possible collision-free representation?

Do f-stop and exposure time perfectly cancel?

What wound would be of little consequence to a biped but terrible for a quadruped?

Should I tell my boss the work he did was worthless

A three room house but a three headED dog

How strictly should I take "Candidates must be local"?

Making a sword in the stone, in a medieval world without magic

Does "variables should live in the smallest scope as possible" include the case "variables should not exist if possible"?

Is Gradient Descent central to every optimizer?

What are some noteworthy "mic-drop" moments in math?

Fourth person (in Slavey language)

infinitive telling the purpose

How are such low op-amp input currents possible?



Getting to the gradient descent algorithm


Batch vs incremental gradient descentA constrained gradient descent algorithmHow does one rigorously prove that gradient descent indeed decreases the function in question locally i.e. show $f(x^{(t+1)}) leq f(x^{(t)})$?How do mirror prox and accelerated methods differ conceptually?(steepest) gradient descent for minimizing a quadratic function $langle x, Ax rangle$ with $A succeq 0$Taylor Series Remainder, gradient methodsAbout the feasibility of a subgradient stepUnderstanding Newton method when optimizing cost function depending on a triangle.Second-Order Taylor Series Terms In Gradient DescentHow to show convergence of Frank-Wolfe algorithm ( or conditional gradient method)?













0












$begingroup$


I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as



$$w^{t+1} = arg min_w frac{1}{2} |w-w^t|^{2} + eta f(w^t) + langle w-w^t, nabla f(w^t) rangle$$



because the approximation $$f(w) approx f(w^t)~+ langle w-w^t,nabla f(w^t) rangle$$ gets worse as $|w-w^t|$ grows.



But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.










share|cite|improve this question











$endgroup$












  • $begingroup$
    Why is the term $eta f(w^t)$ in the objective function?
    $endgroup$
    – Rodrigo de Azevedo
    Mar 9 at 16:48
















0












$begingroup$


I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as



$$w^{t+1} = arg min_w frac{1}{2} |w-w^t|^{2} + eta f(w^t) + langle w-w^t, nabla f(w^t) rangle$$



because the approximation $$f(w) approx f(w^t)~+ langle w-w^t,nabla f(w^t) rangle$$ gets worse as $|w-w^t|$ grows.



But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.










share|cite|improve this question











$endgroup$












  • $begingroup$
    Why is the term $eta f(w^t)$ in the objective function?
    $endgroup$
    – Rodrigo de Azevedo
    Mar 9 at 16:48














0












0








0





$begingroup$


I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as



$$w^{t+1} = arg min_w frac{1}{2} |w-w^t|^{2} + eta f(w^t) + langle w-w^t, nabla f(w^t) rangle$$



because the approximation $$f(w) approx f(w^t)~+ langle w-w^t,nabla f(w^t) rangle$$ gets worse as $|w-w^t|$ grows.



But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.










share|cite|improve this question











$endgroup$




I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as



$$w^{t+1} = arg min_w frac{1}{2} |w-w^t|^{2} + eta f(w^t) + langle w-w^t, nabla f(w^t) rangle$$



because the approximation $$f(w) approx f(w^t)~+ langle w-w^t,nabla f(w^t) rangle$$ gets worse as $|w-w^t|$ grows.



But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.







optimization numerical-optimization gradient-descent






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Mar 9 at 16:47









Rodrigo de Azevedo

13k41960




13k41960










asked Mar 9 at 16:22









BartonBarton

154




154












  • $begingroup$
    Why is the term $eta f(w^t)$ in the objective function?
    $endgroup$
    – Rodrigo de Azevedo
    Mar 9 at 16:48


















  • $begingroup$
    Why is the term $eta f(w^t)$ in the objective function?
    $endgroup$
    – Rodrigo de Azevedo
    Mar 9 at 16:48
















$begingroup$
Why is the term $eta f(w^t)$ in the objective function?
$endgroup$
– Rodrigo de Azevedo
Mar 9 at 16:48




$begingroup$
Why is the term $eta f(w^t)$ in the objective function?
$endgroup$
– Rodrigo de Azevedo
Mar 9 at 16:48










1 Answer
1






active

oldest

votes


















0












$begingroup$

You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
$$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
$$x_{k+1} = x_k - tau nabla f(x_k)$$
The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.






share|cite|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3141295%2fgetting-to-the-gradient-descent-algorithm%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
    $$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
    $$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
    $$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
    $$x_{k+1} = x_k - tau nabla f(x_k)$$
    The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.






    share|cite|improve this answer











    $endgroup$


















      0












      $begingroup$

      You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
      $$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
      $$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
      $$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
      $$x_{k+1} = x_k - tau nabla f(x_k)$$
      The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.






      share|cite|improve this answer











      $endgroup$
















        0












        0








        0





        $begingroup$

        You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
        $$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
        $$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
        $$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
        $$x_{k+1} = x_k - tau nabla f(x_k)$$
        The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.






        share|cite|improve this answer











        $endgroup$



        You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
        $$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
        $$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
        $$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
        $$x_{k+1} = x_k - tau nabla f(x_k)$$
        The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Mar 9 at 17:07

























        answered Mar 9 at 16:46









        lightxbulblightxbulb

        1,125311




        1,125311






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3141295%2fgetting-to-the-gradient-descent-algorithm%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            六本木駅

            Integral that is continuous and looks like it converges to a geometric seriesTesting if a geometric series converges by taking limit to infinitySummation of arithmetic-geometric series of higher orderGeometric series with polynomial exponentHow to Recognize a Geometric SeriesShowing an integral equality with series over the integersDiscontinuity of a series of continuous functionsReasons why a Series ConvergesSum of infinite geometric series with two terms in summationUsing geometric series for computing IntegralsLimit of geometric series sum when $r = 1$

            Joseph Lister