Getting to the gradient descent algorithmBatch vs incremental gradient descentA constrained gradient descent...
Examples of a statistic that is not independent of sample's distribution?
Are the terms "stab" and "staccato" synonyms?
How much attack damage does the AC boost from a shield prevent on average?
BitNot does not flip bits in the way I expected
Finding algorithms of QGIS commands?
Could you please stop shuffling the deck and play already?
Why doesn't this Google Translate ad use the word "Translation" instead of "Translate"?
How to pass a string to a command that expects a file?
Space in array system equations
Who deserves to be first and second author? PhD student who collected data, research associate who wrote the paper or supervisor?
How did Alan Turing break the enigma code using the hint given by the lady in the bar?
Algorithm to convert a fixed-length string to the smallest possible collision-free representation?
Do f-stop and exposure time perfectly cancel?
What wound would be of little consequence to a biped but terrible for a quadruped?
Should I tell my boss the work he did was worthless
A three room house but a three headED dog
How strictly should I take "Candidates must be local"?
Making a sword in the stone, in a medieval world without magic
Does "variables should live in the smallest scope as possible" include the case "variables should not exist if possible"?
Is Gradient Descent central to every optimizer?
What are some noteworthy "mic-drop" moments in math?
Fourth person (in Slavey language)
infinitive telling the purpose
How are such low op-amp input currents possible?
Getting to the gradient descent algorithm
Batch vs incremental gradient descentA constrained gradient descent algorithmHow does one rigorously prove that gradient descent indeed decreases the function in question locally i.e. show $f(x^{(t+1)}) leq f(x^{(t)})$?How do mirror prox and accelerated methods differ conceptually?(steepest) gradient descent for minimizing a quadratic function $langle x, Ax rangle$ with $A succeq 0$Taylor Series Remainder, gradient methodsAbout the feasibility of a subgradient stepUnderstanding Newton method when optimizing cost function depending on a triangle.Second-Order Taylor Series Terms In Gradient DescentHow to show convergence of Frank-Wolfe algorithm ( or conditional gradient method)?
$begingroup$
I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as
$$w^{t+1} = arg min_w frac{1}{2} |w-w^t|^{2} + eta f(w^t) + langle w-w^t, nabla f(w^t) rangle$$
because the approximation $$f(w) approx f(w^t)~+ langle w-w^t,nabla f(w^t) rangle$$ gets worse as $|w-w^t|$ grows.
But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.
optimization numerical-optimization gradient-descent
$endgroup$
add a comment |
$begingroup$
I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as
$$w^{t+1} = arg min_w frac{1}{2} |w-w^t|^{2} + eta f(w^t) + langle w-w^t, nabla f(w^t) rangle$$
because the approximation $$f(w) approx f(w^t)~+ langle w-w^t,nabla f(w^t) rangle$$ gets worse as $|w-w^t|$ grows.
But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.
optimization numerical-optimization gradient-descent
$endgroup$
$begingroup$
Why is the term $eta f(w^t)$ in the objective function?
$endgroup$
– Rodrigo de Azevedo
Mar 9 at 16:48
add a comment |
$begingroup$
I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as
$$w^{t+1} = arg min_w frac{1}{2} |w-w^t|^{2} + eta f(w^t) + langle w-w^t, nabla f(w^t) rangle$$
because the approximation $$f(w) approx f(w^t)~+ langle w-w^t,nabla f(w^t) rangle$$ gets worse as $|w-w^t|$ grows.
But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.
optimization numerical-optimization gradient-descent
$endgroup$
I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as
$$w^{t+1} = arg min_w frac{1}{2} |w-w^t|^{2} + eta f(w^t) + langle w-w^t, nabla f(w^t) rangle$$
because the approximation $$f(w) approx f(w^t)~+ langle w-w^t,nabla f(w^t) rangle$$ gets worse as $|w-w^t|$ grows.
But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.
optimization numerical-optimization gradient-descent
optimization numerical-optimization gradient-descent
edited Mar 9 at 16:47
Rodrigo de Azevedo
13k41960
13k41960
asked Mar 9 at 16:22
BartonBarton
154
154
$begingroup$
Why is the term $eta f(w^t)$ in the objective function?
$endgroup$
– Rodrigo de Azevedo
Mar 9 at 16:48
add a comment |
$begingroup$
Why is the term $eta f(w^t)$ in the objective function?
$endgroup$
– Rodrigo de Azevedo
Mar 9 at 16:48
$begingroup$
Why is the term $eta f(w^t)$ in the objective function?
$endgroup$
– Rodrigo de Azevedo
Mar 9 at 16:48
$begingroup$
Why is the term $eta f(w^t)$ in the objective function?
$endgroup$
– Rodrigo de Azevedo
Mar 9 at 16:48
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
$$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
$$x_{k+1} = x_k - tau nabla f(x_k)$$
The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3141295%2fgetting-to-the-gradient-descent-algorithm%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
$$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
$$x_{k+1} = x_k - tau nabla f(x_k)$$
The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.
$endgroup$
add a comment |
$begingroup$
You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
$$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
$$x_{k+1} = x_k - tau nabla f(x_k)$$
The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.
$endgroup$
add a comment |
$begingroup$
You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
$$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
$$x_{k+1} = x_k - tau nabla f(x_k)$$
The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.
$endgroup$
You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually:
$$x_{k+1} = argmin_{xin mathbb{R}^n}{f(x_{k}) + nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$x_{k+1} = argmin_{xin mathbb{R}^n}{nabla f(x_k) cdot (x-x_k) + frac{1}{2tau}||x-x_k||^2}$$
$$nabla f(x_k) + frac{1}{tau}(x_{k+1}-x_k) = 0$$
$$x_{k+1} = x_k - tau nabla f(x_k)$$
The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + nabla f(x_k) cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{tau}(x) = f_{x_k}(x) + frac{1}{2tau}||x-x_k||^2$ which is a $tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.
edited Mar 9 at 17:07
answered Mar 9 at 16:46
lightxbulblightxbulb
1,125311
1,125311
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3141295%2fgetting-to-the-gradient-descent-algorithm%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Why is the term $eta f(w^t)$ in the objective function?
$endgroup$
– Rodrigo de Azevedo
Mar 9 at 16:48