Finding the M.L.E estimates of weight maximizing the Likelihood Function of a Linear RegressionDifference...
Could the E-bike drivetrain wear down till needing replacement after 400 km?
Fly on a jet pack vs fly with a jet pack?
Does the Mind Blank spell prevent the target from being frightened?
Are all species of CANNA edible?
Varistor? Purpose and principle
Is it possible to have a strip of cold climate in the middle of a planet?
Why has "pence" been used in this sentence, not "pences"?
Transformation of random variables and joint distributions
Did US corporations pay demonstrators in the German demonstrations against article 13?
What (else) happened July 1st 1858 in London?
API Access HTML/Javascript
Freedom of speech and where it applies
How do you respond to a colleague from another team when they're wrongly expecting that you'll help them?
ArcGIS not connecting to PostgreSQL db with all upper-case name
Frequency of inspection at vegan restaurants
Did arcade monitors have same pixel aspect ratio as TV sets?
Is there a word to describe the feeling of being transfixed out of horror?
What does the Rambam mean when he says that the planets have souls?
What's the difference between 違法 and 不法?
How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?
How do I repair my stair bannister?
Why does Async/Await work properly when the loop is inside the async function and not the other way around?
Journal losing indexing services
Bob has never been a M before
Finding the M.L.E estimates of weight maximizing the Likelihood Function of a Linear Regression
Difference between gradient and JacobianGradient and Jacobian row and column conventionsThe Gradient as a Row vs. Column VectorGradient is covariant or contravariant?Is the gradiant a column or a row?Minimizing $L_1$ RegularizationStata: “Between and fixed effect estimates” in a linear regression?Finding estimates of a Linear Regression Equation - RLinear Regression Problem (“Regression Towards the Mean”)Effects of feature scaling on weight vectors for linear regressionLikelihood function for logistic regressionHow does bayesian regression differs from maximum likelihood regression?Derivation of normal equations for maximum likelihood and least squaresEfficient Numerical Optimization for Gradient Descent with Constraints (Lagrangian Multiplier)Understanding the use of Radial Basis Function in Linear Regression
$begingroup$
I am reading this book "Pattern Recognition and Machine Learning " - Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :
We have a feature vector $x in R^{k}$ , a basis function $phi_{i}$ , and we call $ phi = [phi_{0},phi_{1},phi_{2},....phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
$$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $beta^{-1}$.
So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
$$ N(t|y(x,w),beta) $$.
The book then proceeds with the likelihood function , for $N$ overstations to be :
$$ P =prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),beta )\
Rightarrow ln(P) = sum_{i=1}^{i=N}
N(t_{i}|y(x_{i},w),beta)
=frac{N}{2}ln(beta) -frac{N}{2}ln(2pi) -beta E_{D}w$$ where $E_{D}w= frac{1}{2}sum_{i=1}^{i=N}[ t_{n}-w^{T}phi(x_{n})]^{2}$
Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
Now the author calculates the gradient of $E_{D}w$ and writes :
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})^{T}$$ .
My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?
Edit /Note :
After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})$$
i.e without a transpose over the $phi(x_{n})$ .
I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !
multivariable-calculus regression machine-learning pattern-recognition
$endgroup$
add a comment |
$begingroup$
I am reading this book "Pattern Recognition and Machine Learning " - Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :
We have a feature vector $x in R^{k}$ , a basis function $phi_{i}$ , and we call $ phi = [phi_{0},phi_{1},phi_{2},....phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
$$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $beta^{-1}$.
So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
$$ N(t|y(x,w),beta) $$.
The book then proceeds with the likelihood function , for $N$ overstations to be :
$$ P =prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),beta )\
Rightarrow ln(P) = sum_{i=1}^{i=N}
N(t_{i}|y(x_{i},w),beta)
=frac{N}{2}ln(beta) -frac{N}{2}ln(2pi) -beta E_{D}w$$ where $E_{D}w= frac{1}{2}sum_{i=1}^{i=N}[ t_{n}-w^{T}phi(x_{n})]^{2}$
Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
Now the author calculates the gradient of $E_{D}w$ and writes :
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})^{T}$$ .
My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?
Edit /Note :
After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})$$
i.e without a transpose over the $phi(x_{n})$ .
I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !
multivariable-calculus regression machine-learning pattern-recognition
$endgroup$
add a comment |
$begingroup$
I am reading this book "Pattern Recognition and Machine Learning " - Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :
We have a feature vector $x in R^{k}$ , a basis function $phi_{i}$ , and we call $ phi = [phi_{0},phi_{1},phi_{2},....phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
$$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $beta^{-1}$.
So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
$$ N(t|y(x,w),beta) $$.
The book then proceeds with the likelihood function , for $N$ overstations to be :
$$ P =prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),beta )\
Rightarrow ln(P) = sum_{i=1}^{i=N}
N(t_{i}|y(x_{i},w),beta)
=frac{N}{2}ln(beta) -frac{N}{2}ln(2pi) -beta E_{D}w$$ where $E_{D}w= frac{1}{2}sum_{i=1}^{i=N}[ t_{n}-w^{T}phi(x_{n})]^{2}$
Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
Now the author calculates the gradient of $E_{D}w$ and writes :
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})^{T}$$ .
My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?
Edit /Note :
After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})$$
i.e without a transpose over the $phi(x_{n})$ .
I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !
multivariable-calculus regression machine-learning pattern-recognition
$endgroup$
I am reading this book "Pattern Recognition and Machine Learning " - Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :
We have a feature vector $x in R^{k}$ , a basis function $phi_{i}$ , and we call $ phi = [phi_{0},phi_{1},phi_{2},....phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
$$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $beta^{-1}$.
So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
$$ N(t|y(x,w),beta) $$.
The book then proceeds with the likelihood function , for $N$ overstations to be :
$$ P =prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),beta )\
Rightarrow ln(P) = sum_{i=1}^{i=N}
N(t_{i}|y(x_{i},w),beta)
=frac{N}{2}ln(beta) -frac{N}{2}ln(2pi) -beta E_{D}w$$ where $E_{D}w= frac{1}{2}sum_{i=1}^{i=N}[ t_{n}-w^{T}phi(x_{n})]^{2}$
Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
Now the author calculates the gradient of $E_{D}w$ and writes :
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})^{T}$$ .
My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?
Edit /Note :
After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})$$
i.e without a transpose over the $phi(x_{n})$ .
I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !
multivariable-calculus regression machine-learning pattern-recognition
multivariable-calculus regression machine-learning pattern-recognition
edited Mar 16 at 9:00
warrior_monk
asked Mar 14 at 12:46
warrior_monkwarrior_monk
436
436
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
$phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
$$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.
Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
$$
nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
$$
since summing over the vectors doesn't change their dimensionality.
Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.
Your question:
my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$
Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.
Edit for comments:
In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).
However, it is slightly more common, I feel, to consider it a row vector, because:
- The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions
- It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).
There is plenty of debate on this though; e.g.,
[1],
[2],
[3],
[4],
[5],
[6],
[7]
[8].
(Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)
I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?
TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.
$endgroup$
$begingroup$
1.) My bad , i fixed the typo for $phi(x_{n})$.
$endgroup$
– warrior_monk
Mar 14 at 23:07
$begingroup$
2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
$endgroup$
– warrior_monk
Mar 14 at 23:16
1
$begingroup$
@warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
$endgroup$
– user3658307
Mar 15 at 1:55
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3147959%2ffinding-the-m-l-e-estimates-of-weight-maximizing-the-likelihood-function-of-a-li%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
$phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
$$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.
Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
$$
nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
$$
since summing over the vectors doesn't change their dimensionality.
Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.
Your question:
my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$
Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.
Edit for comments:
In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).
However, it is slightly more common, I feel, to consider it a row vector, because:
- The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions
- It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).
There is plenty of debate on this though; e.g.,
[1],
[2],
[3],
[4],
[5],
[6],
[7]
[8].
(Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)
I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?
TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.
$endgroup$
$begingroup$
1.) My bad , i fixed the typo for $phi(x_{n})$.
$endgroup$
– warrior_monk
Mar 14 at 23:07
$begingroup$
2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
$endgroup$
– warrior_monk
Mar 14 at 23:16
1
$begingroup$
@warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
$endgroup$
– user3658307
Mar 15 at 1:55
add a comment |
$begingroup$
I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
$phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
$$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.
Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
$$
nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
$$
since summing over the vectors doesn't change their dimensionality.
Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.
Your question:
my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$
Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.
Edit for comments:
In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).
However, it is slightly more common, I feel, to consider it a row vector, because:
- The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions
- It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).
There is plenty of debate on this though; e.g.,
[1],
[2],
[3],
[4],
[5],
[6],
[7]
[8].
(Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)
I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?
TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.
$endgroup$
$begingroup$
1.) My bad , i fixed the typo for $phi(x_{n})$.
$endgroup$
– warrior_monk
Mar 14 at 23:07
$begingroup$
2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
$endgroup$
– warrior_monk
Mar 14 at 23:16
1
$begingroup$
@warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
$endgroup$
– user3658307
Mar 15 at 1:55
add a comment |
$begingroup$
I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
$phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
$$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.
Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
$$
nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
$$
since summing over the vectors doesn't change their dimensionality.
Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.
Your question:
my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$
Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.
Edit for comments:
In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).
However, it is slightly more common, I feel, to consider it a row vector, because:
- The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions
- It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).
There is plenty of debate on this though; e.g.,
[1],
[2],
[3],
[4],
[5],
[6],
[7]
[8].
(Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)
I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?
TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.
$endgroup$
I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
$phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
$$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.
Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
$$
nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
$$
since summing over the vectors doesn't change their dimensionality.
Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.
Your question:
my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$
Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.
Edit for comments:
In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).
However, it is slightly more common, I feel, to consider it a row vector, because:
- The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions
- It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).
There is plenty of debate on this though; e.g.,
[1],
[2],
[3],
[4],
[5],
[6],
[7]
[8].
(Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)
I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?
TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.
edited Mar 15 at 1:52
answered Mar 14 at 15:50
user3658307user3658307
4,9883946
4,9883946
$begingroup$
1.) My bad , i fixed the typo for $phi(x_{n})$.
$endgroup$
– warrior_monk
Mar 14 at 23:07
$begingroup$
2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
$endgroup$
– warrior_monk
Mar 14 at 23:16
1
$begingroup$
@warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
$endgroup$
– user3658307
Mar 15 at 1:55
add a comment |
$begingroup$
1.) My bad , i fixed the typo for $phi(x_{n})$.
$endgroup$
– warrior_monk
Mar 14 at 23:07
$begingroup$
2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
$endgroup$
– warrior_monk
Mar 14 at 23:16
1
$begingroup$
@warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
$endgroup$
– user3658307
Mar 15 at 1:55
$begingroup$
1.) My bad , i fixed the typo for $phi(x_{n})$.
$endgroup$
– warrior_monk
Mar 14 at 23:07
$begingroup$
1.) My bad , i fixed the typo for $phi(x_{n})$.
$endgroup$
– warrior_monk
Mar 14 at 23:07
$begingroup$
2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
$endgroup$
– warrior_monk
Mar 14 at 23:16
$begingroup$
2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
$endgroup$
– warrior_monk
Mar 14 at 23:16
1
1
$begingroup$
@warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
$endgroup$
– user3658307
Mar 15 at 1:55
$begingroup$
@warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
$endgroup$
– user3658307
Mar 15 at 1:55
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3147959%2ffinding-the-m-l-e-estimates-of-weight-maximizing-the-likelihood-function-of-a-li%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown