Finding the M.L.E estimates of weight maximizing the Likelihood Function of a Linear RegressionDifference...

Could the E-bike drivetrain wear down till needing replacement after 400 km?

Fly on a jet pack vs fly with a jet pack?

Does the Mind Blank spell prevent the target from being frightened?

Are all species of CANNA edible?

Varistor? Purpose and principle

Is it possible to have a strip of cold climate in the middle of a planet?

Why has "pence" been used in this sentence, not "pences"?

Transformation of random variables and joint distributions

Did US corporations pay demonstrators in the German demonstrations against article 13?

What (else) happened July 1st 1858 in London?

API Access HTML/Javascript

Freedom of speech and where it applies

How do you respond to a colleague from another team when they're wrongly expecting that you'll help them?

ArcGIS not connecting to PostgreSQL db with all upper-case name

Frequency of inspection at vegan restaurants

Did arcade monitors have same pixel aspect ratio as TV sets?

Is there a word to describe the feeling of being transfixed out of horror?

What does the Rambam mean when he says that the planets have souls?

What's the difference between 違法 and 不法?

How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?

How do I repair my stair bannister?

Why does Async/Await work properly when the loop is inside the async function and not the other way around?

Journal losing indexing services

Bob has never been a M before



Finding the M.L.E estimates of weight maximizing the Likelihood Function of a Linear Regression


Difference between gradient and JacobianGradient and Jacobian row and column conventionsThe Gradient as a Row vs. Column VectorGradient is covariant or contravariant?Is the gradiant a column or a row?Minimizing $L_1$ RegularizationStata: “Between and fixed effect estimates” in a linear regression?Finding estimates of a Linear Regression Equation - RLinear Regression Problem (“Regression Towards the Mean”)Effects of feature scaling on weight vectors for linear regressionLikelihood function for logistic regressionHow does bayesian regression differs from maximum likelihood regression?Derivation of normal equations for maximum likelihood and least squaresEfficient Numerical Optimization for Gradient Descent with Constraints (Lagrangian Multiplier)Understanding the use of Radial Basis Function in Linear Regression













1












$begingroup$


I am reading this book "Pattern Recognition and Machine Learning " - Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :

We have a feature vector $x in R^{k}$ , a basis function $phi_{i}$ , and we call $ phi = [phi_{0},phi_{1},phi_{2},....phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
$$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $beta^{-1}$.
So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
$$ N(t|y(x,w),beta) $$.
The book then proceeds with the likelihood function , for $N$ overstations to be :
$$ P =prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),beta )\
Rightarrow ln(P) = sum_{i=1}^{i=N}
N(t_{i}|y(x_{i},w),beta)
=frac{N}{2}ln(beta) -frac{N}{2}ln(2pi) -beta E_{D}w$$
where $E_{D}w= frac{1}{2}sum_{i=1}^{i=N}[ t_{n}-w^{T}phi(x_{n})]^{2}$
Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
Now the author calculates the gradient of $E_{D}w$ and writes :
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})^{T}$$ .



My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?



Edit /Note :

After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
$$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})$$

i.e without a transpose over the $phi(x_{n})$ .
I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !










share|cite|improve this question











$endgroup$

















    1












    $begingroup$


    I am reading this book "Pattern Recognition and Machine Learning " - Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :

    We have a feature vector $x in R^{k}$ , a basis function $phi_{i}$ , and we call $ phi = [phi_{0},phi_{1},phi_{2},....phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
    $$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $beta^{-1}$.
    So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
    $$ N(t|y(x,w),beta) $$.
    The book then proceeds with the likelihood function , for $N$ overstations to be :
    $$ P =prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),beta )\
    Rightarrow ln(P) = sum_{i=1}^{i=N}
    N(t_{i}|y(x_{i},w),beta)
    =frac{N}{2}ln(beta) -frac{N}{2}ln(2pi) -beta E_{D}w$$
    where $E_{D}w= frac{1}{2}sum_{i=1}^{i=N}[ t_{n}-w^{T}phi(x_{n})]^{2}$
    Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
    Now the author calculates the gradient of $E_{D}w$ and writes :
    $$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})^{T}$$ .



    My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?



    Edit /Note :

    After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
    $$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})$$

    i.e without a transpose over the $phi(x_{n})$ .
    I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !










    share|cite|improve this question











    $endgroup$















      1












      1








      1





      $begingroup$


      I am reading this book "Pattern Recognition and Machine Learning " - Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :

      We have a feature vector $x in R^{k}$ , a basis function $phi_{i}$ , and we call $ phi = [phi_{0},phi_{1},phi_{2},....phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
      $$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $beta^{-1}$.
      So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
      $$ N(t|y(x,w),beta) $$.
      The book then proceeds with the likelihood function , for $N$ overstations to be :
      $$ P =prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),beta )\
      Rightarrow ln(P) = sum_{i=1}^{i=N}
      N(t_{i}|y(x_{i},w),beta)
      =frac{N}{2}ln(beta) -frac{N}{2}ln(2pi) -beta E_{D}w$$
      where $E_{D}w= frac{1}{2}sum_{i=1}^{i=N}[ t_{n}-w^{T}phi(x_{n})]^{2}$
      Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
      Now the author calculates the gradient of $E_{D}w$ and writes :
      $$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})^{T}$$ .



      My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?



      Edit /Note :

      After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
      $$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})$$

      i.e without a transpose over the $phi(x_{n})$ .
      I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !










      share|cite|improve this question











      $endgroup$




      I am reading this book "Pattern Recognition and Machine Learning " - Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :

      We have a feature vector $x in R^{k}$ , a basis function $phi_{i}$ , and we call $ phi = [phi_{0},phi_{1},phi_{2},....phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
      $$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $beta^{-1}$.
      So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
      $$ N(t|y(x,w),beta) $$.
      The book then proceeds with the likelihood function , for $N$ overstations to be :
      $$ P =prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),beta )\
      Rightarrow ln(P) = sum_{i=1}^{i=N}
      N(t_{i}|y(x_{i},w),beta)
      =frac{N}{2}ln(beta) -frac{N}{2}ln(2pi) -beta E_{D}w$$
      where $E_{D}w= frac{1}{2}sum_{i=1}^{i=N}[ t_{n}-w^{T}phi(x_{n})]^{2}$
      Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
      Now the author calculates the gradient of $E_{D}w$ and writes :
      $$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})^{T}$$ .



      My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?



      Edit /Note :

      After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
      $$ nabla_{w}ln(P) = sum_{n=1}^{N}[t_{n}-w^{T}phi(x_{n})]phi(x_{n})$$

      i.e without a transpose over the $phi(x_{n})$ .
      I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !







      multivariable-calculus regression machine-learning pattern-recognition






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Mar 16 at 9:00







      warrior_monk

















      asked Mar 14 at 12:46









      warrior_monkwarrior_monk

      436




      436






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
          $phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
          $$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
          meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.



          Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
          $$
          nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
          $$

          since summing over the vectors doesn't change their dimensionality.
          Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.



          Your question:




          my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$




          Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.





          Edit for comments:



          In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).



          However, it is slightly more common, I feel, to consider it a row vector, because:




          • The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions

          • It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).


          There is plenty of debate on this though; e.g.,
          [1],
          [2],
          [3],
          [4],
          [5],
          [6],
          [7]
          [8].



          (Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)



          I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?



          TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.






          share|cite|improve this answer











          $endgroup$













          • $begingroup$
            1.) My bad , i fixed the typo for $phi(x_{n})$.
            $endgroup$
            – warrior_monk
            Mar 14 at 23:07










          • $begingroup$
            2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
            $endgroup$
            – warrior_monk
            Mar 14 at 23:16








          • 1




            $begingroup$
            @warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
            $endgroup$
            – user3658307
            Mar 15 at 1:55











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "69"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3147959%2ffinding-the-m-l-e-estimates-of-weight-maximizing-the-likelihood-function-of-a-li%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
          $phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
          $$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
          meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.



          Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
          $$
          nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
          $$

          since summing over the vectors doesn't change their dimensionality.
          Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.



          Your question:




          my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$




          Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.





          Edit for comments:



          In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).



          However, it is slightly more common, I feel, to consider it a row vector, because:




          • The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions

          • It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).


          There is plenty of debate on this though; e.g.,
          [1],
          [2],
          [3],
          [4],
          [5],
          [6],
          [7]
          [8].



          (Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)



          I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?



          TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.






          share|cite|improve this answer











          $endgroup$













          • $begingroup$
            1.) My bad , i fixed the typo for $phi(x_{n})$.
            $endgroup$
            – warrior_monk
            Mar 14 at 23:07










          • $begingroup$
            2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
            $endgroup$
            – warrior_monk
            Mar 14 at 23:16








          • 1




            $begingroup$
            @warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
            $endgroup$
            – user3658307
            Mar 15 at 1:55
















          1












          $begingroup$

          I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
          $phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
          $$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
          meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.



          Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
          $$
          nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
          $$

          since summing over the vectors doesn't change their dimensionality.
          Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.



          Your question:




          my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$




          Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.





          Edit for comments:



          In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).



          However, it is slightly more common, I feel, to consider it a row vector, because:




          • The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions

          • It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).


          There is plenty of debate on this though; e.g.,
          [1],
          [2],
          [3],
          [4],
          [5],
          [6],
          [7]
          [8].



          (Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)



          I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?



          TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.






          share|cite|improve this answer











          $endgroup$













          • $begingroup$
            1.) My bad , i fixed the typo for $phi(x_{n})$.
            $endgroup$
            – warrior_monk
            Mar 14 at 23:07










          • $begingroup$
            2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
            $endgroup$
            – warrior_monk
            Mar 14 at 23:16








          • 1




            $begingroup$
            @warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
            $endgroup$
            – user3658307
            Mar 15 at 1:55














          1












          1








          1





          $begingroup$

          I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
          $phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
          $$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
          meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.



          Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
          $$
          nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
          $$

          since summing over the vectors doesn't change their dimensionality.
          Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.



          Your question:




          my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$




          Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.





          Edit for comments:



          In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).



          However, it is slightly more common, I feel, to consider it a row vector, because:




          • The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions

          • It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).


          There is plenty of debate on this though; e.g.,
          [1],
          [2],
          [3],
          [4],
          [5],
          [6],
          [7]
          [8].



          (Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)



          I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?



          TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.






          share|cite|improve this answer











          $endgroup$



          I think what's happening is $x_iinmathbb{R}^D ;forall i$ defines the data points , $t_iin mathbb{R}$ is the target for $x_i$, the basis functions are
          $phi_j:mathbb{R}^Drightarrowmathbb{R}$ (so that $phi : mathbb{R}^Drightarrow mathbb{R}^M$), and the weights are given by $winmathbb{R}^M$. The data generating mechanism is
          $$ t = y(w,x) + epsilon = epsilon + sum_{ell=0}^{M-1} w^Tphi(x) $$
          meaning $tsimmathcal{N}(t|y(x,w),beta^{-1})$. See equation 3.3 in the book.



          Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,beta)$), and computes the gradient wrt to the weights. We should see that $nablaln p(t|w,beta)inmathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X={x_1,ldots,x_N}$ and $t={t_1,ldots,t_N}$. We get:
          $$
          nablamathcal{E}:=nabla ln p(t|w,beta) =sum_{n=1}^N underbrace{left[ t_n - w^Tphi(x_n) right]}_{mathbb{R}}underbrace{phi(x_n)^T}_{mathbb{R}^M}inmathbb{R}^M
          $$

          since summing over the vectors doesn't change their dimensionality.
          Indeed, $nablamathcal{E}inmathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $nablamathcal{E}$.



          Your question:




          my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$




          Note that $phi_n$ is nowhere in the above equation. There is only $phi(x_n)$, which is $M$ dimensional (mapping the $x_ninmathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.





          Edit for comments:



          In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).



          However, it is slightly more common, I feel, to consider it a row vector, because:




          • The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions

          • It can be written (as a directional derivative linear operator) merely by $nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).


          There is plenty of debate on this though; e.g.,
          [1],
          [2],
          [3],
          [4],
          [5],
          [6],
          [7]
          [8].



          (Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)



          I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?



          TL;DR: in ML, the literature is very lenient regarding $mathbb{R}^{1times M}$ vs $mathbb{R}^{Mtimes 1}$ vs $mathbb{R}^{M}$, so I wouldn't worry about it.







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited Mar 15 at 1:52

























          answered Mar 14 at 15:50









          user3658307user3658307

          4,9883946




          4,9883946












          • $begingroup$
            1.) My bad , i fixed the typo for $phi(x_{n})$.
            $endgroup$
            – warrior_monk
            Mar 14 at 23:07










          • $begingroup$
            2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
            $endgroup$
            – warrior_monk
            Mar 14 at 23:16








          • 1




            $begingroup$
            @warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
            $endgroup$
            – user3658307
            Mar 15 at 1:55


















          • $begingroup$
            1.) My bad , i fixed the typo for $phi(x_{n})$.
            $endgroup$
            – warrior_monk
            Mar 14 at 23:07










          • $begingroup$
            2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
            $endgroup$
            – warrior_monk
            Mar 14 at 23:16








          • 1




            $begingroup$
            @warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
            $endgroup$
            – user3658307
            Mar 15 at 1:55
















          $begingroup$
          1.) My bad , i fixed the typo for $phi(x_{n})$.
          $endgroup$
          – warrior_monk
          Mar 14 at 23:07




          $begingroup$
          1.) My bad , i fixed the typo for $phi(x_{n})$.
          $endgroup$
          – warrior_monk
          Mar 14 at 23:07












          $begingroup$
          2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
          $endgroup$
          – warrior_monk
          Mar 14 at 23:16






          $begingroup$
          2.)You mean $phi(x_{n})^{T} in R^{M}$ , I say that it is precisely $R^{ 1times M }$ and i claim the gradient should $ in R^{M times 1}$ . By default we take a vector to be column vector , and hence $w in R^{M times 1}$
          $endgroup$
          – warrior_monk
          Mar 14 at 23:16






          1




          1




          $begingroup$
          @warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
          $endgroup$
          – user3658307
          Mar 15 at 1:55




          $begingroup$
          @warrior_monk I see. I added a few details to the answer. I think it's either a mistake/typo or he considers $phi$ or the gradient to be a row vector. (I tend to do the latter). Either way I don't think it's important. This kind of sloppiness is rampant in this area :)
          $endgroup$
          – user3658307
          Mar 15 at 1:55


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Mathematics Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3147959%2ffinding-the-m-l-e-estimates-of-weight-maximizing-the-likelihood-function-of-a-li%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Nidaros erkebispedøme

          Birsay

          Where did Arya get these scars? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara Favourite questions and answers from the 1st quarter of 2019Why did Arya refuse to end it?Has the pronunciation of Arya Stark's name changed?Has Arya forgiven people?Why did Arya Stark lose her vision?Why can Arya still use the faces?Has the Narrow Sea become narrower?Does Arya Stark know how to make poisons outside of the House of Black and White?Why did Nymeria leave Arya?Why did Arya not kill the Lannister soldiers she encountered in the Riverlands?What is the current canonical age of Sansa, Bran and Arya Stark?