Why use a Gaussian mixture model?












7












$begingroup$


I am learning about Gaussian mixture models (GMM) but I am confused as to why anyone should ever use this algorithm.




  1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering? The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?


  2. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?


  3. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?











share|cite|improve this question











$endgroup$












  • $begingroup$
    GMM has other meanings, not least in econometrics. Abbreviation removed from title to reduce distraction.
    $endgroup$
    – Nick Cox
    yesterday
















7












$begingroup$


I am learning about Gaussian mixture models (GMM) but I am confused as to why anyone should ever use this algorithm.




  1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering? The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?


  2. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?


  3. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?











share|cite|improve this question











$endgroup$












  • $begingroup$
    GMM has other meanings, not least in econometrics. Abbreviation removed from title to reduce distraction.
    $endgroup$
    – Nick Cox
    yesterday














7












7








7


0



$begingroup$


I am learning about Gaussian mixture models (GMM) but I am confused as to why anyone should ever use this algorithm.




  1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering? The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?


  2. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?


  3. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?











share|cite|improve this question











$endgroup$




I am learning about Gaussian mixture models (GMM) but I am confused as to why anyone should ever use this algorithm.




  1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering? The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?


  2. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?


  3. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?








normal-distribution unsupervised-learning gaussian-mixture






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited yesterday









Nick Cox

38.8k583129




38.8k583129










asked yesterday









OlórinOlórin

1725




1725












  • $begingroup$
    GMM has other meanings, not least in econometrics. Abbreviation removed from title to reduce distraction.
    $endgroup$
    – Nick Cox
    yesterday


















  • $begingroup$
    GMM has other meanings, not least in econometrics. Abbreviation removed from title to reduce distraction.
    $endgroup$
    – Nick Cox
    yesterday
















$begingroup$
GMM has other meanings, not least in econometrics. Abbreviation removed from title to reduce distraction.
$endgroup$
– Nick Cox
yesterday




$begingroup$
GMM has other meanings, not least in econometrics. Abbreviation removed from title to reduce distraction.
$endgroup$
– Nick Cox
yesterday










2 Answers
2






active

oldest

votes


















10












$begingroup$

I'll borrow the notation from (1), which describes GMMs quite nicely in my opinon. Suppose we have a feature $X in mathbb{R}^d$. To model the distribution of $X$ we can fit a GMM of the form



$$f(x)=sum_{m=1}^{M} alpha_m phi(x;mu_m;Sigma_m)$$
with $M$ the number of components in the mixture, $alpha_m$ the mixture weight of the $m$-th component and $phi(x;mu_m;Sigma_m)$ being the Gaussian density function with mean $mu_m$ and covariance matrix $Sigma_m$. Using the EM algorithm (its connection to K-Means is explained in this answer) we can aquire estimates of the model parameters, which I'll denote with a hat here ($hat{alpha}_m, hat{mu}_m,hat{Sigma}_m)$. So, our GMM has now been fitted to $X$, let's use it!



This addresses your questions 1 and 3




What is the metric to say that one data point is closer to another
with GMM?

[...]

How can this ever be used for clustering things into K cluster?




As we now have a probabilistic model of the distribution, we can among other things calculate the posterior probability of a given instance $x_i$ belonging to component $m$, which is sometimes referred to as the 'responsibility' of component $m$ for (producing) $x_i$ (2) , denoted as $hat{r}_{im}$



$$ hat{r}_{im} = frac{hat{alpha}_m phi(x_i;mu_m;Sigma_m)}{sum_{k=1}^{M}hat{alpha}_k phi(x_i;mu_k;Sigma_k)}$$



this gives us the probabilities of $x_i$ belonging to the different components. That is precisely how a GMM can be used to cluster your data.



K-Means can encounter problems when the choice of K is not well suited for the data or the shapes of the subpopulations differ. The scikit-learn documentation contains an interesting illustration of such cases



enter image description here



The choice of the shape of the GMM's covariance matrices affects what shapes the components can take on, here again the scikit-learn documentation provides an illustration



enter image description here



While a poorly chosen number of clusters/components can also affect an EM-fitted GMM, a GMM fitted in a bayesian fashion can be somewhat resilient against the effects of this, allowing the mixture weights of some components to be (close to) zero. More on this can be found here.



References




(1) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The
elements of statistical learning. Vol. 1. No. 10. New York: Springer
series in statistics, 2001.

(2) Bishop, Christopher M. Pattern
recognition and machine learning. springer, 2006.







share|cite|improve this answer









$endgroup$





















    7












    $begingroup$



    1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering?





    • k-means is well suited for roughly spherical clusters of equal size. It may fail if these conditions are violated (although it may still work if the clusters are very widely separated). GMMs can fit clusters with a greater variety of shapes and sizes. But, neither algorithm is well suited for data with curved/non-convex clusters.


    • GMMs give a probabilistic assignment of points to clusters. This lets us quantify uncertainty. For example, if a point is near the 'border' between two clusters, it's often better to know that it has near equal membership probabilities for these clusters, rather than blindly assigning it to the nearest one.


    • The probabilistic formulation of GMMs lets us incorporate prior knowledge, using Bayesian methods. For example, we might already know something about the shapes or locations of the clusters, or how many points they contain.


    • The probabilistic formulation gives a way to handle missing data (e.g. using the expectation maximization algorithm typically used to fit GMMs). We can still cluster a data point, even if we haven't observed its value along some dimensions. And, we can infer what those missing values might have been.





    1. ...The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?




    GMMs give a probability that each each point belongs to each cluster (see below). These probabilities can be converted into 'hard assignments' using a decision rule. For example, the simplest choice is to assign each point to the most likely cluster (i.e. the one with highest membership probability).





    1. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?




    Here are just a few possibilities. You can:




    • Perform clustering (including hard assignments, as above).


    • Impute missing values (as above).


    • Detect anomalies (i.e. points with low probability density).


    • Learn something about the structure of the data.


    • Sample from the model to generate new, synthetic data points.





    1. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?




    The expression you wrote is the distribution for the observed data. However, a GMM can be thought of as a latent variable model. Each data point is associated with a latent variable that indicates which cluster it belongs to. When fitting a GMM, we learn a distribution over these latent variables. This gives a probability that each data point is a member of each cluster.






    share|cite|improve this answer









    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "65"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f395905%2fwhy-use-a-gaussian-mixture-model%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      10












      $begingroup$

      I'll borrow the notation from (1), which describes GMMs quite nicely in my opinon. Suppose we have a feature $X in mathbb{R}^d$. To model the distribution of $X$ we can fit a GMM of the form



      $$f(x)=sum_{m=1}^{M} alpha_m phi(x;mu_m;Sigma_m)$$
      with $M$ the number of components in the mixture, $alpha_m$ the mixture weight of the $m$-th component and $phi(x;mu_m;Sigma_m)$ being the Gaussian density function with mean $mu_m$ and covariance matrix $Sigma_m$. Using the EM algorithm (its connection to K-Means is explained in this answer) we can aquire estimates of the model parameters, which I'll denote with a hat here ($hat{alpha}_m, hat{mu}_m,hat{Sigma}_m)$. So, our GMM has now been fitted to $X$, let's use it!



      This addresses your questions 1 and 3




      What is the metric to say that one data point is closer to another
      with GMM?

      [...]

      How can this ever be used for clustering things into K cluster?




      As we now have a probabilistic model of the distribution, we can among other things calculate the posterior probability of a given instance $x_i$ belonging to component $m$, which is sometimes referred to as the 'responsibility' of component $m$ for (producing) $x_i$ (2) , denoted as $hat{r}_{im}$



      $$ hat{r}_{im} = frac{hat{alpha}_m phi(x_i;mu_m;Sigma_m)}{sum_{k=1}^{M}hat{alpha}_k phi(x_i;mu_k;Sigma_k)}$$



      this gives us the probabilities of $x_i$ belonging to the different components. That is precisely how a GMM can be used to cluster your data.



      K-Means can encounter problems when the choice of K is not well suited for the data or the shapes of the subpopulations differ. The scikit-learn documentation contains an interesting illustration of such cases



      enter image description here



      The choice of the shape of the GMM's covariance matrices affects what shapes the components can take on, here again the scikit-learn documentation provides an illustration



      enter image description here



      While a poorly chosen number of clusters/components can also affect an EM-fitted GMM, a GMM fitted in a bayesian fashion can be somewhat resilient against the effects of this, allowing the mixture weights of some components to be (close to) zero. More on this can be found here.



      References




      (1) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The
      elements of statistical learning. Vol. 1. No. 10. New York: Springer
      series in statistics, 2001.

      (2) Bishop, Christopher M. Pattern
      recognition and machine learning. springer, 2006.







      share|cite|improve this answer









      $endgroup$


















        10












        $begingroup$

        I'll borrow the notation from (1), which describes GMMs quite nicely in my opinon. Suppose we have a feature $X in mathbb{R}^d$. To model the distribution of $X$ we can fit a GMM of the form



        $$f(x)=sum_{m=1}^{M} alpha_m phi(x;mu_m;Sigma_m)$$
        with $M$ the number of components in the mixture, $alpha_m$ the mixture weight of the $m$-th component and $phi(x;mu_m;Sigma_m)$ being the Gaussian density function with mean $mu_m$ and covariance matrix $Sigma_m$. Using the EM algorithm (its connection to K-Means is explained in this answer) we can aquire estimates of the model parameters, which I'll denote with a hat here ($hat{alpha}_m, hat{mu}_m,hat{Sigma}_m)$. So, our GMM has now been fitted to $X$, let's use it!



        This addresses your questions 1 and 3




        What is the metric to say that one data point is closer to another
        with GMM?

        [...]

        How can this ever be used for clustering things into K cluster?




        As we now have a probabilistic model of the distribution, we can among other things calculate the posterior probability of a given instance $x_i$ belonging to component $m$, which is sometimes referred to as the 'responsibility' of component $m$ for (producing) $x_i$ (2) , denoted as $hat{r}_{im}$



        $$ hat{r}_{im} = frac{hat{alpha}_m phi(x_i;mu_m;Sigma_m)}{sum_{k=1}^{M}hat{alpha}_k phi(x_i;mu_k;Sigma_k)}$$



        this gives us the probabilities of $x_i$ belonging to the different components. That is precisely how a GMM can be used to cluster your data.



        K-Means can encounter problems when the choice of K is not well suited for the data or the shapes of the subpopulations differ. The scikit-learn documentation contains an interesting illustration of such cases



        enter image description here



        The choice of the shape of the GMM's covariance matrices affects what shapes the components can take on, here again the scikit-learn documentation provides an illustration



        enter image description here



        While a poorly chosen number of clusters/components can also affect an EM-fitted GMM, a GMM fitted in a bayesian fashion can be somewhat resilient against the effects of this, allowing the mixture weights of some components to be (close to) zero. More on this can be found here.



        References




        (1) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The
        elements of statistical learning. Vol. 1. No. 10. New York: Springer
        series in statistics, 2001.

        (2) Bishop, Christopher M. Pattern
        recognition and machine learning. springer, 2006.







        share|cite|improve this answer









        $endgroup$
















          10












          10








          10





          $begingroup$

          I'll borrow the notation from (1), which describes GMMs quite nicely in my opinon. Suppose we have a feature $X in mathbb{R}^d$. To model the distribution of $X$ we can fit a GMM of the form



          $$f(x)=sum_{m=1}^{M} alpha_m phi(x;mu_m;Sigma_m)$$
          with $M$ the number of components in the mixture, $alpha_m$ the mixture weight of the $m$-th component and $phi(x;mu_m;Sigma_m)$ being the Gaussian density function with mean $mu_m$ and covariance matrix $Sigma_m$. Using the EM algorithm (its connection to K-Means is explained in this answer) we can aquire estimates of the model parameters, which I'll denote with a hat here ($hat{alpha}_m, hat{mu}_m,hat{Sigma}_m)$. So, our GMM has now been fitted to $X$, let's use it!



          This addresses your questions 1 and 3




          What is the metric to say that one data point is closer to another
          with GMM?

          [...]

          How can this ever be used for clustering things into K cluster?




          As we now have a probabilistic model of the distribution, we can among other things calculate the posterior probability of a given instance $x_i$ belonging to component $m$, which is sometimes referred to as the 'responsibility' of component $m$ for (producing) $x_i$ (2) , denoted as $hat{r}_{im}$



          $$ hat{r}_{im} = frac{hat{alpha}_m phi(x_i;mu_m;Sigma_m)}{sum_{k=1}^{M}hat{alpha}_k phi(x_i;mu_k;Sigma_k)}$$



          this gives us the probabilities of $x_i$ belonging to the different components. That is precisely how a GMM can be used to cluster your data.



          K-Means can encounter problems when the choice of K is not well suited for the data or the shapes of the subpopulations differ. The scikit-learn documentation contains an interesting illustration of such cases



          enter image description here



          The choice of the shape of the GMM's covariance matrices affects what shapes the components can take on, here again the scikit-learn documentation provides an illustration



          enter image description here



          While a poorly chosen number of clusters/components can also affect an EM-fitted GMM, a GMM fitted in a bayesian fashion can be somewhat resilient against the effects of this, allowing the mixture weights of some components to be (close to) zero. More on this can be found here.



          References




          (1) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The
          elements of statistical learning. Vol. 1. No. 10. New York: Springer
          series in statistics, 2001.

          (2) Bishop, Christopher M. Pattern
          recognition and machine learning. springer, 2006.







          share|cite|improve this answer









          $endgroup$



          I'll borrow the notation from (1), which describes GMMs quite nicely in my opinon. Suppose we have a feature $X in mathbb{R}^d$. To model the distribution of $X$ we can fit a GMM of the form



          $$f(x)=sum_{m=1}^{M} alpha_m phi(x;mu_m;Sigma_m)$$
          with $M$ the number of components in the mixture, $alpha_m$ the mixture weight of the $m$-th component and $phi(x;mu_m;Sigma_m)$ being the Gaussian density function with mean $mu_m$ and covariance matrix $Sigma_m$. Using the EM algorithm (its connection to K-Means is explained in this answer) we can aquire estimates of the model parameters, which I'll denote with a hat here ($hat{alpha}_m, hat{mu}_m,hat{Sigma}_m)$. So, our GMM has now been fitted to $X$, let's use it!



          This addresses your questions 1 and 3




          What is the metric to say that one data point is closer to another
          with GMM?

          [...]

          How can this ever be used for clustering things into K cluster?




          As we now have a probabilistic model of the distribution, we can among other things calculate the posterior probability of a given instance $x_i$ belonging to component $m$, which is sometimes referred to as the 'responsibility' of component $m$ for (producing) $x_i$ (2) , denoted as $hat{r}_{im}$



          $$ hat{r}_{im} = frac{hat{alpha}_m phi(x_i;mu_m;Sigma_m)}{sum_{k=1}^{M}hat{alpha}_k phi(x_i;mu_k;Sigma_k)}$$



          this gives us the probabilities of $x_i$ belonging to the different components. That is precisely how a GMM can be used to cluster your data.



          K-Means can encounter problems when the choice of K is not well suited for the data or the shapes of the subpopulations differ. The scikit-learn documentation contains an interesting illustration of such cases



          enter image description here



          The choice of the shape of the GMM's covariance matrices affects what shapes the components can take on, here again the scikit-learn documentation provides an illustration



          enter image description here



          While a poorly chosen number of clusters/components can also affect an EM-fitted GMM, a GMM fitted in a bayesian fashion can be somewhat resilient against the effects of this, allowing the mixture weights of some components to be (close to) zero. More on this can be found here.



          References




          (1) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The
          elements of statistical learning. Vol. 1. No. 10. New York: Springer
          series in statistics, 2001.

          (2) Bishop, Christopher M. Pattern
          recognition and machine learning. springer, 2006.








          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered yesterday









          RickyfoxRickyfox

          1,21021328




          1,21021328

























              7












              $begingroup$



              1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering?





              • k-means is well suited for roughly spherical clusters of equal size. It may fail if these conditions are violated (although it may still work if the clusters are very widely separated). GMMs can fit clusters with a greater variety of shapes and sizes. But, neither algorithm is well suited for data with curved/non-convex clusters.


              • GMMs give a probabilistic assignment of points to clusters. This lets us quantify uncertainty. For example, if a point is near the 'border' between two clusters, it's often better to know that it has near equal membership probabilities for these clusters, rather than blindly assigning it to the nearest one.


              • The probabilistic formulation of GMMs lets us incorporate prior knowledge, using Bayesian methods. For example, we might already know something about the shapes or locations of the clusters, or how many points they contain.


              • The probabilistic formulation gives a way to handle missing data (e.g. using the expectation maximization algorithm typically used to fit GMMs). We can still cluster a data point, even if we haven't observed its value along some dimensions. And, we can infer what those missing values might have been.





              1. ...The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?




              GMMs give a probability that each each point belongs to each cluster (see below). These probabilities can be converted into 'hard assignments' using a decision rule. For example, the simplest choice is to assign each point to the most likely cluster (i.e. the one with highest membership probability).





              1. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?




              Here are just a few possibilities. You can:




              • Perform clustering (including hard assignments, as above).


              • Impute missing values (as above).


              • Detect anomalies (i.e. points with low probability density).


              • Learn something about the structure of the data.


              • Sample from the model to generate new, synthetic data points.





              1. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?




              The expression you wrote is the distribution for the observed data. However, a GMM can be thought of as a latent variable model. Each data point is associated with a latent variable that indicates which cluster it belongs to. When fitting a GMM, we learn a distribution over these latent variables. This gives a probability that each data point is a member of each cluster.






              share|cite|improve this answer









              $endgroup$


















                7












                $begingroup$



                1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering?





                • k-means is well suited for roughly spherical clusters of equal size. It may fail if these conditions are violated (although it may still work if the clusters are very widely separated). GMMs can fit clusters with a greater variety of shapes and sizes. But, neither algorithm is well suited for data with curved/non-convex clusters.


                • GMMs give a probabilistic assignment of points to clusters. This lets us quantify uncertainty. For example, if a point is near the 'border' between two clusters, it's often better to know that it has near equal membership probabilities for these clusters, rather than blindly assigning it to the nearest one.


                • The probabilistic formulation of GMMs lets us incorporate prior knowledge, using Bayesian methods. For example, we might already know something about the shapes or locations of the clusters, or how many points they contain.


                • The probabilistic formulation gives a way to handle missing data (e.g. using the expectation maximization algorithm typically used to fit GMMs). We can still cluster a data point, even if we haven't observed its value along some dimensions. And, we can infer what those missing values might have been.





                1. ...The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?




                GMMs give a probability that each each point belongs to each cluster (see below). These probabilities can be converted into 'hard assignments' using a decision rule. For example, the simplest choice is to assign each point to the most likely cluster (i.e. the one with highest membership probability).





                1. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?




                Here are just a few possibilities. You can:




                • Perform clustering (including hard assignments, as above).


                • Impute missing values (as above).


                • Detect anomalies (i.e. points with low probability density).


                • Learn something about the structure of the data.


                • Sample from the model to generate new, synthetic data points.





                1. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?




                The expression you wrote is the distribution for the observed data. However, a GMM can be thought of as a latent variable model. Each data point is associated with a latent variable that indicates which cluster it belongs to. When fitting a GMM, we learn a distribution over these latent variables. This gives a probability that each data point is a member of each cluster.






                share|cite|improve this answer









                $endgroup$
















                  7












                  7








                  7





                  $begingroup$



                  1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering?





                  • k-means is well suited for roughly spherical clusters of equal size. It may fail if these conditions are violated (although it may still work if the clusters are very widely separated). GMMs can fit clusters with a greater variety of shapes and sizes. But, neither algorithm is well suited for data with curved/non-convex clusters.


                  • GMMs give a probabilistic assignment of points to clusters. This lets us quantify uncertainty. For example, if a point is near the 'border' between two clusters, it's often better to know that it has near equal membership probabilities for these clusters, rather than blindly assigning it to the nearest one.


                  • The probabilistic formulation of GMMs lets us incorporate prior knowledge, using Bayesian methods. For example, we might already know something about the shapes or locations of the clusters, or how many points they contain.


                  • The probabilistic formulation gives a way to handle missing data (e.g. using the expectation maximization algorithm typically used to fit GMMs). We can still cluster a data point, even if we haven't observed its value along some dimensions. And, we can infer what those missing values might have been.





                  1. ...The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?




                  GMMs give a probability that each each point belongs to each cluster (see below). These probabilities can be converted into 'hard assignments' using a decision rule. For example, the simplest choice is to assign each point to the most likely cluster (i.e. the one with highest membership probability).





                  1. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?




                  Here are just a few possibilities. You can:




                  • Perform clustering (including hard assignments, as above).


                  • Impute missing values (as above).


                  • Detect anomalies (i.e. points with low probability density).


                  • Learn something about the structure of the data.


                  • Sample from the model to generate new, synthetic data points.





                  1. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?




                  The expression you wrote is the distribution for the observed data. However, a GMM can be thought of as a latent variable model. Each data point is associated with a latent variable that indicates which cluster it belongs to. When fitting a GMM, we learn a distribution over these latent variables. This gives a probability that each data point is a member of each cluster.






                  share|cite|improve this answer









                  $endgroup$





                  1. How is this algorithm better than other standard clustering algorithm such as $K$-means when it comes to clustering?





                  • k-means is well suited for roughly spherical clusters of equal size. It may fail if these conditions are violated (although it may still work if the clusters are very widely separated). GMMs can fit clusters with a greater variety of shapes and sizes. But, neither algorithm is well suited for data with curved/non-convex clusters.


                  • GMMs give a probabilistic assignment of points to clusters. This lets us quantify uncertainty. For example, if a point is near the 'border' between two clusters, it's often better to know that it has near equal membership probabilities for these clusters, rather than blindly assigning it to the nearest one.


                  • The probabilistic formulation of GMMs lets us incorporate prior knowledge, using Bayesian methods. For example, we might already know something about the shapes or locations of the clusters, or how many points they contain.


                  • The probabilistic formulation gives a way to handle missing data (e.g. using the expectation maximization algorithm typically used to fit GMMs). We can still cluster a data point, even if we haven't observed its value along some dimensions. And, we can infer what those missing values might have been.





                  1. ...The $K$ means algorithm partitions data into $K$ clusters with clear set memberships, whereas the Gaussian mixture model does not produce clear set membership for each data point. What is the metric to say that one data point is closer to another with GMM?




                  GMMs give a probability that each each point belongs to each cluster (see below). These probabilities can be converted into 'hard assignments' using a decision rule. For example, the simplest choice is to assign each point to the most likely cluster (i.e. the one with highest membership probability).





                  1. How can I make use of the final probability distribution that GMM produces? Suppose I obtain my final probability distribution $f(x|w)$ where $w$ are the weights, so what? I have obtained a probability distribution that fits to my data $x$. What can I do with it?




                  Here are just a few possibilities. You can:




                  • Perform clustering (including hard assignments, as above).


                  • Impute missing values (as above).


                  • Detect anomalies (i.e. points with low probability density).


                  • Learn something about the structure of the data.


                  • Sample from the model to generate new, synthetic data points.





                  1. To follow up with my previous point, for $K$ means, at the end we obtain a set of $K$ clusters, which we may denote as the set ${S_1, ldots, S_K}$, which are $K$ things. But for GMM, all I obtain is one distribution $f(x|w) = sumlimits_{i=1}^N w_i mathcal{N}(x|mu_i, Sigma_i)$ which is $1$ thing. How can this ever be used for clustering things into $K$ cluster?




                  The expression you wrote is the distribution for the observed data. However, a GMM can be thought of as a latent variable model. Each data point is associated with a latent variable that indicates which cluster it belongs to. When fitting a GMM, we learn a distribution over these latent variables. This gives a probability that each data point is a member of each cluster.







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered yesterday









                  user20160user20160

                  17.3k12858




                  17.3k12858






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Cross Validated!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f395905%2fwhy-use-a-gaussian-mixture-model%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to label and detect the document text images

                      Vallis Paradisi

                      Tabula Rosettana