Why do we choose principal components based on maximum variance explained?
$begingroup$
I've seen many people choose # of principal components for PCA based on maximum variance explained. So my question is do we always have to choose principal components based on maximum variance explained? Is it applicable for all scenarios i.e text count vectors(BoW, tfidf..) where number of dimensions are really high.
Does maximum variance means most information about my data in higher dimension is captured into lower dimension?
Usually I'd plot something like this to see the variance explained.
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Principal Components')
plt.ylabel('Variance ratio')
plt.show()
machine-learning python scikit-learn pca
$endgroup$
add a comment |
$begingroup$
I've seen many people choose # of principal components for PCA based on maximum variance explained. So my question is do we always have to choose principal components based on maximum variance explained? Is it applicable for all scenarios i.e text count vectors(BoW, tfidf..) where number of dimensions are really high.
Does maximum variance means most information about my data in higher dimension is captured into lower dimension?
Usually I'd plot something like this to see the variance explained.
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Principal Components')
plt.ylabel('Variance ratio')
plt.show()
machine-learning python scikit-learn pca
$endgroup$
add a comment |
$begingroup$
I've seen many people choose # of principal components for PCA based on maximum variance explained. So my question is do we always have to choose principal components based on maximum variance explained? Is it applicable for all scenarios i.e text count vectors(BoW, tfidf..) where number of dimensions are really high.
Does maximum variance means most information about my data in higher dimension is captured into lower dimension?
Usually I'd plot something like this to see the variance explained.
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Principal Components')
plt.ylabel('Variance ratio')
plt.show()
machine-learning python scikit-learn pca
$endgroup$
I've seen many people choose # of principal components for PCA based on maximum variance explained. So my question is do we always have to choose principal components based on maximum variance explained? Is it applicable for all scenarios i.e text count vectors(BoW, tfidf..) where number of dimensions are really high.
Does maximum variance means most information about my data in higher dimension is captured into lower dimension?
Usually I'd plot something like this to see the variance explained.
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Principal Components')
plt.ylabel('Variance ratio')
plt.show()
machine-learning python scikit-learn pca
machine-learning python scikit-learn pca
asked 2 days ago
user214user214
16815
16815
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
do we always have to choose principal components based on maximum
variance explained?
Yes. "Maximum variance explained" is closely related to the main objective as follows.
Our main objective is: for a limited budget K dimensions, what information $mbox{a}=(a_1,...,a_K)$ to keep from original data $mbox{x}=(x_1,...,x_D)$ ($D gg K$) in order to be able to reconstruct $mbox{x}$ from $mbox{a}$ as close as possible?
If we only allow rotation and scaling of original data, i.e. $a_k := mbox{x}.mbox{v}_k$ for unknown set of vectors $V_K={mbox{v}_k|mbox{v}_k in mathbb{R}^D, 1 leq k leq K}$, and define the reconstruction error as
$$loss(mbox{x},V_K):=left | mbox{x}-underbrace{sum_{k=1}^{K}overbrace{(mbox{x}.mbox{v}_k)}^{a_k}mbox{v}_k}_{hat{mbox{x}}} right |^2,$$
the solution $V^*_K$ that minimizes this error is PCA. For first dimension, PCA keeps the projection of data on vector $mbox{v}^*_1$ in the direction of largest data variance, namely $a^*_1$. For second dimension, it keeps the projection on vector $mbox{v}^*_2$ in the direction of second largest data variance, namely $a^*_2$, and so on.
In other words, when we try to find a K-vector set $V_K$ that minimizes $loss(X,V_K)=frac{1}{N}sum_{n=1}^{N}loss(mbox{x}_n,V_K)$, the solution
$V^*_K$ includes $mbox{v}^*_k$ that is in the direction of $kmbox{-th}$ largest data variance.
Note that "ratio of variance explained" is a measure from statistics. Using the previous notations, it is defined as:
$$mbox{R}(X,V_K):=1 - frac{loss(X,V_K)}{Var(X)}$$
Since variance of original data $Var(X)$ is independent of solution, minimum of $loss(X,V_K)$ is equivalent to maximum of $mbox{R}(X,V_K)$. For example, if $K=2$, then $V^*_2={mbox{v}^*_1, mbox{v}^*_2}$ minimizes $loss(X,V_2)$ and equivalently maximizes $mbox{R}(X,V_2)$. Ideally, if original data $X$ can be perfectly reconstructed from $V_K$, then $R(X, V_K)$ would be $1$.
Does maximum variance means most information about my data in higher
dimension is captured into lower dimension?
Yes. If we agree that "keep as much information as possible" is equivalent to "be able to reconstruct the data as close as possible", then our objective $min_{V_K}loss(X,V_K)$ formalizes "keep as much information as possible", and its solution is "maximum variance".
$endgroup$
add a comment |
$begingroup$
Principal Component Analysis is commonly used as a technique in Machine Learning as a preprocessing step. It is dimensionality reduction. You can imagine that this might be useful for things like visualization or for reducing the size of your training set. Why we want to maximize the variance is so that you preserve as much information about the original data as possible and only loose a small amount of information.
In answer to your question, yes, high variance in this case means preserving most of the information captured in the high dimensional data, in a lower dimension. There is a mathematical intuition for this when you are projecting points on to a perpendicular line, could you revert back to the original points?
On that note - if someone would like to provide the mathematical intuition explicitly I would welcome that answer
$endgroup$
add a comment |
$begingroup$
In addition to what has been said:
Why do we choose principal components based on maximum variance explained?
- Because the variance left by rest of the components is in fact
the residual you want to minimize when looking for the best representation of your data in less dimensions (the best mean-square linear representation, of course).
do we always have to choose principal components based on maximum variance explained?
- Yes, if dimensionality reduction is what you want.
However, there are applications when the residual components are those who tell the story :-)
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46881%2fwhy-do-we-choose-principal-components-based-on-maximum-variance-explained%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
do we always have to choose principal components based on maximum
variance explained?
Yes. "Maximum variance explained" is closely related to the main objective as follows.
Our main objective is: for a limited budget K dimensions, what information $mbox{a}=(a_1,...,a_K)$ to keep from original data $mbox{x}=(x_1,...,x_D)$ ($D gg K$) in order to be able to reconstruct $mbox{x}$ from $mbox{a}$ as close as possible?
If we only allow rotation and scaling of original data, i.e. $a_k := mbox{x}.mbox{v}_k$ for unknown set of vectors $V_K={mbox{v}_k|mbox{v}_k in mathbb{R}^D, 1 leq k leq K}$, and define the reconstruction error as
$$loss(mbox{x},V_K):=left | mbox{x}-underbrace{sum_{k=1}^{K}overbrace{(mbox{x}.mbox{v}_k)}^{a_k}mbox{v}_k}_{hat{mbox{x}}} right |^2,$$
the solution $V^*_K$ that minimizes this error is PCA. For first dimension, PCA keeps the projection of data on vector $mbox{v}^*_1$ in the direction of largest data variance, namely $a^*_1$. For second dimension, it keeps the projection on vector $mbox{v}^*_2$ in the direction of second largest data variance, namely $a^*_2$, and so on.
In other words, when we try to find a K-vector set $V_K$ that minimizes $loss(X,V_K)=frac{1}{N}sum_{n=1}^{N}loss(mbox{x}_n,V_K)$, the solution
$V^*_K$ includes $mbox{v}^*_k$ that is in the direction of $kmbox{-th}$ largest data variance.
Note that "ratio of variance explained" is a measure from statistics. Using the previous notations, it is defined as:
$$mbox{R}(X,V_K):=1 - frac{loss(X,V_K)}{Var(X)}$$
Since variance of original data $Var(X)$ is independent of solution, minimum of $loss(X,V_K)$ is equivalent to maximum of $mbox{R}(X,V_K)$. For example, if $K=2$, then $V^*_2={mbox{v}^*_1, mbox{v}^*_2}$ minimizes $loss(X,V_2)$ and equivalently maximizes $mbox{R}(X,V_2)$. Ideally, if original data $X$ can be perfectly reconstructed from $V_K$, then $R(X, V_K)$ would be $1$.
Does maximum variance means most information about my data in higher
dimension is captured into lower dimension?
Yes. If we agree that "keep as much information as possible" is equivalent to "be able to reconstruct the data as close as possible", then our objective $min_{V_K}loss(X,V_K)$ formalizes "keep as much information as possible", and its solution is "maximum variance".
$endgroup$
add a comment |
$begingroup$
do we always have to choose principal components based on maximum
variance explained?
Yes. "Maximum variance explained" is closely related to the main objective as follows.
Our main objective is: for a limited budget K dimensions, what information $mbox{a}=(a_1,...,a_K)$ to keep from original data $mbox{x}=(x_1,...,x_D)$ ($D gg K$) in order to be able to reconstruct $mbox{x}$ from $mbox{a}$ as close as possible?
If we only allow rotation and scaling of original data, i.e. $a_k := mbox{x}.mbox{v}_k$ for unknown set of vectors $V_K={mbox{v}_k|mbox{v}_k in mathbb{R}^D, 1 leq k leq K}$, and define the reconstruction error as
$$loss(mbox{x},V_K):=left | mbox{x}-underbrace{sum_{k=1}^{K}overbrace{(mbox{x}.mbox{v}_k)}^{a_k}mbox{v}_k}_{hat{mbox{x}}} right |^2,$$
the solution $V^*_K$ that minimizes this error is PCA. For first dimension, PCA keeps the projection of data on vector $mbox{v}^*_1$ in the direction of largest data variance, namely $a^*_1$. For second dimension, it keeps the projection on vector $mbox{v}^*_2$ in the direction of second largest data variance, namely $a^*_2$, and so on.
In other words, when we try to find a K-vector set $V_K$ that minimizes $loss(X,V_K)=frac{1}{N}sum_{n=1}^{N}loss(mbox{x}_n,V_K)$, the solution
$V^*_K$ includes $mbox{v}^*_k$ that is in the direction of $kmbox{-th}$ largest data variance.
Note that "ratio of variance explained" is a measure from statistics. Using the previous notations, it is defined as:
$$mbox{R}(X,V_K):=1 - frac{loss(X,V_K)}{Var(X)}$$
Since variance of original data $Var(X)$ is independent of solution, minimum of $loss(X,V_K)$ is equivalent to maximum of $mbox{R}(X,V_K)$. For example, if $K=2$, then $V^*_2={mbox{v}^*_1, mbox{v}^*_2}$ minimizes $loss(X,V_2)$ and equivalently maximizes $mbox{R}(X,V_2)$. Ideally, if original data $X$ can be perfectly reconstructed from $V_K$, then $R(X, V_K)$ would be $1$.
Does maximum variance means most information about my data in higher
dimension is captured into lower dimension?
Yes. If we agree that "keep as much information as possible" is equivalent to "be able to reconstruct the data as close as possible", then our objective $min_{V_K}loss(X,V_K)$ formalizes "keep as much information as possible", and its solution is "maximum variance".
$endgroup$
add a comment |
$begingroup$
do we always have to choose principal components based on maximum
variance explained?
Yes. "Maximum variance explained" is closely related to the main objective as follows.
Our main objective is: for a limited budget K dimensions, what information $mbox{a}=(a_1,...,a_K)$ to keep from original data $mbox{x}=(x_1,...,x_D)$ ($D gg K$) in order to be able to reconstruct $mbox{x}$ from $mbox{a}$ as close as possible?
If we only allow rotation and scaling of original data, i.e. $a_k := mbox{x}.mbox{v}_k$ for unknown set of vectors $V_K={mbox{v}_k|mbox{v}_k in mathbb{R}^D, 1 leq k leq K}$, and define the reconstruction error as
$$loss(mbox{x},V_K):=left | mbox{x}-underbrace{sum_{k=1}^{K}overbrace{(mbox{x}.mbox{v}_k)}^{a_k}mbox{v}_k}_{hat{mbox{x}}} right |^2,$$
the solution $V^*_K$ that minimizes this error is PCA. For first dimension, PCA keeps the projection of data on vector $mbox{v}^*_1$ in the direction of largest data variance, namely $a^*_1$. For second dimension, it keeps the projection on vector $mbox{v}^*_2$ in the direction of second largest data variance, namely $a^*_2$, and so on.
In other words, when we try to find a K-vector set $V_K$ that minimizes $loss(X,V_K)=frac{1}{N}sum_{n=1}^{N}loss(mbox{x}_n,V_K)$, the solution
$V^*_K$ includes $mbox{v}^*_k$ that is in the direction of $kmbox{-th}$ largest data variance.
Note that "ratio of variance explained" is a measure from statistics. Using the previous notations, it is defined as:
$$mbox{R}(X,V_K):=1 - frac{loss(X,V_K)}{Var(X)}$$
Since variance of original data $Var(X)$ is independent of solution, minimum of $loss(X,V_K)$ is equivalent to maximum of $mbox{R}(X,V_K)$. For example, if $K=2$, then $V^*_2={mbox{v}^*_1, mbox{v}^*_2}$ minimizes $loss(X,V_2)$ and equivalently maximizes $mbox{R}(X,V_2)$. Ideally, if original data $X$ can be perfectly reconstructed from $V_K$, then $R(X, V_K)$ would be $1$.
Does maximum variance means most information about my data in higher
dimension is captured into lower dimension?
Yes. If we agree that "keep as much information as possible" is equivalent to "be able to reconstruct the data as close as possible", then our objective $min_{V_K}loss(X,V_K)$ formalizes "keep as much information as possible", and its solution is "maximum variance".
$endgroup$
do we always have to choose principal components based on maximum
variance explained?
Yes. "Maximum variance explained" is closely related to the main objective as follows.
Our main objective is: for a limited budget K dimensions, what information $mbox{a}=(a_1,...,a_K)$ to keep from original data $mbox{x}=(x_1,...,x_D)$ ($D gg K$) in order to be able to reconstruct $mbox{x}$ from $mbox{a}$ as close as possible?
If we only allow rotation and scaling of original data, i.e. $a_k := mbox{x}.mbox{v}_k$ for unknown set of vectors $V_K={mbox{v}_k|mbox{v}_k in mathbb{R}^D, 1 leq k leq K}$, and define the reconstruction error as
$$loss(mbox{x},V_K):=left | mbox{x}-underbrace{sum_{k=1}^{K}overbrace{(mbox{x}.mbox{v}_k)}^{a_k}mbox{v}_k}_{hat{mbox{x}}} right |^2,$$
the solution $V^*_K$ that minimizes this error is PCA. For first dimension, PCA keeps the projection of data on vector $mbox{v}^*_1$ in the direction of largest data variance, namely $a^*_1$. For second dimension, it keeps the projection on vector $mbox{v}^*_2$ in the direction of second largest data variance, namely $a^*_2$, and so on.
In other words, when we try to find a K-vector set $V_K$ that minimizes $loss(X,V_K)=frac{1}{N}sum_{n=1}^{N}loss(mbox{x}_n,V_K)$, the solution
$V^*_K$ includes $mbox{v}^*_k$ that is in the direction of $kmbox{-th}$ largest data variance.
Note that "ratio of variance explained" is a measure from statistics. Using the previous notations, it is defined as:
$$mbox{R}(X,V_K):=1 - frac{loss(X,V_K)}{Var(X)}$$
Since variance of original data $Var(X)$ is independent of solution, minimum of $loss(X,V_K)$ is equivalent to maximum of $mbox{R}(X,V_K)$. For example, if $K=2$, then $V^*_2={mbox{v}^*_1, mbox{v}^*_2}$ minimizes $loss(X,V_2)$ and equivalently maximizes $mbox{R}(X,V_2)$. Ideally, if original data $X$ can be perfectly reconstructed from $V_K$, then $R(X, V_K)$ would be $1$.
Does maximum variance means most information about my data in higher
dimension is captured into lower dimension?
Yes. If we agree that "keep as much information as possible" is equivalent to "be able to reconstruct the data as close as possible", then our objective $min_{V_K}loss(X,V_K)$ formalizes "keep as much information as possible", and its solution is "maximum variance".
edited 21 hours ago
answered yesterday
EsmailianEsmailian
4905
4905
add a comment |
add a comment |
$begingroup$
Principal Component Analysis is commonly used as a technique in Machine Learning as a preprocessing step. It is dimensionality reduction. You can imagine that this might be useful for things like visualization or for reducing the size of your training set. Why we want to maximize the variance is so that you preserve as much information about the original data as possible and only loose a small amount of information.
In answer to your question, yes, high variance in this case means preserving most of the information captured in the high dimensional data, in a lower dimension. There is a mathematical intuition for this when you are projecting points on to a perpendicular line, could you revert back to the original points?
On that note - if someone would like to provide the mathematical intuition explicitly I would welcome that answer
$endgroup$
add a comment |
$begingroup$
Principal Component Analysis is commonly used as a technique in Machine Learning as a preprocessing step. It is dimensionality reduction. You can imagine that this might be useful for things like visualization or for reducing the size of your training set. Why we want to maximize the variance is so that you preserve as much information about the original data as possible and only loose a small amount of information.
In answer to your question, yes, high variance in this case means preserving most of the information captured in the high dimensional data, in a lower dimension. There is a mathematical intuition for this when you are projecting points on to a perpendicular line, could you revert back to the original points?
On that note - if someone would like to provide the mathematical intuition explicitly I would welcome that answer
$endgroup$
add a comment |
$begingroup$
Principal Component Analysis is commonly used as a technique in Machine Learning as a preprocessing step. It is dimensionality reduction. You can imagine that this might be useful for things like visualization or for reducing the size of your training set. Why we want to maximize the variance is so that you preserve as much information about the original data as possible and only loose a small amount of information.
In answer to your question, yes, high variance in this case means preserving most of the information captured in the high dimensional data, in a lower dimension. There is a mathematical intuition for this when you are projecting points on to a perpendicular line, could you revert back to the original points?
On that note - if someone would like to provide the mathematical intuition explicitly I would welcome that answer
$endgroup$
Principal Component Analysis is commonly used as a technique in Machine Learning as a preprocessing step. It is dimensionality reduction. You can imagine that this might be useful for things like visualization or for reducing the size of your training set. Why we want to maximize the variance is so that you preserve as much information about the original data as possible and only loose a small amount of information.
In answer to your question, yes, high variance in this case means preserving most of the information captured in the high dimensional data, in a lower dimension. There is a mathematical intuition for this when you are projecting points on to a perpendicular line, could you revert back to the original points?
On that note - if someone would like to provide the mathematical intuition explicitly I would welcome that answer
answered 2 days ago
EthanEthan
508222
508222
add a comment |
add a comment |
$begingroup$
In addition to what has been said:
Why do we choose principal components based on maximum variance explained?
- Because the variance left by rest of the components is in fact
the residual you want to minimize when looking for the best representation of your data in less dimensions (the best mean-square linear representation, of course).
do we always have to choose principal components based on maximum variance explained?
- Yes, if dimensionality reduction is what you want.
However, there are applications when the residual components are those who tell the story :-)
New contributor
$endgroup$
add a comment |
$begingroup$
In addition to what has been said:
Why do we choose principal components based on maximum variance explained?
- Because the variance left by rest of the components is in fact
the residual you want to minimize when looking for the best representation of your data in less dimensions (the best mean-square linear representation, of course).
do we always have to choose principal components based on maximum variance explained?
- Yes, if dimensionality reduction is what you want.
However, there are applications when the residual components are those who tell the story :-)
New contributor
$endgroup$
add a comment |
$begingroup$
In addition to what has been said:
Why do we choose principal components based on maximum variance explained?
- Because the variance left by rest of the components is in fact
the residual you want to minimize when looking for the best representation of your data in less dimensions (the best mean-square linear representation, of course).
do we always have to choose principal components based on maximum variance explained?
- Yes, if dimensionality reduction is what you want.
However, there are applications when the residual components are those who tell the story :-)
New contributor
$endgroup$
In addition to what has been said:
Why do we choose principal components based on maximum variance explained?
- Because the variance left by rest of the components is in fact
the residual you want to minimize when looking for the best representation of your data in less dimensions (the best mean-square linear representation, of course).
do we always have to choose principal components based on maximum variance explained?
- Yes, if dimensionality reduction is what you want.
However, there are applications when the residual components are those who tell the story :-)
New contributor
New contributor
answered yesterday
m0nzderrm0nzderr
263
263
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46881%2fwhy-do-we-choose-principal-components-based-on-maximum-variance-explained%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown