Algorithm suggestion for anomaly detection in multivariate time series data
$begingroup$
I have time series data containing user actions at certain time intervals
eg
Date UserId Directory operation Result
01/01/2017 99:00 user1 dir1 created_file success
01/01/2017 99:00 user3 dir10 deleted_file permission_denied
unique userIds > 10K
10 distinct operations
and 4 distinct Results
I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use?
The anomaly needs to flag whether some user operations are outliers
A very small subset of input data will be labelled. But most of the data will be unlabelled.
machine-learning time-series anomaly-detection outlier
New contributor
$endgroup$
add a comment |
$begingroup$
I have time series data containing user actions at certain time intervals
eg
Date UserId Directory operation Result
01/01/2017 99:00 user1 dir1 created_file success
01/01/2017 99:00 user3 dir10 deleted_file permission_denied
unique userIds > 10K
10 distinct operations
and 4 distinct Results
I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use?
The anomaly needs to flag whether some user operations are outliers
A very small subset of input data will be labelled. But most of the data will be unlabelled.
machine-learning time-series anomaly-detection outlier
New contributor
$endgroup$
add a comment |
$begingroup$
I have time series data containing user actions at certain time intervals
eg
Date UserId Directory operation Result
01/01/2017 99:00 user1 dir1 created_file success
01/01/2017 99:00 user3 dir10 deleted_file permission_denied
unique userIds > 10K
10 distinct operations
and 4 distinct Results
I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use?
The anomaly needs to flag whether some user operations are outliers
A very small subset of input data will be labelled. But most of the data will be unlabelled.
machine-learning time-series anomaly-detection outlier
New contributor
$endgroup$
I have time series data containing user actions at certain time intervals
eg
Date UserId Directory operation Result
01/01/2017 99:00 user1 dir1 created_file success
01/01/2017 99:00 user3 dir10 deleted_file permission_denied
unique userIds > 10K
10 distinct operations
and 4 distinct Results
I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use?
The anomaly needs to flag whether some user operations are outliers
A very small subset of input data will be labelled. But most of the data will be unlabelled.
machine-learning time-series anomaly-detection outlier
machine-learning time-series anomaly-detection outlier
New contributor
New contributor
edited 14 hours ago
Alireza Zolanvari
35716
35716
New contributor
asked 22 hours ago
himadrihimadri
11
11
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.
A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.
The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.
Without labeled data, it will be very difficult to use machine learning methods to solve this problem.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
himadri is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47652%2falgorithm-suggestion-for-anomaly-detection-in-multivariate-time-series-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.
A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.
The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.
Without labeled data, it will be very difficult to use machine learning methods to solve this problem.
New contributor
$endgroup$
add a comment |
$begingroup$
The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.
A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.
The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.
Without labeled data, it will be very difficult to use machine learning methods to solve this problem.
New contributor
$endgroup$
add a comment |
$begingroup$
The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.
A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.
The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.
Without labeled data, it will be very difficult to use machine learning methods to solve this problem.
New contributor
$endgroup$
The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.
A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.
The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.
Without labeled data, it will be very difficult to use machine learning methods to solve this problem.
New contributor
edited 19 hours ago
New contributor
answered 19 hours ago
MachineLearnerMachineLearner
1539
1539
New contributor
New contributor
add a comment |
add a comment |
himadri is a new contributor. Be nice, and check out our Code of Conduct.
himadri is a new contributor. Be nice, and check out our Code of Conduct.
himadri is a new contributor. Be nice, and check out our Code of Conduct.
himadri is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47652%2falgorithm-suggestion-for-anomaly-detection-in-multivariate-time-series-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown