Algorithm suggestion for anomaly detection in multivariate time series data












0












$begingroup$


I have time series data containing user actions at certain time intervals
eg



Date                 UserId   Directory  operation      Result
01/01/2017 99:00 user1 dir1 created_file success
01/01/2017 99:00 user3 dir10 deleted_file permission_denied


unique userIds > 10K
10 distinct operations



and 4 distinct Results



I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use?



The anomaly needs to flag whether some user operations are outliers



A very small subset of input data will be labelled. But most of the data will be unlabelled.










share|improve this question









New contributor




himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    0












    $begingroup$


    I have time series data containing user actions at certain time intervals
    eg



    Date                 UserId   Directory  operation      Result
    01/01/2017 99:00 user1 dir1 created_file success
    01/01/2017 99:00 user3 dir10 deleted_file permission_denied


    unique userIds > 10K
    10 distinct operations



    and 4 distinct Results



    I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use?



    The anomaly needs to flag whether some user operations are outliers



    A very small subset of input data will be labelled. But most of the data will be unlabelled.










    share|improve this question









    New contributor




    himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      0












      0








      0





      $begingroup$


      I have time series data containing user actions at certain time intervals
      eg



      Date                 UserId   Directory  operation      Result
      01/01/2017 99:00 user1 dir1 created_file success
      01/01/2017 99:00 user3 dir10 deleted_file permission_denied


      unique userIds > 10K
      10 distinct operations



      and 4 distinct Results



      I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use?



      The anomaly needs to flag whether some user operations are outliers



      A very small subset of input data will be labelled. But most of the data will be unlabelled.










      share|improve this question









      New contributor




      himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I have time series data containing user actions at certain time intervals
      eg



      Date                 UserId   Directory  operation      Result
      01/01/2017 99:00 user1 dir1 created_file success
      01/01/2017 99:00 user3 dir10 deleted_file permission_denied


      unique userIds > 10K
      10 distinct operations



      and 4 distinct Results



      I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use?



      The anomaly needs to flag whether some user operations are outliers



      A very small subset of input data will be labelled. But most of the data will be unlabelled.







      machine-learning time-series anomaly-detection outlier






      share|improve this question









      New contributor




      himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited 14 hours ago









      Alireza Zolanvari

      35716




      35716






      New contributor




      himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 22 hours ago









      himadrihimadri

      11




      11




      New contributor




      himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      himadri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.



          A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.



          The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.



          Without labeled data, it will be very difficult to use machine learning methods to solve this problem.






          share|improve this answer










          New contributor




          MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            himadri is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47652%2falgorithm-suggestion-for-anomaly-detection-in-multivariate-time-series-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.



            A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.



            The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.



            Without labeled data, it will be very difficult to use machine learning methods to solve this problem.






            share|improve this answer










            New contributor




            MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$


















              0












              $begingroup$

              The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.



              A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.



              The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.



              Without labeled data, it will be very difficult to use machine learning methods to solve this problem.






              share|improve this answer










              New contributor




              MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$
















                0












                0








                0





                $begingroup$

                The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.



                A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.



                The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.



                Without labeled data, it will be very difficult to use machine learning methods to solve this problem.






                share|improve this answer










                New contributor




                MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                The problem with your data set it that it does contain multiple categorical variables (as far as I can see). Another problem is that the users might do sequences with different lengths and different order (which makes it very difficult to detect suspicious patterns). I would create histograms for each variable and see which categories are common and which are not so common. If you have looked at the descriptives of each variable you should be able to see which variables allow you to discriminate.



                A good metric is the entropy (dispersion) $H = -sum_{l=1}^{L}p_lln p_l$ (is 0 if all manifestations of the categorical variable are concentrated at one label; is $ln L$ if all manifestations are uniformly distributed). and the Gini-index $text{G}=1-sum_{l=1}^{L}p^2_l$ (tends to zero if one label is very dominant, becomes larger for uniformly distributed labels for a variable and is bounded by $1-1/L$). The variable $p_l$ is the relative frequency of the $l^{text{th}}$ manifestation of the categorical variable that we are investigating and $L$ is the number of all possible manifestations of the categorical variable.



                The problem with this procedure is that we are not considering the interactions between your variables. But it is the first approach that you could try. If the variables do not correlate that much this might be sufficient.



                Without labeled data, it will be very difficult to use machine learning methods to solve this problem.







                share|improve this answer










                New contributor




                MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer








                edited 19 hours ago





















                New contributor




                MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered 19 hours ago









                MachineLearnerMachineLearner

                1539




                1539




                New contributor




                MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






















                    himadri is a new contributor. Be nice, and check out our Code of Conduct.










                    draft saved

                    draft discarded


















                    himadri is a new contributor. Be nice, and check out our Code of Conduct.













                    himadri is a new contributor. Be nice, and check out our Code of Conduct.












                    himadri is a new contributor. Be nice, and check out our Code of Conduct.
















                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47652%2falgorithm-suggestion-for-anomaly-detection-in-multivariate-time-series-data%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Tabula Rosettana

                    Aureus (color)