Should I use regex or machine learning?












2












$begingroup$


I am thinking of two use cases:




  1. You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.

  2. You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".


I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?










share|improve this question











$endgroup$

















    2












    $begingroup$


    I am thinking of two use cases:




    1. You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.

    2. You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".


    I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?










    share|improve this question











    $endgroup$















      2












      2








      2





      $begingroup$


      I am thinking of two use cases:




      1. You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.

      2. You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".


      I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?










      share|improve this question











      $endgroup$




      I am thinking of two use cases:




      1. You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.

      2. You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".


      I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?







      machine-learning nlp automatic-summarization






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Feb 1 '18 at 17:19









      Community

      1




      1










      asked Jan 18 '18 at 14:22









      I_Play_With_DataI_Play_With_Data

      1,122425




      1,122425






















          7 Answers
          7






          active

          oldest

          votes


















          3












          $begingroup$


          You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".




          What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.



          To give you an impression how powerful rule-based systems can be:




          Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”




          Source: Wikipedia: Eliza



          See also




          • Apple Data Detectors

          • TimeML: Robust specification of event and temporal expressions in text






          share|improve this answer









          $endgroup$





















            5












            $begingroup$

            I would say depends on the requirements and how much effort you want to give the task.




            • Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...


            • If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.



            As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.






            share|improve this answer









            $endgroup$













            • $begingroup$
              So which algorithmic approach would you use for this?
              $endgroup$
              – I_Play_With_Data
              Jan 18 '18 at 16:28










            • $begingroup$
              For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
              $endgroup$
              – Hastu
              Jan 18 '18 at 17:44



















            4












            $begingroup$

            If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.



            Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.



            Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.



            That's the reason why, in general, you need to use some ML technique to reach your goal.






            share|improve this answer











            $endgroup$













            • $begingroup$
              So which algorithmic approach would you use for this?
              $endgroup$
              – I_Play_With_Data
              Jan 18 '18 at 16:28










            • $begingroup$
              I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
              $endgroup$
              – Vincenzo Lavorini
              Jan 25 '18 at 8:46





















            2












            $begingroup$

            In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.



            For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.



            On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.






            share|improve this answer









            $endgroup$





















              2












              $begingroup$

              If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.



              In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)






              share|improve this answer











              $endgroup$





















                2












                $begingroup$

                Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.






                share|improve this answer









                $endgroup$





















                  1












                  $begingroup$

                  Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.



                  I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.



                  Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.



                  I was surprised at how useful such a simple methodology can be for profiling an abstract.



                  The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.



                  Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.






                  share|improve this answer








                  New contributor




                  haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  $endgroup$













                    Your Answer





                    StackExchange.ifUsing("editor", function () {
                    return StackExchange.using("mathjaxEditing", function () {
                    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
                    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                    });
                    });
                    }, "mathjax-editing");

                    StackExchange.ready(function() {
                    var channelOptions = {
                    tags: "".split(" "),
                    id: "557"
                    };
                    initTagRenderer("".split(" "), "".split(" "), channelOptions);

                    StackExchange.using("externalEditor", function() {
                    // Have to fire editor after snippets, if snippets enabled
                    if (StackExchange.settings.snippets.snippetsEnabled) {
                    StackExchange.using("snippets", function() {
                    createEditor();
                    });
                    }
                    else {
                    createEditor();
                    }
                    });

                    function createEditor() {
                    StackExchange.prepareEditor({
                    heartbeatType: 'answer',
                    autoActivateHeartbeat: false,
                    convertImagesToLinks: false,
                    noModals: true,
                    showLowRepImageUploadWarning: true,
                    reputationToPostImages: null,
                    bindNavPrevention: true,
                    postfix: "",
                    imageUploader: {
                    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                    allowUrls: true
                    },
                    onDemand: true,
                    discardSelector: ".discard-answer"
                    ,immediatelyShowMarkdownHelp:true
                    });


                    }
                    });














                    draft saved

                    draft discarded


















                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26789%2fshould-i-use-regex-or-machine-learning%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown

























                    7 Answers
                    7






                    active

                    oldest

                    votes








                    7 Answers
                    7






                    active

                    oldest

                    votes









                    active

                    oldest

                    votes






                    active

                    oldest

                    votes









                    3












                    $begingroup$


                    You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".




                    What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.



                    To give you an impression how powerful rule-based systems can be:




                    Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”




                    Source: Wikipedia: Eliza



                    See also




                    • Apple Data Detectors

                    • TimeML: Robust specification of event and temporal expressions in text






                    share|improve this answer









                    $endgroup$


















                      3












                      $begingroup$


                      You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".




                      What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.



                      To give you an impression how powerful rule-based systems can be:




                      Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”




                      Source: Wikipedia: Eliza



                      See also




                      • Apple Data Detectors

                      • TimeML: Robust specification of event and temporal expressions in text






                      share|improve this answer









                      $endgroup$
















                        3












                        3








                        3





                        $begingroup$


                        You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".




                        What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.



                        To give you an impression how powerful rule-based systems can be:




                        Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”




                        Source: Wikipedia: Eliza



                        See also




                        • Apple Data Detectors

                        • TimeML: Robust specification of event and temporal expressions in text






                        share|improve this answer









                        $endgroup$




                        You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".




                        What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.



                        To give you an impression how powerful rule-based systems can be:




                        Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”




                        Source: Wikipedia: Eliza



                        See also




                        • Apple Data Detectors

                        • TimeML: Robust specification of event and temporal expressions in text







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Jan 24 '18 at 6:43









                        Martin ThomaMartin Thoma

                        6,3481353132




                        6,3481353132























                            5












                            $begingroup$

                            I would say depends on the requirements and how much effort you want to give the task.




                            • Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...


                            • If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.



                            As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.






                            share|improve this answer









                            $endgroup$













                            • $begingroup$
                              So which algorithmic approach would you use for this?
                              $endgroup$
                              – I_Play_With_Data
                              Jan 18 '18 at 16:28










                            • $begingroup$
                              For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
                              $endgroup$
                              – Hastu
                              Jan 18 '18 at 17:44
















                            5












                            $begingroup$

                            I would say depends on the requirements and how much effort you want to give the task.




                            • Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...


                            • If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.



                            As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.






                            share|improve this answer









                            $endgroup$













                            • $begingroup$
                              So which algorithmic approach would you use for this?
                              $endgroup$
                              – I_Play_With_Data
                              Jan 18 '18 at 16:28










                            • $begingroup$
                              For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
                              $endgroup$
                              – Hastu
                              Jan 18 '18 at 17:44














                            5












                            5








                            5





                            $begingroup$

                            I would say depends on the requirements and how much effort you want to give the task.




                            • Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...


                            • If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.



                            As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.






                            share|improve this answer









                            $endgroup$



                            I would say depends on the requirements and how much effort you want to give the task.




                            • Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...


                            • If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.



                            As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Jan 18 '18 at 14:33









                            HastuHastu

                            1288




                            1288












                            • $begingroup$
                              So which algorithmic approach would you use for this?
                              $endgroup$
                              – I_Play_With_Data
                              Jan 18 '18 at 16:28










                            • $begingroup$
                              For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
                              $endgroup$
                              – Hastu
                              Jan 18 '18 at 17:44


















                            • $begingroup$
                              So which algorithmic approach would you use for this?
                              $endgroup$
                              – I_Play_With_Data
                              Jan 18 '18 at 16:28










                            • $begingroup$
                              For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
                              $endgroup$
                              – Hastu
                              Jan 18 '18 at 17:44
















                            $begingroup$
                            So which algorithmic approach would you use for this?
                            $endgroup$
                            – I_Play_With_Data
                            Jan 18 '18 at 16:28




                            $begingroup$
                            So which algorithmic approach would you use for this?
                            $endgroup$
                            – I_Play_With_Data
                            Jan 18 '18 at 16:28












                            $begingroup$
                            For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
                            $endgroup$
                            – Hastu
                            Jan 18 '18 at 17:44




                            $begingroup$
                            For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
                            $endgroup$
                            – Hastu
                            Jan 18 '18 at 17:44











                            4












                            $begingroup$

                            If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.



                            Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.



                            Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.



                            That's the reason why, in general, you need to use some ML technique to reach your goal.






                            share|improve this answer











                            $endgroup$













                            • $begingroup$
                              So which algorithmic approach would you use for this?
                              $endgroup$
                              – I_Play_With_Data
                              Jan 18 '18 at 16:28










                            • $begingroup$
                              I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
                              $endgroup$
                              – Vincenzo Lavorini
                              Jan 25 '18 at 8:46


















                            4












                            $begingroup$

                            If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.



                            Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.



                            Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.



                            That's the reason why, in general, you need to use some ML technique to reach your goal.






                            share|improve this answer











                            $endgroup$













                            • $begingroup$
                              So which algorithmic approach would you use for this?
                              $endgroup$
                              – I_Play_With_Data
                              Jan 18 '18 at 16:28










                            • $begingroup$
                              I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
                              $endgroup$
                              – Vincenzo Lavorini
                              Jan 25 '18 at 8:46
















                            4












                            4








                            4





                            $begingroup$

                            If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.



                            Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.



                            Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.



                            That's the reason why, in general, you need to use some ML technique to reach your goal.






                            share|improve this answer











                            $endgroup$



                            If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.



                            Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.



                            Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.



                            That's the reason why, in general, you need to use some ML technique to reach your goal.







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Jan 19 '18 at 16:02

























                            answered Jan 18 '18 at 15:53









                            Vincenzo LavoriniVincenzo Lavorini

                            1,289416




                            1,289416












                            • $begingroup$
                              So which algorithmic approach would you use for this?
                              $endgroup$
                              – I_Play_With_Data
                              Jan 18 '18 at 16:28










                            • $begingroup$
                              I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
                              $endgroup$
                              – Vincenzo Lavorini
                              Jan 25 '18 at 8:46




















                            • $begingroup$
                              So which algorithmic approach would you use for this?
                              $endgroup$
                              – I_Play_With_Data
                              Jan 18 '18 at 16:28










                            • $begingroup$
                              I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
                              $endgroup$
                              – Vincenzo Lavorini
                              Jan 25 '18 at 8:46


















                            $begingroup$
                            So which algorithmic approach would you use for this?
                            $endgroup$
                            – I_Play_With_Data
                            Jan 18 '18 at 16:28




                            $begingroup$
                            So which algorithmic approach would you use for this?
                            $endgroup$
                            – I_Play_With_Data
                            Jan 18 '18 at 16:28












                            $begingroup$
                            I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
                            $endgroup$
                            – Vincenzo Lavorini
                            Jan 25 '18 at 8:46






                            $begingroup$
                            I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
                            $endgroup$
                            – Vincenzo Lavorini
                            Jan 25 '18 at 8:46













                            2












                            $begingroup$

                            In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.



                            For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.



                            On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.






                            share|improve this answer









                            $endgroup$


















                              2












                              $begingroup$

                              In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.



                              For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.



                              On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.






                              share|improve this answer









                              $endgroup$
















                                2












                                2








                                2





                                $begingroup$

                                In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.



                                For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.



                                On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.






                                share|improve this answer









                                $endgroup$



                                In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.



                                For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.



                                On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.







                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered Jan 23 '18 at 15:59









                                geompalikgeompalik

                                33127




                                33127























                                    2












                                    $begingroup$

                                    If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.



                                    In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)






                                    share|improve this answer











                                    $endgroup$


















                                      2












                                      $begingroup$

                                      If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.



                                      In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)






                                      share|improve this answer











                                      $endgroup$
















                                        2












                                        2








                                        2





                                        $begingroup$

                                        If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.



                                        In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)






                                        share|improve this answer











                                        $endgroup$



                                        If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.



                                        In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)







                                        share|improve this answer














                                        share|improve this answer



                                        share|improve this answer








                                        edited Jan 26 '18 at 17:38









                                        Stephen Rauch

                                        1,52551229




                                        1,52551229










                                        answered Jan 26 '18 at 16:35









                                        questerquester

                                        211




                                        211























                                            2












                                            $begingroup$

                                            Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.






                                            share|improve this answer









                                            $endgroup$


















                                              2












                                              $begingroup$

                                              Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.






                                              share|improve this answer









                                              $endgroup$
















                                                2












                                                2








                                                2





                                                $begingroup$

                                                Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.






                                                share|improve this answer









                                                $endgroup$



                                                Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.







                                                share|improve this answer












                                                share|improve this answer



                                                share|improve this answer










                                                answered Jan 27 '18 at 17:24









                                                Vivek KhetanVivek Khetan

                                                1456




                                                1456























                                                    1












                                                    $begingroup$

                                                    Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.



                                                    I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.



                                                    Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.



                                                    I was surprised at how useful such a simple methodology can be for profiling an abstract.



                                                    The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.



                                                    Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.






                                                    share|improve this answer








                                                    New contributor




                                                    haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                    Check out our Code of Conduct.






                                                    $endgroup$


















                                                      1












                                                      $begingroup$

                                                      Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.



                                                      I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.



                                                      Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.



                                                      I was surprised at how useful such a simple methodology can be for profiling an abstract.



                                                      The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.



                                                      Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.






                                                      share|improve this answer








                                                      New contributor




                                                      haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                      Check out our Code of Conduct.






                                                      $endgroup$
















                                                        1












                                                        1








                                                        1





                                                        $begingroup$

                                                        Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.



                                                        I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.



                                                        Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.



                                                        I was surprised at how useful such a simple methodology can be for profiling an abstract.



                                                        The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.



                                                        Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.






                                                        share|improve this answer








                                                        New contributor




                                                        haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.






                                                        $endgroup$



                                                        Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.



                                                        I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.



                                                        Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.



                                                        I was surprised at how useful such a simple methodology can be for profiling an abstract.



                                                        The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.



                                                        Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.







                                                        share|improve this answer








                                                        New contributor




                                                        haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.









                                                        share|improve this answer



                                                        share|improve this answer






                                                        New contributor




                                                        haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.









                                                        answered 2 days ago









                                                        hazhaz

                                                        111




                                                        111




                                                        New contributor




                                                        haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.





                                                        New contributor





                                                        haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.






                                                        haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.






























                                                            draft saved

                                                            draft discarded




















































                                                            Thanks for contributing an answer to Data Science Stack Exchange!


                                                            • Please be sure to answer the question. Provide details and share your research!

                                                            But avoid



                                                            • Asking for help, clarification, or responding to other answers.

                                                            • Making statements based on opinion; back them up with references or personal experience.


                                                            Use MathJax to format equations. MathJax reference.


                                                            To learn more, see our tips on writing great answers.




                                                            draft saved


                                                            draft discarded














                                                            StackExchange.ready(
                                                            function () {
                                                            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26789%2fshould-i-use-regex-or-machine-learning%23new-answer', 'question_page');
                                                            }
                                                            );

                                                            Post as a guest















                                                            Required, but never shown





















































                                                            Required, but never shown














                                                            Required, but never shown












                                                            Required, but never shown







                                                            Required, but never shown

































                                                            Required, but never shown














                                                            Required, but never shown












                                                            Required, but never shown







                                                            Required, but never shown







                                                            Popular posts from this blog

                                                            How to label and detect the document text images

                                                            Vallis Paradisi

                                                            Tabula Rosettana