Should I use regex or machine learning?
$begingroup$
I am thinking of two use cases:
- You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.
- You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?
machine-learning nlp automatic-summarization
$endgroup$
add a comment |
$begingroup$
I am thinking of two use cases:
- You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.
- You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?
machine-learning nlp automatic-summarization
$endgroup$
add a comment |
$begingroup$
I am thinking of two use cases:
- You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.
- You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?
machine-learning nlp automatic-summarization
$endgroup$
I am thinking of two use cases:
- You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.
- You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?
machine-learning nlp automatic-summarization
machine-learning nlp automatic-summarization
edited Feb 1 '18 at 17:19
Community♦
1
1
asked Jan 18 '18 at 14:22
I_Play_With_DataI_Play_With_Data
1,122425
1,122425
add a comment |
add a comment |
7 Answers
7
active
oldest
votes
$begingroup$
You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.
To give you an impression how powerful rule-based systems can be:
Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”
Source: Wikipedia: Eliza
See also
- Apple Data Detectors
- TimeML: Robust specification of event and temporal expressions in text
$endgroup$
add a comment |
$begingroup$
I would say depends on the requirements and how much effort you want to give the task.
Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...
If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.
As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.
$endgroup$
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44
add a comment |
$begingroup$
If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.
Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.
Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.
That's the reason why, in general, you need to use some ML technique to reach your goal.
$endgroup$
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46
add a comment |
$begingroup$
In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.
For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.
On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.
$endgroup$
add a comment |
$begingroup$
If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.
In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)
$endgroup$
add a comment |
$begingroup$
Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.
$endgroup$
add a comment |
$begingroup$
Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.
I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.
Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.
I was surprised at how useful such a simple methodology can be for profiling an abstract.
The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.
Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26789%2fshould-i-use-regex-or-machine-learning%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
7 Answers
7
active
oldest
votes
7 Answers
7
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.
To give you an impression how powerful rule-based systems can be:
Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”
Source: Wikipedia: Eliza
See also
- Apple Data Detectors
- TimeML: Robust specification of event and temporal expressions in text
$endgroup$
add a comment |
$begingroup$
You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.
To give you an impression how powerful rule-based systems can be:
Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”
Source: Wikipedia: Eliza
See also
- Apple Data Detectors
- TimeML: Robust specification of event and temporal expressions in text
$endgroup$
add a comment |
$begingroup$
You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.
To give you an impression how powerful rule-based systems can be:
Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”
Source: Wikipedia: Eliza
See also
- Apple Data Detectors
- TimeML: Robust specification of event and temporal expressions in text
$endgroup$
You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".
What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.
To give you an impression how powerful rule-based systems can be:
Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”
Source: Wikipedia: Eliza
See also
- Apple Data Detectors
- TimeML: Robust specification of event and temporal expressions in text
answered Jan 24 '18 at 6:43
Martin ThomaMartin Thoma
6,3481353132
6,3481353132
add a comment |
add a comment |
$begingroup$
I would say depends on the requirements and how much effort you want to give the task.
Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...
If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.
As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.
$endgroup$
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44
add a comment |
$begingroup$
I would say depends on the requirements and how much effort you want to give the task.
Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...
If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.
As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.
$endgroup$
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44
add a comment |
$begingroup$
I would say depends on the requirements and how much effort you want to give the task.
Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...
If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.
As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.
$endgroup$
I would say depends on the requirements and how much effort you want to give the task.
Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...
If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.
As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.
answered Jan 18 '18 at 14:33
HastuHastu
1288
1288
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44
add a comment |
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44
$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44
add a comment |
$begingroup$
If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.
Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.
Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.
That's the reason why, in general, you need to use some ML technique to reach your goal.
$endgroup$
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46
add a comment |
$begingroup$
If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.
Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.
Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.
That's the reason why, in general, you need to use some ML technique to reach your goal.
$endgroup$
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46
add a comment |
$begingroup$
If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.
Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.
Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.
That's the reason why, in general, you need to use some ML technique to reach your goal.
$endgroup$
If you foresee to build something for a big public, definitely you cannot use regular expressions. There is no way you can write a regular expression that can span the variance that a class of documents (email or PDF) can have.
Even if you are happy with a regex that can handle efficiently a (small) percentage of the possible documents, and then amend it from time to time, it will take so much more time with respect to find the training data and train a ML algorithm to do it.
Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.
That's the reason why, in general, you need to use some ML technique to reach your goal.
edited Jan 19 '18 at 16:02
answered Jan 18 '18 at 15:53
Vincenzo LavoriniVincenzo Lavorini
1,289416
1,289416
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46
add a comment |
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28
$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46
$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46
add a comment |
$begingroup$
In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.
For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.
On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.
$endgroup$
add a comment |
$begingroup$
In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.
For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.
On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.
$endgroup$
add a comment |
$begingroup$
In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.
For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.
On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.
$endgroup$
In scale, unless you are expecting to receive only a particular format, it is machine learning. For the first task, you should first parse the text and then scan it, probably with a Named Entity Recognition (NER) system to extract the information you are after. Having a NER system would work, as you can manually code different types of features that will greatly improve the performance. If you just want to perform candidate matching, then standard bag-of-word approaches would perform decently.
For the second case, things are similar. You can rely in some syntactic analysis of the sentence to obtain the invitation and the time/day of the proposal. This again can be coupled with NER systems. Lately, for both tasks neural networks yielded promising approaches. But, in any case, you need labeled data, which can be cumbersome to obtain.
On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.
answered Jan 23 '18 at 15:59
geompalikgeompalik
33127
33127
add a comment |
add a comment |
$begingroup$
If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.
In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)
$endgroup$
add a comment |
$begingroup$
If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.
In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)
$endgroup$
add a comment |
$begingroup$
If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.
In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)
$endgroup$
If in case 1 there would be one template for organizing information on CVs then you can go for regex, but in order to have really helpful tool for real world CV you have to train ML solution. Just type in google "cv example" and you will see that people use different words and have different order of information and different formatting (formatting could be eliminated from problem scope). You can use regex but it will be enormous and I doubt that you will outperform some NER solution.
In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)
edited Jan 26 '18 at 17:38
Stephen Rauch
1,52551229
1,52551229
answered Jan 26 '18 at 16:35
questerquester
211
211
add a comment |
add a comment |
$begingroup$
Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.
$endgroup$
add a comment |
$begingroup$
Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.
$endgroup$
add a comment |
$begingroup$
Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.
$endgroup$
Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.
answered Jan 27 '18 at 17:24
Vivek KhetanVivek Khetan
1456
1456
add a comment |
add a comment |
$begingroup$
Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.
I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.
Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.
I was surprised at how useful such a simple methodology can be for profiling an abstract.
The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.
Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.
New contributor
$endgroup$
add a comment |
$begingroup$
Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.
I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.
Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.
I was surprised at how useful such a simple methodology can be for profiling an abstract.
The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.
Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.
New contributor
$endgroup$
add a comment |
$begingroup$
Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.
I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.
Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.
I was surprised at how useful such a simple methodology can be for profiling an abstract.
The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.
Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.
New contributor
$endgroup$
Ive been able to use a set of regex rules that feed a scoring system to profile Pubmed abstracts. For example, any instance of 'increased risk', 'increased association', etc., adds to an 'association' counter. Similarly, 'reduced risk' subtracts from the same counter.
I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.
Of course this approach is not valid for complex grammar but it lends itself to the predictable and formulaic format that abstracts are written in, and in particular to the discipline that I am interested in, ie molecular biology.
I was surprised at how useful such a simple methodology can be for profiling an abstract.
The key is in developing the regex rules. For my application, ie extracting data from Pubmed abstracts, I was essentially modelling my internal process of what I look for when scanning abstracts to determine relevance.
Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.
New contributor
New contributor
answered 2 days ago
hazhaz
111
111
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26789%2fshould-i-use-regex-or-machine-learning%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown