Should I use regex or machine learning?

I am thinking of two use cases:

You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.

You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".

I can see how this can be done with both ML and with (very sophisticated) regex. What approaches are typically used for these scenarios? I'm assuming that the same approach can be applied to both of the scenarios above?

edited Feb 1 '18 at 17:19

Community♦

asked Jan 18 '18 at 14:22

I_Play_With_Data

1,122425

add a comment |

I am thinking of two use cases:

You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.

You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".

edited Feb 1 '18 at 17:19

Community♦

asked Jan 18 '18 at 14:22

I_Play_With_Data

1,122425

add a comment |

I am thinking of two use cases:

You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.

You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".

edited Feb 1 '18 at 17:19

Community♦

asked Jan 18 '18 at 14:22

I_Play_With_Data

1,122425

I am thinking of two use cases:

You submit a resume in PDF format to a web site and it extracts your contact information, job titles, etc.

You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".

machine-learning nlp automatic-summarization

edited Feb 1 '18 at 17:19

Community♦

asked Jan 18 '18 at 14:22

I_Play_With_Data

1,122425

edited Feb 1 '18 at 17:19

Community♦

asked Jan 18 '18 at 14:22

I_Play_With_Data

1,122425

edited Feb 1 '18 at 17:19

Community♦

edited Feb 1 '18 at 17:19

Community♦

edited Feb 1 '18 at 17:19

Community♦

asked Jan 18 '18 at 14:22

I_Play_With_Data

1,122425

asked Jan 18 '18 at 14:22

I_Play_With_Data

1,122425

asked Jan 18 '18 at 14:22

I_Play_With_Data

1,122425

add a comment |

7 Answers
7

active

oldest

votes

You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".

What you describe is called Information extraction and is a big field of NLP (Natural Language Processing). You are looking for temporal expression identification. You can have a look at the Stanford Temporal Tagger: SUTime to get a "live" demo. From what I see here it is a regex-based rule system.

To give you an impression how powerful rule-based systems can be:

Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”

Source: Wikipedia: Eliza

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26789%2fshould-i-use-regex-or-machine-learning%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

7 Answers
7

active

oldest

votes

7 Answers
7

active

oldest

votes

You receive an email from a friend that says, "let's have lunch next Tuesday" and your email program detects it and asks if you want to save a new calendar entry for "lunch on Tuesday".

To give you an impression how powerful rule-based systems can be:

Weizenbaum’s own secretary reportedly asked Weizenbaum to leave the room so that she and ELIZA could have a real conversation. Weizenbaum was surprised by this, later writing, “I had not realized… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”

Source: Wikipedia: Eliza

$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44

add a comment |

I would say depends on the requirements and how much effort you want to give the task.

Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...

If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.

As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.

answered Jan 18 '18 at 14:33

Hastu

1288

$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28

$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44

add a comment |

I would say depends on the requirements and how much effort you want to give the task.

Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...

If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.

As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.

answered Jan 18 '18 at 14:33

Hastu

1288

I would say depends on the requirements and how much effort you want to give the task.

Using regex is definitely easier but in the some time you will not be able to cover everything specially if the text you have is not structured as you expect all the time. So in this case you will be updating you regex patterns each time you miss something ...

If you go for machine learning approach then you will need data to train your models on it ( a lot of this data). Time for training, and enhancing the quality of your model. Hopefully you will get something good.

As a conclusion, I think, if you can cover the requirement with regex go for it. If regex will not be a good solution then start thinking in machine learning solutions.

answered Jan 18 '18 at 14:33

Hastu

1288

answered Jan 18 '18 at 14:33

Hastu

1288

answered Jan 18 '18 at 14:33

Hastu

1288

answered Jan 18 '18 at 14:33

Hastu

1288

$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28

$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44

add a comment |

$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28

$begingroup$
For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.
$endgroup$
– Hastu
Jan 18 '18 at 17:44

So which algorithmic approach would you use for this?

– I_Play_With_Data
Jan 18 '18 at 16:28

For the first, looks the data is structured and even if not contact informations, job_titles are things that you can have a regex pattern for them. The second problem i think is more complicated depending on how much you want your algorithm to understand things. So maybe regex is not a good idea.

– Hastu
Jan 18 '18 at 17:44

add a comment |

Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.

That's the reason why, in general, you need to use some ML technique to reach your goal.

edited Jan 19 '18 at 16:02

answered Jan 18 '18 at 15:53

Vincenzo Lavorini

1,289416

$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28

$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46

add a comment |

Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.

That's the reason why, in general, you need to use some ML technique to reach your goal.

edited Jan 19 '18 at 16:02

answered Jan 18 '18 at 15:53

Vincenzo Lavorini

1,289416

$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28

$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46

add a comment |

Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.

That's the reason why, in general, you need to use some ML technique to reach your goal.

edited Jan 19 '18 at 16:02

answered Jan 18 '18 at 15:53

Vincenzo Lavorini

1,289416

Only if you have to parse some kind of standardized document (PDF or email), you can think of using some regex parser.

That's the reason why, in general, you need to use some ML technique to reach your goal.

edited Jan 19 '18 at 16:02

answered Jan 18 '18 at 15:53

Vincenzo Lavorini

1,289416

edited Jan 19 '18 at 16:02

answered Jan 18 '18 at 15:53

Vincenzo Lavorini

1,289416

answered Jan 18 '18 at 15:53

Vincenzo Lavorini

1,289416

answered Jan 18 '18 at 15:53

Vincenzo Lavorini

1,289416

$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28

$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46

add a comment |

$begingroup$
So which algorithmic approach would you use for this?
$endgroup$
– I_Play_With_Data
Jan 18 '18 at 16:28

$begingroup$
I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…
$endgroup$
– Vincenzo Lavorini
Jan 25 '18 at 8:46

So which algorithmic approach would you use for this?

– I_Play_With_Data
Jan 18 '18 at 16:28

I don't have much experience on that, but you can find some suggestion at: datascience.stackexchange.com/questions/2646/… textminingonline.com/…

– Vincenzo Lavorini
Jan 25 '18 at 8:46

add a comment |

On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.

answered Jan 23 '18 at 15:59

geompalik

33127

add a comment |

On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.

answered Jan 23 '18 at 15:59

geompalik

33127

add a comment |

On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.

answered Jan 23 '18 at 15:59

geompalik

33127

On the other hand, regex can be great ways to go with, especially if you can predict/adapt to the variability of the incoming data. In any case, they can be used to create your first training data.

answered Jan 23 '18 at 15:59

geompalik

33127

answered Jan 23 '18 at 15:59

geompalik

33127

answered Jan 23 '18 at 15:59

geompalik

33127

answered Jan 23 '18 at 15:59

geompalik

33127

add a comment |

In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)

edited Jan 26 '18 at 17:38

Stephen Rauch

1,52551229

answered Jan 26 '18 at 16:35

quester

211

add a comment |

In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)

edited Jan 26 '18 at 17:38

Stephen Rauch

1,52551229

answered Jan 26 '18 at 16:35

quester

211

add a comment |

In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)

edited Jan 26 '18 at 17:38

Stephen Rauch

1,52551229

answered Jan 26 '18 at 16:35

quester

211

In case 2 it's common chat bot case so it's question of whether on not use ML to implement chat bot (spoiler - use ML)

edited Jan 26 '18 at 17:38

Stephen Rauch

1,52551229

answered Jan 26 '18 at 16:35

quester

211

edited Jan 26 '18 at 17:38

Stephen Rauch

1,52551229

edited Jan 26 '18 at 17:38

Stephen Rauch

1,52551229

edited Jan 26 '18 at 17:38

Stephen Rauch

1,52551229

answered Jan 26 '18 at 16:35

quester

211

answered Jan 26 '18 at 16:35

quester

211

answered Jan 26 '18 at 16:35

quester

211

add a comment |

Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.

answered Jan 27 '18 at 17:24

Vivek Khetan

1456

add a comment |

Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.

answered Jan 27 '18 at 17:24

Vivek Khetan

1456

add a comment |

Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.

answered Jan 27 '18 at 17:24

Vivek Khetan

1456

Not every Tuesday is lunch day even if you are talking about some lunch on some Tuesday. It should be a NLP system using NER to provide the best prediction.

answered Jan 27 '18 at 17:24

Vivek Khetan

1456

answered Jan 27 '18 at 17:24

Vivek Khetan

1456

answered Jan 27 '18 at 17:24

Vivek Khetan

1456

answered Jan 27 '18 at 17:24

Vivek Khetan

1456

add a comment |

I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.

I was surprised at how useful such a simple methodology can be for profiling an abstract.

Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.

answered 2 days ago

haz

111

New contributor

add a comment |

I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.

I was surprised at how useful such a simple methodology can be for profiling an abstract.

Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.

answered 2 days ago

haz

111

New contributor

add a comment |

I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.

I was surprised at how useful such a simple methodology can be for profiling an abstract.

Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.

answered 2 days ago

haz

111

New contributor

I can also identify and extract specifics such as statistical data measures (eg p-values), gene names, sample population/patient characteristics, etc.

I was surprised at how useful such a simple methodology can be for profiling an abstract.

Over time I found I was adding new phrases to the regex 'library' to capture instances where the rule system encountered text it did not profile correctly.

answered 2 days ago

haz

111

New contributor

answered 2 days ago

haz

111

New contributor

answered 2 days ago

haz

111

answered 2 days ago

haz

111

New contributor

haz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk