Python: Fast indexing of strings in nested list without loop

I have an array dataset from which I need to build D whose each element is a list of strings (Third column of dataset). The dataset looks like:

600,900,3418309

600,900,3418309

600,900,3418314

600,900,3418314

600,900,3418319

600,900,3418319

610,800,3418324

610,700,3418324

600,900,3418329

620,900,3418329

600,900,3418329

600,900,3418334

610,900,3418334

600,900,3418339

600,900,3418339

600,900,3418339

660,700,3418339

610,800,3418339

660,700,3418339

600,900,3418339

600,900,3418339

I want to check for every new string, and if it is not a part of any of the array element then append new string to only that array element.
Since the data on new strings to be added is large, I do not want to use loop. Is there any method to do it fast. I have to use Python.

Right now I am using the code, which is very slow:

for i in range(len(dataset)):

 for j in range(int(int(dataset[i,0]-600)),int(int(dataset[i,1]-600)+1)):



    if(str(dataset[i,2]) not in D[j]):



        D[j].append(str(dataset[i,2]))

edited 13 hours ago

asked 2 days ago

shaifali Gupta

719

$begingroup$
Welcome to the site! You are certainly welcome to leave your question here, but there are some serious python wizards on Stack Overflow and I'm sure that posting your question there can get you an answer in the matter of minutes.
$endgroup$
– I_Play_With_Data
2 days ago

$begingroup$
Why do you do if then pass them else? You could negate the if and drop the pass/else
$endgroup$
– kbrose
15 hours ago

$begingroup$
@kbrose I have updated the code in question based on your suggestion. But still it is very slow. Can it be further improvised?
$endgroup$
– shaifali Gupta
14 hours ago

$begingroup$
i am not an expert, but pandas couldnt work in this case? They have highly optimized performance for some tasks.
$endgroup$
– Victor Oliveira
14 hours ago

$begingroup$
Have you tried making it a set instead of a list? Sets are optimized for containment checking.
$endgroup$
– kbrose
14 hours ago

|
show 3 more comments

I have an array dataset from which I need to build D whose each element is a list of strings (Third column of dataset). The dataset looks like:

Right now I am using the code, which is very slow:

for i in range(len(dataset)):

 for j in range(int(int(dataset[i,0]-600)),int(int(dataset[i,1]-600)+1)):



    if(str(dataset[i,2]) not in D[j]):



        D[j].append(str(dataset[i,2]))

edited 13 hours ago

asked 2 days ago

shaifali Gupta

719

$begingroup$
Welcome to the site! You are certainly welcome to leave your question here, but there are some serious python wizards on Stack Overflow and I'm sure that posting your question there can get you an answer in the matter of minutes.
$endgroup$
– I_Play_With_Data
2 days ago

$begingroup$
Why do you do if then pass them else? You could negate the if and drop the pass/else
$endgroup$
– kbrose
15 hours ago

$begingroup$
@kbrose I have updated the code in question based on your suggestion. But still it is very slow. Can it be further improvised?
$endgroup$
– shaifali Gupta
14 hours ago

$begingroup$
i am not an expert, but pandas couldnt work in this case? They have highly optimized performance for some tasks.
$endgroup$
– Victor Oliveira
14 hours ago

$begingroup$
Have you tried making it a set instead of a list? Sets are optimized for containment checking.
$endgroup$
– kbrose
14 hours ago

|
show 3 more comments

I have an array dataset from which I need to build D whose each element is a list of strings (Third column of dataset). The dataset looks like:

Right now I am using the code, which is very slow:

for i in range(len(dataset)):

 for j in range(int(int(dataset[i,0]-600)),int(int(dataset[i,1]-600)+1)):



    if(str(dataset[i,2]) not in D[j]):



        D[j].append(str(dataset[i,2]))

edited 13 hours ago

asked 2 days ago

shaifali Gupta

719

I have an array dataset from which I need to build D whose each element is a list of strings (Third column of dataset). The dataset looks like:

Right now I am using the code, which is very slow:

for i in range(len(dataset)):

 for j in range(int(int(dataset[i,0]-600)),int(int(dataset[i,1]-600)+1)):



    if(str(dataset[i,2]) not in D[j]):



        D[j].append(str(dataset[i,2]))

python

edited 13 hours ago

asked 2 days ago

shaifali Gupta

719

edited 13 hours ago

asked 2 days ago

shaifali Gupta

719

edited 13 hours ago

asked 2 days ago

shaifali Gupta

719

asked 2 days ago

shaifali Gupta

719

asked 2 days ago

shaifali Gupta

719

$begingroup$
Welcome to the site! You are certainly welcome to leave your question here, but there are some serious python wizards on Stack Overflow and I'm sure that posting your question there can get you an answer in the matter of minutes.
$endgroup$
– I_Play_With_Data
2 days ago

$begingroup$
Why do you do if then pass them else? You could negate the if and drop the pass/else
$endgroup$
– kbrose
15 hours ago

$begingroup$
@kbrose I have updated the code in question based on your suggestion. But still it is very slow. Can it be further improvised?
$endgroup$
– shaifali Gupta
14 hours ago

$begingroup$
i am not an expert, but pandas couldnt work in this case? They have highly optimized performance for some tasks.
$endgroup$
– Victor Oliveira
14 hours ago

$begingroup$
Have you tried making it a set instead of a list? Sets are optimized for containment checking.
$endgroup$
– kbrose
14 hours ago

|
show 3 more comments

$begingroup$
Welcome to the site! You are certainly welcome to leave your question here, but there are some serious python wizards on Stack Overflow and I'm sure that posting your question there can get you an answer in the matter of minutes.
$endgroup$
– I_Play_With_Data
2 days ago

$begingroup$
Why do you do if then pass them else? You could negate the if and drop the pass/else
$endgroup$
– kbrose
15 hours ago

$begingroup$
@kbrose I have updated the code in question based on your suggestion. But still it is very slow. Can it be further improvised?
$endgroup$
– shaifali Gupta
14 hours ago

$begingroup$
i am not an expert, but pandas couldnt work in this case? They have highly optimized performance for some tasks.
$endgroup$
– Victor Oliveira
14 hours ago

$begingroup$
Have you tried making it a set instead of a list? Sets are optimized for containment checking.
$endgroup$
– kbrose
14 hours ago

Welcome to the site! You are certainly welcome to leave your question here, but there are some serious python wizards on Stack Overflow and I'm sure that posting your question there can get you an answer in the matter of minutes.

– I_Play_With_Data
2 days ago

Why do you do if then pass them else? You could negate the if and drop the pass/else

– kbrose
15 hours ago

@kbrose I have updated the code in question based on your suggestion. But still it is very slow. Can it be further improvised?

– shaifali Gupta
14 hours ago

i am not an expert, but pandas couldnt work in this case? They have highly optimized performance for some tasks.

– Victor Oliveira
14 hours ago

Have you tried making it a set instead of a list? Sets are optimized for containment checking.

– kbrose
14 hours ago

|
show 3 more comments

1 Answer
1

active

oldest

votes

Assuming I have understood your question... I might alter my answer if OP updates the question with more details

Using your example data, you can use Pandas easily drop all duplicates.

Setup

First dump your data above into a dataframe with three columns (wone for each of the items in each row:

import pandas:

import pandas as pd

import your data - assuming it is a list of lists (each of your rows a is a list of three items!):

df = pd.DataFrame.from_records(your_list_of_lists, columns=["col1", "col2", "col3"])

Have a look at the first 5 rows:

df.head()

   col1  col2     col3

0   600   900  3418309  

1   600   900  3418309  

2   600   900  3418314  

3   600   900  3418314  

4   600   900  3418319

The values will be integers by default, not strings (if they all were).

Solutions

If you want to get all unique values of col3, you can do one of the following:

uniques1 = set(df.col3)    # returns a Python set

uniques2 = df.col3.unique()          # returns a NumPy ndarray

uniques3 = df.col3.drop_duplicates() # returns a pandas Series object

Performance

There are many other ways to achieve the same result. Of the above, the first method is the fastest (on your small dataset example). Here are the benchmarks:

In [23]: %timeit df.col3.drop_duplicates()                                      

%263 µs ± 883 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)



In [24]: %timeit df.col3.unique()                                               

%37.2 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



In [25]: %timeit set(df.col3)                                                   

10.5 µs ± 45.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

answered 10 hours ago

n1k31t4

6,3262319

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46882%2fpython-fast-indexing-of-strings-in-nested-list-without-loop%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Assuming I have understood your question... I might alter my answer if OP updates the question with more details

Using your example data, you can use Pandas easily drop all duplicates.

Setup

First dump your data above into a dataframe with three columns (wone for each of the items in each row:

import pandas:

import pandas as pd

import your data - assuming it is a list of lists (each of your rows a is a list of three items!):

df = pd.DataFrame.from_records(your_list_of_lists, columns=["col1", "col2", "col3"])

Have a look at the first 5 rows:

df.head()

   col1  col2     col3

0   600   900  3418309  

1   600   900  3418309  

2   600   900  3418314  

3   600   900  3418314  

4   600   900  3418319

The values will be integers by default, not strings (if they all were).

Solutions

If you want to get all unique values of col3, you can do one of the following:

uniques1 = set(df.col3)    # returns a Python set

uniques2 = df.col3.unique()          # returns a NumPy ndarray

uniques3 = df.col3.drop_duplicates() # returns a pandas Series object

Performance

There are many other ways to achieve the same result. Of the above, the first method is the fastest (on your small dataset example). Here are the benchmarks:

In [23]: %timeit df.col3.drop_duplicates()                                      

%263 µs ± 883 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)



In [24]: %timeit df.col3.unique()                                               

%37.2 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



In [25]: %timeit set(df.col3)                                                   

10.5 µs ± 45.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

answered 10 hours ago

n1k31t4

6,3262319

add a comment |

Assuming I have understood your question... I might alter my answer if OP updates the question with more details

Using your example data, you can use Pandas easily drop all duplicates.

Setup

First dump your data above into a dataframe with three columns (wone for each of the items in each row:

import pandas:

import pandas as pd

import your data - assuming it is a list of lists (each of your rows a is a list of three items!):

df = pd.DataFrame.from_records(your_list_of_lists, columns=["col1", "col2", "col3"])

Have a look at the first 5 rows:

df.head()

   col1  col2     col3

0   600   900  3418309  

1   600   900  3418309  

2   600   900  3418314  

3   600   900  3418314  

4   600   900  3418319

The values will be integers by default, not strings (if they all were).

Solutions

If you want to get all unique values of col3, you can do one of the following:

uniques1 = set(df.col3)    # returns a Python set

uniques2 = df.col3.unique()          # returns a NumPy ndarray

uniques3 = df.col3.drop_duplicates() # returns a pandas Series object

Performance

There are many other ways to achieve the same result. Of the above, the first method is the fastest (on your small dataset example). Here are the benchmarks:

In [23]: %timeit df.col3.drop_duplicates()                                      

%263 µs ± 883 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)



In [24]: %timeit df.col3.unique()                                               

%37.2 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



In [25]: %timeit set(df.col3)                                                   

10.5 µs ± 45.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

answered 10 hours ago

n1k31t4

6,3262319

add a comment |

Assuming I have understood your question... I might alter my answer if OP updates the question with more details

Using your example data, you can use Pandas easily drop all duplicates.

Setup

First dump your data above into a dataframe with three columns (wone for each of the items in each row:

import pandas:

import pandas as pd

import your data - assuming it is a list of lists (each of your rows a is a list of three items!):

df = pd.DataFrame.from_records(your_list_of_lists, columns=["col1", "col2", "col3"])

Have a look at the first 5 rows:

df.head()

   col1  col2     col3

0   600   900  3418309  

1   600   900  3418309  

2   600   900  3418314  

3   600   900  3418314  

4   600   900  3418319

The values will be integers by default, not strings (if they all were).

Solutions

If you want to get all unique values of col3, you can do one of the following:

uniques1 = set(df.col3)    # returns a Python set

uniques2 = df.col3.unique()          # returns a NumPy ndarray

uniques3 = df.col3.drop_duplicates() # returns a pandas Series object

Performance

There are many other ways to achieve the same result. Of the above, the first method is the fastest (on your small dataset example). Here are the benchmarks:

In [23]: %timeit df.col3.drop_duplicates()                                      

%263 µs ± 883 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)



In [24]: %timeit df.col3.unique()                                               

%37.2 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



In [25]: %timeit set(df.col3)                                                   

10.5 µs ± 45.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

answered 10 hours ago

n1k31t4

6,3262319

Assuming I have understood your question... I might alter my answer if OP updates the question with more details

Using your example data, you can use Pandas easily drop all duplicates.

Setup

First dump your data above into a dataframe with three columns (wone for each of the items in each row:

import pandas:

import pandas as pd

import your data - assuming it is a list of lists (each of your rows a is a list of three items!):

df = pd.DataFrame.from_records(your_list_of_lists, columns=["col1", "col2", "col3"])

Have a look at the first 5 rows:

df.head()

   col1  col2     col3

0   600   900  3418309  

1   600   900  3418309  

2   600   900  3418314  

3   600   900  3418314  

4   600   900  3418319

The values will be integers by default, not strings (if they all were).

Solutions

If you want to get all unique values of col3, you can do one of the following:

uniques1 = set(df.col3)    # returns a Python set

uniques2 = df.col3.unique()          # returns a NumPy ndarray

uniques3 = df.col3.drop_duplicates() # returns a pandas Series object

Performance

There are many other ways to achieve the same result. Of the above, the first method is the fastest (on your small dataset example). Here are the benchmarks:

In [23]: %timeit df.col3.drop_duplicates()                                      

%263 µs ± 883 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)



In [24]: %timeit df.col3.unique()                                               

%37.2 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



In [25]: %timeit set(df.col3)                                                   

10.5 µs ± 45.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

answered 10 hours ago

n1k31t4

6,3262319

answered 10 hours ago

n1k31t4

6,3262319

answered 10 hours ago

n1k31t4

6,3262319

answered 10 hours ago

n1k31t4

6,3262319

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk