memory problem due to large file

I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.

I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.

My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.

What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....

I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.

Any suggestions will help.
Thanks
Avi

asked Nov 11 '18 at 20:01

Avi

112

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48

add a comment |

I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.

I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.

My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.

What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....

Any suggestions will help.
Thanks
Avi

asked Nov 11 '18 at 20:01

Avi

112

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48

add a comment |

I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.

I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.

My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.

What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....

Any suggestions will help.
Thanks
Avi

asked Nov 11 '18 at 20:01

Avi

112

I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.

I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.

My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.

What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....

Any suggestions will help.
Thanks
Avi

python pandas

asked Nov 11 '18 at 20:01

Avi

112

asked Nov 11 '18 at 20:01

Avi

112

asked Nov 11 '18 at 20:01

Avi

112

asked Nov 11 '18 at 20:01

Avi

112

asked Nov 11 '18 at 20:01

Avi

112

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48

add a comment |

1

$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48

i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once

– Mohammad Athar
Dec 12 '18 at 18:48

add a comment |

1 Answer
1

active

oldest

votes

I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.

answered Nov 11 '18 at 22:10

Vaalizaadeh

7,57562263

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41059%2fmemory-problem-due-to-large-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Nov 11 '18 at 22:10

Vaalizaadeh

7,57562263

add a comment |

answered Nov 11 '18 at 22:10

Vaalizaadeh

7,57562263

add a comment |

answered Nov 11 '18 at 22:10

Vaalizaadeh

7,57562263

answered Nov 11 '18 at 22:10

Vaalizaadeh

7,57562263

answered Nov 11 '18 at 22:10

Vaalizaadeh

7,57562263

answered Nov 11 '18 at 22:10

Vaalizaadeh

7,57562263

answered Nov 11 '18 at 22:10

Vaalizaadeh

7,57562263

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk