memory problem due to large file












2












$begingroup$


I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.



I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.



My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.



What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....



I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.



Any suggestions will help.
Thanks
Avi










share|improve this question









$endgroup$




bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.











  • 1




    $begingroup$
    i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
    $endgroup$
    – Mohammad Athar
    Dec 12 '18 at 18:48


















2












$begingroup$


I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.



I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.



My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.



What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....



I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.



Any suggestions will help.
Thanks
Avi










share|improve this question









$endgroup$




bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.











  • 1




    $begingroup$
    i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
    $endgroup$
    – Mohammad Athar
    Dec 12 '18 at 18:48
















2












2








2





$begingroup$


I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.



I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.



My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.



What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....



I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.



Any suggestions will help.
Thanks
Avi










share|improve this question









$endgroup$




I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.



I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.



My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.



What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....



I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.



Any suggestions will help.
Thanks
Avi







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 11 '18 at 20:01









AviAvi

112




112





bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
    $endgroup$
    – Mohammad Athar
    Dec 12 '18 at 18:48
















  • 1




    $begingroup$
    i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
    $endgroup$
    – Mohammad Athar
    Dec 12 '18 at 18:48










1




1




$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48






$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48












1 Answer
1






active

oldest

votes


















0












$begingroup$

I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.






share|improve this answer









$endgroup$














    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41059%2fmemory-problem-due-to-large-file%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.






    share|improve this answer









    $endgroup$


















      0












      $begingroup$

      I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.






      share|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.






        share|improve this answer









        $endgroup$



        I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 11 '18 at 22:10









        VaalizaadehVaalizaadeh

        7,57562263




        7,57562263






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41059%2fmemory-problem-due-to-large-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to label and detect the document text images

            Vallis Paradisi

            Tabula Rosettana