Is the R language suitable for Big Data












45












$begingroup$


R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".



I've seen a guideline of 5TB for a dataset to be considered as Big Data.



My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?










share|improve this question











$endgroup$








  • 4




    $begingroup$
    In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
    $endgroup$
    – cwharland
    May 14 '14 at 17:45


















45












$begingroup$


R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".



I've seen a guideline of 5TB for a dataset to be considered as Big Data.



My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?










share|improve this question











$endgroup$








  • 4




    $begingroup$
    In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
    $endgroup$
    – cwharland
    May 14 '14 at 17:45
















45












45








45


18



$begingroup$


R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".



I've seen a guideline of 5TB for a dataset to be considered as Big Data.



My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?










share|improve this question











$endgroup$




R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".



I've seen a guideline of 5TB for a dataset to be considered as Big Data.



My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?







bigdata r






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 14 '14 at 13:06









Konstantin V. Salikhov

569515




569515










asked May 14 '14 at 11:15









akellyirlakellyirl

35849




35849








  • 4




    $begingroup$
    In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
    $endgroup$
    – cwharland
    May 14 '14 at 17:45
















  • 4




    $begingroup$
    In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
    $endgroup$
    – cwharland
    May 14 '14 at 17:45










4




4




$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45






$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45












9 Answers
9






active

oldest

votes


















40












$begingroup$

Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.



A mature project for R and Hadoop is RHadoop



RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).






share|improve this answer











$endgroup$













  • $begingroup$
    But does using R with Hadoop overcome this limitation (having to do computations in memory)?
    $endgroup$
    – Felipe Almeida
    Jun 9 '14 at 23:07










  • $begingroup$
    RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
    $endgroup$
    – Steve Kallestad
    Jun 11 '14 at 6:34






  • 2




    $begingroup$
    Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
    $endgroup$
    – wacax
    Dec 5 '15 at 6:03



















30












$begingroup$

The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.




  • RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

  • The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

  • BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

  • bigmemory : An R package which allows powerful and memory-efficient parallel
    analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.






share|improve this answer









$endgroup$









  • 1




    $begingroup$
    Another package is distributedR which allows you to work with distributed files in RAM.
    $endgroup$
    – adesantos
    Jun 25 '14 at 7:03



















17












$begingroup$

Some good answers here. I would like to join the discussion by adding the following three notes:




  1. The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).


  2. Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).


  3. While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".







share|improve this answer











$endgroup$





















    12












    $begingroup$

    R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.



    This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).






    share|improve this answer









    $endgroup$





















      9












      $begingroup$

      Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.



      In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.



      Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.



      Edit:



      Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)






      share|improve this answer











      $endgroup$









      • 3




        $begingroup$
        R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
        $endgroup$
        – organic agave
        May 18 '14 at 21:52










      • $begingroup$
        @pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
        $endgroup$
        – Amir Ali Akbari
        May 19 '14 at 8:08






      • 1




        $begingroup$
        Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
        $endgroup$
        – blunders
        May 20 '14 at 18:46








      • 5




        $begingroup$
        This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
        $endgroup$
        – stanekam
        Jun 10 '14 at 20:31










      • $begingroup$
        Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
        $endgroup$
        – Shawn Mehan
        Dec 5 '15 at 18:35



















      7












      $begingroup$

      R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.



      However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.






      share|improve this answer









      $endgroup$





















        4












        $begingroup$

        As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.






        share|improve this answer









        $endgroup$





















          2












          $begingroup$

          I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).






          share|improve this answer









          $endgroup$





















            0












            $begingroup$

            I think that there is actually a pletora of tools for working with big data in R.
            sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
            Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).






            share|improve this answer









            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "557"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41%2fis-the-r-language-suitable-for-big-data%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              9 Answers
              9






              active

              oldest

              votes








              9 Answers
              9






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              40












              $begingroup$

              Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.



              A mature project for R and Hadoop is RHadoop



              RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).






              share|improve this answer











              $endgroup$













              • $begingroup$
                But does using R with Hadoop overcome this limitation (having to do computations in memory)?
                $endgroup$
                – Felipe Almeida
                Jun 9 '14 at 23:07










              • $begingroup$
                RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
                $endgroup$
                – Steve Kallestad
                Jun 11 '14 at 6:34






              • 2




                $begingroup$
                Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
                $endgroup$
                – wacax
                Dec 5 '15 at 6:03
















              40












              $begingroup$

              Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.



              A mature project for R and Hadoop is RHadoop



              RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).






              share|improve this answer











              $endgroup$













              • $begingroup$
                But does using R with Hadoop overcome this limitation (having to do computations in memory)?
                $endgroup$
                – Felipe Almeida
                Jun 9 '14 at 23:07










              • $begingroup$
                RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
                $endgroup$
                – Steve Kallestad
                Jun 11 '14 at 6:34






              • 2




                $begingroup$
                Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
                $endgroup$
                – wacax
                Dec 5 '15 at 6:03














              40












              40








              40





              $begingroup$

              Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.



              A mature project for R and Hadoop is RHadoop



              RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).






              share|improve this answer











              $endgroup$



              Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.



              A mature project for R and Hadoop is RHadoop



              RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Jan 31 '15 at 11:34









              chenrui333

              20325




              20325










              answered May 14 '14 at 11:24









              MCP_infiltratorMCP_infiltrator

              96697




              96697












              • $begingroup$
                But does using R with Hadoop overcome this limitation (having to do computations in memory)?
                $endgroup$
                – Felipe Almeida
                Jun 9 '14 at 23:07










              • $begingroup$
                RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
                $endgroup$
                – Steve Kallestad
                Jun 11 '14 at 6:34






              • 2




                $begingroup$
                Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
                $endgroup$
                – wacax
                Dec 5 '15 at 6:03


















              • $begingroup$
                But does using R with Hadoop overcome this limitation (having to do computations in memory)?
                $endgroup$
                – Felipe Almeida
                Jun 9 '14 at 23:07










              • $begingroup$
                RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
                $endgroup$
                – Steve Kallestad
                Jun 11 '14 at 6:34






              • 2




                $begingroup$
                Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
                $endgroup$
                – wacax
                Dec 5 '15 at 6:03
















              $begingroup$
              But does using R with Hadoop overcome this limitation (having to do computations in memory)?
              $endgroup$
              – Felipe Almeida
              Jun 9 '14 at 23:07




              $begingroup$
              But does using R with Hadoop overcome this limitation (having to do computations in memory)?
              $endgroup$
              – Felipe Almeida
              Jun 9 '14 at 23:07












              $begingroup$
              RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
              $endgroup$
              – Steve Kallestad
              Jun 11 '14 at 6:34




              $begingroup$
              RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
              $endgroup$
              – Steve Kallestad
              Jun 11 '14 at 6:34




              2




              2




              $begingroup$
              Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
              $endgroup$
              – wacax
              Dec 5 '15 at 6:03




              $begingroup$
              Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
              $endgroup$
              – wacax
              Dec 5 '15 at 6:03











              30












              $begingroup$

              The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.




              • RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

              • The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

              • BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

              • bigmemory : An R package which allows powerful and memory-efficient parallel
                analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.






              share|improve this answer









              $endgroup$









              • 1




                $begingroup$
                Another package is distributedR which allows you to work with distributed files in RAM.
                $endgroup$
                – adesantos
                Jun 25 '14 at 7:03
















              30












              $begingroup$

              The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.




              • RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

              • The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

              • BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

              • bigmemory : An R package which allows powerful and memory-efficient parallel
                analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.






              share|improve this answer









              $endgroup$









              • 1




                $begingroup$
                Another package is distributedR which allows you to work with distributed files in RAM.
                $endgroup$
                – adesantos
                Jun 25 '14 at 7:03














              30












              30








              30





              $begingroup$

              The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.




              • RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

              • The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

              • BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

              • bigmemory : An R package which allows powerful and memory-efficient parallel
                analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.






              share|improve this answer









              $endgroup$



              The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.




              • RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

              • The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

              • BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

              • bigmemory : An R package which allows powerful and memory-efficient parallel
                analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered May 14 '14 at 12:39









              asheeshrasheeshr

              5761413




              5761413








              • 1




                $begingroup$
                Another package is distributedR which allows you to work with distributed files in RAM.
                $endgroup$
                – adesantos
                Jun 25 '14 at 7:03














              • 1




                $begingroup$
                Another package is distributedR which allows you to work with distributed files in RAM.
                $endgroup$
                – adesantos
                Jun 25 '14 at 7:03








              1




              1




              $begingroup$
              Another package is distributedR which allows you to work with distributed files in RAM.
              $endgroup$
              – adesantos
              Jun 25 '14 at 7:03




              $begingroup$
              Another package is distributedR which allows you to work with distributed files in RAM.
              $endgroup$
              – adesantos
              Jun 25 '14 at 7:03











              17












              $begingroup$

              Some good answers here. I would like to join the discussion by adding the following three notes:




              1. The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).


              2. Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).


              3. While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".







              share|improve this answer











              $endgroup$


















                17












                $begingroup$

                Some good answers here. I would like to join the discussion by adding the following three notes:




                1. The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).


                2. Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).


                3. While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".







                share|improve this answer











                $endgroup$
















                  17












                  17








                  17





                  $begingroup$

                  Some good answers here. I would like to join the discussion by adding the following three notes:




                  1. The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).


                  2. Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).


                  3. While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".







                  share|improve this answer











                  $endgroup$



                  Some good answers here. I would like to join the discussion by adding the following three notes:




                  1. The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).


                  2. Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).


                  3. While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".








                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited yesterday









                  Kevin Bowen

                  10915




                  10915










                  answered Jul 19 '14 at 2:19









                  Aleksandr BlekhAleksandr Blekh

                  5,94811747




                  5,94811747























                      12












                      $begingroup$

                      R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.



                      This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).






                      share|improve this answer









                      $endgroup$


















                        12












                        $begingroup$

                        R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.



                        This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).






                        share|improve this answer









                        $endgroup$
















                          12












                          12








                          12





                          $begingroup$

                          R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.



                          This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).






                          share|improve this answer









                          $endgroup$



                          R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.



                          This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered May 18 '14 at 19:22









                          statsRusstatsRus

                          267110




                          267110























                              9












                              $begingroup$

                              Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.



                              In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.



                              Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.



                              Edit:



                              Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)






                              share|improve this answer











                              $endgroup$









                              • 3




                                $begingroup$
                                R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
                                $endgroup$
                                – organic agave
                                May 18 '14 at 21:52










                              • $begingroup$
                                @pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
                                $endgroup$
                                – Amir Ali Akbari
                                May 19 '14 at 8:08






                              • 1




                                $begingroup$
                                Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
                                $endgroup$
                                – blunders
                                May 20 '14 at 18:46








                              • 5




                                $begingroup$
                                This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
                                $endgroup$
                                – stanekam
                                Jun 10 '14 at 20:31










                              • $begingroup$
                                Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
                                $endgroup$
                                – Shawn Mehan
                                Dec 5 '15 at 18:35
















                              9












                              $begingroup$

                              Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.



                              In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.



                              Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.



                              Edit:



                              Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)






                              share|improve this answer











                              $endgroup$









                              • 3




                                $begingroup$
                                R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
                                $endgroup$
                                – organic agave
                                May 18 '14 at 21:52










                              • $begingroup$
                                @pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
                                $endgroup$
                                – Amir Ali Akbari
                                May 19 '14 at 8:08






                              • 1




                                $begingroup$
                                Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
                                $endgroup$
                                – blunders
                                May 20 '14 at 18:46








                              • 5




                                $begingroup$
                                This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
                                $endgroup$
                                – stanekam
                                Jun 10 '14 at 20:31










                              • $begingroup$
                                Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
                                $endgroup$
                                – Shawn Mehan
                                Dec 5 '15 at 18:35














                              9












                              9








                              9





                              $begingroup$

                              Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.



                              In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.



                              Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.



                              Edit:



                              Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)






                              share|improve this answer











                              $endgroup$



                              Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.



                              In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.



                              Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.



                              Edit:



                              Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)







                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited May 19 '14 at 8:13

























                              answered May 18 '14 at 12:30









                              Amir Ali AkbariAmir Ali Akbari

                              80531023




                              80531023








                              • 3




                                $begingroup$
                                R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
                                $endgroup$
                                – organic agave
                                May 18 '14 at 21:52










                              • $begingroup$
                                @pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
                                $endgroup$
                                – Amir Ali Akbari
                                May 19 '14 at 8:08






                              • 1




                                $begingroup$
                                Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
                                $endgroup$
                                – blunders
                                May 20 '14 at 18:46








                              • 5




                                $begingroup$
                                This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
                                $endgroup$
                                – stanekam
                                Jun 10 '14 at 20:31










                              • $begingroup$
                                Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
                                $endgroup$
                                – Shawn Mehan
                                Dec 5 '15 at 18:35














                              • 3




                                $begingroup$
                                R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
                                $endgroup$
                                – organic agave
                                May 18 '14 at 21:52










                              • $begingroup$
                                @pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
                                $endgroup$
                                – Amir Ali Akbari
                                May 19 '14 at 8:08






                              • 1




                                $begingroup$
                                Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
                                $endgroup$
                                – blunders
                                May 20 '14 at 18:46








                              • 5




                                $begingroup$
                                This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
                                $endgroup$
                                – stanekam
                                Jun 10 '14 at 20:31










                              • $begingroup$
                                Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
                                $endgroup$
                                – Shawn Mehan
                                Dec 5 '15 at 18:35








                              3




                              3




                              $begingroup$
                              R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
                              $endgroup$
                              – organic agave
                              May 18 '14 at 21:52




                              $begingroup$
                              R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
                              $endgroup$
                              – organic agave
                              May 18 '14 at 21:52












                              $begingroup$
                              @pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
                              $endgroup$
                              – Amir Ali Akbari
                              May 19 '14 at 8:08




                              $begingroup$
                              @pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
                              $endgroup$
                              – Amir Ali Akbari
                              May 19 '14 at 8:08




                              1




                              1




                              $begingroup$
                              Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
                              $endgroup$
                              – blunders
                              May 20 '14 at 18:46






                              $begingroup$
                              Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
                              $endgroup$
                              – blunders
                              May 20 '14 at 18:46






                              5




                              5




                              $begingroup$
                              This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
                              $endgroup$
                              – stanekam
                              Jun 10 '14 at 20:31




                              $begingroup$
                              This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
                              $endgroup$
                              – stanekam
                              Jun 10 '14 at 20:31












                              $begingroup$
                              Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
                              $endgroup$
                              – Shawn Mehan
                              Dec 5 '15 at 18:35




                              $begingroup$
                              Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
                              $endgroup$
                              – Shawn Mehan
                              Dec 5 '15 at 18:35











                              7












                              $begingroup$

                              R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.



                              However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.






                              share|improve this answer









                              $endgroup$


















                                7












                                $begingroup$

                                R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.



                                However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.






                                share|improve this answer









                                $endgroup$
















                                  7












                                  7








                                  7





                                  $begingroup$

                                  R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.



                                  However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.






                                  share|improve this answer









                                  $endgroup$



                                  R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.



                                  However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.







                                  share|improve this answer












                                  share|improve this answer



                                  share|improve this answer










                                  answered Jan 29 '15 at 12:58









                                  Climbs_lika_SpyderClimbs_lika_Spyder

                                  36038




                                  36038























                                      4












                                      $begingroup$

                                      As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.






                                      share|improve this answer









                                      $endgroup$


















                                        4












                                        $begingroup$

                                        As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.






                                        share|improve this answer









                                        $endgroup$
















                                          4












                                          4








                                          4





                                          $begingroup$

                                          As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.






                                          share|improve this answer









                                          $endgroup$



                                          As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.







                                          share|improve this answer












                                          share|improve this answer



                                          share|improve this answer










                                          answered Jan 29 '15 at 21:03









                                          mindcrimemindcrime

                                          1616




                                          1616























                                              2












                                              $begingroup$

                                              I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).






                                              share|improve this answer









                                              $endgroup$


















                                                2












                                                $begingroup$

                                                I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).






                                                share|improve this answer









                                                $endgroup$
















                                                  2












                                                  2








                                                  2





                                                  $begingroup$

                                                  I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).






                                                  share|improve this answer









                                                  $endgroup$



                                                  I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).







                                                  share|improve this answer












                                                  share|improve this answer



                                                  share|improve this answer










                                                  answered Jul 18 '14 at 15:24









                                                  StenemoStenemo

                                                  1212




                                                  1212























                                                      0












                                                      $begingroup$

                                                      I think that there is actually a pletora of tools for working with big data in R.
                                                      sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
                                                      Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).






                                                      share|improve this answer









                                                      $endgroup$


















                                                        0












                                                        $begingroup$

                                                        I think that there is actually a pletora of tools for working with big data in R.
                                                        sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
                                                        Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).






                                                        share|improve this answer









                                                        $endgroup$
















                                                          0












                                                          0








                                                          0





                                                          $begingroup$

                                                          I think that there is actually a pletora of tools for working with big data in R.
                                                          sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
                                                          Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).






                                                          share|improve this answer









                                                          $endgroup$



                                                          I think that there is actually a pletora of tools for working with big data in R.
                                                          sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
                                                          Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).







                                                          share|improve this answer












                                                          share|improve this answer



                                                          share|improve this answer










                                                          answered Nov 8 '18 at 15:35









                                                          paoloeusebipaoloeusebi

                                                          3266




                                                          3266






























                                                              draft saved

                                                              draft discarded




















































                                                              Thanks for contributing an answer to Data Science Stack Exchange!


                                                              • Please be sure to answer the question. Provide details and share your research!

                                                              But avoid



                                                              • Asking for help, clarification, or responding to other answers.

                                                              • Making statements based on opinion; back them up with references or personal experience.


                                                              Use MathJax to format equations. MathJax reference.


                                                              To learn more, see our tips on writing great answers.




                                                              draft saved


                                                              draft discarded














                                                              StackExchange.ready(
                                                              function () {
                                                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41%2fis-the-r-language-suitable-for-big-data%23new-answer', 'question_page');
                                                              }
                                                              );

                                                              Post as a guest















                                                              Required, but never shown





















































                                                              Required, but never shown














                                                              Required, but never shown












                                                              Required, but never shown







                                                              Required, but never shown

































                                                              Required, but never shown














                                                              Required, but never shown












                                                              Required, but never shown







                                                              Required, but never shown







                                                              Popular posts from this blog

                                                              How to label and detect the document text images

                                                              Vallis Paradisi

                                                              Tabula Rosettana