Is the R language suitable for Big Data
$begingroup$
R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".
I've seen a guideline of 5TB for a dataset to be considered as Big Data.
My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?
bigdata r
$endgroup$
add a comment |
$begingroup$
R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".
I've seen a guideline of 5TB for a dataset to be considered as Big Data.
My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?
bigdata r
$endgroup$
4
$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45
add a comment |
$begingroup$
R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".
I've seen a guideline of 5TB for a dataset to be considered as Big Data.
My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?
bigdata r
$endgroup$
R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".
I've seen a guideline of 5TB for a dataset to be considered as Big Data.
My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?
bigdata r
bigdata r
edited May 14 '14 at 13:06
Konstantin V. Salikhov
569515
569515
asked May 14 '14 at 11:15
akellyirlakellyirl
35849
35849
4
$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45
add a comment |
4
$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45
4
4
$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45
$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45
add a comment |
9 Answers
9
active
oldest
votes
$begingroup$
Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.
A mature project for R and Hadoop is RHadoop
RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).
$endgroup$
$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07
$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34
2
$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03
add a comment |
$begingroup$
The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.
- RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.
- The ff package allows using larger than RAM data sets by utilising memory-mapped pages.
- BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.
- bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.
$endgroup$
1
$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03
add a comment |
$begingroup$
Some good answers here. I would like to join the discussion by adding the following three notes:
The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).
Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).
While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".
$endgroup$
add a comment |
$begingroup$
R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite
package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.
This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR
).
$endgroup$
add a comment |
$begingroup$
Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.
In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.
Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.
Edit:
Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)
$endgroup$
3
$begingroup$
R is a pleasure to work with for data manipulation (reshape2
,plyr
, and nowdplyr
) and I don't think you can do better thanggplot2
/ggvis
for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52
$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08
1
$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46
5
$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31
$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35
add a comment |
$begingroup$
R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.
However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.
$endgroup$
add a comment |
$begingroup$
As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.
$endgroup$
add a comment |
$begingroup$
I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).
$endgroup$
add a comment |
$begingroup$
I think that there is actually a pletora of tools for working with big data in R.
sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41%2fis-the-r-language-suitable-for-big-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
9 Answers
9
active
oldest
votes
9 Answers
9
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.
A mature project for R and Hadoop is RHadoop
RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).
$endgroup$
$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07
$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34
2
$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03
add a comment |
$begingroup$
Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.
A mature project for R and Hadoop is RHadoop
RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).
$endgroup$
$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07
$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34
2
$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03
add a comment |
$begingroup$
Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.
A mature project for R and Hadoop is RHadoop
RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).
$endgroup$
Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.
A mature project for R and Hadoop is RHadoop
RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).
edited Jan 31 '15 at 11:34
chenrui333
20325
20325
answered May 14 '14 at 11:24
MCP_infiltratorMCP_infiltrator
96697
96697
$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07
$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34
2
$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03
add a comment |
$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07
$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34
2
$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03
$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07
$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07
$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34
$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34
2
2
$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03
$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03
add a comment |
$begingroup$
The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.
- RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.
- The ff package allows using larger than RAM data sets by utilising memory-mapped pages.
- BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.
- bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.
$endgroup$
1
$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03
add a comment |
$begingroup$
The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.
- RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.
- The ff package allows using larger than RAM data sets by utilising memory-mapped pages.
- BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.
- bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.
$endgroup$
1
$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03
add a comment |
$begingroup$
The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.
- RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.
- The ff package allows using larger than RAM data sets by utilising memory-mapped pages.
- BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.
- bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.
$endgroup$
The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.
- RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.
- The ff package allows using larger than RAM data sets by utilising memory-mapped pages.
- BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.
- bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.
answered May 14 '14 at 12:39
asheeshrasheeshr
5761413
5761413
1
$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03
add a comment |
1
$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03
1
1
$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03
$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03
add a comment |
$begingroup$
Some good answers here. I would like to join the discussion by adding the following three notes:
The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).
Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).
While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".
$endgroup$
add a comment |
$begingroup$
Some good answers here. I would like to join the discussion by adding the following three notes:
The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).
Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).
While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".
$endgroup$
add a comment |
$begingroup$
Some good answers here. I would like to join the discussion by adding the following three notes:
The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).
Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).
While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".
$endgroup$
Some good answers here. I would like to join the discussion by adding the following three notes:
The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).
Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).
While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".
edited yesterday
Kevin Bowen
10915
10915
answered Jul 19 '14 at 2:19
Aleksandr Blekh♦Aleksandr Blekh
5,94811747
5,94811747
add a comment |
add a comment |
$begingroup$
R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite
package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.
This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR
).
$endgroup$
add a comment |
$begingroup$
R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite
package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.
This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR
).
$endgroup$
add a comment |
$begingroup$
R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite
package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.
This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR
).
$endgroup$
R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite
package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.
This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR
).
answered May 18 '14 at 19:22
statsRusstatsRus
267110
267110
add a comment |
add a comment |
$begingroup$
Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.
In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.
Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.
Edit:
Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)
$endgroup$
3
$begingroup$
R is a pleasure to work with for data manipulation (reshape2
,plyr
, and nowdplyr
) and I don't think you can do better thanggplot2
/ggvis
for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52
$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08
1
$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46
5
$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31
$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35
add a comment |
$begingroup$
Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.
In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.
Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.
Edit:
Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)
$endgroup$
3
$begingroup$
R is a pleasure to work with for data manipulation (reshape2
,plyr
, and nowdplyr
) and I don't think you can do better thanggplot2
/ggvis
for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52
$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08
1
$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46
5
$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31
$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35
add a comment |
$begingroup$
Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.
In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.
Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.
Edit:
Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)
$endgroup$
Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.
In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.
Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.
Edit:
Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)
edited May 19 '14 at 8:13
answered May 18 '14 at 12:30
Amir Ali AkbariAmir Ali Akbari
80531023
80531023
3
$begingroup$
R is a pleasure to work with for data manipulation (reshape2
,plyr
, and nowdplyr
) and I don't think you can do better thanggplot2
/ggvis
for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52
$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08
1
$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46
5
$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31
$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35
add a comment |
3
$begingroup$
R is a pleasure to work with for data manipulation (reshape2
,plyr
, and nowdplyr
) and I don't think you can do better thanggplot2
/ggvis
for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52
$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08
1
$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46
5
$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31
$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35
3
3
$begingroup$
R is a pleasure to work with for data manipulation (
reshape2
, plyr
, and now dplyr
) and I don't think you can do better than ggplot2
/ggvis
for visualization$endgroup$
– organic agave
May 18 '14 at 21:52
$begingroup$
R is a pleasure to work with for data manipulation (
reshape2
, plyr
, and now dplyr
) and I don't think you can do better than ggplot2
/ggvis
for visualization$endgroup$
– organic agave
May 18 '14 at 21:52
$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08
$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08
1
1
$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46
$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46
5
5
$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31
$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31
$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35
$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35
add a comment |
$begingroup$
R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.
However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.
$endgroup$
add a comment |
$begingroup$
R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.
However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.
$endgroup$
add a comment |
$begingroup$
R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.
However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.
$endgroup$
R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.
However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.
answered Jan 29 '15 at 12:58
Climbs_lika_SpyderClimbs_lika_Spyder
36038
36038
add a comment |
add a comment |
$begingroup$
As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.
$endgroup$
add a comment |
$begingroup$
As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.
$endgroup$
add a comment |
$begingroup$
As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.
$endgroup$
As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.
answered Jan 29 '15 at 21:03
mindcrimemindcrime
1616
1616
add a comment |
add a comment |
$begingroup$
I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).
$endgroup$
add a comment |
$begingroup$
I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).
$endgroup$
add a comment |
$begingroup$
I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).
$endgroup$
I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).
answered Jul 18 '14 at 15:24
StenemoStenemo
1212
1212
add a comment |
add a comment |
$begingroup$
I think that there is actually a pletora of tools for working with big data in R.
sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).
$endgroup$
add a comment |
$begingroup$
I think that there is actually a pletora of tools for working with big data in R.
sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).
$endgroup$
add a comment |
$begingroup$
I think that there is actually a pletora of tools for working with big data in R.
sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).
$endgroup$
I think that there is actually a pletora of tools for working with big data in R.
sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).
answered Nov 8 '18 at 15:35
paoloeusebipaoloeusebi
3266
3266
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41%2fis-the-r-language-suitable-for-big-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45