Import csv file contents into pyspark dataframes
$begingroup$
How can I import a .csv file into pyspark dataframes ? I even tried to read csv file in Pandas and then converting it to spark dataframes using createDataFrame but it is still showing some error. Can someone guide me through this. Also, please tell me how can I import xlsx file ?
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames, but it is showing the error
"Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
and my code was
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
sqlc=SQLContext(sc)
df=pd.read_csv(r'D:BestBuytrain.csv')
sdf=sqlc.createDataFrame(df)
pyspark
$endgroup$
add a comment |
$begingroup$
How can I import a .csv file into pyspark dataframes ? I even tried to read csv file in Pandas and then converting it to spark dataframes using createDataFrame but it is still showing some error. Can someone guide me through this. Also, please tell me how can I import xlsx file ?
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames, but it is showing the error
"Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
and my code was
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
sqlc=SQLContext(sc)
df=pd.read_csv(r'D:BestBuytrain.csv')
sdf=sqlc.createDataFrame(df)
pyspark
$endgroup$
1
$begingroup$
If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
$endgroup$
– j.a.gartner
Aug 1 '16 at 15:55
$begingroup$
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
$endgroup$
– neha
Aug 2 '16 at 4:59
$begingroup$
and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
$endgroup$
– neha
Aug 2 '16 at 5:01
$begingroup$
Welcome to DataScience.SE! Please edit your original post instead of adding comments.
$endgroup$
– Emre
Aug 2 '16 at 7:54
$begingroup$
file path must be in HDFS then only u can run the data
$endgroup$
– Prakash Reddy
2 days ago
add a comment |
$begingroup$
How can I import a .csv file into pyspark dataframes ? I even tried to read csv file in Pandas and then converting it to spark dataframes using createDataFrame but it is still showing some error. Can someone guide me through this. Also, please tell me how can I import xlsx file ?
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames, but it is showing the error
"Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
and my code was
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
sqlc=SQLContext(sc)
df=pd.read_csv(r'D:BestBuytrain.csv')
sdf=sqlc.createDataFrame(df)
pyspark
$endgroup$
How can I import a .csv file into pyspark dataframes ? I even tried to read csv file in Pandas and then converting it to spark dataframes using createDataFrame but it is still showing some error. Can someone guide me through this. Also, please tell me how can I import xlsx file ?
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames, but it is showing the error
"Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
and my code was
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
sqlc=SQLContext(sc)
df=pd.read_csv(r'D:BestBuytrain.csv')
sdf=sqlc.createDataFrame(df)
pyspark
pyspark
edited Aug 2 '16 at 18:04
Emre
8,60111935
8,60111935
asked Aug 1 '16 at 11:21
nehaneha
41113
41113
1
$begingroup$
If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
$endgroup$
– j.a.gartner
Aug 1 '16 at 15:55
$begingroup$
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
$endgroup$
– neha
Aug 2 '16 at 4:59
$begingroup$
and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
$endgroup$
– neha
Aug 2 '16 at 5:01
$begingroup$
Welcome to DataScience.SE! Please edit your original post instead of adding comments.
$endgroup$
– Emre
Aug 2 '16 at 7:54
$begingroup$
file path must be in HDFS then only u can run the data
$endgroup$
– Prakash Reddy
2 days ago
add a comment |
1
$begingroup$
If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
$endgroup$
– j.a.gartner
Aug 1 '16 at 15:55
$begingroup$
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
$endgroup$
– neha
Aug 2 '16 at 4:59
$begingroup$
and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
$endgroup$
– neha
Aug 2 '16 at 5:01
$begingroup$
Welcome to DataScience.SE! Please edit your original post instead of adding comments.
$endgroup$
– Emre
Aug 2 '16 at 7:54
$begingroup$
file path must be in HDFS then only u can run the data
$endgroup$
– Prakash Reddy
2 days ago
1
1
$begingroup$
If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
$endgroup$
– j.a.gartner
Aug 1 '16 at 15:55
$begingroup$
If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
$endgroup$
– j.a.gartner
Aug 1 '16 at 15:55
$begingroup$
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
$endgroup$
– neha
Aug 2 '16 at 4:59
$begingroup$
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
$endgroup$
– neha
Aug 2 '16 at 4:59
$begingroup$
and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
$endgroup$
– neha
Aug 2 '16 at 5:01
$begingroup$
and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
$endgroup$
– neha
Aug 2 '16 at 5:01
$begingroup$
Welcome to DataScience.SE! Please edit your original post instead of adding comments.
$endgroup$
– Emre
Aug 2 '16 at 7:54
$begingroup$
Welcome to DataScience.SE! Please edit your original post instead of adding comments.
$endgroup$
– Emre
Aug 2 '16 at 7:54
$begingroup$
file path must be in HDFS then only u can run the data
$endgroup$
– Prakash Reddy
2 days ago
$begingroup$
file path must be in HDFS then only u can run the data
$endgroup$
– Prakash Reddy
2 days ago
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
"How can I import a .csv file into pyspark dataframes ?"
-- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
then you can follow the following steps:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
The other method would be to read in the text file as an rdd using
myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))
Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use
>>> from pyspark.sql import Row
>>> Person = Row('name', 'age')
>>> person = rdd.map(lambda r: Person(*r))
>>> df2 = sqlContext.createDataFrame(person)
>>> df2.collect()
[Row(name=u'Alice', age=1)]
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("name", StringType(), True),
... StructField("age", IntegerType(), True)])
>>> df3 = sqlContext.createDataFrame(rdd, schema)
>>> df3.collect()
[Row(name=u'Alice', age=1)]
Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row
"Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.
$endgroup$
add a comment |
$begingroup$
I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:
>>> from pyspark import SQLContext
>>> from pyspark.sql import Row
>>> sql_c = SQLContext(sc)
>>> d0 = sc.textFile('./temp.csv')
>>> d0.collect()
[u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
>>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
>>> d1.take(1)
[Row(label=u'a', number=1, value=0.239)]
>>> df = sql_c.createDataFrame(d1)
>>> df_cut = df[df.number>1]
>>> df_cut.select('label', 'value').collect()
[Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]
So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.
$endgroup$
add a comment |
$begingroup$
You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f13123%2fimport-csv-file-contents-into-pyspark-dataframes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
"How can I import a .csv file into pyspark dataframes ?"
-- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
then you can follow the following steps:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
The other method would be to read in the text file as an rdd using
myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))
Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use
>>> from pyspark.sql import Row
>>> Person = Row('name', 'age')
>>> person = rdd.map(lambda r: Person(*r))
>>> df2 = sqlContext.createDataFrame(person)
>>> df2.collect()
[Row(name=u'Alice', age=1)]
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("name", StringType(), True),
... StructField("age", IntegerType(), True)])
>>> df3 = sqlContext.createDataFrame(rdd, schema)
>>> df3.collect()
[Row(name=u'Alice', age=1)]
Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row
"Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.
$endgroup$
add a comment |
$begingroup$
"How can I import a .csv file into pyspark dataframes ?"
-- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
then you can follow the following steps:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
The other method would be to read in the text file as an rdd using
myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))
Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use
>>> from pyspark.sql import Row
>>> Person = Row('name', 'age')
>>> person = rdd.map(lambda r: Person(*r))
>>> df2 = sqlContext.createDataFrame(person)
>>> df2.collect()
[Row(name=u'Alice', age=1)]
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("name", StringType(), True),
... StructField("age", IntegerType(), True)])
>>> df3 = sqlContext.createDataFrame(rdd, schema)
>>> df3.collect()
[Row(name=u'Alice', age=1)]
Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row
"Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.
$endgroup$
add a comment |
$begingroup$
"How can I import a .csv file into pyspark dataframes ?"
-- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
then you can follow the following steps:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
The other method would be to read in the text file as an rdd using
myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))
Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use
>>> from pyspark.sql import Row
>>> Person = Row('name', 'age')
>>> person = rdd.map(lambda r: Person(*r))
>>> df2 = sqlContext.createDataFrame(person)
>>> df2.collect()
[Row(name=u'Alice', age=1)]
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("name", StringType(), True),
... StructField("age", IntegerType(), True)])
>>> df3 = sqlContext.createDataFrame(rdd, schema)
>>> df3.collect()
[Row(name=u'Alice', age=1)]
Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row
"Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.
$endgroup$
"How can I import a .csv file into pyspark dataframes ?"
-- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
then you can follow the following steps:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
The other method would be to read in the text file as an rdd using
myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))
Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use
>>> from pyspark.sql import Row
>>> Person = Row('name', 'age')
>>> person = rdd.map(lambda r: Person(*r))
>>> df2 = sqlContext.createDataFrame(person)
>>> df2.collect()
[Row(name=u'Alice', age=1)]
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("name", StringType(), True),
... StructField("age", IntegerType(), True)])
>>> df3 = sqlContext.createDataFrame(rdd, schema)
>>> df3.collect()
[Row(name=u'Alice', age=1)]
Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row
"Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.
answered Aug 8 '16 at 17:27
JonJon
41328
41328
add a comment |
add a comment |
$begingroup$
I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:
>>> from pyspark import SQLContext
>>> from pyspark.sql import Row
>>> sql_c = SQLContext(sc)
>>> d0 = sc.textFile('./temp.csv')
>>> d0.collect()
[u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
>>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
>>> d1.take(1)
[Row(label=u'a', number=1, value=0.239)]
>>> df = sql_c.createDataFrame(d1)
>>> df_cut = df[df.number>1]
>>> df_cut.select('label', 'value').collect()
[Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]
So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.
$endgroup$
add a comment |
$begingroup$
I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:
>>> from pyspark import SQLContext
>>> from pyspark.sql import Row
>>> sql_c = SQLContext(sc)
>>> d0 = sc.textFile('./temp.csv')
>>> d0.collect()
[u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
>>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
>>> d1.take(1)
[Row(label=u'a', number=1, value=0.239)]
>>> df = sql_c.createDataFrame(d1)
>>> df_cut = df[df.number>1]
>>> df_cut.select('label', 'value').collect()
[Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]
So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.
$endgroup$
add a comment |
$begingroup$
I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:
>>> from pyspark import SQLContext
>>> from pyspark.sql import Row
>>> sql_c = SQLContext(sc)
>>> d0 = sc.textFile('./temp.csv')
>>> d0.collect()
[u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
>>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
>>> d1.take(1)
[Row(label=u'a', number=1, value=0.239)]
>>> df = sql_c.createDataFrame(d1)
>>> df_cut = df[df.number>1]
>>> df_cut.select('label', 'value').collect()
[Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]
So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.
$endgroup$
I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:
>>> from pyspark import SQLContext
>>> from pyspark.sql import Row
>>> sql_c = SQLContext(sc)
>>> d0 = sc.textFile('./temp.csv')
>>> d0.collect()
[u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
>>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
>>> d1.take(1)
[Row(label=u'a', number=1, value=0.239)]
>>> df = sql_c.createDataFrame(d1)
>>> df_cut = df[df.number>1]
>>> df_cut.select('label', 'value').collect()
[Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]
So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.
answered Aug 1 '16 at 16:24
j.a.gartnerj.a.gartner
1,0301716
1,0301716
add a comment |
add a comment |
$begingroup$
You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.
$endgroup$
add a comment |
$begingroup$
You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.
$endgroup$
add a comment |
$begingroup$
You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.
$endgroup$
You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.
answered Aug 2 '16 at 20:39
Jan van der VegtJan van der Vegt
6,6051839
6,6051839
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f13123%2fimport-csv-file-contents-into-pyspark-dataframes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
$endgroup$
– j.a.gartner
Aug 1 '16 at 15:55
$begingroup$
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
$endgroup$
– neha
Aug 2 '16 at 4:59
$begingroup$
and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
$endgroup$
– neha
Aug 2 '16 at 5:01
$begingroup$
Welcome to DataScience.SE! Please edit your original post instead of adding comments.
$endgroup$
– Emre
Aug 2 '16 at 7:54
$begingroup$
file path must be in HDFS then only u can run the data
$endgroup$
– Prakash Reddy
2 days ago