Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

SOLVED: Comparison of Dataframe and RDD count execution time in Apache Spark [duplicate]

Rhyzx:

This question already has an answer here:

  • How to know which count query is the fastest? 2 answers

I have a short question regarding Dataframes and RDDs in Spark (2.1.0).
When I load a table from a CSV file and then execute a very simple count operation in Spark like this:


//Dataframe
spark.table("lineitem").count();

//RDD
spark.table("lineitem").rdd().count();

The execution time for Dataframe is 7,991649 s
and the RDD API needs 19,384267 s

I get the same difference if i execute a Reduce or an Aggregate operation to sum a column.

Is there any way to execute a count operation on RDDs (by just using the RDD API), so it is on par or faster than the Dataframe implementation?



Posted in S.E.F
via StackOverflow & StackExchange Atomic Web Robots
This Question have been answered
HERE


This post first appeared on Stack Solved, please read the originial post: here

Share the post

SOLVED: Comparison of Dataframe and RDD count execution time in Apache Spark [duplicate]

×

Subscribe to Stack Solved

Get updates delivered right to your inbox!

Thank you for your subscription

×