Rhyzx:
This question already has an answer here:
- How to know which count query is the fastest? 2 answers
I have a short question regarding Dataframes and RDDs in Spark (2.1.0).
When I load a table from a CSV file and then execute a very simple count operation in Spark like this:
//Dataframe
spark.table("lineitem").count();
//RDD
spark.table("lineitem").rdd().count();
The execution time for Dataframe is 7,991649 s
and the RDD API needs 19,384267 s
I get the same difference if i execute a Reduce or an Aggregate operation to sum a column.
Is there any way to execute a count operation on RDDs (by just using the RDD API), so it is on par or faster than the Dataframe implementation?
Posted in S.E.F
via StackOverflow & StackExchange Atomic Web Robots
This Question have been answered
HERE