Alg_D:
I am wondering how to choose the best settings to run Tune me Spark Job. Basically I am just reading a big csv
file into a DataFrame
and count some string occurrences.
The input file is over 500 GB. The Spark job seems too slow..
terminal Progress Bar:
[Stage1:=======> (4174 + 50) / 18500]
NumberCompletedTasks:
(4174) takes around one hour.
NumberActiveTasks:
(50), I believe I can control with setting. --conf spark.dynamicAllocation.maxExecutors=50
(tried with different values).
TotalNumberOfTasks:
(18500), why is this fixed? what does it mean, is it only related to file size? Since I am reading a csv
just with little logic, how can I optimize the Spark Job?
I also tried changing :
--executor-memory 10g
--driver-memory 12g
Posted in S.E.F
via StackOverflow & StackExchange Atomic Web Robots
This Question have been answered
HERE