Spark-Most-Frequent-Word-Counter

This is built using following technologies:

Spark 2.3.1
JDK 1.8.0_181

To run it:

Download Spark https://spark.apache.org/downloads.html
Run the program using the following command:

$ ./bin/spark-submit --class MostFreqWord --master local[4] most-freq-word-counter-1.0-SNAPSHOT.jar {path to a file}

How is Spark playing the game:

The DAG gets created without actually executing anything. The below snippet computes/executes nothing but just creates a dependency graph (DAG) laying out the sequence in which transformations will be done returning us RDD of that transformation.

JavaRDD<String> textFile = sc.textFile(args[0]);
JavaPairRDD<String, Integer> counts = textFile
    .flatMap(s -> Arrays.asList(s.split("[ ,]")).iterator())
    .mapToPair(word -> new Tuple2<>(word, 1))
    .reduceByKey((a, b) -> a + b);
JavaPairRDD<Integer, String> swapped = counts
    .mapToPair((PairFunction<Tuple2<String, Integer>, Integer, String>) Tuple2::swap)
    .sortByKey(false);

It is only when we call List<Tuple2<Integer, String>> output = swapped.collect();, spark actually starts executing in the sequence laid out by the DAG created in step-1.

Which tasks get Distributed over clusters?

.flatMap(s -> Arrays.asList(s.split("[ ,]")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
.mapToPair((PairFunction<Tuple2<String, Integer>, Integer, String>) Tuple2::swap)
.sortByKey(false);

The functions (lambda expressions) in the all of the above statements will get distributed over the clusters and execute in parallel. However, when running locally over a laptop, they would run in the same JVM, but we can set the number of threads we want to run and the functions will run in different threads.

While running the Spark program - we can choose to use Yarn as the cluster manager which will in turn distribute tasks over cluster depending upon available resources.

$ ./bin/spark-submit --class MostFreqWord --master yarn most-freq-word-counter-1.0-SNAPSHOT.jar {path to a file} This snippet will let yarn to distribute tasks over clusters

$ ./bin/spark-submit --class MostFreqWord --master local[k] most-freq-word-counter-1.0-SNAPSHOT.jar {path to a file} This snippet runs Spark locally with k worker threads to distribute the tasks to.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
src/main		src/main
target		target
.gitignore		.gitignore
LICENSE		LICENSE
MostFreqWordCounter.iml		MostFreqWordCounter.iml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-Most-Frequent-Word-Counter

How is Spark playing the game:

Which tasks get Distributed over clusters?

About

Releases

Packages

Languages

License

outlander24/Spark-Most-Frequent-Word-Counter

Folders and files

Latest commit

History

Repository files navigation

Spark-Most-Frequent-Word-Counter

How is Spark playing the game:

Which tasks get Distributed over clusters?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages