Search for question
Question

3. (Note: From this question onwards you CANNOT use websites but can only use course

material. Also, for all Mapreduce questions please only use the class definition/template

for Mapreduce jobs, i.e., Map/Reduce tasks/functions, and not other sources from the

Web, since there are many different variants on the Web!) One of the candidates

always imitates and emulates a previous president. This candidate believes they

will win by becoming the most popular person on social media. An intern in

their campaign wants to write a Mapreduce program. In MapReduce, one writes

a program for Map that processes one input line at a time and outputs zero or

more (key, value) pairs; and one writes a program for Reduce that processes an

input of (key, all values for key). The iteration over input lines is done

automatically by the MapReduce framework. The intern would like to know who

are the influential Twitter users most similar to their candidate, and would like

to use Hadoop for this. The intern uses an input file containing information from

Twitter (which is an asymmetrical social network) about which users "follow"

which other users. If user a follows b, the entry line is (a, b) - you can assume this

data is already sharded in HDFS and can be loaded from there. Can you help the

intern? Write a MapReduce program (Map and Reduce separately) that outputs

the list of all users U who satisfy the following three conditions simultaneously:

U has at least 100 million followers, and U herself/himself follows fewer than 10

users, and U follows at least one user V who in turn has at least 10 million

followers (e.g., @Barack Obama would be such a U). You can chain Mapreduces if

you want (but only if you must, and even then, only the least number). Your

program must be as highly parallelizable as possible. Correctness is paramount,

though you should try to make it as fast as possible.

Fig: 1