Search for question
Question

, submit a ZIP file that includes a Word document with a cover page

containing the names of your team members and each of the steps

outlined below, clearly identified with a title. Also, include your data

sources in the Zip file for submission.

Please provide thorough comments on your steps and work.

Failure to comply with the submission guidelines will result in penalties.

1. Identify a data source of your choice (See: https://donnees montreal.ca/) and

provide the link to your data source in your Word document.

Describe your data source in your Word document. Proceed with data verification

and assess their quality. Identify and perform any necessary data preprocessing,

if needed. (20 points)

2. Add your data source to HDFS in your Hadoop environment. Include your steps in

your Word document. (20 points)

3. Identify a first processing task for this data source. Create and test your

MapReduce code in your Hadoop environment. Use comments to clearly identify

each step of your MapReduce code. Describe your processing task in one to two

sentences and include it in your Word document. (30 points)

4. Identify a second processing task (Different from the first processing task in step

3) for this data source. Create and test your Spark SQL code in your Hadoop

environment. The use of temporary tables is not allowed in your project. Use

comments to clearly identify each step of your Spark SQL code. Describe your

processing task in one to two sentences and include it in your Word document.

(30 points)

You will be evaluated on the consistency of your processing tasks and

the completeness and details in your Word document compared to the

specifications, as well as the optimality and quality of the code. To

propose consistent work, try to draw inspiration from the various

practices done in class to complete the requested work and not simply

replicate the same examples covered in those practices.

Fig: 1