Subscribe to our Newsletter

Self-Learn Yourself Apache Spark in 21 Blogs – #4

Guest blog post by Kumar Chinnakali

In Blog 4, we will see what are Apache Spark Core and its ecosystem and Apache Spark on AWS Cloud. Click to have quick read on blog 1-3 in this learning series.

Apache Spark has many components including Spark Core which is responsible for Task Scheduling, Memory Management, Fault Recovery, and Interacting with storage systems.

SparkSQL > Structured Data > Querying with SQL/HQL

Spark Streaming > Processing of live streams > Micro batching

MLlib > Machine Learning > Multiple types of ML algorithms

GraphX > Graph Processing >Graph Parallel computations

Now we have fair idea on theoretical knowledge of Apache Spark, let’s begin our game by making our hands dirty. First step on this is to have Apache Spark environment up and running.

  1. Login into: https://aws.amazon.com/
  2. Create an account with your id
  3. Choose AWS Management Console
  4. Under Services choose EMR
  5. Choose Create Cluster
  6. Provide the cluster name, s/w configurations and number of instances
  7. Choose EC2 key pair that you created using the below steps
  8. Click create cluster
  9. Under Services choose EC2
  10. Under EC2 Dashboard you will have all instance details
  11. You can get the master node instance access path and do paste it in putty
  12. [email protected] instance
  13. in ssh > Choose the ppk key created using below steps in puttygen
  14. click open and the instance would be started
  15. S3 Buckets needs to be added to have I/P and O/P files into S3
  16. eg: s3://myawsbucket/input
  17. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
  18. From the navigation bar, select a region for the key pair. You can select any region that’s available to you, regardless of your location. This choice is important because some Amazon EC2 resources can be shared between regions, but key pairs can’t. For example, if you create a key pair in the US West (Oregon) region, you can’t see or use the key pair in another region.
  19. In the navigation pane, under NETWORK & SECURITY, choose Key Pairs.
  20. Choose Create Key Pair.
  21. Enter a name for the new key pair in the Key pair name field of the Create Key Pairdialog box, and then choose Create.
  22. The private key file is automatically downloaded by your browser. The base file name is the name you specified as the name of your key pair, and the file name extension is.pem. Save the private key file in a safe place.
  23. If you will use an SSH client on a Mac or Linux computer to connect to your Linux instance, use the following command to set the permissions of your private key file so that only you can read it. $ chmod 400 my-key-pair.pem

To launch a cluster with Spark installed using the console

The following procedure creates a cluster with Spark installed. For more information about launching clusters with the console.

  1. Open the Amazon EMR console athttps://console.aws.amazon.com/elasticmapreduce/.
  2. Choose Create cluster.
  3. For the Software Configuration field, choose Amazon AMI Version 3.9.0or later.
  4. For the Applications to be installed field, choose Spark from the list, then choose Configure and add.
  5. You can add arguments to change the Spark configuration. For more information, seeConfigure Spark. Choose Add.
  6. Select other options as necessary and then choose Create cluster.

In Blog 5 – Let’s have We will share on Apache Spark Languages with basic Hands-on.

Originally posted here.

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC