Guest blog post by Kumar Chinnakali
Apache Spark has many components including Spark Core which is responsible for Task Scheduling, Memory Management, Fault Recovery, and Interacting with storage systems.
SparkSQL > Structured Data > Querying with SQL/HQL
Spark Streaming > Processing of live streams > Micro batching
MLlib > Machine Learning > Multiple types of ML algorithms
GraphX > Graph Processing >Graph Parallel computations
Now we have fair idea on theoretical knowledge of Apache Spark, let’s begin our game by making our hands dirty. First step on this is to have Apache Spark environment up and running.
- Login into: https://aws.amazon.com/
- Create an account with your id
- Choose AWS Management Console
- Under Services choose EMR
- Choose Create Cluster
- Provide the cluster name, s/w configurations and number of instances
- Choose EC2 key pair that you created using the below steps
- Click create cluster
- Under Services choose EC2
- Under EC2 Dashboard you will have all instance details
- You can get the master node instance access path and do paste it in putty
- [email protected] instance
- in ssh > Choose the ppk key created using below steps in puttygen
- click open and the instance would be started
- S3 Buckets needs to be added to have I/P and O/P files into S3
- eg: s3://myawsbucket/input
- Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
- From the navigation bar, select a region for the key pair. You can select any region that’s available to you, regardless of your location. This choice is important because some Amazon EC2 resources can be shared between regions, but key pairs can’t. For example, if you create a key pair in the US West (Oregon) region, you can’t see or use the key pair in another region.
- In the navigation pane, under NETWORK & SECURITY, choose Key Pairs.
- Choose Create Key Pair.
- Enter a name for the new key pair in the Key pair name field of the Create Key Pairdialog box, and then choose Create.
- The private key file is automatically downloaded by your browser. The base file name is the name you specified as the name of your key pair, and the file name extension is.pem. Save the private key file in a safe place.
- If you will use an SSH client on a Mac or Linux computer to connect to your Linux instance, use the following command to set the permissions of your private key file so that only you can read it. $ chmod 400 my-key-pair.pem
To launch a cluster with Spark installed using the console
The following procedure creates a cluster with Spark installed. For more information about launching clusters with the console.
- Open the Amazon EMR console athttps://console.aws.amazon.com/elasticmapreduce/.
- Choose Create cluster.
- For the Software Configuration field, choose Amazon AMI Version 3.9.0or later.
- For the Applications to be installed field, choose Spark from the list, then choose Configure and add.
- You can add arguments to change the Spark configuration. For more information, seeConfigure Spark. Choose Add.
- Select other options as necessary and then choose Create cluster.
In Blog 5 – Let’s have We will share on Apache Spark Languages with basic Hands-on.
Originally posted here.