Persisting Application History from Ephemeral Clusters on Google Cloud DataprocJacob FerrieroBlockedUnblockFollowFollowingApr 2So you want to use ephemeral Dataproc Clusters, but don’t want to lose your valuable YARN and Spark history once your cluster is torn down.
Enter the single node Dataproc cluster.
It will serve as your persistent history server for MapReduce and Spark job history as well as a window into aggregated YARN logs.
BackgroundHadoop engineers who have been around for a while are probably used to the YARN and Spark UIs for Application history.
Some may even dig through the YARN container logs for a deeper layer of debugging.
These UIs and container logs provide valuable information for diagnosing what went wrong during a certain job.
In the cloud era (no pun intended) of cluster ephemerality, you’d like to have a persistent way of accessing these familiar interfaces.
Dataproc is the Google Cloud Platform service for managing Hadoop clusters, making it simple to spin-up and tear-down resources at will.
This post outlines a pattern to have a single-node long-running Dataproc cluster that acts as an Application History Server for a shorter lived cluster (or several such clusters).
This post follows best practices for creating long-running Dataproc environments as described in Google’s blog plost: “10 tips for building long-running clusters using Cloud Dataproc”What We’re BuildingA GCS bucket to house cluster templates, initialization actions, aggregated YARN logs, Spark events and MapReduce done-dir and intermediate-done-dir.
A VPC network for your clusters.
(Note: ssh access to private IP addressed machines could be provided by a VPN paired with this VPC, but this is out of scope for this example.
)A single-node Dataproc cluster for viewing the Spark and MapReduce job history UIs.
This will have a public IP to allow simple SSH access via OS Login.
A Dataproc cluster configured to persist its YARN logs in GCS.
This will not have public IP addresses.
Let’s Get BuildingFirst things first we need to enable a few services in the project you wish to spin up this example.
gcloud config set project <your-project-id>gcloud services enable compute.
comNext, pull down the Google Cloud professional services repository and navigate to the source for this example.
git clone https://github.
gitcd professional-services/examples/dataproc-persistent-history-serverUsing the Right Tool: gcloud vs TerraformCapturing and version controlling the configuration of your cloud environment has become best practice.
However, engineers are faced with different ways of doing this.
Terraform provides a lot of value when you are spinning up several permanent components of your infrastructure, with various dependencies.
In other words Terraform is great when you want something to authoritatively capture the state of your environment.
The methods provided by the Google Cloud SDK (ie.
the gcloud CLI) are really convenient for spinning up a single resource during prototyping, and for managing ephemeral resources.
When it comes to Dataproc clusters, if you plan to regularly tear down your cluster (as would be the case for a job scoped cluster), it is best to capture the configuration as a cluster template yaml (this can be exported from an existing cluster or written by hand).
It doesn’t make sense to reapply a terraform plan at the beginning and end of jobs to manage cluster spin-up /tear down, as this could be prone to breaking.
The code for this example provides both terraform and yaml for both the history server and a cluster configured to persist its logs on GCS.
The thought behind this isIf you are treating your history server as a rather permanent part of your environment, it should be in terraformIf you are treating the history server as something you spin up and tear down when you’re using it then it makes more sense to create it using gcloud beta dataproc clusters import.
Similarly, the ephemeral-cluster.
yaml and long-running-cluster.
tf will spin up congruent clusters.
Choose which cluster configuration tool makes more sense for your use of that cluster.
If you are just trying to run this example, it’s recommended you read the Google Cloud SDK Environment Setup without running the gcloud blocks.
Just check out the professional services repo and read through the files in the terraform directory, set the appropriate values in terrafom.
tfvars and apply the plan for this example.
Note, you’ll need to have terraform installed.
This will greatly simplify tearing down all the resources for this example when you’re done.
Terraform Environment Set UpFrom the dataproc-persistent-history-server directory, run the following commands.
cd terraformterraform initterraform planterraform applyYou can skip the gcloud blocks in the next section, but I’d recommend skimming for explanation as to why certain decisions were made for configuring the environment for this example.
Google Cloud SDK Environment SetupThis walk-through will spin up an environment using gcloud CLI while explaining each choice.
Storage and Service Account SetupCreate a Google Cloud Storage Bucket in which you’ll store your logs.
It’s a clean practice to keep logs in a separate bucket from your data to make the intention of the bucket clear and to simplify privileges.
(Note: You must stage an empty object in this GCS bucket so that the spark-events path exists before running a job, otherwise the job will fail.
)gsutil mb -c regional -l us-central1 gs://my_history_buckettouch .
keepgsutil cp .
keepCreate a service account for your history server.
gcloud iam service-accounts create history-server-accountGive this service account permissions to run a Dataproc cluster:gcloud projects add-iam-policy-binding project_id –member=serviceAccount:history-server-account@project_id.
workerGrant read access to the history bucket:gsutil iam ch serviceAccount:history-server-account@project_id.
com:objectViewer gs://my_history_bucketSetting up your NetworkYou want to take extra care in setting up the firewall rules to avoid the common security pitfall of exposing the YARN web interface port 8088 to the public internet allowing bad actors to submit jobs on your cluster.
The recommended approach for engineers connecting to a cluster is to connect to your clusters via VPN that is peered with your VPC network.
This way, engineers can ssh into clusters referencing their private IP addresses using the gcloud compute ssh — internal-ip <target-instance>.
To automate Linux account creation you can enable OS Login.
To keep this example simple, you will assign the history server a public IP address and a firewall rule to allow any ingress traffic on port 22.
Note: The firewall rule will expose an entry point to your Dataproc cluster from the public internet on port 22.
First things first, get your networking pre-requisites in order.
The following command creates a regional network and specifies that we will create your own sub-networks.
gcloud compute networks create example-net –bgp-routing-mode regional –subnet-mode customAdd a sub-network for Dataproc.
The following command creates a subnet for dataproc in the us-central1 region.
It specifies a CIDR range that specifies the range of internal IPs for the machines on this network from 10.
0 to 10.
We also enable private Google access, so that the machines without public IP addresses on this network can reach GCP services.
gcloud compute networks subnets create example-hadoop-subnet-central –range 10.
0/16 –network example-net –region us-central1 –enable-private-ip-google-accessNext, create a firewall rule to allow ssh for the hadoop-history-ui-access tag.
For the source range you’re following in suit with the default network and allowing ssh from all IPs.
You should consider restricting this CIDR range to only those of the data engineering teams that would need to ssh to the cluster.
gcloud compute firewall-rules create example-hadoop-subnet-central-hadoop-history-ui-ssh –allow tcp:22 –direction INGRESS –network example-net –target-tags hadoop-history-ui-access,hadoop-admin-ui-access –source-ranges 0.
0/0Finally, Hadoop uses a lot of ports for communication between nodes so create this firewall rule as well to allow all tcp, udp and icmp traffic between nodes with any of the specified tags.
gcloud compute firewall-rules create example-hadoop-subnet-central –allow tcp,udp,icmp –source-tags hadoop-history-ui-access,hadoop-admin-ui-access –target-tags hadoop-history-ui-access,hadoop-admin-ui-accessSpinning up a History ServerThink of this history server as a window to the logs you’ve persisted on GCS.
All of the state is captured on GCS so you can choose to spin this history server up and down at will or keep it running for convenience.
However, keep in mind:… the MapReduce job history server only reads history from Cloud Storage when it first starts up.
From that point forward, the only new job history you will see in the UI is for the jobs that were run on that cluster; in other words, these are jobs whose history was moved by that cluster’s job history server.
By contrast, the Spark job history server is completely stateless, so the previous caveats do not apply.
— Mikayla Konst & Chris Crosbie: 10 tips for building long-running clusters using Cloud DataprocNext, create a single-node Dataproc cluster on the sub-network created above.
This will serve as your long-running Application History Server.
You can use the history_server.
yaml file to create the history server.
First, replace the placeholders in the yaml files by running the following sed commands.
cd workflow_templatessed -i 's/PROJECT/your-gcp-project-id/g' *sed -i 's/HISTORY_BUCKET/your-history-bucket/g' *sed -i 's/HISTORY_SERVER/your-history-server/g' *sed -i 's/REGION/us-central1/g' *sed -i 's/ZONE/us-central1-f/g' *sed -i 's/SUBNET/your-subnet-id/g' *cd .
/cluster_templatessed -i 's/PROJECT/your-gcp-project-id/g' *sed -i 's/HISTORY_BUCKET/your-history-bucket/g' *sed -i 's/HISTORY_SERVER/your-history-server/g' *sed -i 's/REGION/us-central1/g' *sed -i 's/ZONE/us-central1-f/g' *sed -i 's/SUBNET/your-subnet-id/g' *Now you’re ready to spin up the history server.
gcloud beta dataproc clusters import history-server –source=cluster_templates/history-server.
yaml –region=us-central1Creating a Cluster and Submitting JobsFinally, spin up an ephemeral cluster, that’s configured to write logs to Google Cloud Storage, so they can live long after your cluster is torn down.
It’s best practice to capture information about the clusters you are creating in a yaml file for consistent cluster creation.
Note: The cluster yamls refer to an initialization action on GCS to disable the local history servers on the ephemeral clusters.
This can be found in disable_history_servers.
You must stage this in your history bucket.
gsutil cp init_actions/disable_history_servers.
sh gs://my_history_bucket/init_actions/You can manage cluster spin-up, submission of a Spark job, submission of a Hadoop job, and finally cluster tear-down with a workflow template.
This is the most straightforward way to run this example.
The job dependencies and cluster definition can be found in spark_mr_workflow_template.
gcloud dataproc workflow-templates instantiate-from-file spark-mr-example –region=us-central1 –source=workflow_templates/spark_mr_workflow_template.
yamlAnalyze Your Spark application history Long After Your Cluster Is GoneNow that the cluster where you ran the job has been deleted, use an SSH tunnel to your history server in order to display the Spark History on your Chrome browser.
The following commands are for a Linux OS, you can follow the instructions for your OS in the Dataproc docs.
gcloud compute ssh history-server-m –project=your-project-id –zone=us-central1-a — -D 1080 -NConnect Chrome through the proxy.
/usr/bin/google-chrome –proxy-server=”socks5://localhost:1080" –user-data-dir=”/tmp/history-server-m” http://history-server-m:18080This should launch the Chrome browser and you should be able to see the Spark History UI.
Take a look at the YARN History Server.
/usr/bin/google-chrome –proxy-server=”socks5://localhost:1080" –user-data-dir=”/tmp/history-server-m” http://history-server-m:10020Closing ThoughtsWhile this is just an example, hopefully it paints a picture of how you can maintain the logs that are important for diagnosing misbehaving jobs and still confidently destroy clusters when jobs complete.
Depending on your team structure, you could build on top of this to:If the engineers that author your pipelines do not also make infrastructure decisions: Create an API on top of workflow templates that abstracts these cluster configuration details away from your ETL engineers and allow them to just worry about writing their pipelines.
If your ETL pipeline engineers want control over the infrastructure their pipelines run on: Implement a practice where each time your ETL engineer defines a job part of that job definition includes a workflow template that captures their job and the appropriate cluster configuration for it to run on.
The configuration in this example can serve as boiler plate to configure the logging pattern discussed in this post.
Clean UpDon’t forget to tear down the resources unless you want to pay for them :).
Listen to the “Threat Level Midnight” hero, and clean up the artifacts from this example.
If you spun up with terraform just run terraform destroy .
If not, delete your infrastructure down in the cloud console or the with gcloud in the following order:Dataproc clusters glcoud dataproc clusters delete <cluster-id>GCS bucket gsutil rb <bucket-name>, service account gcloud iam service-accounts delete <service-account-id>, VPC subnet gcloud compute networks subnets delete <subnet-name>VPC network gcloud compute networks delete <network-name>.