<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>databricks Archives - Big Data Processing</title>
	<atom:link href="https://bigdataproc.com/category/databricks/feed/" rel="self" type="application/rss+xml" />
	<link>https://bigdataproc.com/category/databricks/</link>
	<description>Big Data Solution for GCP, AWS, Azure and on-prem</description>
	<lastBuildDate>Wed, 10 May 2023 15:32:02 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.3.1</generator>
	<item>
		<title>GCP &#8211; Execute Jar on Databricks from Airflow</title>
		<link>https://bigdataproc.com/gcp-execute-jar-on-databricks-from-airflow/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=gcp-execute-jar-on-databricks-from-airflow</link>
					<comments>https://bigdataproc.com/gcp-execute-jar-on-databricks-from-airflow/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Thu, 10 Feb 2022 14:21:54 +0000</pubDate>
				<category><![CDATA[Airflow]]></category>
		<category><![CDATA[databricks]]></category>
		<category><![CDATA[gcp]]></category>
		<guid isPermaLink="false">https://bigdataproc.com/?p=316</guid>

					<description><![CDATA[<p>We have a framework written using spark scala api for file ingestion. We are using cloud composer also knows as airflow for our job orchestration.&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/gcp-execute-jar-on-databricks-from-airflow/">Continue reading<span class="screen-reader-text">GCP &#8211; Execute Jar on Databricks from Airflow</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/gcp-execute-jar-on-databricks-from-airflow/">GCP &#8211; Execute Jar on Databricks from Airflow</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>We have a framework written using spark scala api for file ingestion. We are using cloud composer also knows as airflow for our job orchestration.  And so we wanted to perform following task with airflow (composer) </p>



<ol id="block-4454f759-5e9e-4746-83f3-9454d9e4383b"><li>First it will create a cluster with the provided configuration</li><li>inserts the jar while creating a cluster</li><li>creates a job and executes the job with given parameter</li></ol>



<p>Good thing is airflow has a operator to execute jar file. However,  the example available on airflow website is very specific to AWS envionment and so it took some time for me to figure out how to create dag for GCP databricks. </p>



<p>Let&#8217;s understand how to do this. </p>



<h2 class="wp-block-heading">Setup Databricks Connection </h2>



<p>To setup connection you need two things.  databricks API token and databricks workspace URL. </p>



<h3 class="wp-block-heading">Generate API Token</h3>



<p>To generate databricks API token, login to your workspace and then go to <strong>settings &#8211;&gt; user settings</strong>. And then click on <strong>generate new token</strong>.  Please save this token as you won&#8217;t be able to retrieve this again. </p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img decoding="async" fetchpriority="high" width="571" height="283" src="https://bigdataproc.com/wp-content/uploads/2022/02/image-3.png" alt="" class="wp-image-319" srcset="https://bigdataproc.com/wp-content/uploads/2022/02/image-3.png 571w, https://bigdataproc.com/wp-content/uploads/2022/02/image-3-300x149.png 300w" sizes="(max-width: 571px) 100vw, 571px" /></figure></div>



<p>Following tasks you need to execute on Airflow (cloud composer). You will need <strong>admin </strong>access for this. </p>



<h3 class="wp-block-heading">Install Databricks Library </h3>



<ul><li>from your google cloud console, navigate to your cloud composer instance and click on it. </li><li>Click on PYPI PACKAGES and then click on EDIT. </li><li>Add the <strong>apache-airflow-providers-databricks</strong> package</li><li>it will take some time to make this changes, so wait for some time and visit this page again to see if package has installed properly. (internet connectivity from cloud could create an issue for this installation)<br></li></ul>



<div class="wp-block-image"><figure class="aligncenter size-large"><img decoding="async" width="1024" height="321" src="https://bigdataproc.com/wp-content/uploads/2022/02/image-4-1024x321.png" alt="" class="wp-image-320" srcset="https://bigdataproc.com/wp-content/uploads/2022/02/image-4-1024x321.png 1024w, https://bigdataproc.com/wp-content/uploads/2022/02/image-4-300x94.png 300w, https://bigdataproc.com/wp-content/uploads/2022/02/image-4-768x241.png 768w, https://bigdataproc.com/wp-content/uploads/2022/02/image-4.png 1347w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading">Create Connection </h3>



<p>Login to airflow and navigate to <strong>admin &#8211;&gt; connection</strong>.  We are just modifying default connectoin, so if you were able to install the databricks package sucessfully. You should be able to see <strong>databricks_default</strong> connection. Click on edit.  You just need to fill following fields. </p>



<ul><li><strong>Host</strong>: you&#8217;re host should be databricks workspace URL, it should look something like this. <br><strong>https://xxx.gcp.databricks.com/?o=xxx</strong></li><li><strong>Password</strong>:  In password field you need to paste the API token we created in first step. </li></ul>



<p>Now let&#8217;s write DAG to create cluster and execute the jar file. </p>



<h2 class="wp-block-heading">DAG to Execute JAR on Databricks</h2>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import os
from datetime import datetime

from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from airflow.models import Variable

with DAG(
    dag_id='ingest_csv_file_to_bigqury',
    schedule_interval='@daily',
    start_date=datetime(2021, 1, 1),
    tags=['Gaurang'],
    catchup=False,
) as dag:
    new_cluster = {
        "spark_version": "7.3.x-scala2.12",
        "node_type_id": "n1-standard-4",
        "autoscale": {
            "min_workers": 2,
            "max_workers": 8
        },
        "gcp_attributes": {
            "use_preemptible_executors": False,
            "google_service_account": "sa-databricks@testproject.iam.gserviceaccount.com"

        },
    }

    spark_jar_task = DatabricksSubmitRunOperator(
        task_id='run_ingestion_jar',
        new_cluster=new_cluster,
        spark_jar_task={'main_class_name': 'com.test.Main', "parameters":["xxx", "yyy]},
        libraries=[{'jar': 'gs://deploymet/test-1.0.jar'}],
    )

    spark_jar_task</pre>



<p> <strong>DatabricksSubmitRunOperator </strong>takes three parmeters. </p>



<ul><li>new_cluster:  You need to provide a JSON for creating new cluster. </li><li>spark_jar_task:  you need to provide you main class and parameters your jar is expecting </li><li>libraries: location of your jar file. </li></ul>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/gcp-execute-jar-on-databricks-from-airflow/">GCP &#8211; Execute Jar on Databricks from Airflow</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/gcp-execute-jar-on-databricks-from-airflow/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
