<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>linage Archives - Big Data Processing</title>
	<atom:link href="https://bigdataproc.com/tag/linage/feed/" rel="self" type="application/rss+xml" />
	<link>https://bigdataproc.com/tag/linage/</link>
	<description>Big Data Solution for GCP, AWS, Azure and on-prem</description>
	<lastBuildDate>Mon, 31 Jul 2023 19:39:05 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>GCP &#8211; Create Custom Bigquery Linage using DataCatalog Python API</title>
		<link>https://bigdataproc.com/gcp-create-custom-bigquery-linage-using-datacatalog-python-api/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=gcp-create-custom-bigquery-linage-using-datacatalog-python-api</link>
					<comments>https://bigdataproc.com/gcp-create-custom-bigquery-linage-using-datacatalog-python-api/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Mon, 31 Jul 2023 15:33:46 +0000</pubDate>
				<category><![CDATA[bigquery]]></category>
		<category><![CDATA[GCP]]></category>
		<category><![CDATA[dataplex]]></category>
		<category><![CDATA[gcp]]></category>
		<category><![CDATA[linage]]></category>
		<guid isPermaLink="false">https://bigdataproc.com/?p=402</guid>

					<description><![CDATA[<p>Blogpost shows how to create a custom linage using dataplex custom linage python client for your bigquery tables, if those tables are being ingested/modified by any external system. </p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/gcp-create-custom-bigquery-linage-using-datacatalog-python-api/">Continue reading<span class="screen-reader-text">GCP &#8211; Create Custom Bigquery Linage using DataCatalog Python API</span></a></div>
<p>The post <a href="https://bigdataproc.com/gcp-create-custom-bigquery-linage-using-datacatalog-python-api/">GCP &#8211; Create Custom Bigquery Linage using DataCatalog Python API</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>In our GCP (google cloud platform) data warehousing workflow, we rely on GCP BigQuery for storing and analyzing data. However, the data ingestion process involves a different service that does not automatically show lineage in BigQuery. To address this limitation, I developed a Python utility that enables the creation of custom lineage for ingestion jobs using Dataplex Custom Linage Python Client.</p>



<p>Custom lineage creation involves three key tasks, each serving an essential purpose:</p>



<ol class="wp-block-list">
<li><strong>Create a Lineage Process: </strong>This step allows us to define a name for the lineage process. Leveraging GCP Cloud Composer, I often use the DAG name as the process name, facilitating seamless linking of the ingestion tables to their respective processes.</li>



<li><strong>Create the Run:</strong> For every execution of the above process, we should create a new run. I assign the task ID as the run name, ensuring a unique identifier for each run.</li>



<li><strong>Create a Lineage Event:</strong> In the final task, I specify the source and target mapping along with associated details, effectively establishing the lineage relationship between the datasets.</li>
</ol>



<p> </p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img fetchpriority="high" decoding="async" width="1024" height="313" src="https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Process-1024x313.png" alt="Image depicting the GCP BigQuery Custom Lineage Process." class="wp-image-412" srcset="https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Process-1024x313.png 1024w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Process-300x92.png 300w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Process-768x235.png 768w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Process-1536x470.png 1536w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Process-2048x627.png 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Exploring the data lineage process using GCP BigQuery and dataplex custom lineage python client.</figcaption></figure></div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" width="1024" height="395" src="https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Runs-1024x395.png" alt="Bigquery Linage Runs" class="wp-image-411" srcset="https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Runs-1024x395.png 1024w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Runs-300x116.png 300w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Runs-768x296.png 768w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Runs.png 1202w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Create Bigquery Custom Linage Runs using Dataplex Custom Linage Python Client</figcaption></figure></div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" width="1024" height="429" src="https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Run-Details-1024x429.png" alt="Bigquery Custom Run Details" class="wp-image-410" srcset="https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Run-Details-1024x429.png 1024w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Run-Details-300x126.png 300w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Run-Details-768x322.png 768w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Run-Details-1536x644.png 1536w, https://bigdataproc.com/wp-content/uploads/2023/07/GCP-Bigquery-Custom-Linage-Run-Details.png 1550w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure></div>


<p>Please find the entire code snippet on github</p>



<p><a href="https://gist.github.com/Gaurang033/01ab9d4cedfb1049dd23dd30cd88cdad" target="_blank" rel="noreferrer noopener">https://gist.github.com/Gaurang033/01ab9d4cedfb1049dd23dd30cd88cdad</a></p>



<h2 class="wp-block-heading">Install Dependencies </h2>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">google-cloud-datacatalog-lineage==0.2.3</pre>



<h2 class="wp-block-heading">Create Custom Linage Process</h2>



<p>For process you can also add custom attributes,  I have given an example of owner, framework and service. </p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def create_linage_process(project_id, process_display_name):
    parent = f"projects/{project_id}/locations/northamerica-northeast1"
    process = Process()
    process.display_name = process_display_name
    process.attributes = {
        "owner": "gaurangnshah@gmail.com",
        "framework": "file_ingestion_framework",
        "service": "databricks"
    }

    response = client.create_process(parent=parent, process=process)
    return response.name</pre>



<h2 class="wp-block-heading">Create Custom Linage Run </h2>



<p>following code will help you create the custom run for the linage process we created </p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def create_run(process_id, start_time, end_time, state, run_display_name):
    run = lineage_v1.Run()
    run.start_time = start_time
    run.end_time = end_time
    run.state = state
    run.display_name = run_display_name
    run.attributes = {
        "owner": "gaurang",
        "purpose": "Testing Linage"
    }

    request = lineage_v1.CreateRunRequest(parent=process_id, run=run)
    response = client.create_run(request=request)
    logger.info(f"New run Created {response.name}")
    return response.name</pre>



<h2 class="wp-block-heading">Create Custom Linage Event </h2>



<p>once you have linage run created you need to attach an even to that, event is nothing but source to target mapping.  for both source and target you need to use fully qualified name with proper protocols.  please visit following page to see all the supported protocols for source and target FQDN </p>



<p><a href="https://cloud.google.com//data-catalog/docs/fully-qualified-names" target="_blank" rel="noreferrer noopener">https://cloud.google.com//data-catalog/docs/fully-qualified-names</a></p>



<p></p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def create_lineage_event(run_id, source_fqdn, target_fqdn, start_time, end_time):
    source = lineage_v1.EntityReference()
    target = lineage_v1.EntityReference()
    source.fully_qualified_name = source_fqdn
    target.fully_qualified_name = target_fqdn
    links = [EventLink(source=source, target=target)]
    lineage_event = LineageEvent(links=links, start_time=start_time, end_time=end_time)

    request = lineage_v1.CreateLineageEventRequest(parent=run_id, lineage_event=lineage_event)
    response = client.create_lineage_event(request=request)
    print("Lineage event created: %s", response.name)</pre>



<h2 class="wp-block-heading">Update Custom Linage Process</h2>



<p>For us, it&#8217;s a same process which ingest new file into table, rather than creating new process every time, I am just updating the existing process to add new run and linage event. </p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def create_custom_linage_for_ingestion(project_id, process_display_name, source, target, start_time, end_time, state,
                                       run_display_name):
    process_id = create_linage_process(project_id, process_display_name=process_display_name)
    run_id = create_run(process_id=process_id, start_time=start_time, end_time=end_time, state=state,
                        run_display_name=run_display_name)
    create_lineage_event(run_id=run_id, start_time=start_time, end_time=end_time, source_fqdn=source,
                         target_fqdn=target)


def _get_process_id(project_id, process_display_name):
    parent = f"projects/{project_id}/locations/northamerica-northeast1"
    processes = client.list_processes(parent=parent)
    for process in processes:
        if process.display_name == process_display_name:
            return process.name
    return None


def _convert_to_proto_timestamp(timestamp):
    return timestamp.strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3] + "Z"</pre>



<h2 class="wp-block-heading">How To Run? </h2>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">if __name__ == '__main__':
    project_id = "&lt;your_project_id>"
    process_display_name = "INGESTION_DAG_NAME"  ## DAG NAME
    source = "path:gs://&lt;your_bucket_name>/test_schema/test_20230604.csv"
    target = "bigquery:&lt;project_id>.gaurang.test_custom_linage"

    start_time = datetime.now() - timedelta(hours=3)
    process_start_time = _convert_to_proto_timestamp(start_time)  # Start time dag
    process_end_time = _convert_to_proto_timestamp(datetime.now())  # End Time

    state = "COMPLETED"
    run_display_name = "TASK_RUN_ID"
    create_or_update_custom_linage_for_ingestion(project_id, process_display_name, source, target, process_start_time,
                                                 process_end_time, state, run_display_name)</pre>
<p>The post <a href="https://bigdataproc.com/gcp-create-custom-bigquery-linage-using-datacatalog-python-api/">GCP &#8211; Create Custom Bigquery Linage using DataCatalog Python API</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/gcp-create-custom-bigquery-linage-using-datacatalog-python-api/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
	</channel>
</rss>
