<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Uncategorized Archives - Big Data Processing</title>
	<atom:link href="https://bigdataproc.com/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>https://bigdataproc.com/category/uncategorized/</link>
	<description>Big Data Solution for GCP, AWS, Azure and on-prem</description>
	<lastBuildDate>Sun, 15 Jan 2023 04:54:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.3.2</generator>
	<item>
		<title>Airflow SMS Notification for Failed Jobs</title>
		<link>https://bigdataproc.com/airflow-sms-notification-for-failed-jobs/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=airflow-sms-notification-for-failed-jobs</link>
					<comments>https://bigdataproc.com/airflow-sms-notification-for-failed-jobs/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Mon, 11 Jan 2021 18:17:40 +0000</pubDate>
				<category><![CDATA[Airflow]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[airflow]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=232</guid>

					<description><![CDATA[<p>Airflow only provides email notification for job failure. However, this is not enough for production job as not everyone have access to email on their&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/airflow-sms-notification-for-failed-jobs/">Continue reading<span class="screen-reader-text">Airflow SMS Notification for Failed Jobs</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/airflow-sms-notification-for-failed-jobs/">Airflow SMS Notification for Failed Jobs</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Airflow only provides email notification for job failure. However, this is not enough for production job as not everyone have access to email on their phone. Atleast, that&#8217;s true with our team. And so, I had to figure out a way to send SMS notification for job failure so swift action can be taken.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img decoding="async" fetchpriority="high" width="731" height="411" src="http://allabouthadoop.net/wp-content/uploads/2021/01/airflow_sms_notification-1.png" alt="" class="wp-image-234" srcset="https://bigdataproc.com/wp-content/uploads/2021/01/airflow_sms_notification-1.png 731w, https://bigdataproc.com/wp-content/uploads/2021/01/airflow_sms_notification-1-300x169.png 300w" sizes="(max-width: 731px) 100vw, 731px" /><figcaption>Airflow SMS Notification</figcaption></figure></div>



<h2 class="wp-block-heading">Send Logs to AWS S3 Bucket</h2>



<p>modify following setting in <strong>airflow.cfg</strong> file to send Job logs to s3 bucket. <br>In my case ec2 has access to s3 bucket through aws role so I don&#8217;t need to provide connection Id. In your case you might need create connection id and then need to provide.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">remote_log_conn_id =
remote_base_log_folder = s3://&lt;your bucket name>/airflow-logs/
encrypt_s3_logs = False</pre>



<h2 class="wp-block-heading">Create AWS Lambda to Parse Logs.</h2>



<p>AWS lambda will parse airflow job logs and parse it to check if any task has failed or not and based on that it will send message to airflow topic.  Parsing the log file is not complicated, we just need to check for following line in log file which indicates if log file any task failed or not.</p>



<p>Make sure the s3  bucket where are storing airflow job logs sending put events to aws lambda. </p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import json
import boto3
import re
import os 
from urllib.parse import unquote_plus

# Publish a simple message to the specified SNS topic
def send_sns_notification(message):
    topicArn = os.environ.get("TopicArn")
    sns = boto3.client('sns')
    response = sns.publish(
        TopicArn=topicArn,    
        Message=message, 
        MessageAttributes={
            'AWS.SNS.SMS.SenderID': {
              'DataType': 'String',
              'StringValue': 'Airflow'   
            },
            'AWS.SNS.SMS.SMSType': {
              'DataType': 'String',
              'StringValue': 'Transactional'   
            }
        }   
    )
    
    print(response)

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    error_regex=re.compile('(\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}\]) ({.*\.py:\d+}) (INFO) - Marking task as FAILED.\s*dag_id=(.*), task_id=(.*), execution_date=(.*), start_date=(.*), end_date=(.*)')
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'])
        filename="{}/{}".format(bucket, key)
        print("processing {} file".format(filename))
        # Publish a simple message to the specified SNS topic
        data = s3.get_object(Bucket=bucket, Key=key)
        logfile_content = data['Body'].read().decode('utf-8')
        errors = error_regex.findall(logfile_content)
        if(len(errors) > 0):
            dag_name = key.split("/")[-4]
            task_name = key.split("/")[-3]
            message="job {} with task {} failed".format(dag_name, task_name)
            print(message)
            send_sns_notification(message)
            print("notification sent")
        else:
            print("file {} does not have any error".format(filename))</pre>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/airflow-sms-notification-for-failed-jobs/">Airflow SMS Notification for Failed Jobs</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/airflow-sms-notification-for-failed-jobs/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>computing total storage size of a folder in azure data lake storage</title>
		<link>https://bigdataproc.com/computing-total-storage-size-of-a-folder-in-azure-data-lake-storage/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=computing-total-storage-size-of-a-folder-in-azure-data-lake-storage</link>
					<comments>https://bigdataproc.com/computing-total-storage-size-of-a-folder-in-azure-data-lake-storage/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Sat, 08 Feb 2020 15:33:48 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=123</guid>

					<description><![CDATA[<p>A few days back we needed to calculate how much data have we ingested into our data lake by each project. And that&#8217;s when I&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/computing-total-storage-size-of-a-folder-in-azure-data-lake-storage/">Continue reading<span class="screen-reader-text">computing total storage size of a folder in azure data lake storage</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/computing-total-storage-size-of-a-folder-in-azure-data-lake-storage/">computing total storage size of a folder in azure data lake storage</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>A few days back we needed to calculate how much data have we ingested into our data lake by each project. And that&#8217;s when I realized there is no direct way to get the size of any directory in Azure Datalake Storage.  Storage explore allows you to get the statistics of the folder which shows the size,  however, imagine that doing for 100 folders.  And so,  I thought to write a script. </p>



<p>Following PowerShell script will give you the size of all the folders under the given path. </p>



<pre class="wp-block-preformatted">$path="/infomart"
$account="azueus2dev"
$ChildPaths=(Get-AzureRmDataLakeStoreChildItem -Account "azueus2devadlsdatalake" -path $path).Name
foreach($ChildPath in $ChildPaths){
	$length=(Get-AzureRmDataLakeStoreChildItemSummary -Account $account -path "$path/$ChildPath" -Concurrency 128).Length
	"$path/$ChildPath, $length" | Out-File $path.txt -Append
}</pre>



<p>You will need to install <code>AzureRM.DataLakeStore</code> module to run above script. </p>



<pre class="wp-block-code"><code>install-module -name AzureRM.DataLakeStore</code></pre>



<p></p>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/computing-total-storage-size-of-a-folder-in-azure-data-lake-storage/">computing total storage size of a folder in azure data lake storage</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/computing-total-storage-size-of-a-folder-in-azure-data-lake-storage/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
