Skip to content

Distcp to Copy your HDFS data to GCP Cloud Storage

A while back, I found myself deeply immersed in a Hadoop migration project where our cloud platform of choice was Google Cloud Platform (GCP). Our mission? To seamlessly transition data from on-premises infrastructure to the cloud. Due to various constraints, utilizing hardware wasn’t a viable option. Thus, I embarked on a quest to explore multiple software solutions to tackle this challenge.

For one-off migrations, Spark emerged as a favorable choice. It facilitated direct data migration to BigQuery, bypassing the intermediary step of storing it in cloud storage. However, there was a caveat: Spark lacked the ability to detect changes, necessitating a full refresh each time. This approach proved less than ideal, especially when dealing with substantial datasets.

My gaze then turned to Cloudera BDR, but alas, it didn’t support integration with Google Cloud. Left with no alternative, I delved into Distcp. In this blog post, I’ll guide you through the setup process for Distcp, enabling seamless data transfer from an on-prem HDFS cluster to Google Cloud Storage.

Service Account Setup

To begin, create a GCP service account with read/write permissions for the designated Google Cloud Storage bucket. Obtain the JSON key associated with this service account. This key will need to be distributed across all nodes involved in the migration process. For instance, I’ve opted to store it at /tmp/sa-datamigonpremtobigquery.json. Also make sure, the user with which you are going to run distcp command have access to this path.

HDFS.conf

Please store following file on edge node in your home directory. please replace the value of fs.gs.project.id with your project id.

<configuration>
  <property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>The AbstractFileSystem for 'gs:' URIs.</description>
  </property>
  <property>
    <name>fs.gs.project.id</name>
    <value>raw-bucket</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>google.cloud.auth.type</name>
    <value>SERVICE_ACCOUNT_JSON_KEYFILE</value>
    <description>
      Authentication type to use for GCS access.
    </description>
  </property>
  <property>
    <name>google.cloud.auth.service.account.json.keyfile</name>
    <value>/tmp/sa-datamigonpremtobigquery.json</value>
    <description>
      The JSON keyfile of the service account used for GCS
      access when google.cloud.auth.type is SERVICE_ACCOUNT_JSON_KEYFILE.
    </description>
  </property>

  <property>
    <name>fs.gs.checksum.type</name>
    <value>CRC32C</value>
    <description>
          https://cloud.google.com/architecture/hadoop/validating-data-transfers
  </description>
  </property>

  <property>
    <name>dfs.checksum.combine.mode</name>
    <value>COMPOSITE_CRC</value>
    <description>
          https://cloud.google.com/architecture/hadoop/validating-data-transfers
  </description>
  </property>
</configuration>

Executing Transfer

hadoop --debug distcp --conf hdfs.conf -pc -update -v -log hdfs:///tmp/distcp_log hdfs:///tmp/ gs://raw-bucket/ 
Published inGCPHadoophive

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *