HDFS S3 Sync App
Ingest and backup hadoop HDFS data to Amazon S3 for data upload in a fault tolerant way for use cases such as disaster recovery. This application copies files from the configured source path in HDFS to Amazon S3 configured storage. The source code is available at: https://github.com/DataTorrent/app-templates/tree/master/hdfs-to-s3-sync.
Please send feedback or feature requests to: email@example.com
This document has a step-by-step guide to configure, customize, and launch this application.
Click on the AppHub tab from the top navigation bar.
Page listing the applications available on AppHub is displayed. Search for HDFS to see all applications related to HDFS.
Click on import button for
HDFS S3 Sync App.
Notification is displayed on the top right corner after application package is successfully imported.
Click on the link in the notification which navigates to the page for this application package. Detailed information about the application package like version, last modified time, and short description is available on this page. Click on launch button for
Use saved configurationoption. This displays list of pre-saved configurations. Please select
cluster-memory-conf.xmldepending on whether your environment is the DataTorrent sandbox, or other cluster.
Specify custom propertiesoption. Click on
add default propertiesbutton.
This expands a key-value editor pre-populated with mandatory properties for this application. Change values as needed. For example, suppose we wish to copy all files in
source-nodeto Amazon S3 storage at
OUTPUT-DIRECTORYrespectively at S3 object Store. Properties should be set as follows:
name value dt.operator.HDFSInputModule.prop.files /user/appuser/input dt.operator.S3OutputModule.prop.outputDirectoryPath archive dt.operator.S3OutputModule.prop.accessKey ACCESS_KEY_ID dt.operator.S3OutputModule.prop.secretAccessKey SECRET_KEY dt.operator.S3OutputModule.prop.bucketName BUCKET_NAME
This application is tuned for better performance if reading data from remote cluster to host cluster. Details about configuration options are available in Configuration options section.
Launchbutton on bottom right corner to launch the application. Notification is displayed on the top right corner after application is launched successfully and includes the Application ID which can be used to monitor this instance and find its logs.
Click on the
Monitortab from the top navigation bar.
A page listing all running applications is displayed. Search for current application based on name or application id or any other relevant field. Click on the application name or id to navigate to application instance details page.
Application instance details page shows key metrics for monitoring the application status. The
logicaltab shows application DAG, Stram events, operator status based on logical operators, stream status, and a chart with key metrics.
Click on the
physicaltab to look at the status of physical instances of the operator, containers etc.
End user must specify the values for these properties (these properties are all strings and are HDFS paths: the first is the destination and the second the source).
There are pre-saved configurations based on the application environment. Recommended settings for datatorrent sandbox edition are in
sandbox-memory-conf.xml and for a cluster environment in
|dt.operator.HDFSInputModule.prop.minReaders||Minimum number of BlockReader partitions for parallel reading.||int||4||1|
|dt.operator.HDFSInputModule.prop.maxReaders||Maximum number of BlockReader partitions for parallel reading.||int||16||1|
|dt.operator.HDFSInputModule.prop.blocksThreshold||Rate at which block metadata is emitted per second||int||16||1|
|dt.operator.S3OutputModule.prop.mergerCount||number of instances of S3FileMerger operator||int||1||1|
Steps to customize the application
Make sure you have following utilities installed on your machine and available on
PATHin environment variables
Use following command to clone the examples repository:
git clone firstname.lastname@example.org:DataTorrent/app-templates.git
Change directory to
Import this maven project in your favorite IDE (e.g. eclipse).
Change the source code as per your requirements. This application is for copying files from source to destination. Thus,
Application.javadoes not involve any processing operator in between.
Make respective changes in the test case and
properties.xmlbased on your environment.
Compile this project using maven:
mvn clean package
This will generate the application package with
.apaextension inside the
Go to DataTorrent UI Management console on web browser. Click on the
Developtab from the top navigation bar.
upload packagebutton and upload the generated
Application package page is shown with the listing of all packages. Click on the
Launchbutton for the uploaded application package. Follow the steps for launching an application.