HDFS line copy Application

Summary

This application demonstrates data preparation pipeline which reads lines from files on source HDFS. It performs customized filtering, transformations on the line and writes them into destination HDFS.

Required Properties

End user must specify the values for these properties.

Property Type Example Notes
Input Directory Or File Path String
  • /user/appuser/input /directory1
  • /user/appuser /input/file2.log
  • hdfs://node1.corp1 .com/user/user1 /input
HDFS path for input file or directory
Output Directory Path String /user/appuser/output HDFS path for the output directory. Generally, this refers to path on the hadoop cluster on which app is running.
Output File Name String output.txt Name of the output file. This name will be appended with suffix for each part.

Advanced Properties (optional)

Property Default Type Notes
Block Size For Hdfs Splitter 1048576 (1MB) long No of bytes record reader operator would consider at a time for splitting records. Record reader might add latencies for higher block sizes. Suggested value is 1-10 MB
Csv Formatter Schema { "separator": "|", "quoteChar": "\"", "lineDelimiter":"", "fields": [ { "name": "accountNumber", "type": "Integer" }, { "name": "name", "type": "String" }, { "name": "amount", "type": "Integer" } ] } String JSON string defining schema for CSV formatter
Csv Parser Schema { "separator": "|", "quoteChar": "\"", "fields": [ { "name": "accountNumber", "type": "Integer" }, { "name": "name", "type": "String" }, { "name": "amount", "type": "Integer" } ] } String JSON string defining schema for CSV parser
Number Of Blocks Per Window 1 int File splitter will emit these many blocks per window for downstream operators.
Number Of Readers For Partitioning 2 int Blocks reader operator would be partioned into these many partitions.
Tuple Class Name For Csv Parser Output com.datatorrent.apps.PojoEvent String FQCN for the tuple object to be emitted by CSV Parser
Tuple Class Name For Formatter Input com.datatorrent.apps.PojoEvent String FQCN for the tuple object to be consumed by CSV formatter
Tuple Class Name For Transform Input com.datatorrent.apps.PojoEvent String FQCN for the tuple object to be consumed by Transform operator
Tuple Class Name For Transform Output com.datatorrent.apps.PojoEvent String FQCN for the tuple object to be emitted by Transform operator

Notes

  • Application is pre-configured for pre-defined schema mentioned here. For using this application for custom schema objects; changes will be needed in the configuration.

  • PojoEvent is the POJO (plain old java objects used for representing record present in the database row). One can define custom class to represent custom schema and include it in the classpath. Configuration package can be used for this purpose.