sequencefile-utility

Introduction

This is a utility for creating a Hadoop SequenceFile. The utility can be used either as a local command line application that collects files recursively by traversing a root directory. Or it can be used as a Hadoop job where data is collected in the Map task (requires that the file system is mounted on all TaskTracker nodes).

Download

The executable jar with all dependencies is available on Bintray:

sequencefile-utility version 1.0

Usage

{hadoop jar|java -jar} target/sequencefile-utility-1.0-jar-with-dependencies.jar 
[-c <arg>] [-d <arg>] [-e <arg>] [-h] [-m] [-n <arg>] [-p <arg>] [-t]

-c,--compr <arg>
- Compression, one of NONE,RECORD,BLOCK, default: BLOCK (Optional).
-d,--dir <arg>
- Local directory (or directories - commaseparated) containing files to be added. Note that in hadoop map mode (parameter -m) each tasktracker must be able to access the files at the same path. (Required)
-e,--ext <arg>
- Extension filter(s) - commaseparated (Optional).
-h,--help
- Print this message.
-m,--mapmode
- Hadoop map mode: A text file containing all absolute paths of the input directory (parameter -d) is created and the sequence file is created using a map/reduce job. The sequence file is direcly stored in HDFS. If this parameter is omitted, the sequence file creation runs as a batch process, adding all files of a directory to a sequence file (one per directory) in a separate thread. (Optional)
-n,--name <arg>
- Hadoop map mode: Hadoop job name (Optional)
-p,--path <arg>
- Hadoop map mode: HDFS input path where the text files containing input paths is available. If this parameter is provided, the -d parameter is not required (Optional)
-t,--textline
- Text line mode, input files aretext files and each line of the text files should be added as a record in the sequence file (Optional).

Examples

Read all files from a local directory into one sequence file:

java -jar target/sequencefile-utility-1.0-jar-with-dependencies.jar -d /path/to/dir/ -c NONE

Use a Hadoop job to read all files from a file system that is mounted on all Task Tracker nodes:

hadoop jar target/sequencefile-utility-1.0-jar-with-dependencies.jar -m -n hadoopjob -p /path/to/hdfs/dir -c NONE

The -p parameter indicates a HDFS input path where text file(s) are located that contain absolute paths to the files that should be packaged in the SequenceFile.

Installation requirements

In order to build the application, the Java SE Development Kit (JDK) version 1.6.0 (or higher) and Apache Maven 2.2.1 (or higher) are required.

Installation

Compile project using maven:

mvn install

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src/main		src/main
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sequencefile-utility

Introduction

Download

Usage

Examples

Installation requirements

Installation

About

Releases

Packages

Languages

shsdev/sequencefile-utility

Folders and files

Latest commit

History

Repository files navigation

sequencefile-utility

Introduction

Download

Usage

Examples

Installation requirements

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages