An on-premises job scheduler like Grid Engine typically uses a shared file system. Your Grid Engine scripts can reference files by their paths on the shared file system.
With dsub, your input files reside in a Google Cloud Storage bucket and your output files will also be copied out to Cloud Storage.
When you submit a job with dsub
- your input files will be automatically copied from bucket paths to local disk
- your code will work on the local file system inside the Docker container
- your output files will be automatically copied from local disk back to bucket paths.
Rather than giving many options of what disks to allocate and where to
put input and output files, dsub
is prescriptive.
- All input and output is written to a single data disk mounted at
/mnt/data
. - All input and output paths mirror the remote storage location with a local
path of the form
/mnt/data/gs/bucket/path
.
Environment variables are made available to your script indicating the Docker container input and output paths.
There are several common use cases for both input and output, each described here and demonstrated in this example.
To copy a single file from Cloud Storage, specify the full URL to the file on
the dsub
command-line:
--input INPUT_FILE=gs://bucket/path/file.bam
The object at the Cloud Storage path will be copied and made available at
the path /mnt/data/input/gs/bucket/path/file.bam
.
The Docker container will receive the environment variable:
INPUT_FILE=/mnt/data/input/gs/bucket/path/file.bam
To copy a set of files from Cloud Storage, specify the full URL pattern on
the dsub
command-line:
--input INPUT_FILES=gs://bucket/path/*.bam
The object(s) at the Cloud Storage path will be copied and made available at
the path /mnt/data/input/gs/bucket/path
.
The Docker container will receive the environment variable:
INPUT_FILES=/mnt/data/input/gs/bucket/path/*.bam
You will likely want your script code to tokenize the environment variable
into its constituent path and pattern components. To tokenize the INPUT_FILES
variable, the following code:
INPUT_FILES_PATH="$(dirname "${INPUT_FILES}")"
INPUT_FILES_PATTERN="$(basename "${INPUT_FILES}")"
will set:
INPUT_FILES_PATH=/mnt/data/input/gs/bucket/path
INPUT_FILES_PATTERN=*.bam
To process a list of files from a path + wildcard pattern in Bash, a typical coding pattern is to create an array and iterate over the array.
If you know you don't have spaces in your paths, this can simply be:
readonly INPUT_FILE_LIST=( $(ls "${INPUT_FILES_PATH}"/${INPUT_FILES_PATTERN}) )
If you might have spaces in your file paths, then you need to take a bit more care. The following will create a list of files and force Bash to tokenize the list by newlines (instead of by whitespace):
declare INPUT_FILE_LIST="$(ls -1 "${INPUT_FILES_PATH}"/${INPUT_FILES_PATTERN})"
IFS=$'\n' INPUT_FILE_LIST=(${INPUT_FILE_LIST})
readonly INPUT_FILE_LIST
Note: in both cases above, do not quote ${INPUT_FILES_PATTERN}
as that will
suppress wildcard expansion.
The following code shows how to iterate over the list of files in the array:
for INPUT_FILE in "${INPUT_FILE_LIST[@]}"; do
# INPUT_FILE will be the full path including the filename
# If you need the filename alone, use basename:
INPUT_FILE_NAME="$(basename "${INPUT_FILE}")"
# If you further want to trim off the file extension, perhaps to construct
# a new output file name, then use Bash suffix subsititution:
INPUT_FILE_ROOTNAME="${INPUT_FILE_NAME%.*}"
# Do stuff with the INPUT_FILE environment variables you now have
...
done
To recursively copy a directory from Cloud Storage, use the
dsub
command-line flag --input-recursive
.
--input-recursive INPUT_PATH=gs://bucket/path
The object(s) at the Cloud Storage path will be recursively copied and
made available at the path /mnt/data/input/gs/bucket/path
.
The Docker container will receive the environment variable:
INPUT_PATH=/mnt/data/input/gs/bucket/path
To copy a single file to Cloud Storage, specify the full URL to the file on
the dsub
command-line:
--output OUTPUT_FILE=gs://bucket/path/file.bam
Then have your script write the output file to
${OUTPUT_FILE}
within the Docker container.
The file will be automatically copied to Cloud Storage when your script or
command exits with success.
The Docker container will receive the environment variable:
OUTPUT_FILE=/mnt/data/output/gs/bucket/path/file.bam
To copy a set of files to Cloud Storage, specify the full URL pattern on
the dsub
command-line:
--output OUTPUT_FILES=gs://bucket/path/*.bam
Then have your job write output files to
${OUTPUT_FILES}
within the Docker container.
All files matching the pattern /mnt/data/output/gs/bucket/path/*.bam
will be
automatically copied to Cloud Storage when your script or
command exits with success.
The Docker container will receive the environment variable:
OUTPUT_FILES=/mnt/data/output/gs/bucket/path/*.bam
Typically a job script will have the output file extensions hard-coded, but if needed it can be parsed from the environment variable. More commonly, the job script will need the output directory.
To get the output directory and file extension in Bash:
# This will set OUTPUT_DIR to "/mnt/data/output/gs/bucket/path"
OUTPUT_DIR="$(dirname "${OUTPUT_FILES}")"
# This will set OUTPUT_FILE_PATTERN to "*.bam"
OUTPUT_FILE_PATTERN="$(basename "${OUTPUT_FILES}")"
# This will set OUTPUT_EXTENSION to "bam" using the Bash prefix removal
# operator "##", matching the longest pattern up to and including the period.
OUTPUT_EXTENSION="${OUTPUT_FILE_PATTERN##*.}"
To recursively copy a directory of output to Cloud Storage, use the
dsub
command-line flag --output-recursive
:
--output-recursive OUTPUT_PATH=gs://bucket/path
Then have your job write output files and subdirectories to
${OUTPUT_PATH}
within the Docker container.
All files and directories under the path /mnt/data/output/gs/bucket/path
will be automatically copied to Cloud Storage when your script or
command exits with success.
The Docker container will receive the environment variable:
OUTPUT_PATH=/mnt/data/output/gs/bucket/path
As a getting started convenience, if --input-recursive
or --output-recursive
are used with the google
provider, dsub
will automatically check for and,
if needed, install the
Google Cloud SDK in the Docker container
at runtime (before your script executes).
If you use the recursive copy features, install the Cloud SDK in your Docker image when you build it to avoid the installation at runtime.
If you use a Debian or Ubuntu Docker image, you are encouraged to use the package installation instructions.
If you use a Red Hat or CentOS Docker image, you are encouraged to use the package installation instructions.
- GCS recursive wildcards (**) are not supported
- Wildcards in the middle of a path are not supported
- Output parameters to a directory are not supported, instead:
- use an explicit wildcard on the filename (such as
gs://mybucket/mypath/*
) - use the recursive copy feature
- use an explicit wildcard on the filename (such as