Skip to content
This repository has been archived by the owner on Apr 22, 2022. It is now read-only.

No documentation on S3 sink setup #272

Open
tatianafrank opened this issue Jul 17, 2019 · 10 comments
Open

No documentation on S3 sink setup #272

tatianafrank opened this issue Jul 17, 2019 · 10 comments

Comments

@tatianafrank
Copy link

The documentation says you can use S3 as a file sink but gives zero details on how to do so. There is one line linking somewhere else but the link is broken.
These are the docs: http://divolte-releases.s3-website-eu-west-1.amazonaws.com/divolte-collector/0.9.0/userdoc/html/configuration.html
and this is the broken link: https://wiki.apache.org/hadoop/AmazonS3

@friso
Copy link
Collaborator

friso commented Jul 18, 2019

Divolte doesn't treat S3 any differently than HDFS. This means you can use the built in support of the HDFS client to access S3 buckets of a particular layout.

Divolte currently ships with hadoop 3.2.0, so the relevant updated link on AWS integration (including using S3 filesystems) is here: https://hadoop.apache.org/docs/r2.9.2/hadoop-aws/tools/hadoop-aws/index.html

Note that there are now three different S3 client implementations in Hadoop, which all use different layouts on S3. If your aim is to use Divolte just for collection and subsequently use the Avro files on S3 using tools other than Hadoop, the s3n or s3a is probably what you want. s3n has been available for a while, whereas s3a is still under development, but is aimed to be a drop in replacement for s3n down the line. s3a is mostly aimed at use cases at substantial scale, involving large files that can become a performance issue for s3n.

@tatianafrank
Copy link
Author

ok im using s3a with the following config:
client {
fs.DEFAULT_FS = "https://s3.us.cloud-object-storage.appdomain.cloud"
fs.defaultFS = "https://s3.us.cloud-object-storage.appdomain.cloud"
fs.s3a.bucket.BUCKET_NAME.access.key = ""
fs.s3a.bucket.BUCKET_NAME.secret.key = ""
fs.s3a.bucket.BUCKET_NAME.endpoint = "https://s3.us.cloud-object-storage.appdomain.cloud"
}

But im getting the following error even though I do have a tmp/working directory:
Path for in-flight AVRO records is not a directory: /tmp/working
So Im guessing its not properly connecting to s3 since the directory DOES exist. Is something wrong about my config? My s3 provider is not AWS but another cloud provider so I used the URL structure is a little different. Am I supposed to set the fs.defaultFS to the s3 URL? Where do I set the bucket?

@tatianafrank
Copy link
Author

tatianafrank commented Jul 18, 2019

I changed my settings to the below and tried both s3a, s3n, and s3 and got the same error: "org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a" (or "s3n" or "s3")

client {
fs.DEFAULT_FS = "s3a://BUCKET-NAME"
fs.defaultFS = "s3a://BUCKET-NAME"
fs.s3a.access.key = ""
fs.s3a.secret.key = ""
fs.s3a.endpoint = "https://s3.us.cloud-object-storage.appdomain.cloud"
}

@krisgeus
Copy link
Contributor

The libraries might not be shipped with divolte and you need some additional settings

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

depending on your version of hadoop:
http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar

@krisgeus
Copy link
Contributor

krisgeus commented Jul 19, 2019

I did a quick check with the docker divolte image. This is what was needed:

libraries downloaded and put into /opt/divolte/divolte-collector/lib
http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar
http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.1/hadoop-aws-3.1.1.jar

config:
client {
fs.DEFAULT_FS = "s3a://avro-bucket"
fs.defaultFS = "s3a://avro-bucket"
fs.s3a.access.key = foo
fs.s3a.secret.key = bar
fs.s3a.endpoint = "s3-server:4563"
fs.s3a.path.style.access = true
fs.s3a.connection.ssl.enabled = false
fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
}

enable hdfs through env vars in docker-compose:
DIVOLTE_HDFS_ENABLED: "true"
DIVOLTE_HDFS_SINK_WORKING_DIR: "s3a://avro-bucket/tmp/s3working"
DIVOLTE_HDFS_SINK_PUBLISH_DIR: "s3a://avro-bucket/tmp/s3publish"

s3-server is a localstack docker container which mimics s3.

@krisgeus
Copy link
Contributor

Oh and make sure the bucket is availabe and the tmp/s3working and tmp/s3publish keys are present. (A directory exists check is done so adding a file to the bucket with a correct key prefix fools the hdfs client)

@tatianafrank
Copy link
Author

thanks for looking into this @krisgeus im just a little confused about something. Im trying to use s3 instead of hdfs so why do I need hdfs to be running for this to work?

@tatianafrank
Copy link
Author

I did everything you listed above and its not working. I got an error about a missing hadoop.tmp.dir var so I added that and now theres no error but there's no files being added to s3 either. Since theres no error im not sure what the issue is.

@krisgeus
Copy link
Contributor

Sorry for the late response (Holiday season). Without an error I cannot help you out either. With the steps provided above I managed to create a working example based on the divolte docker image.

@rakzcs
Copy link

rakzcs commented Jan 24, 2022

I did a quick check with the docker divolte image. This is what was needed:

libraries downloaded and put into /opt/divolte/divolte-collector/lib http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.1/hadoop-aws-3.1.1.jar

config: client { fs.DEFAULT_FS = "s3a://avro-bucket" fs.defaultFS = "s3a://avro-bucket" fs.s3a.access.key = foo fs.s3a.secret.key = bar fs.s3a.endpoint = "s3-server:4563" fs.s3a.path.style.access = true fs.s3a.connection.ssl.enabled = false fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem }

enable hdfs through env vars in docker-compose: DIVOLTE_HDFS_ENABLED: "true" DIVOLTE_HDFS_SINK_WORKING_DIR: "s3a://avro-bucket/tmp/s3working" DIVOLTE_HDFS_SINK_PUBLISH_DIR: "s3a://avro-bucket/tmp/s3publish"

s3-server is a localstack docker container which mimics s3.

Been trying this but keep getting Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found error. Do i need to install the complete hadoop application as well or am i missing something else?

edit: it seems the libraries are very particular on the version you use
solution:
https://hadoop.apache.org/docs/r3.3.1/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants