No documentation on S3 sink setup #272

tatianafrank · 2019-07-17T22:55:52Z

The documentation says you can use S3 as a file sink but gives zero details on how to do so. There is one line linking somewhere else but the link is broken.
These are the docs: http://divolte-releases.s3-website-eu-west-1.amazonaws.com/divolte-collector/0.9.0/userdoc/html/configuration.html
and this is the broken link: https://wiki.apache.org/hadoop/AmazonS3

friso · 2019-07-18T07:02:32Z

Divolte doesn't treat S3 any differently than HDFS. This means you can use the built in support of the HDFS client to access S3 buckets of a particular layout.

Divolte currently ships with hadoop 3.2.0, so the relevant updated link on AWS integration (including using S3 filesystems) is here: https://hadoop.apache.org/docs/r2.9.2/hadoop-aws/tools/hadoop-aws/index.html

Note that there are now three different S3 client implementations in Hadoop, which all use different layouts on S3. If your aim is to use Divolte just for collection and subsequently use the Avro files on S3 using tools other than Hadoop, the s3n or s3a is probably what you want. s3n has been available for a while, whereas s3a is still under development, but is aimed to be a drop in replacement for s3n down the line. s3a is mostly aimed at use cases at substantial scale, involving large files that can become a performance issue for s3n.

tatianafrank · 2019-07-18T21:23:42Z

ok im using s3a with the following config:
client {
fs.DEFAULT_FS = "https://s3.us.cloud-object-storage.appdomain.cloud"
fs.defaultFS = "https://s3.us.cloud-object-storage.appdomain.cloud"
fs.s3a.bucket.BUCKET_NAME.access.key = ""
fs.s3a.bucket.BUCKET_NAME.secret.key = ""
fs.s3a.bucket.BUCKET_NAME.endpoint = "https://s3.us.cloud-object-storage.appdomain.cloud"
}

But im getting the following error even though I do have a tmp/working directory:
Path for in-flight AVRO records is not a directory: /tmp/working
So Im guessing its not properly connecting to s3 since the directory DOES exist. Is something wrong about my config? My s3 provider is not AWS but another cloud provider so I used the URL structure is a little different. Am I supposed to set the fs.defaultFS to the s3 URL? Where do I set the bucket?

tatianafrank · 2019-07-18T21:32:21Z

I changed my settings to the below and tried both s3a, s3n, and s3 and got the same error: "org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a" (or "s3n" or "s3")

client {
fs.DEFAULT_FS = "s3a://BUCKET-NAME"
fs.defaultFS = "s3a://BUCKET-NAME"
fs.s3a.access.key = ""
fs.s3a.secret.key = ""
fs.s3a.endpoint = "https://s3.us.cloud-object-storage.appdomain.cloud"
}

krisgeus · 2019-07-19T08:58:19Z

The libraries might not be shipped with divolte and you need some additional settings

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

depending on your version of hadoop:
http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar

krisgeus · 2019-07-19T13:11:43Z

I did a quick check with the docker divolte image. This is what was needed:

libraries downloaded and put into /opt/divolte/divolte-collector/lib
http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar
http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.1/hadoop-aws-3.1.1.jar

config:
client {
fs.DEFAULT_FS = "s3a://avro-bucket"
fs.defaultFS = "s3a://avro-bucket"
fs.s3a.access.key = foo
fs.s3a.secret.key = bar
fs.s3a.endpoint = "s3-server:4563"
fs.s3a.path.style.access = true
fs.s3a.connection.ssl.enabled = false
fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
}

enable hdfs through env vars in docker-compose:
DIVOLTE_HDFS_ENABLED: "true"
DIVOLTE_HDFS_SINK_WORKING_DIR: "s3a://avro-bucket/tmp/s3working"
DIVOLTE_HDFS_SINK_PUBLISH_DIR: "s3a://avro-bucket/tmp/s3publish"

s3-server is a localstack docker container which mimics s3.

krisgeus · 2019-07-19T13:15:00Z

Oh and make sure the bucket is availabe and the tmp/s3working and tmp/s3publish keys are present. (A directory exists check is done so adding a file to the bucket with a correct key prefix fools the hdfs client)

tatianafrank · 2019-07-26T19:28:28Z

thanks for looking into this @krisgeus im just a little confused about something. Im trying to use s3 instead of hdfs so why do I need hdfs to be running for this to work?

tatianafrank · 2019-07-26T21:00:31Z

I did everything you listed above and its not working. I got an error about a missing hadoop.tmp.dir var so I added that and now theres no error but there's no files being added to s3 either. Since theres no error im not sure what the issue is.

krisgeus · 2019-08-13T12:51:12Z

Sorry for the late response (Holiday season). Without an error I cannot help you out either. With the steps provided above I managed to create a working example based on the divolte docker image.

rakzcs · 2022-01-24T09:06:51Z

I did a quick check with the docker divolte image. This is what was needed:

libraries downloaded and put into /opt/divolte/divolte-collector/lib http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.1/hadoop-aws-3.1.1.jar

config: client { fs.DEFAULT_FS = "s3a://avro-bucket" fs.defaultFS = "s3a://avro-bucket" fs.s3a.access.key = foo fs.s3a.secret.key = bar fs.s3a.endpoint = "s3-server:4563" fs.s3a.path.style.access = true fs.s3a.connection.ssl.enabled = false fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem }

enable hdfs through env vars in docker-compose: DIVOLTE_HDFS_ENABLED: "true" DIVOLTE_HDFS_SINK_WORKING_DIR: "s3a://avro-bucket/tmp/s3working" DIVOLTE_HDFS_SINK_PUBLISH_DIR: "s3a://avro-bucket/tmp/s3publish"

s3-server is a localstack docker container which mimics s3.

Been trying this but keep getting Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found error. Do i need to install the complete hadoop application as well or am i missing something else?

edit: it seems the libraries are very particular on the version you use
solution:
https://hadoop.apache.org/docs/r3.3.1/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No documentation on S3 sink setup #272

No documentation on S3 sink setup #272

tatianafrank commented Jul 17, 2019

friso commented Jul 18, 2019

tatianafrank commented Jul 18, 2019

tatianafrank commented Jul 18, 2019 •

edited

Loading

krisgeus commented Jul 19, 2019

krisgeus commented Jul 19, 2019 •

edited

Loading

krisgeus commented Jul 19, 2019

tatianafrank commented Jul 26, 2019

tatianafrank commented Jul 26, 2019

krisgeus commented Aug 13, 2019

rakzcs commented Jan 24, 2022 •

edited

Loading

No documentation on S3 sink setup #272

No documentation on S3 sink setup #272

Comments

tatianafrank commented Jul 17, 2019

friso commented Jul 18, 2019

tatianafrank commented Jul 18, 2019

tatianafrank commented Jul 18, 2019 • edited Loading

krisgeus commented Jul 19, 2019

krisgeus commented Jul 19, 2019 • edited Loading

krisgeus commented Jul 19, 2019

tatianafrank commented Jul 26, 2019

tatianafrank commented Jul 26, 2019

krisgeus commented Aug 13, 2019

rakzcs commented Jan 24, 2022 • edited Loading

tatianafrank commented Jul 18, 2019 •

edited

Loading

krisgeus commented Jul 19, 2019 •

edited

Loading

rakzcs commented Jan 24, 2022 •

edited

Loading