-
Notifications
You must be signed in to change notification settings - Fork 77
No documentation on S3 sink setup #272
Comments
Divolte doesn't treat S3 any differently than HDFS. This means you can use the built in support of the HDFS client to access S3 buckets of a particular layout. Divolte currently ships with hadoop 3.2.0, so the relevant updated link on AWS integration (including using S3 filesystems) is here: https://hadoop.apache.org/docs/r2.9.2/hadoop-aws/tools/hadoop-aws/index.html Note that there are now three different S3 client implementations in Hadoop, which all use different layouts on S3. If your aim is to use Divolte just for collection and subsequently use the Avro files on S3 using tools other than Hadoop, the s3n or s3a is probably what you want. s3n has been available for a while, whereas s3a is still under development, but is aimed to be a drop in replacement for s3n down the line. s3a is mostly aimed at use cases at substantial scale, involving large files that can become a performance issue for s3n. |
ok im using s3a with the following config: But im getting the following error even though I do have a tmp/working directory: |
I changed my settings to the below and tried both s3a, s3n, and s3 and got the same error: "org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a" (or "s3n" or "s3") client { |
The libraries might not be shipped with divolte and you need some additional settings fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem depending on your version of hadoop: |
I did a quick check with the docker divolte image. This is what was needed: libraries downloaded and put into /opt/divolte/divolte-collector/lib config: enable hdfs through env vars in docker-compose: s3-server is a localstack docker container which mimics s3. |
Oh and make sure the bucket is availabe and the tmp/s3working and tmp/s3publish keys are present. (A directory exists check is done so adding a file to the bucket with a correct key prefix fools the hdfs client) |
thanks for looking into this @krisgeus im just a little confused about something. Im trying to use s3 instead of hdfs so why do I need hdfs to be running for this to work? |
I did everything you listed above and its not working. I got an error about a missing hadoop.tmp.dir var so I added that and now theres no error but there's no files being added to s3 either. Since theres no error im not sure what the issue is. |
Sorry for the late response (Holiday season). Without an error I cannot help you out either. With the steps provided above I managed to create a working example based on the divolte docker image. |
Been trying this but keep getting Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found error. Do i need to install the complete hadoop application as well or am i missing something else? edit: it seems the libraries are very particular on the version you use |
The documentation says you can use S3 as a file sink but gives zero details on how to do so. There is one line linking somewhere else but the link is broken.
These are the docs: http://divolte-releases.s3-website-eu-west-1.amazonaws.com/divolte-collector/0.9.0/userdoc/html/configuration.html
and this is the broken link: https://wiki.apache.org/hadoop/AmazonS3
The text was updated successfully, but these errors were encountered: