Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting-started guides? #238

Open
bhtucker opened this issue Aug 27, 2020 · 4 comments
Open

Getting-started guides? #238

bhtucker opened this issue Aug 27, 2020 · 4 comments

Comments

@bhtucker
Copy link
Contributor

Summary

I'm trying to set up a fresh project and wonder if there are any templates for the 'sibling' repo. (I have the fortunate position of vaguely remembering how this should work, and still I'm stuck!)

By banging my head against the validator, I eventually came up with a dummy warehouse config (uselessly passes the validator):

{
  "arthur_settings": {},
  "data_warehouse": {},
  "type_maps": {},
  "object_store": {
    "s3": {
      "bucket_name": "load-bucket",
      "iam_role": "arn:aws:iam::123:role/NotARole"
    }
  },
  "resources": {
    "key_name": "my-fake-ssh-key",
    "VPC": {
      "region": "us-east-1",
      "account": "123",
      "name": "MyVPC",
      "public_subnet": "PublicSubnet",
      "whitelist_security_group": "sg-123"
    },
    "DataPipeline": {
      "role": "NotARole"
    },
    "EC2": {
      "instance_type": "m5.4xlarge",
      "image_id": "",
      "public_security_group": "foobar",
      "iam_instance_profile": "instanceprofile"
    },
    "EMR": {
      "master": {
        "instance_type": "m5.4xlarge",
        "managed_security_group": "foobar"
      },
      "core": {
        "instance_type": "m5.4xlarge",
        "managed_security_group": "foobar"
      },
      "release_label": "emr-5.29.0"
    }
  },
  "etl_events": {}
}

Now I need to set up my prefix, with e.g. bootstrapping scripts as well as sync output. I guess this is upload_env.sh?

Anyway, if I'm missing existing assets, I'd love to use them -- and if not, it would be good to know, so I can write down what I do!

Details

At the moment I'm just trying to use extract.

Labels Please set the label on the issue so that

  • you pick bug fix, feature, or enhancement
  • you pick one of the components of Arthur, such as component: extract or component: load

I don't think I have 'edit' rights on the labels

@tvogels01
Copy link
Contributor

tvogels01 commented Aug 28, 2020

Just for testing, I created a config directory inside the arthur-redshift-etl directory. Then I built a minimal set of config files.

Here's a PR that makes this easier: #241

mkdir config
export DATA_WAREHOUSE_CONFIG=`pwd`/config

cp etc/aws_template.yaml config/aws.yaml
cp etc/warehouse_template.yaml config/warehouse.yaml
cp etc/credentials.sh.template config/credentials.sh

Take a look at the templates. The aws.yaml config file needs to be updated based on outputs from the CloudFormation stack. Looking at the config file posted, it just might be easier than you remember. We've made some improvements.

Starting Arthur now:

bin/run_arthur.sh

This will show some settings. Take a look at the rest:

arthur.py settings

And now make sure that S3 has the ETL code:

upload_env.sh 

Without changes to the template this fails, of course, but you'll update aws.yaml.

Finding bucket name and prefix in configuration...

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
Check whether the bucket "object-store" exists and you have access to it!

Then create a table design file:

arthur.py bootstrap_sources webapp

This also fails because you need a credentials file with the connections, see prompts in config/credentials.sh.

After you've setup the credentials (with connection strings), don't forget to copy the file to s3.

Once you have a design file, upload the local schemas to S3:

arthur.py sync --deploy-config

And now run one of:

arthur.py extract
install_extraction_pipeline.sh

Let me know which hurdles you encounter and I'll try to get them resolved.

@bhtucker
Copy link
Contributor Author

Thank you for the guidance!

@bhtucker
Copy link
Contributor Author

This worked great. The one hiccup was: credentials.sh seems to be uploaded out-of-band, right? Neither upload_env nor arthur.py sync end up copying it?

@tvogels01
Copy link
Contributor

Yes, unfortunately. You'll have to create and upload the credentials.sh file manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants