Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRCH-5469-spidermon #50

Merged
merged 29 commits into from
Nov 15, 2024
Merged

SRCH-5469-spidermon #50

merged 29 commits into from
Nov 15, 2024

Conversation

IsabelLaurenceau
Copy link
Collaborator

@IsabelLaurenceau IsabelLaurenceau commented Nov 6, 2024

Summary

Testing Instructions:

Prep:

  • You'll need to set the following environment variables:
    • export SPIDERMON_EMAIL_SENDER
    • export SPIDERMON_EMAIL_TO
    • export SPIDERMON_SMTP_HOST
    • export SPIDERMON_SMTP_PORT
    • export SPIDERMON_SMTP_USER
    • export SPIDERMON_SMTP_PASSWORD
  • You'll also need to pip install the new libraries (spidermon & premailer)

Testing:

  • Once you set up the libraries and environment variables you should be able to run the test scrape in /searchgov-spider/search_gov_crawler using scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com/tag -a start_urls=https://quotes.toscrape.com
  • Feel free to experiment with the settings specifically SPIDERMON_MIN_ITEMS, SPIDERMON_TIME_INTERVAL, SPIDERMON_ITEM_COUNT_INCREASE, and SPIDERMON_MAX_EXECUTION_TIME

Post Merge Steps:

  • We will need to set the environment variables for the email action

Future Work:

There are a few things I would like to do that I don't think really fit into the scope of this ticket:

  • Optimize the parameters.
    • I ran the domain_spider and observed on my local I was getting ~3,000 items/second. This was running the entire domain_spider and we will need to remember that some blocks will return fewer items. I set these parameters as mvp metrics but they need to be updated. I know this will be on my local which (hopefully) should be slower than running on the EC2 but this can give us some better baseline numbers before we actually get data running it from prod.
  • Format the email notification to send an attachment instead of just appending the report to the body of the email
  • Upload the report to cloud watch
  • Upload the report to new relic
  • Slack notifications (depending on GSA/TTS policy)

Checklist

Please ensure you have addressed all concerns below before marking a PR "ready for review" or before requesting a re-review. If you cannot complete an item below, replace the checkbox with the ⚠️ :warning: emoji and explain why the step was not completed.

Functionality Checks

  • You have merged the latest changes from the target branch (usually main) into your branch.

  • Your primary commit message is of the format SRCH-#### <description> matching the associated Jira ticket.

  • PR title is either of the format SRCH-#### <description> matching the associated Jira ticket (i.e. "SRCH-123 implement feature X"), or Release - SRCH-####, SRCH-####, SRCH-#### matching the Jira ticket numbers in the release.

  • Automated checks pass. If Code Climate checks do not pass, explain reason for failures:

Process Checks

  • You have specified at least one "Reviewer".

@selfdanielj
Copy link
Contributor

@IsabelLaurenceau what do you think about a way for us to easily disable or enable the email notifications based on environment? For instance I don't think we would usually want to send emails if there was an error locally or even from a future dev or staging environment. A blunt tool could be to set the value of SPIDERMON_ENABLED to an environment variable, perhaps default it to off and something has to be set to turn it on?

@selfdanielj
Copy link
Contributor

@IsabelLaurenceau what did you end up using for email? did you configure your personal gmail to send these or something else? Were there any steps you followed? I know you said it took a while to figure it out so I'm hoping you can point us in the right direction at least.

@IsabelLaurenceau
Copy link
Collaborator Author

@IsabelLaurenceau what do you think about a way for us to easily disable or enable the email notifications based on environment? For instance I don't think we would usually want to send emails if there was an error locally or even from a future dev or staging environment. A blunt tool could be to set the value of SPIDERMON_ENABLED to an environment variable, perhaps default it to off and something has to be set to turn it on?

That makes sense to me. I think we could just set it to an environment variable like you said. I'm not sure if this is what you mean by having something set to turn it on but set that variable to false in all environments other than Prod.

@IsabelLaurenceau
Copy link
Collaborator Author

IsabelLaurenceau commented Nov 7, 2024

@IsabelLaurenceau what did you end up using for email? did you configure your personal gmail to send these or something else? Were there any steps you followed? I know you said it took a while to figure it out so I'm hoping you can point us in the right direction at least.

I personally use an apple email so I used my apple email credentials for SPIDERMON_EMAIL_SENDER, SPIDERMON_SMTP_HOST, SPIDERMON_SMTP_PORT, SPIDERMON_SMTP_USER, SPIDERMON_SMTP_PASSWORD. It depends on what email server you want to use. The tricky part I had with my apple email was that the password has to be an app password and not just your iCloud password (which is what I had assumed) and the smto_user has to be the entire email address including the domain. I also had to use port 587 with TLS

@selfdanielj
Copy link
Contributor

In case it helps anyone else, I got the smtp setup by adding an app password https://myaccount.google.com/apppasswords to my fearless gmail account and then using these environment variables:

SPIDERMON_EMAIL_SENDER = "dself@fearless.tech" 
SPIDERMON_EMAIL_TO = "daniel.self@gsa.gov" 
SPIDERMON_SMTP_HOST = "smtp.gmail.com"
SPIDERMON_SMTP_PORT = "587"
SPIDERMON_SMTP_USER = "dself@fearless.tech"
SPIDERMON_SMTP_PASSWORD = <password provided by the app passwords functionality>

@selfdanielj
Copy link
Contributor

@IsabelLaurenceau Did you ever get this error when sending emails? The emails are sent but this error occurs. This looks like an open issue with spidermon: scrapinghub/spidermon#412

Traceback (most recent call last):
  File \"/Users/dself/Projects/searchgov-spider/.venv/lib/python3.12/site-packages/twisted/internet/asyncioreactor.py\", line 138, in _readOrWrite
    why = method()
          ^^^^^^^^
  File \"/Users/dself/Projects/searchgov-spider/.venv/lib/python3.12/site-packages/twisted/internet/tcp.py\", line 250, in doRead
    return self._dataReceived(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/Users/dself/Projects/searchgov-spider/.venv/lib/python3.12/site-packages/twisted/internet/tcp.py\", line 255, in _dataReceived
    rval = self.protocol.dataReceived(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/Users/dself/Projects/searchgov-spider/.venv/lib/python3.12/site-packages/twisted/protocols/tls.py\", line 339, in dataReceived
    self._flushReceiveBIO()
  File \"/Users/dself/Projects/searchgov-spider/.venv/lib/python3.12/site-packages/twisted/protocols/tls.py\", line 310, in _flushReceiveBIO
    self._flushSendBIO()
  File \"/Users/dself/Projects/searchgov-spider/.venv/lib/python3.12/site-packages/twisted/protocols/tls.py\", line 263, in _flushSendBIO
    bytes = self._tlsConnection.bio_read(2**15)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'bio_read'

@selfdanielj
Copy link
Contributor

@IsabelLaurenceau scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/
So i get 13 emails but only one report file is written out.... is this expected? or should we get 13 files? If I had to guess I would say it probably is writing 13 files but the name is the same so they are overwritten.

@IsabelLaurenceau
Copy link
Collaborator Author

@IsabelLaurenceau scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/ So i get 13 emails but only one report file is written out.... is this expected? or should we get 13 files? If I had to guess I would say it probably is writing 13 files but the name is the same so they are overwritten.

I don't usually get more than one email. That is not expected behavior. Did you change any of the spidermon variables when this happened?

Copy link
Contributor

@igoristic igoristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 🥇

Do we really need email support? I know it's really easy to be classified as a spam bot if we implement it wrong or send more emails than the "allowed" threshold.

I understand slack bot/notifications are hard to get approved, but I think it's so much more valuable than emails. *But, only because my email is already spammed out by datadog and newrelic 🙃

@@ -7,6 +7,10 @@
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import os
from pathlib import Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import isn't used and can be removed

@@ -0,0 +1,18 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import isn't used and can be removed

dirname= os.path.dirname(__file__)
body_html_template = os.path.join(dirname, 'actions', 'results.jinja')

SPIDERMON_ENABLED = os.environ.get('SPIDERMON_ENABLED')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though scrapy's get_bool function https://github.com/scrapy/scrapy/blob/212e848402a63b43fe8b7204e19d47fa7c4f0cd9/scrapy/settings/__init__.py#L126 allows this to be really flexible, I think it would be more clear to add a default so that someone can understand that this is meant to be a boolean: os.environ.get('SPIDERMON_ENABLED', 'False') for example. What do you think?

@selfdanielj selfdanielj self-requested a review November 15, 2024 18:48
Copy link
Contributor

@selfdanielj selfdanielj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, generally works as expected. If you wanted to fix the few small things I put in that would be great otherwise no problem, seems like we will have some more work to do on this before enabling it in prod anyway.

I'm also going to add in the comments, just for tracking purposes, three things that I think we all agree are out of scope for this PR:

  • The snippet I slacked you about changing the email mime type
  • The error I'm seeing when I send emails (although emails still get sent)
  • Some more "proof" that multiple emails are getting sent but only one report file is present at the end of the run (perhaps being overwritten again and again).

@selfdanielj
Copy link
Contributor

For future use in making the emails more pretty:

Spidermon uses a scrapy MailSender class: https://github.com/scrapinghub/spidermon/blob/de6921541f38613f62384efc682ed2a0282b08fa/spidermon/contrib/actions/email/smtp.py#L79
but the default mimetype value in the scrapy class is "text/plain"
https://github.com/scrapy/scrapy/blob/261c4b61dc48353346c1e0387d0783ac15ab459d/scrapy/mail.py#L92
so i provided a new send_message function for the CreateCusom email class... the only change is the mimetype="text/html" argument

class CreateCustomEmailReport(SendSmtpEmail):
    dirname = os.path.dirname(__file__)
    body_html_template = os.path.join(dirname, "actions", "results.jinja")

    def send_message(self, message, **kwargs):
        mail_sender = MailSender(
            smtphost=self.smtp_host,
            mailfrom=self.sender,
            smtpuser=self.smtp_user,
            smtppass=self.smtp_password,
            smtpport=self.smtp_port,
            smtptls=self.smtp_enforce_tls,
            smtpssl=self.smtp_enforce_ssl,
        )

        mail_sender.send(
            to=self.to,
            subject=message["Subject"],
            body=message.as_string(),
            cc=self.cc,
            mimetype="text/html",
            _callback=kwargs.get("_callback"),
        )

@selfdanielj
Copy link
Contributor

For future use in investigating discrepancy in file report output and number of emails sent:

For this test I used the built in email mock setting https://spidermon.readthedocs.io/en/latest/actions/email-action.html#spidermon-email-fake so that I wasn't actually sending emails but I could log when they are being sent. I redirected logs to a file and then grep'ed for the number of times it sent an email... which was 11 here. At the end of this run I only saw one report file.

(.venv) ➜  search_gov_crawler git:(SRCH-5469-Spidermon) ✗ scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/ > scrapy_log.txt 2>&1
(.venv) ➜  search_gov_crawler git:(SRCH-5469-Spidermon) ✗ grep -c "SendSmtpEmail... OK" scrapy_log.txt 
11
(.venv) ➜  search_gov_crawler git:(SRCH-5469-Spidermon) ✗ ls -1 *spidermon_file_report.html | wc -l
1

@selfdanielj
Copy link
Contributor

For future use investigating bio_read errors while sending email as mentioned in #50 (comment)

With default settings and email enabled from dself@fearless.com to daniel.self@gsa.gov:

(.venv) ➜  search_gov_crawler git:(SRCH-5469-Spidermon) ✗ scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/ > scrapy_log.txt 2>&1
(.venv) ➜  search_gov_crawler git:(SRCH-5469-Spidermon) ✗ grep -c "Mail sent OK: To=\[\'daniel.self@gsa.gov\']" scrapy_log.txt
11
(.venv) ➜  search_gov_crawler git:(SRCH-5469-Spidermon) ✗ grep -c "builtins.AttributeError: \'NoneType\' object has no attribute \'bio_read\'" scrapy_log.txt 
1

@IsabelLaurenceau IsabelLaurenceau merged commit e90fd0a into main Nov 15, 2024
2 checks passed
@selfdanielj selfdanielj deleted the SRCH-5469-Spidermon branch November 25, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants