Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add /verifyrun cog #36

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

add /verifyrun cog #36

wants to merge 1 commit into from

Conversation

b9r5
Copy link
Collaborator

@b9r5 b9r5 commented Nov 24, 2024

Description

Adds a /verifyrun command that does the same work as the smoke test but is easier to use.

image

Checklist

Before submitting this PR, ensure the following steps have been completed:

  • Run the smoke test on your own server.
    • Run the cluster bot on your server:
      python discord-bot.py
    • Start a training run by with the slash command /run.
      You may need to exercise some judgement about the script and GPU type.
    • Wait for the training run to complete.
    • Copy the URL for the thread started by the cluster bot in response to
      your /run message ("Cluster Bot started a thread: ..."):
      • Click on the 3 dots (...) to the cluster bot's message.
      • Select Copy Message Link.
    • Using the copied URL, run the smoke test:
      python discord-bot-smoke-test.py copied_url
    • Verify that the smoke test script responds with:
      All tests passed!
      
    For more information on running a cluster bot on your own server, see
    README.md.

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some time we could save in having a single command verify all of modal, nvidia and amd one after the other and letting us know how many of those 3 succeeded. Also as far as the UX goes, why not have verify actually trigger the bot 3 times instead of having a human manually verify in a thread

This cog provides functionality to verify that either a GitHub Actions or
Modal run completed successfully by checking for specific message patterns
in a Discord thread. It supports verification of two types of runs:
1. GitHub Actions runs - Identified by "GitHub Action triggered!" message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this seems to test a trigger but we've had issues where this wouldn't have helped like for example when we had timeout issues with both the NVIDIA and AMD runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants