Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide DocEvent Webhook #1002

Open
hackerwins opened this issue Sep 9, 2024 · 6 comments
Open

Provide DocEvent Webhook #1002

hackerwins opened this issue Sep 9, 2024 · 6 comments
Assignees
Labels
enhancement 🌟 New feature or request

Comments

@hackerwins
Copy link
Member

What would you like to be added:

We are currently implementing an LLM-based document search functionality in CodePair. As part of this, we need to maintain a vector of document content in Vector Store. It's crucial that any updates to the document are reflected in the Vector Store by continually editing the content.

To achieve this, we require a mechanism that notifies external services like CodePair when documents are modified in Yorkie. We propose the introduction of a Webhook system that triggers when a document event occurs.

Specifically, we suggest that when handling the PushPullChanges requests, the server should check if a Webhook for the DocEvent is registered for the project. If it is, the server would call that Webhook during the background routine of the PushPullChanges API execution, right before publishing the DocEvent.

I think it will have a similar structure to the Auth Webhook, and if changes occur frequently, an event control device such as debouncing will also be needed.

Why is this needed:

This enhancement would enable seamless integration with external services, allowing for real-time updates to Search Engine or Vector Store based on document changes in Yorkie, thereby enhancing the overall document management and search capabilities of our application.

@window9u
Copy link
Contributor

Hello! Could I try this issue?

@window9u
Copy link
Contributor

window9u commented Nov 23, 2024

What Events Should We Send?

To keep external services informed about the state of documents, we should send events corresponding to the CRUD (Create, Read, Update, Delete) operations.

Common Webhook Specifications
  • Webhook Request Type: HTTP POST
  • Content Type: application/json
  • Expected Response:
    • Status Code: 200 OK

Event Types and Payloads

a. Document Created

  • Event Type: documentCreated

  • Payload Schema:

    {
      "type": "documentCreated",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      }
    }
  • Example:

    {
      "type": "documentCreated",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T05:43:52.318Z"
      }
    }

b. Document Watched

  • Event Type: documentWatched

  • Payload Schema:

    {
      "type": "documentWatched",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      },
      "data": {
        "watchingNum": "integer"
      }
    }
  • Example:

    {
      "type": "documentWatched",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T05:45:00.000Z"
      },
      "data": {
        "watchingNum": 3
      }
    }

c. Document Unwatched

  • Event Type: documentUnwatched

  • Payload Schema:

    {
      "type": "documentUnwatched",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      },
      "data": {
        "watchingNum": "integer"
      }
    }
  • Example:

    {
      "type": "documentUnwatched",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T05:50:00.000Z"
      },
      "data": {
        "watchingNum": 2
      }
    }

d. Document Changed

1. Change Event
  • Event Type: documentChanged

  • Payload Schema:

    {
      "type": "documentChanged",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      },
    }
  • Example:

    {
      "type": "documentChanged",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T06:00:00.000Z"
      },
    }
2. Snapshot Event
  • Event Type: snapshotStored

  • Payload Schema:

    {
      "type": "snapshotStored",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      },
      "data": {
        "snapshot": "string" // marshaled snapshot data
      }
    }
  • Example:

    {
      "type": "snapshotStored",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T06:05:00.000Z"
      },
      "data": {
        "snapshot": "",
      }
    }

e. Document Deleted

  • Event Type: documentDeleted

  • Payload Schema:

    {
      "type": "documentDeleted",
      "attributes": {
        "documentKey": "string",
        "clientKey": "string",
        "issuedAt": "string" // ISO 8601 timestamp
      }
    }
  • Example:

    {
      "type": "documentDeleted",
      "attributes": {
        "documentKey": "document-key",
        "clientKey": "client-key",
        "issuedAt": "2024-10-06T06:10:00.000Z"
      }
    }

@window9u
Copy link
Contributor

Explanation of Events

Common Parts

  • Event Types:
  • Attributes:
    • To maintain consistency, all events share the same set of attributes:
      • documentKey: Unique identifier for the document.
      • clientKey: Unique identifier for the client. If our user (CodePair) sets the clientKey to their user_id, they can identify who made the event.
      • issuedAt: Timestamp when the event was issued (ISO 8601 format).

Document Watched / Unwatched

  • Purpose: Indicates that a client has started or stopped watching (reading) a document.
  • Watching Number (watchingNum):
    • Yorkie maintains the status and controls the watch/unwatch events of documents, and it sends the watchingNum to indicate the number of clients currently watching the document.
    • Potential Uses:
      • Real-time ranking of documents to identify which ones are popular or actively being edited.
      • Analytics on user engagement and collaboration intensity.
    • Considerations:
      • The usefulness of watchingNum is currently uncertain, and its implementation might not be immediately necessary.
      • It has a lower priority and could be omitted or deferred if it complicates the implementation.
      • If there are difficulties in implementing this feature, we can consider adding it later.

Document Changed

Change Event
  • Purpose: Notifies that changes have occurred in a document. Due to the high frequency of changes, we need to implement rate limiting mechanisms.
  • Rate Limiting Strategies:
    • Debouncing:
      • Collect events over a short period (e.g., 5 seconds).
      • Send a single aggregated event after the period ends.
      • Reduces the number of webhook calls and prevents overwhelming external services.
    • Throttling:
      • Limit the maximum number of webhook calls within a specific time frame.
      • Ensures that webhook calls are spaced out over time.
  • Implementation Choice:
    • I have chosen throttling for this event type.
      • During a defined time period, we acknowledge that changes occur but do not consider how many changes are made within that period.
      • This simplifies the implementation and reduces the load on external services.
  • Reasons:
    • The number of changes (changeNum) does not accurately represent the amount of data changed.
      • A single change could involve a large amount of data (e.g., copy and paste operations).
      • Minor edits like adding a word or a space also count as a change.
    • Therefore, the emphasis is on the occurrence of changes rather than the quantity.
Snapshot Stored
  • Purpose: In certain situations, sending a snapshot of the entire document can be useful.
  • Use Case Examples:
    • Direct Snapshot Transmission:
      • If we need to periodically receive the entire data, it might be more efficient to send the snapshot directly rather than receiving a change event and then pulling the document to get the snapshot. This is especially beneficial when automatically creating up-to-date thumbnail documents.
    • CodePair Integration:
      • In CodePair, when changes occur, it retrieves a snapshot from Yorkie to store in a vector database (e.g., for search indexing or machine learning models).
      • Instead of updating the vector database incrementally with each change, it might be more efficient to send the complete data when a snapshot occurs.
    • Thumbnail Generation:
      • We could use snapshots as thumbnails for documents. Saving one snapshot per document for thumbnail purposes can be efficient.
  • Considerations:
    1. Size of Snapshots:
      • If the snapshot is large, sending it can be burdensome on network resources and processing time.
    2. Frequency of Snapshots:
      • We might consider sending snapshots after a certain number of changes instead of after every single change.
      • Similar to debouncing, we can aggregate changes and send snapshots periodically.
  • Possible Approach:
    • For example, we could implement a mechanism to send a snapshot every third time it is generated.
    • This balances the need for up-to-date data with the overhead of transmitting large snapshots.

@window9u
Copy link
Contributor

If the above data types are finalized, we should consider the following:

  1. Where to Send the Data
    • Adding an Endpoint Attribute to the Project: We need to include an endpoint property in the project configuration to specify where the data should be sent.
    • Defining Endpoint Properties: We should define various properties of the endpoint, such as the debouncing period, snapshot period, or how frequently to send data (e.g., after a certain number of changes).
    • Batching Events: It might be possible to send events in batches (if CodePair processes them in batches). Therefore, we need to discuss batching strategies.
    • Security Considerations: Determine how to handle security, such as how users can verify that the Yorkie server is the one sending the data.
  2. How to Handle Exceptions
    • Timeout Settings for Requests: Decide on a timeout setting for individual requests.
    • Handling Unresponsive Endpoints: Determine what to do if the endpoint continuously fails to receive requests.
    • Storing Unsent Events: Should we store events that the endpoint failed to receive? Or should we allow users to choose whether or not to store them? If we decide to store them, where should we store them?

@krapie
Copy link
Member

krapie commented Nov 27, 2024

@hackerwins @devleejb @sejongk Any thoughts on this proposed schema?

@devleejb
Copy link
Member

@window9u Sorry for late check.

Overall looks good. I have a few questions.

  1. If documentChanged is throttled, I believe clientKey cannot be included in the payload. What are your thoughts on this?
  2. Could we include clientKey in the snapshotStored event? It seems this event is not directly related to clientKey.
  3. As you mentioned, how about evaluating the priority of these events? While the events cover many cases, implementing and discussing them all at once might be burdensome. I think the documentChanged event should have a higher priority than the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🌟 New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

4 participants