diff --git a/bydbctl/internal/cmd/rest.go b/bydbctl/internal/cmd/rest.go index 924f93c13..e74ede208 100644 --- a/bydbctl/internal/cmd/rest.go +++ b/bydbctl/internal/cmd/rest.go @@ -141,7 +141,7 @@ func parseFromYAML(tryParseGroup bool, reader io.Reader) (requests []reqBody, er data["groups"] = []string{group} } } else { - return nil, errors.WithMessage(errMalformedInput, "absent node: metadata or name&group") + return nil, errors.WithMessage(errMalformedInput, "absent node: name or groups") } j, err = json.Marshal(data) if err != nil { diff --git a/docs/README.md b/docs/README.md index 9d44786b9..ef3d1c13b 100644 --- a/docs/README.md +++ b/docs/README.md @@ -9,11 +9,11 @@ There’s room to improve the performance and resource usage based on the nature Here you can learn all you need to know about BanyanDB. Let's get started with it. +- **Guides**. Learn how to install, configure, and use BanyanDB by real-world examples. - **Installation**. Instruments about how to download and onboard BanyanDB server, Banyand. -- **Clients**. Some native clients to access Banyand. -- **Observability**. Learn how to effectively monitor, diagnose and optimize Banyand. +- **Interacting**. Learn how to interact with Banyand, including schema management, data ingestion, data retrieving and so on. +- **Operation**. Learn how to operate Banyand, including observability, troubleshooting, and so on. - **Concept**. Learn the concepts of Banyand. Includes the architecture, data model, and so on. -- **CRUD Operations**. To create, read, update, and delete data points or entities on resources in the schema. ### Useful Links diff --git a/docs/concept/rotation.md b/docs/concept/rotation.md new file mode 100644 index 000000000..9d4e4e0bf --- /dev/null +++ b/docs/concept/rotation.md @@ -0,0 +1,144 @@ +# Data Rotation + +Data rotation is the process of managing the size of data stored in BanyanDB by removing old data and keeping only the most recent data. Data rotation is essential to prevent the database from running out of disk space and to maintain query performance. + +## Overview + +BanyanDB partitions its data into multiple [**segments**](tsdb.md#segment). These segments are time-based, allowing efficient management of data retention and querying. The `segment_interval` and retention policy (`ttl`) for each [group](../interacting/data-lifecycle.md#measures-and-streams) determine how data is segmented and retained in the database. + +## Formulation + +To express the relationship between the **number of segments**, the **segment interval**, and the **time-to-live (TTL)** in BanyanDB, we can derive a simple formula. + +### General Formula for Number of Segments + +The relationship between the number of segments, segment interval, and TTL can be expressed as: + +``` +S = (T / I) rounded up + 1 +``` + +Where: + +- `S` is the **number of segments**. +- `I` is the **segment interval** (in the same unit as the TTL). +- `T` is the **TTL** (time-to-live, in the same unit as the segment interval). + +### Explanation + +1. **T / I**: This represents the number of full segments needed to cover the TTL. For example, if the TTL is 7 days and the segment interval is 3 days, you would need at least 2.33 segments to cover the 7-day period. + +2. **Rounded up**: We round up the result of `T / I` because partial segments still require a full segment to store the data. + +3. **+ 1 segment**: We add 1 additional segment to account for the next segment being created to store incoming data as the current period closes. + +### General Insights + +- **Smaller segment intervals** (e.g., 1 day) lead to a larger number of segments because more segments are needed to cover the TTL. +- **Larger segment intervals** (e.g., 3 days) reduce the number of segments, but you still need 1 additional segment to handle data as it transitions between periods. + +Thus, the formula effectively balances the need for both data retention and the number of segments based on the chosen segment interval. + +### Example 1: Segment Interval = 3 Days, TTL = 7 Days + +``` +S = (7 / 3) rounded up + 1 +S = 2.33 rounded up + 1 +S = 3 + 1 +S = 4 +``` + +| Time (Day) | Action | Number of Segments | +|---------------|---------------------------------------|--------------------| +| Day 1 (00:00) | Segment for Days 1–3 is created | 1 | +| Day 3 (23:00) | New segment for Days 4–6 is created | 2 | +| Day 6 (23:00) | New segment for Days 7–9 is created | 3 | +| Day 9 (23:00) | New segment for Days 10–12 is created | 4 | +| Day 10 (00:00) | Oldest segment for Days 1–3 is removed | 3 | +| Day 12 (23:00) | New segment for Days 13–15 is created | 4 | +| Day 13 (00:00) | Oldest segment for Days 4–6 is removed | 3 | + +So, **4 segments** are required to retain data for 7 days with a 3-day segment interval. + +### Example 2: Segment Interval = 1 Day, TTL = 7 Days + +``` +S = (7 / 1) rounded up + 1 +S = 7 + 1 +S = 8 +``` + +| Time (Day) | Action | Number of Segments | +|---------------|-----------------------------------|--------------------| +| Day 1 (23:00) | New segment for Day 2 is created | 1 | +| Day 2 (00:00) | Oldest segment (if any) removed | 1 | +| Day 2 (23:00) | New segment for Day 3 is created | 2 | +| Day 3 (00:00) | Oldest segment (if any) removed | 2 | +| ... | ... | ... | +| Day 7 (23:00) | New segment for Day 8 is created | 7 | +| Day 8 (00:00) | Oldest segment for Day 1 removed | 7 | +| Day 8 (23:00) | New segment for Day 9 is created | 8 | +| Day 9 (00:00) | Oldest segment for Day 2 removed | 7 | + +At any given time, there will be a maximum of **8 segments**: 1 for the new day and 7 for the last 7 days of data. + +### Example 3: Segment Interval = 2 Days, TTL = 7 Days + +``` +S = (7 / 2) rounded up + 1 +S = 3.5 rounded up + 1 +S = 4 + 1 +S = 5 +``` + +So, **5 segments** are required to retain data for 7 days with a 2-day segment interval. + +### Generalization for Any Time Unit + +To use this formula with time units like hours and days , make sure **both the segment interval (I)** and **TTL (T)** use the same unit of time. If they don’t, convert one of them so that they match. + +#### Steps + +1. **Convert both the segment interval and TTL to the same time unit**, if necessary. + - For example, if the TTL is in days but the segment interval is in hours, convert the TTL to hours (e.g., 3 days = 72 hours). + +2. **Apply the formula** to get the number of segments. + +### Example 4: Mixed Units (Segment Interval in Hours, TTL in Days) + +- **Segment Interval** = 12 hours +- **TTL** = 3 days + +First, convert the TTL to hours: + +``` +3 days = 3 * 24 = 72 hours +``` + +Now, apply the formula: + +``` +S = (72 / 12) rounded up + 1 +S = 6 + 1 +S = 7 +``` + +So, **7 segments** are required to retain data for 3 days with a 12-hour segment interval. + + +### Example 5: Minimum Number of Segments + +``` +S = (7 / 8) rounded up + 1 +S = 0.875 rounded up + 1 +S = 1 + 1 +S = 2 +``` + +So, **2 segments** are required to retain data for 7 days with an 8-day segment interval. 2 segments are the minimum number whatever the TTL and segment interval are. When the TTL is less than the segment interval, you can have the minimum number of segments. + +## Conclusion + +Data rotation is a critical aspect of managing data in BanyanDB. By understanding the relationship between the number of segments, segment interval, and TTL, you can effectively manage data retention and query performance in the database. The formula provided here offers a simple way to calculate the number of segments required based on the chosen segment interval and TTL. + +For more information on data management and lifecycle in BanyanDB, refer to the [Data Lifecycle](../interacting/data-lifecycle.md) documentation. diff --git a/docs/interacting/data-lifecycle.md b/docs/interacting/data-lifecycle.md index 6163b7329..37c120a9c 100644 --- a/docs/interacting/data-lifecycle.md +++ b/docs/interacting/data-lifecycle.md @@ -48,6 +48,8 @@ More ttl units can be found in the [IntervalRule.Unit](../api-reference.md#inter You can also manage the Group by other clients such as [Web-UI](./web-ui/schema/group.md) or [Java-Client](java-client.md). +For more details about how they works, please refer to the [data rotation](../concept/rotation.md). + ## [Property](../concept/data-model.md#properties) `Property` data provides both [CRUD](./bydbctl/property.md) operations and TTL mechanism. diff --git a/docs/menu.yml b/docs/menu.yml index 97cab235c..2b1fe0d9b 100644 --- a/docs/menu.yml +++ b/docs/menu.yml @@ -108,7 +108,19 @@ catalog: - name: "Cluster Management" path: "/operation/cluster" - name: "Troubleshooting" - path: "/operation/troubleshooting" + catalog: + - name: "Error Checklist" + path: "/operation/troubleshooting/error-checklist" + - name: "Troubleshooting Installation" + path: "/operation/troubleshooting/install" + - name: "Troubleshooting Crash" + path: "/operation/troubleshooting/crash" + - name: "Troubleshooting No Data" + path: "/operation/troubleshooting/no-data" + - name: "Troubleshooting Overhead" + path: "/operation/troubleshooting/overhead" + - name: "Troubleshooting Query" + path: "/operation/troubleshooting/query" - name: "Security" path: "/operation/security" - name: "File Format" diff --git a/docs/operation/configuration.md b/docs/operation/configuration.md index 449725d48..3f19fd6a0 100644 --- a/docs/operation/configuration.md +++ b/docs/operation/configuration.md @@ -21,6 +21,24 @@ There are three bootstrap commands: `data`, `liaison`, and `standalone`. You cou Below are the available flags for configuring BanyanDB: +### Service Discovery + +BanyanDB Liaison reads the endpoints of the data servers from the etcd server. The following flags are used to configure: + +`node-host-provider`: the node host provider, can be "hostname", "ip" or "flag", default is hostname. + +If the `node-host-provider` is "flag", you can use `node-host` to configure the node host: + +```sh +./banyand liaison --node-host=foo.bar.com --node-host-provider=flag +``` + +If the `node-host-provider` is "hostname", BanyanDB will use the hostname of the server as the node host. The hostname is parsed from the go library `os.Hostname()`. + +If the `node-host-provider` is "ip", BanyanDB will use the IP address of the server as the node host. The IP address is parsed from the go library `net.Interfaces()`. BanyanDB will use the first non-loopback IPv4 address as the node host. + +The official Helm chart uses the `node-host-provider` as "ip" as the default value. + ### Liaison & Network BanyanDB uses gRPC for communication between the servers. The following flags are used to configure the network settings. diff --git a/docs/operation/observability.md b/docs/operation/observability.md index 2c6b330e4..b3a928120 100644 --- a/docs/operation/observability.md +++ b/docs/operation/observability.md @@ -294,6 +294,8 @@ The read flow is the same as reading data from `measure`, with each metric being Banyand, the server of BanyanDB, supports profiling automatically. The profiling data is collected by the `pprof` package and can be accessed through the `/debug/pprof` endpoint. The port of the profiling server is `2122` by default. +Refer to the [pprof documentation](https://golang.org/pkg/net/http/pprof/) for more information on how to use the profiling data. + ## Query Tracing BanyanDB supports query tracing, which allows you to trace the execution of a query. The tracing data includes the query plan, execution time, and other useful information. You can enable query tracing by setting the `QueryRequest.trace` field to `true` when sending a query request. diff --git a/docs/operation/troubleshooting.md b/docs/operation/troubleshooting.md deleted file mode 100644 index 99a1f1b8f..000000000 --- a/docs/operation/troubleshooting.md +++ /dev/null @@ -1,134 +0,0 @@ -# Troubleshooting - -## Error Checklist - -When facing issues with BanyanDB, follow this checklist to effectively troubleshoot and resolve errors. - -### 1. Collect Information - -Gather detailed information about the error to assist in diagnosing the issue: - -- **Error Message**: Note the exact error message displayed. -- **Logs**: Collect relevant log files from the BanyanDB system. See the [Logging](./observability.md#logging) section for more information. -- **Query Tracing**: If the error is related to a query, enable query tracing to capture detailed information about the query execution. See the [Query Tracing](./observability.md#query-tracing) section for more information. -- **Environment**: Document the environment details, including OS, BanyanDB version, and hardware specifications. -- **Database Schema**: Provide the schema details related to the error. See the [Schema Management](../interacting/bydbctl/schema/) section for more information. -- **Data Sample**: If applicable, include a sample of the data causing the error. -- **Configuration Settings**: Share the relevant configuration settings. -- **Data Files**: Attach any relevant data files. See the [Configuration](./configuration.md) section to find where the date files is stored. -- **Reproduction Steps**: Describe the steps to reproduce the error. - -### 2. Define Error Type - -Classify the error to streamline the troubleshooting process: - -- **Configuration Error**: Issues related to incorrect configuration settings. -- **Network Error**: Problems caused by network connectivity. -- **Performance Error**: Slowdowns or high resource usage. -- **Data Error**: Inconsistencies or corruption in stored data. - -### 3. Error Support Procedure - -Follow this procedure to address the identified error type: - -- Identify the error type based on the collected information. -- Refer to the relevant sections in the documentation to troubleshoot the error. -- Refer to the issues section in the SkyWalking repository for known issues and solutions. -- If the issue persists, submit a discussion in the [SkyWalking Discussion](https://github.com/apache/skywalking/discussions) for assistance. -- You can also raise a bug report in the [SkyWalking Issue Tracker](https://github.com/apache/skywalking/issues) if the issue is not resolved. -- Finally, As a OpenSource project, you could try to fix the issue by yourself and submit a pull request. - -Here's an expanded section on common issues for your BanyanDB troubleshooting document: - -## Common Issues - -### Write Failures - -**Troubleshooting Steps:** - -1. **Check Disk Space**: Ensure there is enough disk space available for writes. -2. **Review Permissions**: Verify that BanyanDB has the necessary permissions to write to the target directories. -3. **Examine Logs**: Look for error messages in the logs that might indicate the cause of the failure. -4. **Network Issues**: Ensure network stability if the write operation involves remote nodes. -5. **Configuration Settings**: Double-check configuration settings related to write operations. - -### Incorrect Query Results - -**Diagnose and Fix:** - -1. **Verify Query Syntax**: Ensure that the query syntax is correct and aligns with BanyanDB's query API. -2. **Inspect Indexes**: Ensure indexes are correctly defined and up-to-date. -3. **Review Recent Changes**: Consider any recent changes to data or schema that might affect query results. -4. ** - -### Performance Issues - -#### Slow Write - -**Steps to Address:** - -1. **Resource Allocation**: Ensure adequate resources (CPU, memory) are allocated to BanyanDB. -2. **Batch Writes**: Increasing batch sizes can improve write performance. -3. **Monitor Disk I/O**: Check for disk I/O bottlenecks. - -#### Slow Query - -**Optimize Slow Queries:** - -1. **Analyze Query Plans**: Use query tracing to understand execution plans and identify bottlenecks. -2. **Index Usage**: Ensure appropriate indexes are used to speed up queries. -3. **Refactor Queries**: Simplify complex queries where possible. For example, using `IN` instead of multiple `OR` conditions. -4. **Sharding**: Consider adding more shards to distribute query load. - -#### Out of Memory (OOM) - -**Handle OOM Errors:** - -1. **Increase Memory Limits**: Adjust memory allocation settings for BanyanDB. -2. **Optimize Queries**: Ensure queries are efficient and not causing excessive memory usage. -3. **Data Sharding**: Distribute data across shards to reduce memory pressure. -4. **Monitor Memory Usage**: Use monitoring tools to track memory usage patterns. - -### Process Crashes and File Corruption - -**Steps to Diagnose and Recover:** - -1. **Examine Log Files**: Check logs for error messages leading up to the crash. -2. **Check File System**: Ensure the file system is not corrupted and is functioning properly. -3. **Update Software**: Ensure BanyanDB and its dependencies are up-to-date with the latest patches. -4. **Remove Corrupted Data**: If data corruption is detected, remove the corrupted data and restore from backups if necessary. - -### How to Remove Corrupted Data - -Corrupted data can cause BanyanDB to malfunction or produce incorrect results. Follow these steps to safely remove corrupted data from BanyanDB: - -1. **Identify the Corrupted Data**: - Monitor the BanyanDB logs for any error messages indicating data corruption. The file is located in a part directory. You have to remove the whole part directory instead of a single file. - -2. **Shutdown BanyanDB**: - - Before making any changes to the data files, ensure that BanyanDB is not running. This prevents further corruption and ensures data integrity. - - Send `SIGTERM` or `SIGINT` signals to the BanyanDB process to gracefully shut it down - -3. **Locate the Snapshot File**: - - In each shard of the TSDB (Time Series Database), there is a [snapshot file](../concept/tsdb.md#shard) that contains all alive parts directories. - - Navigate to the directory where BanyanDB stores its data. This is typically specified in the [flags](./configuration.md) - -4. **Remove the Corrupted File**: - - Identify the corrupted part within the snapshot directory. - - Remove the part's record from the snapshot file. - -5. **Clean Up Part**: - - Remove the corrupted part directory from the disk. - -6. **Restart BanyanDB**: - - Once the corrupted part is removed and the metadata is cleaned up, restart BanyanDB to apply the changes - -7. **Verify the Integrity**: - - After restarting, monitor the BanyanDB logs to ensure that the corruption issues have been resolved. - - Run any necessary integrity checks or queries to verify that the database is functioning correctly. - -8. **Prevent Future Corruptions**: - - Monitor system resources and ensure that the hardware and storage systems are functioning correctly. - - Keep BanyanDB and its dependencies updated to the latest versions to benefit from bug fixes and improvements. - -By following these steps, you can safely remove corrupted data from BanyanDB and ensure the continued integrity and performance of your database. diff --git a/docs/operation/troubleshooting/crash.md b/docs/operation/troubleshooting/crash.md new file mode 100644 index 000000000..cbc0e4ea6 --- /dev/null +++ b/docs/operation/troubleshooting/crash.md @@ -0,0 +1,49 @@ +# Troubleshooting Crash Issues + +If BanyanDB processes crash or encounter file corruption, follow these steps to diagnose and recover from the issue. + +## Remove Corrupted Standalone Metadata + +If the BanyanDB standalone process crashes due to corrupted metadata. You should remove the corrupted metadata: + +1. **Shutdown BanyanDB**: + - Before making any changes to the data files, ensure that BanyanDB is not running. This prevents further corruption and ensures data integrity. + - Send `SIGTERM` or `SIGINT` signals to the BanyanDB process to gracefully shut it down +2. **Locate the Metadata File**: + - The metadata file is located in the standalone directory. + - Navigate to the directory where BanyanDB stores its standalone data. This is typically specified in the [metadata-root-path](../configuration.md#data--storage) + +## Remove Corrupted Stream or Measure Data + +The logs may indicate that the crash was caused by corrupted data. In such cases, it is essential to remove the corrupted data to restore the integrity of the database. Follow these steps to safely remove corrupted data from BanyanDB: + +1. **Identify the Corrupted Data**: + Monitor the BanyanDB logs for any error messages indicating data corruption. The file is located in a part directory. You have to remove the whole part directory instead of a single file. + +2. **Shutdown BanyanDB**: + - Before making any changes to the data files, ensure that BanyanDB is not running. This prevents further corruption and ensures data integrity. + - Send `SIGTERM` or `SIGINT` signals to the BanyanDB process to gracefully shut it down + +3. **Locate the Snapshot File**: + - In each shard of the TSDB (Time Series Database), there is a [snapshot file](../../concept/tsdb.md#shard) that contains all alive parts directories. + - Navigate to the directory where BanyanDB stores its data. This is typically specified in the [flags](../configuration.md) + +4. **Remove the Corrupted File**: + - Identify the corrupted part within the snapshot directory. + - Remove the part's record from the snapshot file. + +5. **Clean Up Part**: + - Remove the corrupted part directory from the disk. + +6. **Restart BanyanDB**: + - Once the corrupted part is removed and the metadata is cleaned up, restart BanyanDB to apply the changes + +7. **Verify the Integrity**: + - After restarting, monitor the BanyanDB logs to ensure that the corruption issues have been resolved. + - Run any necessary integrity checks or queries to verify that the database is functioning correctly. + +8. **Prevent Future Corruptions**: + - Monitor system resources and ensure that the hardware and storage systems are functioning correctly. + - Keep BanyanDB and its dependencies updated to the latest versions to benefit from bug fixes and improvements. + +By following these steps, you can safely remove corrupted data from BanyanDB and ensure the continued integrity and performance of your database. \ No newline at end of file diff --git a/docs/operation/troubleshooting/error-checklist.md b/docs/operation/troubleshooting/error-checklist.md new file mode 100644 index 000000000..99fd173b5 --- /dev/null +++ b/docs/operation/troubleshooting/error-checklist.md @@ -0,0 +1,44 @@ +# Error Checklist + +When facing issues with BanyanDB, follow this checklist to effectively troubleshoot and resolve errors. + +## 1. Collect Information + +Gather detailed information about the error to assist in diagnosing the issue: + +- **Logs**: Collect relevant log files from the BanyanDB system. See the [Logging](../observability.md#logging) section for more information. +- **Query Tracing**: If the error is related to a query, enable query tracing to capture detailed information about the query execution. See the [Query Tracing](../observability.md#query-tracing) section for more information. +- **Environment**: Document the environment details, including OS, BanyanDB version, and hardware specifications. +- **Database Schema**: Provide the schema details related to the error. See the [Schema Management](../../interacting/bydbctl/schema/) section for more information. +- **Data Sample**: If applicable, include a sample of the data causing the error. +- **Configuration Settings**: Share the relevant configuration settings. +- **Data Files**: Attach any relevant data files. See the [Configuration](../configuration.md) section to find where the date files is stored. +- **Reproduction Steps**: Describe the steps to reproduce the error. + +## 2. Define Error Type + +Classify the error to streamline the troubleshooting process: + +- **Configuration Error**: Issues related to incorrect configuration settings. +- **Network Error**: Problems caused by network connectivity. +- **Performance Error**: Slowdowns or high resource usage. +- **Data Error**: Inconsistencies or corruption in stored data. + +## 3. Error Support Procedure + +Follow this procedure to address the identified error type: + +- Identify the error type based on the collected information. +- Refer to the relevant sections in the documentation to troubleshoot the error. +- Refer to the issues section in the SkyWalking repository for known issues and solutions. +- If the issue persists, submit a discussion in the [SkyWalking Discussion](https://github.com/apache/skywalking/discussions) for assistance. +- You can also raise a bug report in the [SkyWalking Issue Tracker](https://github.com/apache/skywalking/issues) if the issue is not resolved. +- Finally, As a OpenSource project, you could try to fix the issue by yourself and submit a pull request. + +Here's an expanded section on common issues for your BanyanDB troubleshooting document: + +- [Troubleshooting Crash Issues](./crash.md) +- [Troubleshooting Overhead Issues](./overhead.md) +- [Troubleshooting No Data Issues](./no-data.md) +- [Troubleshooting Query Issues](./query.md) +- [Troubleshooting Installation Issues](./install.md) diff --git a/docs/operation/troubleshooting/install.md b/docs/operation/troubleshooting/install.md new file mode 100644 index 000000000..abf1aba84 --- /dev/null +++ b/docs/operation/troubleshooting/install.md @@ -0,0 +1,76 @@ +# Troubleshooting Installation Issues + +If you encounter issues during the installation of BanyanDB, follow these troubleshooting steps to resolve common problems. + +## Version Identification + +Before troubleshooting, ensure you are using the correct version of BanyanDB. The installation instructions are specific to each version, so it's essential to verify the version you are installing. + +```sh +banyand-server -v + +version vx.y.z +``` + +It's recommended to use the latest stable version of BanyanDB to benefit from the latest features and bug fixes. Keep all servers in the cluster at the same version to avoid compatibility issues. + +## Permission Denied + +If you encounter a "Permission Denied" error during installation, check the file permissions and ownership of the installation directory. Ensure that the user running the installation has the necessary permissions to read, write, and execute files in the installation directory. + +```sh +ls -l /path/to/installation/directory +``` + +If you deployed BanyanDB to OpenShift or Kubernetes, ensure that the service account has the required permissions to access the installation directory and run the BanyanDB server. The Docker image we published on Docker Hub is running as a **root** user, so you may need to adjust the permissions accordingly: + +```sh +## check the service account of the pod of data node +kubectl get -n pod -o=jsonpath='{.spec.serviceAccountName}' +``` + +Assuming the service account is **banyand**, you can grant the necessary permissions to the service account: + +```sh +oc adm policy add-scc-to-user anyuid -z banyand -n +``` + +## Liaison and Data Node Keeps in Pending State + +If the liaison and data nodes remain in a pending state after installation, check the logs for any error messages that may indicate the cause of the issue. The logs can provide valuable information to troubleshoot the problem. + +```sh +kubectl logs -n +``` + +If you see `the schema registry init timeout, retrying...`, that means the schema registry(etcd) is not ready yet. You can check the status of the etcd cluster. + +## Liaison and Data Node Keeps Restarting + +If the liaison and data nodes keep restarting after installation, review the logs to identify the root cause of the issue: + +```sh +kubectl logs -n +``` + +Common reasons for nodes restarting include insufficient resources, configuration errors, or network connectivity issues. Ensure that the nodes have enough resources to run BanyanDB and that the configuration settings are correct. + +## Liaison and Data Node Connection Issues + +If the liaison and data nodes are unable to connect to each other, verify the network configuration and connectivity between the nodes. Ensure that the nodes can communicate with each other over the network and that there are no firewall rules blocking the connections. + +Check registered endpoints of the data nodes in the etcd cluster: + +```sh +etcdctl get --prefix /banyandb/nodes +``` + +`banyandb` is the namespace of the BanyanDB cluster. It can be changed by the flag `namespace`. You should ensure this namespace is consistent across all nodes. + +If the addresses are incorrect or the nodes are not registered, check the configuration setting [service discovery](../configuration.md#service-discovery) + +## Failed to Connect to Liaison Node + +If the client application fails to connect to the liaison node, verify the network configuration and connectivity between the client and the liaison node. Ensure that the client can reach the liaison node over the network and that there are no firewall rules blocking the connection. + +The SkyWalking OAP is using gRPC to communicate with the liaison node. Ensure that the gRPC port is open and accessible from the client application. If you are using WebUI(liaison) or bydbctl to connect to the liaison node, ensure that the correct HTTP port is used for the connection. The default HTTP port is `17913`. Refer to the [network](../configuration.md#liaison--network) for more details. diff --git a/docs/operation/troubleshooting/no-data.md b/docs/operation/troubleshooting/no-data.md new file mode 100644 index 000000000..51e5e1581 --- /dev/null +++ b/docs/operation/troubleshooting/no-data.md @@ -0,0 +1,26 @@ +# Troubleshooting No Data Issue + +If you encounter issues with missing data in BanyanDB, follow these troubleshooting steps to identify and resolve the problem. + +## Check Data Ingestion + +1. **Monitor Write Rate**: Use the BanyanDB metrics [write rate](../observability.md#write-rate)to monitor the write rate and ensure that data is being ingested into the database. +2. **Monitor Write Errors**: Monitor the [write errors](../observability.md#write-and-query-errors-rate) metric to identify any issues with data ingestion. High write errors can indicate problems with data ingestion. +3. **Review Ingestion Logs**: Check the BanyanDB logs for any errors or warnings related to data ingestion. Look for messages indicating failed writes or data loss. + +## Verify the Query Time Range + +Ensure that the query time range is correct and includes the data you expect to see. Incorrect time ranges can result in missing data in query results. + +1. **Check Available Data**: Check the folders and files in the BanyanDB data directory to verify that the data files exist for the specified time range. Refer to the [tsdb](../../concept/tsdb.md) documentation for more information on data storage. +2. **Time Zone Settings**: Verify that the time zone settings in the query are correct and align with the data stored in BanyanDB. The BanyanDB server and bydbctl uses the time zone aligned with the system time zone by default. + +## Check Data Retention Policy + +Verify that the data retention policy is not deleting data prematurely. If the data retention policy is set to delete data after a certain period, it may result in missing data in query results. Please check the [Data Lifecycle](../../interacting/data-lifecycle.md) documentation for more information on data retention policies. + +## Metadata Missing + +If the metadata for a group, measure or stream is missing, it can result in missing data in query results. Ensure that the metadata for the group, measure or stream is correctly defined and available in the BanyanDB metadata registry. + +If only the metadata is missing, you can recreate the metadata using the SkyWalking OAP, bydbctl or WebUI to restore the missing metadata. Refer to the [metadata management](../../interacting/bydbctl/schema/) documentation for more information. diff --git a/docs/operation/troubleshooting/overhead.md b/docs/operation/troubleshooting/overhead.md new file mode 100644 index 000000000..2f99cf1b2 --- /dev/null +++ b/docs/operation/troubleshooting/overhead.md @@ -0,0 +1,29 @@ +# Troubleshooting Overhead Issue + +If you encounter issues with high overhead in BanyanDB, follow these troubleshooting steps to identify and resolve the problem. + +## High CPU and Memory Usage + +If you notice high CPU and memory usage on the BanyanDB server, follow these steps to troubleshoot the issue: + +1. **Check Write and Query Rate**: Monitor the write and query rates to identify any spikes in traffic that may be causing high CPU and memory usage. Refer to the [metrics](../observability.md#metrics) documentation for more information on monitoring BanyanDB metrics. +2. **Check Merge Operation Rate**: Monitor the merge operation rate to identify any issues with data compaction that may be be causing high CPU and memory usage. Refer to the [metrics](../observability.md#merge-file-rate) documentation for more information on monitoring BanyanDB metrics. + +## High Disk Usage + +If you notice high disk usage on the BanyanDB server, follow these steps to troubleshoot the issue: + +1. **Check Group TTL**: Verify that the TTL policy for groups is not causing excessive data storage. If the TTL for a group is set too high, it may result in high disk usage. Use the `bydbctl` command to [update the group schema](../../interacting/bydbctl/schema/group.md#update-operation) and adjust the TTL as needed. +2. **Check Segment Interval**: Check the segment interval for groups to ensure that data is being compacted and stored efficiently. If the TTL is 7 days, the segment interval is set to 3 days. At the 10th morning, the first segment will be deleted. There will be 9 days of data in the database at most, which is more than the TTL. + +## Too Many Open Files + +The BanyanDB uses LSM-tree storage engine, which may open many files. If you encounter issues with too many open files, follow these steps to troubleshoot the issue: + +1. **Check File Descriptor Limit**: Verify that the file descriptor limit is set high enough to accommodate the number of open files required by BanyanDB. Use the `ulimit` command to increase the file descriptor limit if needed. Refer to the [remove system limits](../system.md#remove-system-limits) documentation for more information on setting system limits. +2. **Check Write Rate**: Monitor the write rate to identify any spikes in traffic that may be causing too many open files. High write rates can result in a large number of open files on the BanyanDB server. +3. **Check Merge Operation Rate**: Monitor the merge operation rate to identify any issues with data compaction that may be causing too many open files. Low merge operation rates can result in a large number of open files on the BanyanDB server. + +## Profile BanyanDB Server + +If you are unable to identify the cause of high overhead, you can profile the BanyanDB server to identify performance bottlenecks. Use the `pprof` tool to generate a CPU profile and analyze the performance of the BanyanDB server. Refer to the [profiling](../observability.md#profiling) documentation for more information on profiling BanyanDB. diff --git a/docs/operation/troubleshooting/query.md b/docs/operation/troubleshooting/query.md new file mode 100644 index 000000000..e231dca12 --- /dev/null +++ b/docs/operation/troubleshooting/query.md @@ -0,0 +1,127 @@ +# Troubleshooting Query Issues + +If you encounter issues with query results in BanyanDB, follow these troubleshooting steps to identify and resolve the problem. + +## Query Syntax Errors + +### Mandatory Fields Missing + +The query's mandatory fields are `time_range`, `name` and `groups`. If any of these fields are missing, the query will fail. Ensure that the query includes all the required fields and that the syntax is correct. + +For `Stream`, the `tag_projection` field is mandatory. For `Measure`, either `tag_projection` or `field_projection` is required. + +### Ignore Unknown Fields + +If the query includes unknown fields, it won't affect the query results. BanyanDB will ignore any unknown fields in the query. This behavior allows you to add custom fields to the query without affecting the query results. But it also bring the risk of typo in the query. + +For example, the following query includes an unknown field `lmit`: + +```yaml +name: "network_address_alias_minute" +groups: ["measure-default"] +tagProjection: + tagFamilies: + - name: "default" + tags: ["last_update_time_bucket", "represent_service_id", "represent_service_instance_id"] + - name: "storage_only" + tags: ["address"] +# The correct field name should be "limit", not "lmit". +lmit: 5 +``` + +In this case, the query will still execute successfully, but the `lmit` field will be ignored. + +### Invalid Time Range + +The valid time range is: + +- `start_time` < `end_time` +- minimum time is `1677-09-21 00:12:43.145224192 +0000 UTC.`, which is the minimum time of `time.Time` in Go. +- maximum time is `2262-04-11 23:47:16.854775807 +0000 UTC.`, which is the maximum time of `time.Time` in Go. + +If the query includes an invalid time range, the query will fail. Ensure that the time range is correct and within the valid range. + +## Unexpected Query Results + +### No Data or Partial Data + +Please refer to the [Troubleshooting No Data Issue](./no-data.md) guide to identify and resolve the issue. + +**Valid Old Data Missing**: If you see valid old data(it's TTL is not reached) is missing, some of data servers may be down, the new data is still ingested into the database, but the query results may be incomplete. Please check the [Active Data Servers](../observability.md#active-instances) to ensure that all data servers are running. The old data will be available once the data servers are back online. + +### Duplicate Data + +`Stream` and `Measure` handles duplicate data differently: + +- `Stream`: If the same data is ingested multiple times, the `Stream` will store all the data points. The query results will include all the duplicate data points with the same entity and timestamp. +- `Measure`: If the same data is ingested multiple times, the `Measure` will store the latest data point. It uses a internal `version` field to determine the latest data point. + +`Measure` data version is determined by: + +1. If [`DataPointValue.version`](../../api-reference.md#datapointvalue) is set, use it as the version. +2. If [WriteRequest.message_id](../../api-reference.md#writerequest) is set, use it as the version. +3. If neither of the above fields are set, leave the version as 0. + +## Query Performance Issues + +Use query tracing to understand execution plans and identify bottlenecks. To enable query tracing, set the `trace` field to `true` in the [MeasureQueryRequest](../../api-reference.md#queryrequest) and [StreamQueryRequest](../../api-reference.md#queryrequest-1). The query results will include detailed tracing information to help you identify performance issues. + +There are some important nodes in the trace result: + +- `measure-grpc` or `stream-grpc`: It represents the overall time spent on the gRPC call. +- `data-xxx`: It represents the time spent on reading data from the data server. The `xxx` is the data server's address. Because all data servers are queried in parallel, the total time(`xxxx-grpc`) spent on reading data is the maximum time spent on reading data from all data servers. + +In each data server, there are some important nodes: + +- `indexScan`: Using index to fetch data. + - `seriesIndex.Search`: It represents the time spent on searching the series index for the specified time range. + tag: + 1. `query`: The query expression to search the series index. + 1. `matched`: The number of series matched by the query. + 1. `field_length`: The number of fields is read from the series index. For `Stream`, it's always 1. For `Measure`, it's the number of indexed tags. + - `scan-blocks`: It represents the time spent on scanning the data blocks for the matched series. + 1. `series_num`: The number of series to scan. It should be identical to the `matched` in `seriesIndex.Search`. + 1. `part_num`: The number of data parts to scan. + 1. `part_header`: The header of the value list in `part_xxxx` + 1. `part_xxx`: The data part to scan. + 1. `block_header`: The header of the value list in `block_xxx` + 1. `block_xxx`: The data block to scan. +- `iterator`: It represents the time spent on iterating the rows in the data block for filtering, sorting and aggregation. + +### Part and Block Information + +If the `part_header` is: + +```yaml +- key: part_header + value: MinTimestamp, MaxTimestamp, CompressionSize, UncompressedSize, + TotalCount, BlocksCount +``` + +`part_xxxx` is: + +```yaml + - key: part_377403_/tmp/measure/measure-default/seg-20240923/shard-0/000000000005c23b + value: Sep 23 00:00:00, Sep 23 22:50:00, 37 MB, 61 MB, 736,674, 420,920 +``` + +`377403` is the `PartID`, which means this data part is in the directory `part_377403_/tmp/measure/measure-default/seg-20240923/shard-0/000000000005c23b`. `000000000005c23b` is the hexadecimal representation of the `PartID`. + +The `MinTimestamp` and `MaxTimestamp` are `Sep 23 00:00:00` and `Sep 23 22:50:00`, respectively. The `TotalCount` is 736,674, which means there are 736,674 data points in this data part. The `BlocksCount` is 420,920, which means there are 420,920 blocks in this data part. The `CompressionSize` and `UncompressedSize` are the size of the compressed and uncompressed data part, respectively. + +If the `block_header` is: + +```yaml +- key: block_header + value: PartID, SeriesID, MinTimestamp, MaxTimestamp, Count, UncompressedSize +``` + +`block_xxx` is: + +```yaml +- key: block_0 + value: 377403, 4570144289778100188, Jun 16 23:08:08, Sep 24 23:08:08, + 1, 16 B +``` + +The `PartID` is 377403, which means this block is in the data part `part_377403_/tmp/measure/measure-default/seg-20240923/shard-0/0000000000005c23b`. The `SeriesID` is 4570144289778100188, which means this block is for the series with the ID `4570144289778100188`. The `MinTimestamp` and `MaxTimestamp` are `Jun 16 23:08:08` and `Sep 24 23:08:08`, respectively. The `Count` is 1, which means there is only one data point in this block. The `UncompressedSize` is 16 B, which means the uncompressed size of this block is 16 bytes.