deployment monitoring and epic progress dashboard #4999

synctext · 2019-12-09T17:14:37Z

To better organise ourselves we need more critical information in 1 place.

The coming time we aim to close #1 finally. Our progress towards this goal and how stable we are should be captured in a Tribler-at-a-Glance dashboard. Example from Jenkins:

https://medium.com/kj187/jenkins-job-dashing-widget-cc72feeed654

https://www.level-up.one/6-of-my-favorite-jenkins-plugins/

https://www.datadoghq.com/blog/monitor-jenkins-datadog/

Tribler critical information candidates:

stability issues
- crash reports from the wild (24h, last week, last month) latest devel, latest stable version and all versions
- Application tester with random clicker number of faults (24h, last week, last month) latest devel, latest stable version
- burn-in testing of running Tribler for 1 week: total CPU cycle, peak memory, total disk IO (crash or run-away resource usage)
- issues pending
performance monitoring
1. Anonymous end-to-end download performance (latest devel, latest stable version)
2. Exit node based download speed
3. Start of download delay, (non-)anonymous mode
4. First time startup time
deployment monitoring
- explorer Trustchain, growth of blocks in (last minute ! ; last hour ; last day; last week ; all time)
- exit node status (CPU, connections, idle slots, memory?)
- traffic stats
- metadata status: keyword searches, channel gossip community

devos50 · 2019-12-10T19:08:34Z

Interesting visualisations! Somewhat related to #3508 (at least the TrustChain deployment monitoring).

e2e anonymous download is an excellent candidate for performance monitoring and should not take long to setup. I think @ichorid addressed this a while ago actually but it has not been actively monitored since then. In fact, making us (more) aware of failing tests/validation experiments is becoming a necessity as the number of different tests that run with fixed time intervals is growing.

I think we have to address this issue rather sooner than later. The problem is that if we do not do it, we will have a proliferation of different tools. Currently, we have the TrustChain explorer, Tribler user statistics, the error reporter and all tests/monitors on Jenkins. There might be some opportunity to merge some tools, which eases maintenance.

metadata status: keyword searches, channel gossip community

This might be a dangerous one to monitor and could be a violation of ones privacy expectations of Tribler.

synctext · 2020-03-31T09:20:43Z

Please look at FileCoin slipped roadmap. After Release 7.5 I'm considering that we work together on the first Jenkins dashboard for 2 weeks:

arrange hardware monitors with obscene awesomeness, due to size (@synctext)
Anonymous end-to-end download performance (latest devel, latest stable version) (@egbertbouman)
crash reports from the wild (24h, last week, last month) latest devel, latest stable version and all versions (@ichorid)
Application tester with random clicker number of faults (24h, last week, last month) latest devel, latest stable version (@devos50)
IPv8 traffic stats with total of unique number of public keys in last (24h, last week, last month) within discovery community (heard about only, responsive) (@qstokkink)
explorer Trustchain, growth of blocks in (last minute ! ; last hour ; last day; last week ; all time) (@grimadas)

qstokkink · 2020-03-31T09:40:59Z

Can we decide on some software/library to use (or to make) to graph all of this data? All sorts of dashboard creation tools exist.

For example: https://dzone.com/articles/build-beautiful-console-dashboards-with-sampler

devos50 · 2020-03-31T10:09:15Z

Most of this data can either be extracted from our existing Jenkins Job using the API, or from our running Trustchain explorer backend, also with API requests. One of the question we should also answer, is whether we want a dedicated website for this. Jenkins unfortunately does not provide the tools for such real-time data, and integration of this dashboard in Jenkins would just be a new job with succeed/fail status.

arrange hardware monitors with obscene awesomeness, due to size

We should secure a prominent spot at the coffee machine ☕️

qstokkink · 2020-04-20T11:24:27Z

I propose starting with something "easy". Exposing GitHub events through tribler.org:

Add a webhook to GitHub for the Tribler repository (sending POST requests to the tribler.org domain).
Add a new page (tribler.org/githubevents?) which renders all GitHub events (possibly with websockets for live updates).

The idea is that we can reuse the resulting backend for another (bigger and better) dashboard and we'll have something to look at in the mean time.

devos50 · 2020-07-15T12:35:53Z

One way to get more insights into our user count is by analysing the crawled TrustChain data. The plot below is generated based on our current dataset, with over 80.000 users and 123 million records. The (major) releases of Tribler are annotated. Note how our 7.5.0 release resulted in an increase in new user count.

Parsing this 97GB database, however, is computationally intensive and could be done on a daily basis for example. A dashboard could include this static image.

synctext · 2020-07-16T14:16:45Z

In 2006-2009 we had initial deployment monitoring. Included in Zeilemaker master thesis.

xoriole · 2020-09-04T08:31:19Z

Based on data we already have

kozlovsky · 2020-09-11T07:53:23Z

Yesterday I did a little research on this topic, and now I want to suggest a way to show anonymized performance statistics. It may be the following set of technologies:

Custom client-side code to prepare anonymized statistics
Dedicated server with custom API as an entry point
InfluxDB for storing anonymized data
Grafana for displaying beautiful graphs

The most popular tool for gathering and processing metrics is Prometheus. It has has a big community and is widely used for gathering server metrics. Prometheus if often compared with InfluxDB (see the comparison on official Prometheus doc). While Prometheus is more popular, in my opinion, InfluxDB is better suited to our needs for the following reasons:

Prometheus pulls metrics from the known number of server instances. In our case, we cannot pull statistics from client machines and want to push instead. While it is possible to use Prometheus with additional tools like Prometheus Aggregation Gateway, it in some way goes against Prometheus philosophy. On the other side, InfluxDB expects that the data are pushed, which is better suited to our needs.
Prometheus data storage is ephemeral and not intended to be stored for a long time. InfluxDB data are persistent and can be used to compare changes in gathered statistics on long time intervals.

Grafana is a very popular open-source tool for graph visualization, which can be used with Prometheus, InfluxDB, and multiple other data sources. It allows constructing powerful dashboards with different types of graphs and charts.

If we decide to use this set of tools, I think I can take on this task. I see the following sub-tasks here to be implemented:

a client-side code for preparing anonymized statistics
a client-side code to send gathered statistics to our dedicated server
a custom server API to collect anonymized statistics
a server code which implements API as mentioned above, aggregate collected data and put it into InfluxDB instance
deploy a dedicated server with statistics gathering API, deploy InfluxDB instance (probably on some different machine)
deploy Grafana instance
make Grafana dashboard

Later we can use Grafana to display all graphs, not only user statistics but also server builds, etc.

What do you think?

synctext · 2020-09-11T08:02:00Z

Health monitoring of client state
Health monitoring of our website, Github, statistics servers
Health monitoring of bootstrap servers and crawler servers

Pitfall: everything we want with our self-organising research project is easier to do in a central server... Primarily use our crawlers as early warning infrastructure! (IPv8 is designed for network health monitoring) Then we need to emphasise crawler intelligence and stats aggregation.

Are we not re-creating this from scratch? https://jenkins-ci.tribler.org/job/Test_BootstrapServers/lastSuccessfulBuild/artifact/walk_rtts.png

First, anonymity is our existential feature. How to do this? (True anonymity might be impossible, OFF switch by default)
We could show the user inside the debug panel the exact history and record which will be shared in private with our debug servers optionally? Can we protect against Internet address leakage? Many steps in future I guess to re-usage our Tor-like stuff while debugging our Tor-like stuff :-)

This needs to be opt-in for production releases and can hopefully be opt-out for nightly builds and Beta versions. What about Release Candidates?

InfluxDB: 34,082 commits, 19.5k of stars on Github. This is a general time-series database solution, we still need to make custom code for deployment monitoring?

This seems quite complex tooling. Afraid of overengineering for the user community we have currently. However, deployment monitoring is something we really need to do more and get right.

xoriole · 2020-09-11T08:17:38Z

InfluxDB and Graphana are indeed good choices.

1.Custom client-side code to prepare anonymized statistics
2.Dedicated server with custom API as an entry point
3.InfluxDB for storing anonymized data
4.Grafana for displaying beautiful graphs

I have done some work on 1 and 2. I'm extending https://release.tribler.org/docs to receive anonymized data from the client. That can be the entry point to further processing using InfluxDB and visualizing on Graphana.

kozlovsky · 2020-09-11T09:06:16Z

We probably can use InfluxDB Jenkins plugin to put deployment statistics into the InfluxDB:
https://wiki.jenkins.io/display/JENKINS//InfluxDB+Plugin

synctext · 2020-09-11T09:18:22Z

Change of plans:-)
By 25 September aim to have plots in Jenkins. The PopularityCommunity is crawled and health statistics are refreshed every few minutes or half an hour. After this test project we determine what we need and roadmap. Could be a fix of the PopularityCommunity code plus algorithm as next step, deploy, monitor, etc.

Our current methodology:

undocumented algorithm
exclusively rely on unit tests
end-to-end test manually if the desired feature works
no health monitoring of protocol deployment

Tribler is a bottomless pit of problems. (stolen quote)
Our work methodology should become relentlessly data-driven: there is direct evidence we need better crawling, no evidence of client monitoring beyond debug screen and crash reporting (might change; agile)

devos50 · 2020-09-16T07:50:04Z

exclusively rely on unit tests

I think a key metric is the stability of our unit tests. Currently, unstable unit tests (both on devel and our release branches) are delaying the development process. Converting the test suite to pytest, which should make the debugging process of errors in the tests easier, is much more work than I anticipated.

My suggestion would be to continuously run all unit tests on a dedicated machine and include in the upcoming dashboard how stable they are (e.g., % of runs failing during the last day).

synctext · 2020-10-19T13:59:36Z

Related work: https://stats.goerli.net/

synctext · 2021-01-13T11:20:06Z

Impressive progress! Our .yml and servers are getting in much better shape. We can even see in real time the upgrade speed. Learned something new: they upgrade quite fast. Previous years we never had this.

synctext · 2021-03-10T14:18:00Z

Yeah! More pretty graphs, exit node peak: 121 GiB per second

one-two-my-gad · 2021-04-19T08:23:44Z

cool

synctext · 2021-05-14T10:03:43Z

Example: https://data.syncthing.net/
File sync with central servers discovery and no spam measures. Great deployment monitoring!

synctext · 2021-09-23T10:21:17Z

@kozlovsky Could you please duplicate this specific https://data.syncthing.net/ graphs and wrap up the Grafana work?
This is quite a useful and simple graph to have.
Users Joining and Leaving per Day === This is the total number of unique users joining and leaving per day. A user is counted as "joined" on first the day their unique ID is seen, and as "left" on the last day the unique ID was seen before a two weeks or longer absence. "Bounced" refers to users who joined and left on the same day.

synctext · 2021-10-05T12:12:42Z

To better organise ourselves we need more critical information in 1 place.

Mature network alerts and deployment monitoring. The mission is to put everything in one place. The big danger is to partially put everything together, but actually create the n+1 place called Grafana where data is fragmented. Full user experience pipeline:

Keyword search performance for Tribler (locked somewhat into Google)
Website visits to Tribler.org (Github hosting statistics export)
Crawling of our .exe download stats from https://githubdownloads.com/?username=tribler&repository=tribler Tribler-7.10.0.dmg (60.20 MiB) - downloaded 3,241 times. Last updated on 2021-07-14
How many active users are our network health crawlers chatting to?
How many incoming introduction-requests are our network health crawlers getting? (27Sep 6AM event)
what are the various exit nodes self-reporting in total traffic?
How many daily users are using our various initial bootstrap nodes?

A single page having graphs for the health of each step in our user journey would help to identify faults. We learned a lot from our recent "unknown user drop" incident. Like:

Took the team 5 days to figure out we had a suspicious memory dip at 06:00AM dailly.

synctext · 2022-09-06T15:09:49Z

When we have hired more developers we can re-visit this issue. We need to focus on putting everything inside application-tester and existing code. Example of IPFS people on DHT health.

synctext · 2022-10-20T07:51:10Z

IPFS people have nice uptime monitoring script (DHT only level):

Epic 2015 ticket with monitoring with Niels statistics. User community insight using an improved crawler

synctext · 2022-10-27T08:58:13Z

We take screenshots, takes a few clicks to find (application tester on Jenkins)

Plus smooth Github actions: https://github.com/Tribler/tribler/actions/runs/3330189428/jobs/5508351618

synctext · 2023-07-10T07:32:29Z

Complex monitoring. Numerous statistics systems, all connected together, and almost all down now 😿

The network was not functioning optimal these days. The Tor-like network was running out of capacity. Root cause of failure was a memory leak which went unnoticed. Grafana did not alert. No Slack alarm post. Testers did not alert. InfluxDB is not recording anymore. Prometheus-Grafana data feed is down. Dream of a single dashboard with health should have caught this. Another system brought live in a few hours:

This duplicates Jenkins monitoring: https://jenkins-ci.tribler.org/job/Test_BootstrapServers/lastSuccessfulBuild/artifact/summary.png
We lack a single vision and minimal maintenance platform for alerts. ToDo after big release.

drew2a · 2023-07-10T09:57:07Z

This is yet another indication that choosing Grafana+Prometheus may not have been the best decision for our "new" dashboard. We already have ample sources of information, so adding another unique source doesn't seem optimal. What we really need is a singular place to integrate all existing information.

From my perspective, here's what we should do (with a rough time estimation):

Select a tool capable of integrating data from all existing information sources (1w).
Install the tool (1d).
Ensure the selected tool can analyze the entirety of this information and present it using a simple traffic-light-style indicator: 🟢 🟡 🔴 (3d).
Consolidate all these sources into a single dashboard (1m-2m).
Make this tool easily accessible for all developers:
1. Provide easy access to a web page (1d).
2. Dispatch daily summary notifications (1d).
3. Send out alerts concerning critical incidents (1d).

Our information sources:

Jenkins (Experiments, Application Tester, Release Builds)
Download statistics: https://release.tribler.org/dashboard/
Grafana: https://dashboard.tribler.org
Infrastructure monitoring tools
Sentry
Metabase (crawlers)

(did I miss something?)

kozlovsky · 2023-07-10T10:18:30Z

I think that of all the services we use for dashboards and monitoring (Prometheus, InfluxDB, Grafana), Prometheus is the most reliable (and can display monitoring graphs without Grafana), while the most problematic was InfluxDB; most dashboard outages were caused by it.

It may be worth spending time to set up Prometheus alerts, as it should cover most of the current problems.

For persistent time series data, the most convenient data storage may be TimescaleDB, which can replace InfluxDB and fix most problems.

But trying something simpler like Graphite is also possible.

xoriole · 2024-01-25T13:21:22Z

Since Grafana is currently used for deployment monitoring and as far as I understand there is no immediate priority to work on an alternative, I'm unassigning myself from this ticket.

qstokkink · 2024-08-29T12:07:17Z

Indeed, we have a solution in place. This issue is - at the very least for now - resolved. If we have specific alternatives that we want to explore in the future, another issue can be opened.

devos50 added the long-term label Dec 10, 2019

devos50 added this to the Backlog milestone Dec 10, 2019

synctext mentioned this issue Dec 16, 2019

Operational credit mining branch, Gumby scenario and multichain #1842

Closed

synctext mentioned this issue Mar 21, 2020

Tribler crashes after 4-10 minutes #5220

Closed

synctext mentioned this issue May 15, 2020

attack-resilient micro-economy for media #1

Open

15 tasks

synctext mentioned this issue Jun 14, 2020

Keyword search performance and RemoteQuery community #5208

Closed

devos50 mentioned this issue Jun 18, 2020

Improvements/fixes of the Tribler statistics page #5387

Closed

synctext added the infrastructure label Jul 16, 2020

synctext assigned xoriole Aug 31, 2020

drew2a added the component: monitoring label Jan 15, 2021

synctext mentioned this issue Apr 21, 2021

end-to-end anonymous seeding and download performance test #2548

Open

xoriole removed their assignment Jan 25, 2024

qstokkink removed the component: monitoring label Aug 19, 2024

qstokkink removed this from the Backlog milestone Aug 23, 2024

qstokkink closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deployment monitoring and epic progress dashboard #4999

deployment monitoring and epic progress dashboard #4999

synctext commented Dec 9, 2019 •

edited

Loading

devos50 commented Dec 10, 2019

synctext commented Mar 31, 2020

qstokkink commented Mar 31, 2020 •

edited

Loading

devos50 commented Mar 31, 2020

qstokkink commented Apr 20, 2020

devos50 commented Jul 15, 2020 •

edited

Loading

synctext commented Jul 16, 2020

xoriole commented Sep 4, 2020

kozlovsky commented Sep 11, 2020

synctext commented Sep 11, 2020 •

edited

Loading

xoriole commented Sep 11, 2020

kozlovsky commented Sep 11, 2020

synctext commented Sep 11, 2020 •

edited

Loading

devos50 commented Sep 16, 2020

synctext commented Oct 19, 2020

synctext commented Jan 13, 2021

synctext commented Mar 10, 2021

one-two-my-gad commented Apr 19, 2021

synctext commented May 14, 2021

synctext commented Sep 23, 2021

synctext commented Oct 5, 2021 •

edited

Loading

synctext commented Sep 6, 2022

synctext commented Oct 20, 2022 •

edited

Loading

synctext commented Oct 27, 2022 •

edited

Loading

synctext commented Jul 10, 2023 •

edited

Loading

drew2a commented Jul 10, 2023

kozlovsky commented Jul 10, 2023

xoriole commented Jan 25, 2024

qstokkink commented Aug 29, 2024

deployment monitoring and epic progress dashboard #4999

deployment monitoring and epic progress dashboard #4999

Comments

synctext commented Dec 9, 2019 • edited Loading

devos50 commented Dec 10, 2019

synctext commented Mar 31, 2020

qstokkink commented Mar 31, 2020 • edited Loading

devos50 commented Mar 31, 2020

qstokkink commented Apr 20, 2020

devos50 commented Jul 15, 2020 • edited Loading

synctext commented Jul 16, 2020

xoriole commented Sep 4, 2020

kozlovsky commented Sep 11, 2020

synctext commented Sep 11, 2020 • edited Loading

xoriole commented Sep 11, 2020

kozlovsky commented Sep 11, 2020

synctext commented Sep 11, 2020 • edited Loading

devos50 commented Sep 16, 2020

synctext commented Oct 19, 2020

synctext commented Jan 13, 2021

synctext commented Mar 10, 2021

one-two-my-gad commented Apr 19, 2021

synctext commented May 14, 2021

synctext commented Sep 23, 2021

synctext commented Oct 5, 2021 • edited Loading

synctext commented Sep 6, 2022

synctext commented Oct 20, 2022 • edited Loading

synctext commented Oct 27, 2022 • edited Loading

synctext commented Jul 10, 2023 • edited Loading

drew2a commented Jul 10, 2023

kozlovsky commented Jul 10, 2023

xoriole commented Jan 25, 2024

qstokkink commented Aug 29, 2024

synctext commented Dec 9, 2019 •

edited

Loading

qstokkink commented Mar 31, 2020 •

edited

Loading

devos50 commented Jul 15, 2020 •

edited

Loading

synctext commented Sep 11, 2020 •

edited

Loading

synctext commented Sep 11, 2020 •

edited

Loading

synctext commented Oct 5, 2021 •

edited

Loading

synctext commented Oct 20, 2022 •

edited

Loading

synctext commented Oct 27, 2022 •

edited

Loading

synctext commented Jul 10, 2023 •

edited

Loading