-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
phd placeholder: "Decentralized Machine Learning Systems for Information Retrieval" #7290
Comments
Ideas...
|
Brainstorm: We present a decentralised search engine, based on deep learning of URLs, URIs, magnet links, and IPFS links. Deep learning has been successfully used in the past to identify trusted and malicious websites. We go beyond this prior work and present an experimental search engine based on fully decentralised unsupervised learning. Our fuzzy search algorithm is based on unsupervised online learning-to-rank (OL2R). Literature, Dataset and code:
|
Spent the last week doing basic courses on neural networks again. Trying to get a linear regression model running to predict a sine function (using SGD). As basic as this is, implementing it is not as easy as I would have expected 🥲 I will continue to try to get it to work. I need to learn it at some point, I think. However, I'm inclined to first-publication ideas that do not directly employ NN/learning, as kind of a soft start. Talked to @kozlovsky again about semantic search based on simple embeddings. We could use the crowdsourced information and metadata to compute embeddings for every torrent and build a distributed (dec.) search algorithm based on vector distance to have a YouTube-like search experience.
Use this as a basis for semantic search and improve through OL2R in the next step? This week's ToDos:
|
Please chat to all people in the lab, understand their speciality. btw documenting your lesson learned; ML blogs make it look easy, none of them worked. Please review the Meerkat system Github repo from RedPajama.
Suggested sprint: you successfully encoded
|
Got a sine function prediction working using a NN regression model. It's not much... but feels good to have succeeded at this task. Learned about activation functions, and the challenge of parametrization in ML. Also did some reading on tagging, recommender systems, and collaborative filtering, which opens another broad area of research, e.g., the issue of trust in collaborative tagging (see research topic 4 in science#42) - which I do find interesting. This (and perhaps also the next) week, I want to play around with Meerkat, learn about Autoencoder, and see if I can get something up and running, i.e., another ML model. I hope to further evolve my understanding of AI/ML and learn about yet new concepts. |
Rich metadata embedding of "artist passport" Cullah. |
Update on Autoencoder: The hardships of the last weeks seem to start paying off. I was able to create some functional autoencoders within just one day. I trained a model on eight pictures displaying tulips (dimensions: 240x180px), i.e., an input layer of 3x240x180=130k neurons, reduced that to 1000 neurons in a single hidden layer (encoding). If I'm not mistaken, this equates to a data reduction from 130 to 4 KB (the original JPEGs had 50-80 KB). This might not be impressive, and with the right parametrization, we might be able to get more out of it. But for now, I'm just happy that it works in principle. |
Motivated by my recent success with autoencoders, I spent the last week trying again to get a pairwise LTR model for a sample of test data running. By doing that, I learned a lot more about the details of this approach. However, I had to pause this project because I would like to move this outside of a notebook and run it locally. I'm waiting for my work MacBook for that (my machine has difficulties) - it should arrive next week. So now I turned to the idea of NN-based file compression which apparently is not only successful with the task of lossless compression but can actually compete with traditional algorithms like GZIP or 7Z (see, e.g., DZIP or its successor DeepZIP). |
Using LLM for ranking. Background https://blog.vespa.ai/improving-text-ranking-with-few-shot-prompting/ and also https://blog.reachsumit.com/posts/2023/03/llm-for-text-ranking/ Did a silly OpenAI experiment, might be useful building block, but small part:
Found this dataset with 7000+ pet images (+ code) for scientific research, so it sort of works 🐶 |
So I ended up not learning about RNNs, NN-compression, etc. last week. Instead, I investigated an idea proposed by @grimadas, which is to leverage ML as a means to classify nodes as Sybil or not-Sybil and use the resulting score as a weight-factor in the reputation algorithms of MeritRank. Back to LTRGot my new MacBook yesterday 🔥 so I was able to continue my work on the OL2R project. My goal was to just get anything working, and I specifically sticked with the pairwise LTR approach for this. To this end, I was only following the basic idea, which is to train a model based on query-document-pair triples, and followed my intuition for the rest. Algorithm
Remarks
|
Sprint: get a solid production-level metadata dataset (Creative Commons, https://github.com/MTG/mtg-jamendo-dataset ?) |
Update:
Love this roadmap!! ❤️ I have started getting my hands on the Tribler code and gain a better understanding of its inner workings. Will try to move forward with the practical preparatory work for the next thesis chapter, such as getting a dataset. |
Please write 3 next paper ideas around topic of "Web3 crowdsourcing". 6-page for DICG 4th workshop would be indeed a great learning. One of 3: crowdsourcing of metadata using MeritRank; everybody can tag, describe work done of Great news, hoping chapter to be ready for final reading & submission aug/sep. As a starting point for new lab people a short history and essential reading. |
New ACM Journal on Collective Intelligence seems like a solid venue to target. Your second chapter could then simply be decentralised collective intelligence. Using Tribler to share info, use tagging, and trust. You can re-publish a lot of the Tribler work done by others in the past 18 years and 3 months. |
To give a little update...
|
I'm focusing on the DICG'23 now. @grimadas MovieLens DatasetI was able to find a really nice dataset. MovieLens is a non-commercial project run by people from U. Minnesota since 1995. It's a database of movies that users are allowed to rate and tag. Any user can add new tags, and existing tags can be up- and downvoted in a way (see screenshot). Tags also have the attribute of being positive, neutral, or negative. I am not sure how complete their dataset is about that, but they are responsive to my emails and seem highly cooperative with the provision of data. We can use this dataset to get an idea of the quantity and quality when it comes to crowdsourcing of tags, and base our simulations on it. IdeaPerhaps, for this workshop, I could come up with some subjective tag scoring algorithm, a bit related to the "Justin Bieber is gay" problem. Playing with the idea that for a group of similar users, a tag might be agreed upon, but for another group of users the same might not, etc. Approach
Will further investigate this idea and the dataset and make updates here. Comments welcome. |
Just noticed this line of work, very interesting! I worked on something similar (trust, tags + MovieLens dataset) more than a year ago, see this issue (note that this is a private repo with many of our research ideas so you might have to request access). The overall goal of that issue was to work on the foundation of tag-centric crowdsourcing in Tribler. I tried out a few algorithms/approaches and I remember I identified some shortcomings of Credence, which is related to what you're trying to achieve. but as reputation/trust was not really my domain, I decided to focus on the fundamental data structures instead (using Skip Graphs and Knowledge Graphs for Web3 data management). The paper with these ideas is currently under review. Nonetheless, extending that work with a trust layer would be a valuable contribution! |
Hi Martijn :) thanks for your input! I was knocked out by COVID over the last two weeks, and still am a bit, but here is the continuation of what I was trying to do: I have calculated user similarity based on the Pearson correlation of common sets of rated movies (as suggested here), and based on that, subjective tags on movies (indeed similar to Credence in that I weigh based on peer correlation). I based this solely on the interactions of the 200 most active users (perf reasons). Example of a sample of users and their subjective tags on the movie "The Shawshank Redemption"
From there on, I tried to find extreme results, i.e., movie tags for users of "opposite" groups. To this end, I looked up controversial movies and their tags for users with minimum/negative correlation, hoping for something like a clear political or a gender split. And it wasn't easy, perhaps due to the lack of data. But I still found an interesting disparity for Disney's Star Wars remake. While one user has funny, good action, and great action among his top tags, Full list of tags for two negatively correlated users on "Star Wars: The Last Jedi"
That was fun to explore but it still lacks a scientific methodology in order to really evaluate the effectiveness of the subjective tags I computed. Previously, I proposed that
Maybe that gives us something. Maybe for all tags that have been up- and down-voted, I can compare the subjective with the objective reality and derive a success metric. And this would allow me to experiment with more sophisticated scoring algorithms and see their effect on this metric.
Good stuff. I don't know if trust should be my scope either. I'll talk to Johan today, will know more then. Status update on
After almost half a year, I still don't have a grasp of the field enough to come up with own ideas for publications. |
No worries about your progress in 6 months of a 48 months phd. Getting into the field of distributed system and doing something novel is hard. Having a draft publication within the first 12 months is already a solid achievement. Goal: April 2024 == 1 thesis chapter under review + 1 finished draft thesis chapter. Non-linear productivity 📈 Task for September 2023: come up with ideas for a scientific paper and select one (or get inspiration) SwarmLLM: collective LLM intelligence (with new AI phd expert??)We present the first proof-of-principle of collective intelligence for transformers. Intelligence emerges from the interaction between numerous elements [REFS]. We use a transformer as the basic building block for a interconnected network of connected unique transformers. Instead of the classical transformer approach with billions of parameters, we connect thousands of specialised transformers into a network. This is a generalisation of the mixture of experts approach with the highly desired new property of unbounded scalability. There is a cost to pay in our approach. In a typical divide and conquer style, the challenge of finding the correct expert becomes harder. LLM as a key/value storekey: any youtube URL in Youtube-8M dataset. Rich metadata inside an LLMTulip picture embedding in generic form. Tribler: a public semantic search engineWe shamelessly claim to have a proof-of-principle for public Internet infrastructure after 20 years of effort. We argue that critical societal infrastructure should be developed as a non-profit endeavour. Similar to Bittorrent and Bitcoin we present a self-organising system for semantic search. Our work is based on learn-to-rank and clicklog gossip with privacy-enhancing technology using a Tor-derived protocol. Web3Search: Online Pairwise Learning to Rank by Divide-and-Conquer with full decentralisationEmbedding nodes using algorithms like node2vec. Embedding of any item using 50-ish dimensional vector. Foundations of Trust and Tags.Use the @grimadas work with theoretical grounding: emergence of trust networks with repeated successful interactions. Use tags based on crowdsourcing if you have successfully collaborated. Next steps: learn-by-doing methodology. Work for 2 weeks further on the Tulip stuff. Autoencoder of 1000-ish Youtube-8M thumbnails. Work for 2 weeks on another of above brainfarts. Commit to best chapter idea. |
Seeing how far I can get autoencoding YouTube thumbnails. Time for some quick coding. Using YouTube's API I got the thumbnails of 577 search results with "starship" as the query. Note Using YouTube's search API instead of its 8M dataset (can't run that on my machine!) is different in that I collect the thumbnails of videos which match the search query, and in the 8M dataset they sort of match the query (selected set of visual entities) with what they actually found displayed in the video. I still went with it, trained the network on 576 thumbnails, and then ran the 577th search result's thumbnail through the autoencoder.
What might do is the labeling we get on frame-level (or ~1-second-interval video segments). We have that in the 8M dataset. Getting an actual image entails downloading the original YouTube video and then extracting the corresponding frame. That's costly but doable on a small to mid scale. We have been thinking about doing text-to-image basically, using auto-encoders? I think that was the plan... |
wow, as impressive as I hoped it would be!! 4k! |
There was an error in my code, and a bit in the approach. What I did was training only on 50 thumbnails, and then use a thumbnail that was part of the training data for testing. I updated my last comment; the result is very different.
Actually, PyTorch does not support CUDA (GPU acceleration) on Mac :( Google Colab with GPU runs faster for me. |
|
I have been digging into some papers in the context of the upcoming Queries-Is-All-You-Need chapter and potential future work. Dumping my learnings here.
|
update : (more paper ideas then finished chapter, simply documenting) |
"learn to tokenize documents"I got the script to work (GenRet). However, I was rethinking this idea and I don't see how we could sell it. Learning semantic docids from the queries themselves obviously requires you to have the queries beforehand, and even then it implies some fluidity (cannot just "improve" the ID in the continuum). I'm dropping this! roadmap on life-long decentralised learning [...] what is the first realistic next step?Yeah as you listed: overfitting, pollution, spam, those are also things that come to my mind. While there are some ideas how it could work conceptually (e.g., DSI++, IncDSI), the datasets they use to validate them are a bit weak (in those papers, NQ and MS MARCO). For a real evaluation, we (1) need real search queries, including spam, but also (2) we should care about the chronological order that the queries come in, and that the model learns on. Waiting for the Tribler Clicklog. Would it be possible to filter hallucination of our magnet link generative AI (Queries Is All You Need)?Not in the way that is described in this body of research (referring to your survey link), I think. What we can do in the next step is to assume knowledge of, let's say, healthy torrents. Using this knowledge, the model will be configured to predict the most likely token, with which the resulting output continues to match a prefix found within the set of healthy torrents. Will do! |
Representation of Targets MatterIn our last paper, we saw significant differences in performance when representing our targets as ORCAS docids (e.g., D1234567) vs. as magnet links (40 character hex string). The model would generally have a harder time predicting magnet links. We blamed this on their length; more tokens to generate, more chances to trip along the way. When thinking about how to optimize the performance of our model, I therefore thought the number of tokens on a docid should be minimized. Why not use the entire ASCII space for example? Or hell, the T5-small has 32k tokens, why not encode docids as, for instance, "car 2001": two tokens, 1 billion possible combinations. It turns out this confuses the model more than it helps 😅. This beeeeeegs the question.... 🤔 Using an LLM to predict arbitrary identifiers, what kind of identifiers come natural to it?Is it a question of length? Or consistency? Or the employed vocabulary? What tokens should you use? How many? I ran a lot of experiments to get closer to an answer about all these things. In order to enhance the performance of our model, I initially thought that the number of tokens used to represent a docid should be minimized. My rationale was _less tokens, less chances to mispredict. And while that might be true, or maybe only true to some extent, it definitely seems to be case that the employed vocabulary matters too! I have been experimenting with different representations (or rather encodings) of the targets (i.e., the docids) -- and, spoiler, the results are actually quite impressive. Here is exactly what I did
I repeat this experiment with different encodings. A full list, including some result metrics, is shown below.
🚀 In this experiment, we made a 7% performance increase over our original results just by choosing a different encoding for the docids It seems to have an easier time with numbers. Maybe it is because there exist many tokens for compounded digits (69, 420, 2001, 1944), thus reducing in less tokens needed to represent a docid. Another theory I have is that having predicted a number token, based on this context it is more likely to predict another number token, and that this might help performance a little. It is perhaps also interesting to acknowledge that number tokens are semantically very similar to each other. That goes to say the tokens for We might already be very close to what the perfect (or perfectly-enough) representation is. But it might be interesting, not just for this application, but also for the broader ML community, to investigate what representations an LLM works best with. @pneague and me were thinking of using ML (genetic algorithms in particular) to learn an optimal vocabulary for representing arbitrary string identifiers. 🌈 Edit: Looking at the results again, it might just be that the model favors a low but consistent token length. But more experiments need to be conducted. |
Encoding of TargetsThrew away that idea of ML/ genetic algorithms to determining the best tokenization. It's not that complicated after all! As the prior analysis already indicated (cf. boxplot in prev. comment), the LLM works best when the targets are represented in an encoding that
The tokenization of strings like "a3bf01..." is unreliable in the number of tokens it produces.
This makes encoding later easy as it is just a mapping of This approach yielded the best accuracy we ever measured. For the experiment of which I listed results in the table above, this encoding yielded 42% (if I remember correctly). Further experiments could alter the initial embedding of the poop-tokens (currently random), or compound bytes such that the length of tokens per docid could be further reduced. In this case, halved:
Practical Implications of Our ResultsAs we have uncovered in our last meeting, we are actually only determining the accuracy on the next unseen query, whereas most queries are likely to have been used before. In other words, we don't even know how much unseen queries are a problem, and what we win in real-world scenarios. Therefore, I would like to include another experiment that suggests the real-world implications of our results. Queries targeting a document follow a power-law distribution (there are dozens of papers on that). What I would like to do is assume some probability distribution and map it to our dataset. That is, instead of doing a distinct train/val/test split of the queries, I want for each set of length This approach is still flawed, however, as we ignore the degree of similarity that is correlated with the frequency of query-usage. For instance, the 2nd most used query could just be the 1st query with a typo. |
|
In the last three weeks I was occupied with the re-doing of the stochastic calculations of the chunking algorithm AE, and its "cousin" RAM. The purpose of this analysis is deriving a formula in order to understand the relationship between their parameter We're approaching 40 pages now, and thinking how to sell this work with @grimadas; Ideas of breaking this work into two papers: one survey/SoK theoretical paper, and one about the empirical study. Possible target: JSys 1st August Draft thesis title for forms: Decentralized Machine Learning Systems for Information Retrieval |
Progress after this summer: The problem is not that we failed to decentralise BM25 for 30 years. The metadata is simply not there to implement anything. So we need trustworthy metadata before we're able to realise any search. Idea Use LTR as a means to collect metadata from implicit cues (solves incentive problem). We propose two strategies:
|
Comparing money and search...:There is a lack of incentives for transaction processing with double spending prevention in Peer-to-peer payments. Then Bitcoin came along, it introduced mining to make it costly to participate in a lottery. A single person selected at random from the participants is trusted to execute money transfers without fraud. Vulnerable to 51% attack, requires an honest majority. We don't have that in the metadata and decentral search world. Making a Decentral Google is harder then printing your own money 🤑 .
ToDo: polished 4 pages of article text. Only Journal submission-ready sections please. No experimental setup. No experiment description. No early results section. left for future sprints. |
First principles approach: I took some steps back to learn about the evolution of information retrieval techniques and what place learning-to-rank has in there. To that end, I found the following resources incredibly valuable:
Another interesting resource I want to share (did not read it very seriously yet, keep it for future work)
Update of LTR chapter draft (improved intro and worked on background/related work): |
Bit shocking to read the available papers. The state-of-the-art in decentral search and learn-to-rank makes your cry 😭 Both CASearch from 2020 and MAAY from 2006 are evidence that the Stanford-class scientists do not touch a scientific topic without any startup potential and high engineering cost. Anything decentralised is easy 100x more effort or 99% of central approaches don't work and you need to invent that 1%. Peer-to-peer is so 2001, everybody left the field. Background reading on paper publishing versus projects in AI. Thinking two steps ahead is rather interesting advice. Top-level analysis on decentral search from 2024. Discussing the Go-NoGo moment for the Learn-to-Rank paper. Is there a need for a distraction 🤡 Please note that a phd thesis is highly specialised. It can be entirely about decentralised Learn-to-rank 💥 🤔 So, strategic thinking is indeed: which peers can we trust with privacy-preserving ClickLog info? Have a noble science goal in mind, such as "scalable models of intelligence". For great storyline we need stuff like superhuman performance of AI (their code) Road to publishable results! Future paper ideas:
Sprint focus: De-NeuralNDCG
|
✨ Grand scheme thinking late at night (loose understanding of literature, take everything with heaps of salt) ✨ Inspired by my latest reads and chats in the lab, there is a vision for decentralized IR that is growing increasingly stronger in me. The Future of Decentralized IR State of the Global Brain My Vision My Vision, but being concrete I think this vision opens the door to so many research papers, e.g.,
|
Did my coding homework for the upcoming LTR paper. That includes
We plan to implement LTR using the allRank framework, which is based on Context-Aware Learning to Rank with Self-Attention (2020) Current concerns:
|
What to focus on? Set deadline of https://euromlsys.eu/ upcoming Feb!
|
I'm still working on getting a p2p simulation for my ranking algorithms on the DAS6. To that end, I have also started getting familiar with During our co-writing exercise this Tuesday, I developed a spontaneous idea in an attempt to make collaborative LTR work in a network with untrusted peers. Moreover, this idea could also generalize to other ML tasks. The idea is very simple! Every user has their local model, which they train on locally generated data (i.e., search behavior). It is also used for inference (in this case, the reranking of search results). Blind model averaging is suboptimal because (1) peers' models do not always align with personal search interests and (2) the risk of byzantine nodes. I propose a very simple protocol.
Hope that was clear enough. More formally: |
🦄 🦄 🦄 🦄 Seriously great idea 🦄 🦄 🦄 🦄 Never seen an {storyline example disclaimer: as systems people we do not know which exact ML terms to use. Whole little story below could be incorrect. Please check the papers} Critical element is replacing vertical/horizontal federated learning with Bittorrent and Bitcoin are fully decentralised. No permission is needed to join such systems. Our gossip learning approach also is permissionless. By using a Pagerank-inspired algorithm we address the security concerns. ToDo: add details. We now present a remarkably simple gossip learning algorithm based on the game theoretical concept of win-stay, lose-shift. This game theory strategy has been shown to outperform tit-for-tat in the classic prisoner's dilemma game. dataset Please use ORCAS. Web10k is not human readable. Systems people love finding and fixing bugs. It is essential to get a feel for the performance of your code with actual queries and actual URLs. {query embeddings, no content anlysis or feature vectors.} Compare to (overly 😁) simple approach: pair-wise online learn-to-rank https://github.com/mg98/p2p-ol2r related work Gossip learning at scale has some resemblance to Mixture of experts (MoE). MoE use multiple learners to divide a problem space into homogeneous regionsREF. These learners are located in the same machine. For instance, the Mixtral system uses "8 sparse mixture of Experts"REF with a gating function and confusingly named "router". Critical difference is that gossip learning uses network-based routing to other machines. Thus millions of "experts" can be selected in large-scale gossip networks.
ToDo: make a plot for next meeting |
Some feedback on the above stuff (also sent in a private message on Slack). So the idea is to sample models from random nodes in the network, take the model with the lowest loss (on your training data?), and nudge your model update towards the model with the lowest loss? I believe in essence you're modifying the loss function and add and additional nudging factor term to it. There seems to be no model aggregation going on as well. I just talked with the other DL experts in the lab, and we analyzed the algorithm a bit. We understand the insight behind it, but we're not sure what you're trying to solve. Are you aiming towards personalization? Or Byzantine robustness? Each of these objectives has different needs, and I also wonder how this would work in non-IID settings where the optimization trajectory is not as smooth as in an IID setting. The idea of evaluating different neighboring models is closely related to another paper we recently worked on. Nevertheless, it's difficult for us to make predictions on whether it works or not - it depends on many factors. Also, for now I assume it's a synchronous algorithm. Asynchrony and stale updates is going to mess significantly with your loss function. It would be interesting to theoretically analyse if it converges (remark: this is a must have to have a chance in ML/system conferences nowadays!!). My feeling says it does since the differences between local models are becoming smaller so doing an infinite amount of rounds should make sure all nodes end up with the same model? But I'm not really sure since the else statement can also trigger, meaning that models are diverging (different nodes doing different model updates lead to divergence between these models). Sometimes, a good way to see if it works or not is to just implement it and see what happens. It's an approach we often do as we have a lot of random ideas on how to improve DL. Line 2 seems to suggests that you're doing a training step per sample? Usually one takes a mini batch to ensure more accurate gradient estimations. Small note, in line 7, you can replace L(y, f_local(x)) with L_local (computed in line 3). Also, the gradient computation step seems to be missing from the algorithm.
Small remark that the domain of distributed optimization has been actively researched since the early 2000s already. The main difference is that these works all focus on simple problems with a convex optimization landscape. Using deep neural networks in a P2P setting is a newer area of research indeed. Update I was thinking a bit more about your algorithm. On second thought, it is more resembling of Gossip Learning than D-PSGD actually because ultimately you are only “aggregating” a single model if it exhibits a lower loss (although it has elements of both algorithms). So, GL should be a baseline as well I guess. To fairly compare these algorithms, you might want to implement a synchronous version of GL (I assume your algorithm is also synchronous?) and measure the performance of all these algorithms in terms of rounds (and not absolute time). Also, I was thinking that the extent of collaboration and learning from other nodes might be rather low, of course depending on many different parameters (sample size, nudging factor etc). This is good for Byzantine resilient, since (as you also pointed out) you only trust yourself. Nevertheless, this only brings you so far in terms of achieved accuracy. While I am confident that the accuracy of your approach compared to a no-collaboration scenario will be better (because models still influence each other), it might not outperform SOTA personalization methods that more aggressively/strategically integrate the knowledge of others. Given that you want to focus on personalization, these algorithms should also be a baseline. BTW, Both DL and FL have this interesting trade-off between doing more local training and model aggregation. Your algorithm might be more on the side of local training while having an occasional nudge by the models of other nodes. Quantifying how much you learn from others would be a very interesting experiment to conduct. Finally, your algorithm is heavy on communication costs since each node has to download k models (potentially large ones!) every round. Then, you only integrate the knowledge of at most a single received model while discarding the knowledge in all other models, which seems wasteful. It might very well be that the parameter updates in slightly sub-optimal models in the long run will lead to better performance of your own model since they help you to generalize knowledge. Also, note that your compute costs also increase since each node needs to do a forward pass on each received model each round. This is an important factor to keep in mind, so unless your paper is purely theoretical, this is something you might want to address. Nevertheless, this is an interesting starting point to explore different dynamics! |
Thnx Martijn!
Aim is more like 1963, permisionless communication (e.g. Internets). 😁 It offers permisionless intelligence (with my lower communication cost modifications in comment above). (Context: Global Brain issue) |
Decentralized learning in permissionless settings is certainly novel. DL is mostly researched in enterprise settings (hospitals, IoT, military, etc - also see this article) and the research we do also assumes such settings. That said, Byzantine behaviour is a serious treat to convergence, especially in non-IID settings. Thus, most research in this area assumes a maximum fraction of Byzantine nodes, like 1/3, similar to research on consensus. In permissionless settings, such a threshold is not really applicable as potentially any node can send arbitrary model updates to other nodes. It also depends on the threat model one considers - passive attacks (e.g., honest-but-curious attackers) are much easier to deal with than active attacks such as the Sybil Attack. In fact, the above algorithm that conservatively integrates model updates from others might work in permissionless settings. But in very adversarial settings, I wonder if it's worth collaborating and consuming the bandwidth for model exchanges in the first place. Perhaps training a model from scratch/fine-tuning a pre-trained model locally might yield sufficient utility for the envisioned use case. |
Update from my side regarding the LTR project: I've been trying to get familiar with EPFL's decentralizepy framework. I think it's the right choice for this project. Moreover, I decided it will be a good investment, be it for the Trusted Decentralized Learning idea discussed here with Martijn or other future projects. Unfortunately, doing that and trying to get allRank to work within decentralizepy and the WEB10K dataset ate up 2-3 weeks of my time 😩 This weekend, however, I finally succeeded. What that means is that nodes can train WEB10K collaboratively and decentralized, via dataset sharding and model parameter gossip and aggregation, using the transformer-based LTR model and neuralNDCG loss function from the allRank framework. This setup would now allow me to run simulations on, e.g., byzantine attacks, enabling me to do experiments for my Algorithm 1. @synctext criticised about the MSLR WEB10K/WEB30K datasets that they don't contain raw queries or documents. Indeed, it is a standard for LTR datasets to come with nothing but relevance labels, query ids, and an esoteric query-document relationship vector, since this is enough to benchmark LTR models. People from Heidelberg University shared the sentiment in 2016:
While it's hard to find a dataset that does (perhaps for privacy reasons), Heidelberg people did something clever. They scraped NutritionFacts.org, which contains articles, blog posts, Q&A threads, videos. If one page links to another page, it is the metadata of the referral page (blog post/video title, video description, keywords) that are used as query to the linked page. There's more to that, but this far is abstract enough. Anyhow, this leaves us with the only suitable dataset in existence (that I found) until we have more Tribler data. Even though it will require some annoying post-processing from my side for it to work with our setup 😫. I ran our crawler yesterday for <1 hour to see how much we can fetch. Only got 30 distinct queries. |
2 week sprint goal: performance graph from IPv8+Allrank please, explore datasets (simple protocol, keep 20 neighbours, most similar ever seen) |
< Placeholder >
timeline: April 2023 - April 2027.
Key historical 2016 issue of thesis topic
ToDo: 6 weeks hands-on Python onboarding project. Learn a lot and plan to throw it away. Next step is then to start working on your first scientific article. You need 4 thesis chapters and you then completed your phd.
One idea: towards trustworthy and perfect metadata for search (e.g. perfect memory retrieval for the global brain #7064 ).
Another idea: Gradient decent model takes any keyword search query as input. Output is a limited set of vectors. Only valid content is recommended. Learning goal is to provide semantic matching between input query and output vector.
General background and Spotity linkage possible dataset sharing
Max. usage of expertise: product/market fit thinking
Background reading:
pointwise approach, broadly speaking, each historic impression with a click is a positive training example, and each impression without a click is a negative training example.
https://towardsdatascience.com/learning-to-rank-a-primer-40d2ff9960afWe will implement a character-level sequence-to-sequence model, processing the input character-by-character and generating the output character-by-character. Another option would be a word-level model, which tends to be more common for machine translation.
, https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.htmlVenues:
note no prior msc courses on machine learning. We are a systems lab and might know how to apply machine learning in permissionless, byzantine, unsupervised, decentralised, adversarial, continuous learning context.
Possible scientific storyline: SearchZero a decentralised, self-supervised search engine with continuous learning
The text was updated successfully, but these errors were encountered: