SPY VS. Reddit

SPY VS. Reddit

Process: Scrape redditor profiles for stock picks. Stock performance is compared to SPY according to redditors earliest mention date.

106,708 users parsed, 112:01:57 total time, 2024-11-18

57k mentioned at least 1 stock, 16k mentioned at least 5 stocks.

Of the 16k, 3.2k were able to beat SPY at least 50% of the time.

Price Increase Post Mention, PIPM, checks if the stock hit a higher price at least once after the date of mention.

Users that beat SPY at least 50% of the time

+---------+-------+----------+-------+-------+
| Tickers | Users | Beat SPY | %     | PIPM  |
+---------+-------+----------+-------+-------+
| 5       | 16266 | 3182     | 19.56 | 89.65 |
| 10      | 8153  | 1445     | 17.72 | 89.96 |
| 15      | 4669  | 739      | 15.83 | 90.14 |
| 20      | 2847  | 411      | 14.44 | 90.34 |
| 25      | 1803  | 217      | 12.04 | 90.61 |
| 30      | 1182  | 122      | 10.32 | 90.68 |
| 35      | 815   | 82       | 10.06 | 90.7  |
| 40      | 564   | 51       | 9.04  | 90.68 |
| 45      | 401   | 29       | 7.23  | 90.88 |
| 50      | 296   | 24       | 8.11  | 91.01 |
+---------+-------+----------+-------+-------+

55%

+---------+-------+----------+-------+-------+
| Tickers | Users | Beat SPY | %     | PIPM  |
+---------+-------+----------+-------+-------+
| 5       | 16266 | 2692     | 16.55 | 89.65 |
| 10      | 8153  | 955      | 11.71 | 89.96 |
| 15      | 4669  | 488      | 10.45 | 90.14 |
| 20      | 2847  | 238      | 8.36  | 90.34 |
| 25      | 1803  | 116      | 6.43  | 90.61 |
| 30      | 1182  | 61       | 5.16  | 90.68 |
| 35      | 815   | 40       | 4.91  | 90.7  |
| 40      | 564   | 24       | 4.26  | 90.68 |
| 45      | 401   | 11       | 2.74  | 90.88 |
| 50      | 296   | 8        | 2.7   | 91.01 |
+---------+-------+----------+-------+-------+

60%

+---------+-------+----------+------+-------+
| Tickers | Users | Beat SPY | %    | PIPM  |
+---------+-------+----------+------+-------+
| 5       | 16266 | 1609     | 9.89 | 89.65 |
| 10      | 8153  | 533      | 6.54 | 89.96 |
| 15      | 4669  | 237      | 5.08 | 90.14 |
| 20      | 2847  | 109      | 3.83 | 90.34 |
| 25      | 1803  | 53       | 2.94 | 90.61 |
| 30      | 1182  | 30       | 2.54 | 90.68 |
| 35      | 815   | 15       | 1.84 | 90.7  |
| 40      | 564   | 11       | 1.95 | 90.68 |
| 45      | 401   | 3        | 0.75 | 90.88 |
| 50      | 296   | 1        | 0.34 | 91.01 |
+---------+-------+----------+------+-------+

65%

+---------+-------+----------+------+-------+
| Tickers | Users | Beat SPY | %    | PIPM  |
+---------+-------+----------+------+-------+
| 5       | 16266 | 1157     | 7.11 | 89.65 |
| 10      | 8153  | 270      | 3.31 | 89.96 |
| 15      | 4669  | 103      | 2.21 | 90.14 |
| 20      | 2847  | 56       | 1.97 | 90.34 |
| 25      | 1803  | 18       | 1.0  | 90.61 |
| 30      | 1182  | 9        | 0.76 | 90.68 |
| 35      | 815   | 3        | 0.37 | 90.7  |
| 40      | 564   | 1        | 0.18 | 90.68 |
| 45      | 401   | 0        | 0.0  | 90.88 |
| 50      | 296   | 0        | 0.0  | 91.01 |
+---------+-------+----------+------+-------+

OMITING FAVORITES

nofavs.csv excludes top 10 most mentioned - ['GME', 'TSLA', 'AAPL', 'GOOGL', 'AMZN', 'NVDA', 'AMC', 'V', 'META', 'MSFT']. nospy.csv omits any stock already included in SPY. Returns degrade substantially when favorites are omitted. See Results folder for data.

Without Favorites, Users that beat SPY at least 50% of the time

+---------+-------+----------+-------+-------+
| Tickers | Users | Beat SPY | %     | PIPM  |
+---------+-------+----------+-------+-------+
| 5       | 12108 | 1737     | 14.35 | 88.25 |
| 10      | 5653  | 702      | 12.42 | 88.7  |
| 15      | 3103  | 351      | 11.31 | 89.14 |
| 20      | 1849  | 180      | 9.73  | 89.48 |
| 25      | 1157  | 96       | 8.3   | 89.79 |
| 30      | 769   | 66       | 8.58  | 89.97 |
| 35      | 528   | 41       | 7.77  | 90.03 |
| 40      | 374   | 28       | 7.49  | 90.32 |
| 45      | 282   | 19       | 6.74  | 90.64 |
| 50      | 214   | 16       | 7.48  | 90.75 |
+---------+-------+----------+-------+-------+

Without SPY, Users that beat SPY at least 50% of the time

+---------+-------+----------+-------+-------+
| Tickers | Users | Beat SPY | %     | PIPM  |
+---------+-------+----------+-------+-------+
| 5       | 7512  | 1062     | 14.14 | 84.44 |
| 10      | 2559  | 298      | 11.65 | 84.31 |
| 15      | 1121  | 104      | 9.28  | 84.25 |
| 20      | 551   | 49       | 8.89  | 84.38 |
| 25      | 306   | 20       | 6.54  | 84.43 |
| 30      | 180   | 10       | 5.56  | 84.13 |
| 35      | 122   | 8        | 6.56  | 83.83 |
| 40      | 82    | 5        | 6.1   | 83.47 |
| 45      | 66    | 5        | 7.58  | 83.36 |
| 50      | 52    | 3        | 5.77  | 84.03 |
+---------+-------+----------+-------+-------+

CODE

Alias.json is converted into a Trie in order to parse out text.

eg.

"MSFT":  [
           "$MSFT",
           "MICRO SOFT",
           "MICROSOFT",
           "MSFT",
           "WINDOWS"
         ]

getUserTickers("example_user", 10)

Fetches and parses a redditors submission and comments. The second parameter specifies the search depth. Returns the earliest mentioned date of each ticker.

eg.
```
[('FORD', '07/18/2022'),
("KO", '02/27/2023'),
("AAPL", '06/29/2019',
("GME", '04/01/2021')]
```

bench_vs_user("SPY", data)

Compares the performance of the ticker relative to SPY since the date of mention, and since the tickers inception. Returns a list:

[0] - [Ticker, 
[1] - Date Mentioned, 
[2] - SPY > Ticker since mention?, 
[3] - Date of inception, 
[4] - SPY > Ticker since Inception?, 
[5] - Price Inc After Mention?,
[6] - True/False]

eg.

 [['FORD', 'Mention: 07/18/2022', True, 'Inception: 11/17/1994', True, 'Price Inc After Mention', True],
 ['KO', 'Mention: 02/27/2023', True, 'Inception: 01/29/1993', True, 'Price Inc After Mention', True],
 ['AAPL', 'Mention: 06/28/2019', False, 'Inception: 01/29/1993', False, 'Price Inc After Mention', True],
 ['GME', 'Mention: 04/01/2021', True, 'Inception: 02/13/2002', False, 'Price Inc After Mention', True]]

DISCUSSION

Project does not take into account sentiment and context when scraping tickers. "I think SO-SO Corp is terrible, but I think THIS-AND-THAT Inc. is a gem!" In this case, both tickers would be picked up by the Trie. Although negative sentiment mentions are rare, and most likely don't have a tangible effect on the data, their existence should still be acknowledged.

Proposed remedy - Incorporate LLM to interpret sentiment and context in order to omit negative sentiment tickers. Might need new graphics card?
This project relies on the Alias.json to match words and phrases to their respective tickers. While the list has been reasonably well refined, false positives do occur from time to time. This is most promiment when factoring in non-english users and non-english subreddits.

Proposed remedy - Further refine Alias.json, or omit problematic subreddits/languages altogether.
The data base of this project relies on YAHOO FINANCE, and by extension, the yfinance module. As of this week, 11/18/2024, YAHOO FINANCE seems to have silently implemented limits on API calls. The current itteration of the database includes about ~7000 CSVs which must be updated daily. The updating process seems to hit the API call limit around 2000-2500. Furthermore, the nature of CSVs makes them cumbersome and slow to work with.

Temporary remedy - Keep SPY updated daily. When comparing SPY to the users TICKER, trim the length of SPY to match the length of TICKER. At worst, this results the omission of a day or two of returns. Overall inconsequential, but nonetheless very annoying.

Proposed remedy 1 - Update the database in a staggered/asynch formation throughout the day to avoid API call limits. Look into AlphaVantage, or consider paying for data. Reach out to testfol.io creator for second opinion.

Proposed remedy 2 - Convert CSV database to SQL for faster access and compatibility with multiprocess. Duck DB looks promising.
The exact returns generated by each redditor isn't calculated. It's plausible, though tenuous, that a redditor may pick one or two stocks that generate enough alpha to create net profit despite losses on all other stock picks. 2.4% of firms generated the bulk of market returns, Bessembinder et al. Took a preliminary look into this, specifically in the pennystock and short-squeeze focused subreddits. These redditors, in aggregate, were consistently and overwhelmingly net negative even when accounting for squeezes. Results to be posted later.

Proposed remedy - Incorporate exact returns into bench_vs_user function. Allocate one single dollar to each stock starting at its earliest mention? Reconcile when redditor choose to "yolo" on a stock.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Results		Results
README.md		README.md
alias.json		alias.json
aliasFunctions.py		aliasFunctions.py
compareReturns.py		compareReturns.py
config.yaml		config.yaml
csvFunctions.py		csvFunctions.py
getUserTickers.py		getUserTickers.py
main.py		main.py
stockFunctions.py		stockFunctions.py
timeFunctions.py		timeFunctions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPY VS. Reddit

Process: Scrape redditor profiles for stock picks. Stock performance is compared to SPY according to redditors earliest mention date.

Users that beat SPY at least 50% of the time

55%

60%

65%

OMITING FAVORITES

Without Favorites, Users that beat SPY at least 50% of the time

Without SPY, Users that beat SPY at least 50% of the time

CODE

Alias.json is converted into a Trie in order to parse out text.

getUserTickers("example_user", 10)

bench_vs_user("SPY", data)

DISCUSSION

Always bet on SPY.

About

Releases

Packages

Languages

miramo1/Spy-Vs-Reddit

Folders and files

Latest commit

History

Repository files navigation

SPY VS. Reddit

Process: Scrape redditor profiles for stock picks. Stock performance is compared to SPY according to redditors earliest mention date.

Users that beat SPY at least 50% of the time

55%

60%

65%

OMITING FAVORITES

Without Favorites, Users that beat SPY at least 50% of the time

Without SPY, Users that beat SPY at least 50% of the time

CODE

Alias.json is converted into a Trie in order to parse out text.

getUserTickers("example_user", 10)

bench_vs_user("SPY", data)

DISCUSSION

Always bet on SPY.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages