-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reshape if sinlge value when havinf several values #29
Comments
Could you check if the column names are the same in the data frame that you want to predict on as the one used for training? Could you check the shape of your data frame ('df.shape')? |
Btw, if there was something wrong with your data frame, please also let me know. Then I raise a warning with some explanation. That might be useful for future users. |
the shape is (10934, 2) so I don´t understand the problem, I'm trying to see what could be the problem, but is a really large data frame |
So I discover a work around, but I think better I tell you my case. I have a csv with two columns "id name", this file has 500000 rows and wheights 15MB. My problem was that when trying to predict the entire file the memory usage increase up to 70GB and kill the process instantly. So to try to improve this I made a script that divide the dataframe in N pieces and then start join the first piece with the second and predict, and then the first with the third and then start again with the second and on and on(ex: dividing in 3 will be like 1-2 1-3 2-3)... That started workin but suddenly in the third "principle" iteration the error of my first message appeared. Now the work around is to join the pieces, save it in an csv and open it, but now instead of costing 3 seconds it cost like 20 for each iteration. Is there any way of predicting the entire 15MB file without dividing it and not consuming mor than 60GB of memory? Becasue in that case I'm doing this, and I don't know if there's something that I can improve. import pandas as pd
import pickle
import sys
with open("./scripts/model.pkl", 'rb') as f:
myDedupliPy = pickle.load(f)
df = pd.read_csv("./scripts/file.csv", sep='\t')
res = myDedupliPy.predict(df, score_threshold=0.1)
print(res.sort_values('deduplication_id').head(60)) |
I think that the size shouldn't be a problem. Most likely the selected blocking rules results in too large blocks of comparisons that are made. Could you give some more details on what you're comparing? Probably I get guide you in selecting useful blocking rules that won't blow up memory usage. |
Yes, this are some examples of the dataset Mostly are name of authors sometimes "surname, name" or "name surname" and some abbreviation of the name. |
You could try to use blocking rules like the ones below:
or
The advanced example notebook shows how to use custom blocking rules. Morever, I would create your own string similarity metric for the id column as you don't want to do fuzzy matching on this field. Two IDs match or not, there's no approximate match. The similarity metrics for this field should look like this:
The advanced example notebook also shows how to use custom similarity metrics for specific fields. Good luck! |
Oh perfect thank you so much :D |
Hi sometimes when trying to predict some values my it gaves me the next error:
It's says my array is 1D when I read a .csv with pandas that contains more than 5000 values.
I don't know what can be wrong, my data are names of authors, and there's nothing strange, I have fit with more than 60000 and then try to predict and works with some values and don't with other, may be some kind of bug?
The text was updated successfully, but these errors were encountered: