-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Examples with a dataset like ML-10M #111
Comments
Hi André, We are aware of this fact and that the current datamodels are not very generic (check issues #83 and #103). Probably the best solution at the moment would be to implement your own DataModel or rely on other frameworks already optimised for large datasets (such as RankSys [ http://ranksys.org/ ]). Thank you for the interest, and let us know whatever works for you, so we can work on that in the future. |
Hi all, Thank you @abellogin for mentioning our framework. Indeed, by generalising your DataModel it would be easy for you to add more efficient implementations (or, even, plugging RankSys' ones). Just to give you some perspective, I have measured the memory footprint of various implementations of the RankSys' PreferenceData inferface (equivalent to RiVal's DataModel) using the ML10M dataset. First, I have created an implementation equivalent to the current DataModel (RiValPreferenceData). Then, I used the two publicly available implementations in RankSys 0.3 (SimplePreferenceData and SimpleFastPreferenceData). Finally, I am including the results of our last RecSys'15 poster that applies state-of-the-art compression techniques and whose implementations will be published (hopefully soon) in RankSys 0.4. The results are the following:
As you can see, there is ample room for improvement with respect to having two nested maps. I am planning to publish these and other observations in a blog post once RankSys 0.4 has been released. Just one more thing. RankSys 0.4 will be released under a much more relaxed license. That should allow its usage in other projects without requiring them to be licensed under the GPL (as it happens currently). |
Thank you both @abellogin and @saulvargas ! I'll have a look over the RankSys, I think it fits my needs! I will also have a look at your work "Analyzing Compression Techniques for In-Memory Collaborative Filtering" and hopefully will use it for mine. |
Hi @abellogin, If you want i can send you an adaptation to the CrossValidationSplitter<U, I> that I developed. I named it CrossValidationSplitterIterative<U, I>, instead of caching 5 folds to memory I compute one a write it promptly to the file (test or train respectively) Let me know if you want |
Sure @afcarvalho1991, you can do a pull request or upload it somewhere and paste here the URL. I think it can be useful to have an intermediate class that does not handle everything in memory for the n-fold case. Thank you! |
Related to #60 since strategies take a lot of memory when loading from file. |
@afcarvalho1991 can you please do a pull request with your code and we'll see if we can merge it? |
@alansaid Within the next two days no problem. Edit: |
Those are great news! I assume that is because of the last changes, which make use of the RankSys data representation /cc @saulvargas |
Hello, I did a pull from the latest version from the master branch. I have now a local branch with my modifications how can I perform a pull request? Can you help me? My contribution is the implementation of a CrossValidationSpilitter (Iterative) and a functioning Test that was modified from the CrossValidatedMahoutKNNRecommenderEvaluator to create the CrossValidatedIterativeMahoutKNNRecommenderEvaluator. Also, I would like to inform you that unable to execute the CrossValidatedMahoutKNNRecommenderEvaluator perhaps you need to review this class, some sort of problem with the timestamps table. Thank you, |
Hi André, I guess it depends on whether you forked the repository and started doing changes there (preferred case) or if you simply cloned the repository and your changes are on the base branch. Alex PS: I will check CrossValidatedMahoutKNNRecommenderEvaluator ASAP, thanks for noticing. |
Hello,
I'm using a large-scale dataset such as ML-10M or netflix and I find that the DataModel<Long,Long> object is taking up to much space, in fact I run out of memory even before loading everything into the DataModel<Long,Long> structure, I removed the timestamp variable from all samples but it didn't do the trick.
Is it me or it's expected that happen? I have 16GB of RAM so it should be more than sufficient to load a sparse matrix into memory even for the netflix "problem" which is a ~3gb trainset.
Thanks,
André
The text was updated successfully, but these errors were encountered: