Replies: 2 comments 4 replies
-
I meant min_support, not min_confidence, in most cases, I think. I wrote this from memory, sorry. |
Beta Was this translation helpful? Give feedback.
-
Wow, thanks for sharing your experiences in such a comprehensive post!
Wow, this is huge! Definitely something we should look at in terms of improving the mlxtend functions here. Thanks for sharing.
Yeah, I can potentially see a "set size" parameter to facilitate that.
That's another neat thing to keep in mind. We could potentially have a parameter for that
yeah, I think there are some algorithmic restrictions that would require some very very careful engineering to make this work. Definitely not trivial. Anyways, thanks so much for sharing! Personally, I am very interested in improving the current functions, but at the same time, I am also a bit swamped. I will definitely but them down in the Issue list though and hope we can tackle this some time! |
Beta Was this translation helpful? Give feedback.
-
Dear Sebastian Raschka,
Thanks so much for your excellent mlxtend lib and its refreshingly defect-free code in fpgrowth et al. I recently finished developing a web-scale recommender web app using mlxtend, for discovering interesting subreddits at reddit.com. I thought community might like to know what I discovered that made my web app run really fast. Fast is important because it enables interactive queries and cheaper Amazon EC2 instances!
The data pipeline: json files from reddit.com holding one month of comments each, including subreddit name associated with comment, are the raw data being input to my data pipeline. Bash commands are used to transform two such json month files, into one giant text file of lines, where a line is the list of all subreddits one unique reddit user commented at during that month (contiguous duplicate subreddits are elided out). That list of subreddits by one user, comprises a "market basket transaction" of this recommender system, which uses association rules mining to recommend interesting new subreddits to anyone who types in the name of a subreddit they already know. There seem to be a few weird users, maybe robots or admin, posting to 50 or 500 or more different subreddits in a month but most users post to a handful of subreddits. It does not yet seem to mess up anything in the patterns I am finding to keep them in the pattern mining, but I could eliminate them if needed to improve the patterns, to be more like what normal humans find interesting.
In post-processing on the dataframe of the trained model output by association_rules(), deleting rows where antecedent or consequent sets sizes were g.t. 3 dramatically improved query performance of the model dataframe for obvious reasons: Fewer rows!
If we instead moved such code out of app post-processing, into mlxtend lib, then training time as well, will also be sped up for such apps that do not benefit from longer antecent or consequent sets.
Also memory allocation during training will be reduced, enabling even deeper mining and exploration for additional interesting patterns that may be living at lower min_confidence thresholds. This is indeed the case in my app, because this is exactly where all newer subreddits and more quirky and niche subreddits are living, that people are hoping to find, beyond the usual familiar subreddits already shown on the main menu of news, worldnews, popular, AskReddit, and pics. In experiments I got as low as min_confidence = 0.0003 which used super-near to all my RAM and swap space but critically still managed to complete and save the trained model!
Alternatively, bigger datasets could be input, while confidence filter stays fixed at some best level for that app, while training on the same size computer RAM, meaning higher quality pattern statistics could be yielded.
It's now a very wide model dataframe but showed net excellent positive consequences in all cases except code complexity. Disk space no problem at all. I divided the trained model into two dataframes actually, so the app queries just the 0/1 dataframe and subsequently uses the indexes of the results to retrieve the same-indexed rows of the original model dataframe. It is blazingly fast to query this way. Pyarrow is obviously optimized to work well with 0/1 integer columns in queries; this was my hypothesis going in; and my experimental observations confirm it.
My app's next post-processing step deletes both frozenset columns subsequently for faster loading of the trained model from disk.
Pyarrow/feather API within pandas is being used by my app to get fastest possible queries on the trained model. I just load the dataframe using pd.read_feather() and this alone seems to subsequently during queries on the trained model, automatically invoke the pyarrow/feather versions of the pandas query functions. I can see multiple cores being utilized during queries now.
I added a bit of interesting code to cause query conditionals to include and exclude exactly the correct columns in queries on my very wide customized trained model's dual-dataframes.
Sparse matrix option is of course being used here, because of 10^7 comments, 10^6 unique users, 10^4 subreddits if I recall correctly. No problems at all with the sparse versions, rock solid code in mlxtend lib all the way. :)
Pyarrow/feather API has some kind of incompatibility or lack of support (yet) for frozenset saving to disk as I observed. I don't need it anymore because I found the design described above to be working super good. On my early attempts it was horribly slow to do queries involving eval() of frozenset strings; and frozensets as frozenset type prepared in advance of any queries during postprocessing on the whole trained model, were simply not able to be saved to disk by feather/pyarrow subsystem in my experience (the text of the thrown error message confirmed the lack of compatibility). On big models the multicore queries of pyarrow that I observed watching my System Monitor are certainly desirable over plain pandas, and would not want to give up this speed.
Ultimately this combination of software tricks and choices has yielded query speedup I actually observed of 1.1 hr down to milliseconds (sub-second). I did not know in advance that it was possible but I plowed ahead and am super happy. (Also a reduced size on disk of the trained model; one-tenth of original, helping the web app load from disk and use less RAM.
Awesome mlxtend lib! Rock solid code you have here. I hope to read more of your code and learn more.
Hope this helps.
Geoffrey Anderson
ps. I will add this web app to Show and Tell category, after I get the permissions problems and virtual dir problems fixed for this web app at the AWS EC2 instance.
tl;dr: Anyway the association_rules is where combinatorial growth of search space is happening (all combinations of items in a frequent pattern set), and this is exactly where, happily, the proposed optional upper limits on lengths of antecedents and consequents would alone be super helpful for speeding up and reducing memory usage, but also deeper mining or bigger input datasets, potentially offering higher quality pattern results.
Beta Was this translation helpful? Give feedback.
All reactions