Impressions and suggestions for mlxtend after dev of a web scale recommender using fpgrowth() #893

geoffreya · 2022-02-24T20:36:40Z

geoffreya
Feb 24, 2022

Dear Sebastian Raschka,

Thanks so much for your excellent mlxtend lib and its refreshingly defect-free code in fpgrowth et al. I recently finished developing a web-scale recommender web app using mlxtend, for discovering interesting subreddits at reddit.com. I thought community might like to know what I discovered that made my web app run really fast. Fast is important because it enables interactive queries and cheaper Amazon EC2 instances!

The data pipeline: json files from reddit.com holding one month of comments each, including subreddit name associated with comment, are the raw data being input to my data pipeline. Bash commands are used to transform two such json month files, into one giant text file of lines, where a line is the list of all subreddits one unique reddit user commented at during that month (contiguous duplicate subreddits are elided out). That list of subreddits by one user, comprises a "market basket transaction" of this recommender system, which uses association rules mining to recommend interesting new subreddits to anyone who types in the name of a subreddit they already know. There seem to be a few weird users, maybe robots or admin, posting to 50 or 500 or more different subreddits in a month but most users post to a handful of subreddits. It does not yet seem to mess up anything in the patterns I am finding to keep them in the pattern mining, but I could eliminate them if needed to improve the patterns, to be more like what normal humans find interesting.
In post-processing on the dataframe of the trained model output by association_rules(), deleting rows where antecedent or consequent sets sizes were g.t. 3 dramatically improved query performance of the model dataframe for obvious reasons: Fewer rows!

If we instead moved such code out of app post-processing, into mlxtend lib, then training time as well, will also be sped up for such apps that do not benefit from longer antecent or consequent sets.

Also memory allocation during training will be reduced, enabling even deeper mining and exploration for additional interesting patterns that may be living at lower min_confidence thresholds. This is indeed the case in my app, because this is exactly where all newer subreddits and more quirky and niche subreddits are living, that people are hoping to find, beyond the usual familiar subreddits already shown on the main menu of news, worldnews, popular, AskReddit, and pics. In experiments I got as low as min_confidence = 0.0003 which used super-near to all my RAM and swap space but critically still managed to complete and save the trained model!

Alternatively, bigger datasets could be input, while confidence filter stays fixed at some best level for that app, while training on the same size computer RAM, meaning higher quality pattern statistics could be yielded.

In my post-processing on the trained dataframe, I appended one additional df column of 0 or 1 data for every market basket item in my problem domain (subreddit names). Using these 0/1 columns in queries dramatically sped them up, versus using frozensets. My deliberate mis-use of TransactionEncoder to create all these columns on the already-trained model dataframe worked positively excellent and conveniently, no problem at all.

It's now a very wide model dataframe but showed net excellent positive consequences in all cases except code complexity. Disk space no problem at all. I divided the trained model into two dataframes actually, so the app queries just the 0/1 dataframe and subsequently uses the indexes of the results to retrieve the same-indexed rows of the original model dataframe. It is blazingly fast to query this way. Pyarrow is obviously optimized to work well with 0/1 integer columns in queries; this was my hypothesis going in; and my experimental observations confirm it.

My app's next post-processing step deletes both frozenset columns subsequently for faster loading of the trained model from disk.

Pyarrow/feather API within pandas is being used by my app to get fastest possible queries on the trained model. I just load the dataframe using pd.read_feather() and this alone seems to subsequently during queries on the trained model, automatically invoke the pyarrow/feather versions of the pandas query functions. I can see multiple cores being utilized during queries now.

I added a bit of interesting code to cause query conditionals to include and exclude exactly the correct columns in queries on my very wide customized trained model's dual-dataframes.

Sparse matrix option is of course being used here, because of 10^7 comments, 10^6 unique users, 10^4 subreddits if I recall correctly. No problems at all with the sparse versions, rock solid code in mlxtend lib all the way. :)

Pyarrow/feather API has some kind of incompatibility or lack of support (yet) for frozenset saving to disk as I observed. I don't need it anymore because I found the design described above to be working super good. On my early attempts it was horribly slow to do queries involving eval() of frozenset strings; and frozensets as frozenset type prepared in advance of any queries during postprocessing on the whole trained model, were simply not able to be saved to disk by feather/pyarrow subsystem in my experience (the text of the thrown error message confirmed the lack of compatibility). On big models the multicore queries of pyarrow that I observed watching my System Monitor are certainly desirable over plain pandas, and would not want to give up this speed.

Ultimately this combination of software tricks and choices has yielded query speedup I actually observed of 1.1 hr down to milliseconds (sub-second). I did not know in advance that it was possible but I plowed ahead and am super happy. (Also a reduced size on disk of the trained model; one-tenth of original, helping the web app load from disk and use less RAM.

Final opportunity, and least important actually: I saw only one core running during training and it was pegged at 100%. It's an opportunity to parallelize fpgrowth and association_rules to use all 16 cores in a box etc. It takes hours to train per iteration and I did a lot of iterations to find the best min_confidence and it mattered greatly to use a low threshold in my app to get the good quality subreddits to be output, which in my app are low confidence (prevalence) by their very nature. I am generally comfortable writing python parallel code but fpgrowth and association_rules are tricky. FPGrowth has the tree building in phase 1 (I have some plans here) and something else in phase 2 so a different parallelization would be used in phase 2. Again, though, this is exactly where, happily, the proposed optional upper limits on lengths of antecedents and consequents would alone be super helpful, and much easier to code.

Awesome mlxtend lib! Rock solid code you have here. I hope to read more of your code and learn more.

Hope this helps.

Geoffrey Anderson

ps. I will add this web app to Show and Tell category, after I get the permissions problems and virtual dir problems fixed for this web app at the AWS EC2 instance.

tl;dr: Anyway the association_rules is where combinatorial growth of search space is happening (all combinations of items in a frequent pattern set), and this is exactly where, happily, the proposed optional upper limits on lengths of antecedents and consequents would alone be super helpful for speeding up and reducing memory usage, but also deeper mining or bigger input datasets, potentially offering higher quality pattern results.

geoffreya · 2022-02-24T20:46:14Z

geoffreya
Feb 24, 2022
Author

I meant min_support, not min_confidence, in most cases, I think. I wrote this from memory, sorry.

0 replies

rasbt · 2022-03-04T02:46:15Z

rasbt
Mar 4, 2022
Maintainer

Wow, thanks for sharing your experiences in such a comprehensive post!

Ultimately this combination of software tricks and choices has yielded query speedup I actually observed of 1.1 hr down to milliseconds (sub-second).

Wow, this is huge! Definitely something we should look at in terms of improving the mlxtend functions here. Thanks for sharing.

If we instead moved such code out of app post-processing, into mlxtend lib, then training time as well, will also be sped up for such apps that do not benefit from longer antecent or consequent sets.

Yeah, I can potentially see a "set size" parameter to facilitate that.

I appended one additional df column of 0 or 1 data for every market basket item. ... Using these 0/1 columns in queries dramatically sped them up ...

That's another neat thing to keep in mind. We could potentially have a parameter for that mode='binary'/'frozenset' or maybe an alternative implementation to keep the code less complex.

I am generally comfortable writing python parallel code but fpgrowth and association_rules are tricky.

yeah, I think there are some algorithmic restrictions that would require some very very careful engineering to make this work. Definitely not trivial.

Anyways, thanks so much for sharing! Personally, I am very interested in improving the current functions, but at the same time, I am also a bit swamped. I will definitely but them down in the Issue list though and hope we can tackle this some time!

4 replies

geoffreya Mar 4, 2022
Author

I could make changes potentially.

rasbt Mar 4, 2022
Maintainer

That'd be awesome! No rush on that front, but I (and others) would definitely appreciate it!

geoffreya Mar 29, 2022
Author

Thank you for reading + responding!

I happened to try Spark back in 2016-ish and I ran its mllib.FPGrowth() on some different problem, but sadly it did not return any results at all, and I don't know why, maybe my mistake, or maybe their implementation had a mistake.

Did you, or anyone, happen to ever get any correct result from Apache Spark's FPGrowth?

My plan is I intend to go find out if FPGrowth on latest Spark version is working correctly now years later. The speed of multiple workers in parallel would be very helpful for bigger datasets ( like my WackySubs.com). I am going to use Azure cloud which gets me instantly to the latest version of Spark. My python libs on my local HPC dev workstation are getting too messed up for pip and conda to be able to install anything anymore lol.

I am going to start verifying Azure's Apache with one of your MLXTEND unit test cases, or another I was using to dev with. If that works to produce correct outputs, then next step is I will run a medium size dataset from WackySubs to see if I can perceive a performance gain from mapreduce parallelism already built-in to FPGrowth there as hoped. If so then on to the full size dataset from WackySubs so I can retrain its model on the freshest months of reddit data. (It's using stale 2018 data for now.)

Azure cloud is hosting Wackysubs now too FYI (It's on 2 different cloud systems at the same time for now, so I can compare cloud vendors.).
https://wackysubs.azurewebsites.net
Azure turns out to be cheaper $$$ than AWS EC2 by the month according to my billing statements, and it's easier for me to install, because it's running so-called serverless, meaning on Azure I did less admin -- I happily did not have to install any EC2 ubuntu server, apache web sever, wsgi conf files, virtual directories, python libs, and fix any file permissions. Azure automatically read my python requirements.txt and installed gunicorn web server and put it all in a docker container without much effort, and it automatically deploys new code changes directly to website by detecting changes in my github repo, which is nice.

Probably spark.mllib.FPGrowth() works by now 6 years later. But in case spark.mllib.FPGrowth still doesn't work right (or I can't get it to work due to my own user error?), it will get interesting -- I may intend to write it myself if time permits in mapreduce style because this way it's implicitly parallelizable and distributable. Fortunately I have developed some confidence to make a new impl because the pseudeocode so far is looking good, having recently developed ~75% of the FPGrowth algo (but not yet the association_rules function) expressed as map, flatmap, reduce functions on RDD tuples.

Thanks for reading.

ga

rasbt Mar 29, 2022
Maintainer

I never really used Spark's mllib to be honest. I will be curious to hear what you'll find in terms of potential speed-improvements. Btw. another route is to maybe swap out mlxten's pandas internals with modin: https://github.com/modin-project/modin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impressions and suggestions for mlxtend after dev of a web scale recommender using fpgrowth() #893

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Impressions and suggestions for mlxtend after dev of a web scale recommender using fpgrowth() #893

geoffreya Feb 24, 2022

Replies: 2 comments · 4 replies

geoffreya Feb 24, 2022 Author

rasbt Mar 4, 2022 Maintainer

geoffreya Mar 4, 2022 Author

rasbt Mar 4, 2022 Maintainer

geoffreya Mar 29, 2022 Author

rasbt Mar 29, 2022 Maintainer

geoffreya
Feb 24, 2022

Replies: 2 comments 4 replies

geoffreya
Feb 24, 2022
Author

rasbt
Mar 4, 2022
Maintainer

geoffreya Mar 4, 2022
Author

rasbt Mar 4, 2022
Maintainer

geoffreya Mar 29, 2022
Author

rasbt Mar 29, 2022
Maintainer