Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match on module aliases for auto import suggestions #730

Merged
merged 14 commits into from
Jan 30, 2024

Conversation

MrBago
Copy link

@MrBago MrBago commented Nov 7, 2023

Description

This PR adds a table of aliases for AutoImport to use. It joins the alias with the names table to find available modules with matching aliases.

This PR is in the initial draft to verify the approach and get feedback. It is still missing:

  • A way to specify a list of import aliases in a config file
  • Documentation udpates

Fixes #712

Checklist (delete if not relevant):

  • I have added tests that prove my fix is effective or that my feature works
  • I have updated CHANGELOG.md
  • I have made corresponding changes to user documentation for new features

Copy link

@tkrabel-db tkrabel-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like the direction! Some minor concern regarding performance.

connection.execute("CREATE INDEX IF NOT EXISTS alias ON aliases(alias)")

modules = Query(
"(SELECT DISTINCT aliases.*, package, source, type FROM aliases INNER JOIN names on aliases.module = names.module)",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a DB expert, but the names table can comprise 10,000 - 100,000 rows, so I am wondering if we should run this inner join on every autoimport request (which can happen with every keystroke when rope is run inside of a language server).
Can we quickly test how much adding alias support slows down search?
Alternatively, I'd make sure aliases only contains the aliases to modules that exist in names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The joins are pretty fast, I made a notebook to test it out.

The join time should be dominated by the Alias table, not the Names table, because the Names table has an index on the module column. Also, here we're including a where clause which makes the left side of the join even smaller. Most DB engines are pretty good about pushing down the filter past the join and sqlite3 seems to handle it well.

I thought about this a little bit before testing out this implementations I see 3 main paths forward:

  • The join approach
  • Materialize the availability information in the Aliases table as a column, we'd need to be careful to always update the Aliases table whenever updating the cache. This would probably be the fastest approach, but more work.
  • Keep the aliases in memory as a list or dict. We'd basically be implementing the join logic manually, but it might be really fast if the # of Aliases is very small. Then again if the # of Aliases is very slow the join should also be very fast.

@tkrabel what do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach has the benefit that we never have to do any updates on the aliases tables. The names table is the source of truth of that is available to the user.
If you're happy with the performance, then let's go with the current approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBago thanks for doing testing the performance notebook, the notebook brings up something that is interesting/surprising to me, in that the module search_by_name_like query is much slower than what I was expecting. A prefix search using an index should not have been that slow.

That is an unrelated issue from this PR though, so I've created another ticket for that #736, but with the fixed index the Alias query should hopefully become faster as well. 883ms for an inner join between a large table and a very small table doesn't smell right to me that seems to indicate a full table scan as well.

I'll see if I can fix this tomorrow, but in the meantime, apologies but I'll be holding off on merging this PR yet until that is fixed and then we can see the new performance impact.

@lieryan
Copy link
Member

lieryan commented Nov 9, 2023

@MrBago thanks for making this PR, from a quick look this looks great to me, I'm on a trip right now, so my availability to review this is quite limited for the next few weeks. Once I'm back, I'll look into this properly, but please continue the conversation for now.

@MrBago
Copy link
Author

MrBago commented Dec 1, 2023

@lieryan I added the aliases to prefs so that it can be configured. When you have a min can you take a look at the PR, also I think I need to be given some kind of permission so the CI will run tests on my PRs.

Copy link
Member

@lieryan lieryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MrBago, this looks good to me on a first pass. I am going to have to hold off on merging for now for reasons mentioned below, but once I resolved that issue then I'll be taking a second look at this again.

rope/base/prefs.py Outdated Show resolved Hide resolved
ropetest/contrib/autoimporttest.py Outdated Show resolved Hide resolved
connection.execute("CREATE INDEX IF NOT EXISTS alias ON aliases(alias)")

modules = Query(
"(SELECT DISTINCT aliases.*, package, source, type FROM aliases INNER JOIN names on aliases.module = names.module)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBago thanks for doing testing the performance notebook, the notebook brings up something that is interesting/surprising to me, in that the module search_by_name_like query is much slower than what I was expecting. A prefix search using an index should not have been that slow.

That is an unrelated issue from this PR though, so I've created another ticket for that #736, but with the fixed index the Alias query should hopefully become faster as well. 883ms for an inner join between a large table and a very small table doesn't smell right to me that seems to indicate a full table scan as well.

I'll see if I can fix this tomorrow, but in the meantime, apologies but I'll be holding off on merging this PR yet until that is fixed and then we can see the new performance impact.

models.Package.create_table(self.connection)
models.Metadata.create_table(self.connection)
self.add_aliases(self.project.prefs.import_aliases)
data = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand this correctly, this will add the aliases into the database only when the database is created. IIUC, this would need to depend on the database being re-created when preference changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I didn't look at the different ways that prefs can change, we could add a method to clear the aliases table and reset it and invoke that when the prefs are updated.

@MrBago
Copy link
Author

MrBago commented Dec 20, 2023

@lieryan let me know if I can help with looking at the timing. One key thing to notice in my notebook is that I intentionally bloated the database to get the timing for an extreme case.

When doing the timing, one thing that I found odd was that including DISTINCT in the query seemed to make a bigger difference than I expected. I expected the database to be more efficient than python and removing duplicates, but I found that removing the "DISTINCT" and using a set in python was more efficient than pushing down to the database :/.

@lieryan
Copy link
Member

lieryan commented Jan 6, 2024

Hi @tkrabel-db, apologies for the delay. I got sidetracked as I didn't get the performance goal that I was expecting few weeks ago when I initially experimented with fixing the index. But I tried again today with a fresh pair of eyes, and it now worked as I expected, after I found a couple of silly errors when creating the index in my original attempt. Now after applying PR #739, this is more inline with the performance that I was expecting for these operations:

In [2]: from rope.base.project import Project
   ...: from rope.contrib.autoimport.defs import SearchResult
   ...: from rope.contrib.autoimport.sqlite import AutoImport
   ...: 
   ...: import os; os.makedirs('/tmp/bagoD/rope', exist_ok=True)
   ...: project = Project('/tmp/bagoD/rope')
   ...: autoimport = AutoImport(project, memory=False)
   ...: 
   ...: autoimport.generate_cache()  # Generates a cache of the local modules, from the project you're working on
   ...: autoimport.generate_modules_cache()  # Generates a cache of external modules

In [4]: import rope.contrib.autoimport.models as m

In [5]: %time aa = list(autoimport._execute(m.FinalQuery("SELECT * FROM names"), ()))
CPU times: user 23.2 ms, sys: 2.33 ms, total: 25.5 ms
Wall time: 25.1 ms

In [6]: autoimport._executemany(m.Name.objects.insert_into(), aa * 300)
Out[6]: <sqlite3.Cursor at 0x105251110>

In [7]: !du -sh /tmp/bagoD/
1.4G	/tmp/bagoD/

In [8]: %time set(autoimport._execute(m.Name.search_module_like.select_star(), ('abc',)))
    ...: 
CPU times: user 3.26 ms, sys: 2.43 ms, total: 5.7 ms
Wall time: 4.62 ms
Out[8]: 
{('ABC', 'abc', 'abc', 5, 7),
 ('abstractclassmethod', 'abc', 'abc', 5, 7),
 ('abstractmethod', 'abc', 'abc', 5, 3),
 ('abstractproperty', 'abc', 'abc', 5, 7),
 ('abstractstaticmethod', 'abc', 'abc', 5, 7)}

with just the old case sensitive index, the LIKE operations would've been more like a 400ms operation.

if you would update your PR to include the new index as well I can review that again. I think you may need to add an index that looks like this for the alias table (untested):

connection.execute("CREATE INDEX IF NOT EXISTS aliases_alias_nocase ON aliases(alias COLLATE NOCASE)")

@tkrabel-db
Copy link

@lieryan thanks!
@MrBago this is unblocked

@@ -140,6 +140,22 @@ class Prefs:
"""),
)

import_aliases: List[Tuple[str, str]] = field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding an autoimport prefs table in #516 , can we move this there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lieryan do you have a preference where the "import_aliases" option goes. I tried moving it, but I was having trouble with the nested Prefs. Specifically I wasn't able to set the prefs for testing here.

@MrBago MrBago changed the title [WIP] Add aliases to AutoImport search Match on module aliases for autoimport suggestions Jan 26, 2024
@MrBago MrBago changed the title Match on module aliases for autoimport suggestions Match on module aliases for auto import suggestions Jan 26, 2024
Bago Amirbekian added 2 commits January 25, 2024 18:10
@MrBago
Copy link
Author

MrBago commented Jan 26, 2024

@lieryan updated my PR and added the new index. The aliases table should be small so I'm not sure how much of an impact this index will have, but it shouldn't hurt. I tried to optimize the alias query a bit, but I wasn't really able to move the needle. Let me know if you think using an alias table and join like this might be an issue.

@MrBago MrBago requested review from lieryan and tkrabel-db January 26, 2024 21:13
@MrBago
Copy link
Author

MrBago commented Jan 29, 2024

@lieryan When you have a few min can you take a look at this PR, I have some time and would love to move this across the finish line.

@tkrabel-db
Copy link

@lieryan can you prioritize this work so that we have closure? :)

docs/contributing.rst Outdated Show resolved Hide resolved
@lieryan
Copy link
Member

lieryan commented Jan 30, 2024

Thanks @MrBago for implementing this PR and @tkrabel-db, @bagel897 for contributing to the discussions.

I've made some changes to the preferences to align the autoimport preferences changes with #516.

@all-contributors add @MrBago for code

Copy link
Contributor

@lieryan

@MrBago already contributed before to code

@lieryan lieryan merged commit e264c6f into python-rope:master Jan 30, 2024
18 checks passed
@lieryan lieryan added this to the 1.13.0 milestone Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support aliases in rope_autoimport
4 participants