Skip to content

DuckDB extension for parsing, extracting, and analyzing domains, URIs, and paths with ease.

License

Notifications You must be signed in to change notification settings

hatamiarash7/duckdb-netquack

Repository files navigation

DuckDB Netquack Extension

DuckDB Badge GitHub License GitHub Release

This extension is designed to simplify working with domains, URIs, and web paths directly within your database queries. Whether you're extracting top-level domains (TLDs), parsing URI components, or analyzing web paths, Netquack provides a suite of intuitive functions to handle all your network tasks efficiently. Built for data engineers, analysts, and developers.

With Netquack, you can unlock deeper insights from your web-related datasets without the need for external tools or complex workflows.

Installation πŸš€

netquack is distributed as a DuckDB Community Extension and can be installed using SQL:

INSTALL netquack FROM community;
LOAD netquack;

If you previously installed the netquack extension, upgrade using the FORCE command

FORCE INSTALL netquack FROM community;
LOAD netquack;

Usage Examples πŸ“š

Once installed, the macro functions provided by the extension can be used just like built-in functions.

Extracting The Main Domain

This function extracts the main domain from a URL. For this purpose, the extension will get all public suffixes from the publicsuffix.org list and extract the main domain from the URL.

The download process of the public suffix list is done automatically when the function is called for the first time. After that, the list is stored in the public_suffix_list table to avoid downloading it again.

D SELECT extract_domain('a.example.com') as domain;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   domain    β”‚
β”‚   varchar   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ example.com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_domain('https://b.a.example.com/path') as domain;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   domain    β”‚
β”‚   varchar   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ example.com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

You can use the update_suffixes function to update the public suffix list manually.

D SELECT update_suffixes();
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ update_suffixes() β”‚
β”‚      varchar      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ updated           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extracting The Path

This function extracts the path from a URL.

D SELECT extract_path('https://b.a.example.com/path/path') as path;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    path    β”‚
β”‚  varchar   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ /path/path β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_path('example.com/path/path/image.png') as path;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         path         β”‚
β”‚       varchar        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ /path/path/image.png β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extracting The Host

This function extracts the host from a URL.

D SELECT extract_host('https://b.a.example.com/path/path') as host;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      host       β”‚
β”‚     varchar     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ b.a.example.com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_host('example.com:443/path/image.png') as host;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    host     β”‚
β”‚   varchar   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ example.com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extracting The Schema

This function extracts the schema from a URL. Supported schemas for now:

  • http | https
  • ftp
  • mailto
  • tel | sms
D SELECT extract_schema('https://b.a.example.com/path/path') as schema;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ schema  β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ https   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_schema('mailto:[email protected]') as schema;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ schema  β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ mailto  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_schema('tel:+123456789') as schema;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ schema  β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ tel     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extracting The Query

This function extracts the query string from a URL.

D SELECT extract_query_string('example.com?key=value') as query;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   query   β”‚
β”‚  varchar  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ key=value β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_query_string('http://example.com.ac/path/?a=1&b=2&') as query;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  query   β”‚
β”‚ varchar  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ a=1&b=2& β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extracting The TLD (Top-Level Domain)

This function extracts the top-level domain from a URL. This function will use the public suffix list to extract the TLD. Check the Extracting The Main Domain section for more information about the public suffix list.

D SELECT extract_tld('https://example.com.ac/path/path') as tld;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   tld   β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ com.ac  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_tld('a.example.com') as tld;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   tld   β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ com     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extracting The Sub Domain

This function extracts the sub-domain from a URL. This function will use the public suffix list to extract the TLD. Check the Extracting The Main Domain section for more information about the public suffix list.

D SELECT extract_subdomain('http://a.b.example.com/path') as dns_record;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dns_record β”‚
β”‚  varchar   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ a.b        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_subdomain('test.example.com.ac') as dns_record;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dns_record β”‚
β”‚  varchar   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ test       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Get Tranco Rank

Update Tranco List

This function returns the Tranco rank of a domain. You have a update_tranco function to update the Tranco list manually.

D SELECT update_tranco(true);
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ update_tranco(CAST('f' AS BOOLEAN)) β”‚
β”‚               varchar               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Tranco list updated                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This function will get the latest Tranco list and save it into the tranco_list table. There will be a tranco_lit_%Y-%m-%d.csv file in the current directory after the function is called. The extension will use this file to prevent downloading the list again.

You can ignore the file and force the extension to download the list again by calling the function with true as a parameter. If you don't want to download the list again, you can call the function with false as a parameter.

D SELECT update_tranco(false);

As the latest Tranco list is for the last day, you can download your list manually and rename it to tranco_lit_%Y-%m-%d.csv to use it with the extension too.

Get Tranco Ranking

You can use this function to get the ranking of a domain:

D SELECT get_tranco_rank('microsoft.com') as rank;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ rank  β”‚
β”‚ int32 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚     2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT get_tranco_rank('cloudflare.com') as rank;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ rank  β”‚
β”‚ int32 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚    13 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜

Get Extension Version

You can use the netquack_version function to get the version of the extension.

D select * from netquack_version();
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ version β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ v1.1.0  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Roadmap πŸ—ΊοΈ

  • Create a TableFunction for extract_query_parameters that return each key-value pair as a row.
  • Save Tranco data as Parquet
  • Create Rank category for Tranco ( top1k , top5k, top10k, top100k, top500k, top1m )
  • Implement GeoIP functionality
  • Add new functions to work with IPs
  • Return default value for get_tranco_rank

Contributing 🀝

Don't be shy and reach out to us if you want to contribute πŸ˜‰

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request

Issues πŸ›

Each project may have many problems. Contributing to the better development of this project by reporting them. πŸ‘