Skip to content

thescalaguy/detectpii

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Detect PII

Detect PII is a library inspired by piicatcher and CommonRegex to detect columns in tables that may potentially contain PII. It does so by performing regex matches on column names and column values, flagging the ones that may contain PII.

Usage

Installation

Packages can be installed by specifying extras, e.g.:

pip install detectpii[postgres]

See all supported databases and data warehouses.

Scan tables for PII

from detectpii.catalog import PostgresCatalog
from detectpii.pipeline import PiiDetectionPipeline
from detectpii.scanner import DataScanner, MetadataScanner
from detectpii.util import print_columns

# -- Create a catalog to connect to a database / warehouse
pg_catalog = PostgresCatalog(
    host="localhost",
    user="postgres",
    password="my-secret-pw",
    database="postgres",
    port=5432,
    schema="public"
)

# -- Create a pipeline to detect PII in tables using an English dictionary
pipeline = PiiDetectionPipeline(
    catalog=pg_catalog,
    scanners=[
        MetadataScanner(),
        DataScanner(),
    ],
    times=1,
    percentage=20,
)

# -- Scan for PII columns.
pii_columns = pipeline.scan()

# -- Print them to the console
print_columns(pii_columns)

Persist the pipeline

import json
from detectpii.pipeline import pipeline_to_dict

# -- Create a pipeline
pipeline = ...

# -- Convert it into a dictionary
dictionary = pipeline_to_dict(pipeline)

# -- Print it
print(json.dumps(dictionary, indent=4))

# {
#     "catalog": {
#         "tables": [],
#         "resolver": {
#             "name": "PlaintextResolver",
#             "_type": "PlaintextResolver"
#         },
#         "user": "postgres",
#         "password": "my-secret-pw",
#         "host": "localhost",
#         "port": 5432,
#         "database": "postgres",
#         "schema": "public",
#         "_type": "PostgresCatalog"
#     },
#     "scanners": [
#         {
#             "_type": "MetadataScanner"
#         },
#         {
#             "_type": "DataScanner"
#         }
#     ]
#    "times": 1,
#    "percentage": 10
# }

Load the pipeline

from detectpii.pipeline import dict_to_pipeline

# -- Load the persisted pipeline as a dictionary
dictionary: dict = ...

# -- Convert it back to a pipeline object
pipeline = dict_to_pipeline(dictionary=dictionary)

For more detailed documentation, please see the docs folder.

Supported databases / warehouses

Database / Warehouse Package
Hive detectpii[hive]
Postgres detectpii[postgres]
Snowflake detectpii[snowflake]
Trino detectpii[trino]
Yugabyte detectpii[yugabyte]
BigQuery detectpii[bigquery]

Available languages

The following languages are available for metadata detection:

Language Detector
English EnglishColumnNameRegexDetector
Spanish SpanishColumnNameRegexDetector

About

A library to detect PII columns in tables.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages