feat(heuristics): add Fake Email analyzer to validate maintainer email domain #1106

AmineRaouane · 2025-06-16T21:21:18Z

Summary

This PR adds a new heuristic analyzer called FakeEmailAnalyzer. It verifies the validity of maintainer email addresses listed in a PyPI package by checking both the format and the existence of MX records for their domains. This helps detect packages with fake or throwaway emails, which are often indicative of malicious intent.

Description of changes

Implemented FakeEmailAnalyzer that:
- Validates email format using a regex.
- Verifies the existence of MX records for the email domain via DNS resolution.
Updated detect_malicious_metadata_check.py to include and invoke this new analyzer.
The analyzer handles DNS errors and skips analysis if no email is present.
The logical reason for combining quickUndetailed with a failed(Heuristics.FAKE_EMAIL.value) is that a package that is rushed onto a platform by someone using a fake email address points to an actor who may be trying to quickly distribute a package while obscuring their identity and avoiding being investigated.

Related issues

None

Checklist

I have reviewed the contribution guide.
My PR title and commits follow the Conventional Commits convention.
My commits include the "Signed-off-by" line.
I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green verified label should appear next to all of your commits on GitHub.
I have updated the relevant documentation, if applicable.
I have tested my changes and verified they work as expected.

benmss · 2025-07-09T01:50:10Z

src/macaron/malware_analyzer/README.md

@@ -56,6 +56,11 @@ When a heuristic fails, with `HeuristicResult.FAIL`, then that is an indicator b
    - **Description**:  Checks if the package name is suspiciously similar to any package name in a predefined list of popular packages. The similarity check incorporates the Jaro-Winkler distance and considers keyboard layout proximity to identify potential typosquatting.
    - **Rule**: Return `HeuristicResult.FAIL` if the similarity ratio between the package name and any popular package name meets or exceeds a defined threshold; otherwise, return `HeuristicResult.PASS`.
    - **Dependency**: None.
+
+11. **Fake Email**
+    - **Description**:  Checks if the package maintainer or author has a suspicious or invalid email .


Suggested change

- **Description**: Checks if the package maintainer or author has a suspicious or invalid email .

- **Description**: Checks if the package maintainer or author has a suspicious or invalid email.

benmss · 2025-07-09T01:50:38Z

src/macaron/malware_analyzer/README.md

+
+11. **Fake Email**
+    - **Description**:  Checks if the package maintainer or author has a suspicious or invalid email .
+    - **Rule**: Return `HeuristicResult.FAIL` if the email format is invalid or the email domain has no MX records ; otherwise, return `HeuristicResult.PASS`.


Suggested change

- **Rule**: Return `HeuristicResult.FAIL` if the email format is invalid or the email domain has no MX records ; otherwise, return `HeuristicResult.PASS`.

- **Rule**: Return `HeuristicResult.FAIL` if the email format is invalid or the email domain has no MX records; otherwise, return `HeuristicResult.PASS`.

benmss · 2025-07-09T02:13:59Z

src/macaron/malware_analyzer/pypi_heuristics/metadata/fake_email.py

+            depends_on=None,
+        )
+
+    def is_valid_email(self, email: str) -> bool:


Validating email addresses is a complex task. What we have here is more like a sanity check, verifying that the address is vaguely of the right format. That may be enough for the purpose of this check, in which case this method should be renamed and re-documented to make that clear. Alternatively, if we do really want to ensure that email addresses are valid, this method will need to be expanded considerably. @behnazh-w
See the top two answers on this stackoverflow thread for more information: https://stackoverflow.com/questions/2049502/what-characters-are-allowed-in-an-email-address

As part of the above, I think the regex checking and dns resolution steps should be split into separate functions. This could also simplify the related tests.

Another alternative is to use the Python library email-validator to more formally validate emails. It also uses dnspython, but handles some of the more complicated validation aspects.
https://pypi.org/project/email-validator/

i used the email-validator approach

…l domains Signed-off-by: Amine <[email protected]>

Signed-off-by: Amine <[email protected]>

…mail domain validation Signed-off-by: Amine <[email protected]>

AmineRaouane requested review from behnazh-w and tromai as code owners June 16, 2025 21:21

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jun 16, 2025

AmineRaouane force-pushed the fake-emails-heuristic branch from c6f35f7 to 8d29103 Compare June 16, 2025 21:26

AmineRaouane force-pushed the fake-emails-heuristic branch 5 times, most recently from f945882 to a7103e4 Compare July 5, 2025 18:59

benmss reviewed Jul 9, 2025

View reviewed changes

AmineRaouane added 4 commits July 12, 2025 16:51

feat(heuristics): add Fake Email analyzer to validate maintainer emai…

34a69a0

…l domains Signed-off-by: Amine <[email protected]>

refactor: remove redundant package download

18715d1

Signed-off-by: Amine <[email protected]>

docs: update documentation

59f4c61

Signed-off-by: Amine <[email protected]>

refactor(fake-email): replace dns_resolver with email-validator for e…

d99495c

…mail domain validation Signed-off-by: Amine <[email protected]>

AmineRaouane force-pushed the fake-emails-heuristic branch from 759ab97 to d99495c Compare July 12, 2025 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(heuristics): add Fake Email analyzer to validate maintainer email domain #1106

feat(heuristics): add Fake Email analyzer to validate maintainer email domain #1106

AmineRaouane commented Jun 16, 2025 •

edited

Loading

Uh oh!

benmss Jul 9, 2025

Uh oh!

benmss Jul 9, 2025

Uh oh!

benmss Jul 9, 2025

Uh oh!

benmss Jul 9, 2025

Uh oh!

benmss Jul 10, 2025 •

edited

Loading

Uh oh!

AmineRaouane Jul 12, 2025

Uh oh!

Uh oh!

	- Description: Checks if the package maintainer or author has a suspicious or invalid email .
	- Description: Checks if the package maintainer or author has a suspicious or invalid email.

	- Rule: Return `HeuristicResult.FAIL` if the email format is invalid or the email domain has no MX records ; otherwise, return `HeuristicResult.PASS`.
	- Rule: Return `HeuristicResult.FAIL` if the email format is invalid or the email domain has no MX records; otherwise, return `HeuristicResult.PASS`.

feat(heuristics): add Fake Email analyzer to validate maintainer email domain #1106

Are you sure you want to change the base?

feat(heuristics): add Fake Email analyzer to validate maintainer email domain #1106

Conversation

AmineRaouane commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description of changes

Related issues

Checklist

Uh oh!

benmss Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

benmss Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

benmss Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

benmss Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

benmss Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmineRaouane Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AmineRaouane commented Jun 16, 2025 •

edited

Loading

benmss Jul 10, 2025 •

edited

Loading