[Issue-435] Job Argument filter for ZyteJobsComparisonMonitor #461

shafiq-muhammad · 2025-01-09T11:04:28Z

Job argument filter to get the past jobs with desired arguments is added for ZyteJobsComparisonMonitor.

SPIDERMON_JOBS_COMPARISON_ARGUMENTS setting will be used to provide the list of desired arguments.

curita · 2025-01-31T19:48:45Z

Hi @shafiq-muhammad! The current tests are failing because we need to mock new things or mock them in a different way.

Take for example the test test_jobs_comparison_monitor_get_jobs(). The traceback of the error we get looks like this:

============================================================================================ FAILURES ============================================================================================
_____________________________________________________________________________ test_jobs_comparison_monitor_get_jobs ______________________________________________________________________________

    def test_jobs_comparison_monitor_get_jobs():
        mock_client = Mock()
        with patch(
            "spidermon.contrib.scrapy.monitors.monitors.Client"
        ) as mock_client_class:
            mock_client_class.return_value = mock_client
            monitor = TestZyteJobsComparisonMonitor()
            monitor._get_tags_to_filter = Mock(side_effect=lambda: None)
            monitor.data = Mock()
            monitor.crawler.settings.getlist.return_value = None
            mock_client.spider.jobs.list = Mock(side_effect=get_paginated_jobs)
    
            # Return exact number of jobs
>           jobs = monitor._get_jobs(states=None, number_of_jobs=50)

tests/contrib/scrapy/monitors/test_jobs_comparison_monitor.py:166: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
spidermon/contrib/scrapy/monitors/monitors.py:594: in _get_jobs
    if not self._has_desired_args(job, args):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <MONITOR:(TestZyteJobsComparisonMonitor/runTest) at 0x108fe9ea0>, job = <Mock id='4445866160'>, args = []

    def _has_desired_args(self, job, args):
        if not args and not job.get("spider_args"):
            return True
    
>       job_args = job["spider_args"].keys()
E       TypeError: 'Mock' object is not subscriptable

As seen at the end, it says, "'Mock' object is not subscribable." The act of "subscribing" an object is when we do [] or __getitem__() over that object. So that means that the job that was passed to that function is a Mock() object, which doesn't support calling ["spider_args"] over it.

Investigating where we are mocking the job, we can see that happens when calling client.spider.jobs.list inside ZyteJobsComparisonMonitor._get_jobs(). That call is mocked inside this test when doing: mock_client.spider.jobs.list = Mock(side_effect=get_paginated_jobs). So when calling that function, we call and get the result of its side_effect: get_paginated_jobs().

This is defined like this earlier in the file:

def get_paginated_jobs(**kwargs):
    return [Mock() for _ in range(kwargs["count"])]

So essentially, client.spider.jobs.list() ends up returning a list of Mock() objects.

Now, we need to prevent job["spider_args"] from breaking. One way to do it is to make Mock() support it. That can be accomplished by using a MagicMock() instead, and you can even define the result you expect out of job["spider_args"] this way:

def get_paginated_jobs(**kwargs):
    mocked_job_meta = []
    for _ in range(kwargs["count"]):
        jmock = MagicMock()
        def mock_getitem(key):
            if key == "spider_args":
                return {"arg1": "value1"}
        jmock.__getitem__.side_effect = mock_getitem
        mocked_job_meta.append(jmock)
    return mocked_job_meta

But since the actual result of client.spider.jobs.list() when it's not mocked is a list of dictionaries, we don't really need to use Mock objects, we can replace that with:

def get_paginated_jobs(**kwargs):
    mocked_job_meta = []
    for _ in range(kwargs["count"]):
        mocked_job_meta.append({"spider_args": {"arg1": "value1"}})
    return mocked_job_meta

That should fix the first part of the test. The rest doesn't use get_paginated_jobs(), so you'll have to update the corresponding place accordingly.

I hope this gives you a hint on how to fix the rest of the tests and add new ones. Let me know how it goes!

codecov · 2025-02-20T09:13:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.69%. Comparing base (d7d553a) to head (fc6cf2c).
Report is 7 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #461      +/-   ##
==========================================
+ Coverage   79.54%   79.69%   +0.14%     
==========================================
  Files          76       76              
  Lines        3242     3260      +18     
  Branches      539      325     -214     
==========================================
+ Hits         2579     2598      +19     
  Misses        593      593              
+ Partials       70       69       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

curita

Thanks for the changes @shafiq-muhammad!

I realized the filtering logic is not quite what is expected, but I think it can be adapted 👍

curita · 2025-02-24T14:40:59Z

spidermon/contrib/scrapy/monitors/monitors.py

+    The job that will have all the desired arguments will be processed.
+    Example ["debug_url"] or ["is_full_crawl"]


issue?: This is not exactly the intention of the original ticket, sorry I didn't realize this before 🙇‍♀️

We not only want to check that the arguments listed in SPIDERMON_JOBS_COMPARISON_ARGUMENTS appear in the jobs we'll use as comparison, but that the values are the same as well.

Following this example, if the job where we are running the monitor was called with the argument is_full_crawl=1 (and "is_full_crawl" is in SPIDERMON_JOBS_COMPARISON_ARGUMENTS), we want to only consider the past jobs that have the argument "is_full_crawl" with a value of "1". If some past job has the spider argument "is_full_crawl" with a value of "0" instead we'll want to skip it.

In the case that the current job doesn't have that argument (even if it's listed in SPIDERMON_JOBS_COMPARISON_ARGUMENTS), we might want to compare them against past jobs that don't have that argument as well. We could consider that case as having that spider argument set to None in a sense.

shafiq-muhammad added 2 commits January 9, 2025 15:53

added filter on job comparison monitor based on job arguments

486cd03

added doc string

bbe5d02

added test cases

b835366

shafiq-muhammad added 3 commits February 20, 2025 14:13

linting

1ea95f8

added new test cases

a8412bf

removed unwanted code

fc6cf2c

curita requested changes Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue-435] Job Argument filter for ZyteJobsComparisonMonitor #461

[Issue-435] Job Argument filter for ZyteJobsComparisonMonitor #461

shafiq-muhammad commented Jan 9, 2025

curita commented Jan 31, 2025

codecov bot commented Feb 20, 2025 •

edited

Loading

curita left a comment

curita Feb 24, 2025

		The job that will have all the desired arguments will be processed.
		Example ["debug_url"] or ["is_full_crawl"]

[Issue-435] Job Argument filter for ZyteJobsComparisonMonitor #461

Are you sure you want to change the base?

[Issue-435] Job Argument filter for ZyteJobsComparisonMonitor #461

Conversation

shafiq-muhammad commented Jan 9, 2025

curita commented Jan 31, 2025

codecov bot commented Feb 20, 2025 • edited Loading

Codecov Report

curita left a comment

Choose a reason for hiding this comment

curita Feb 24, 2025

Choose a reason for hiding this comment

codecov bot commented Feb 20, 2025 •

edited

Loading