Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue-435] Job Argument filter for ZyteJobsComparisonMonitor #461

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

shafiq-muhammad
Copy link
Contributor

Job argument filter to get the past jobs with desired arguments is added for ZyteJobsComparisonMonitor.

SPIDERMON_JOBS_COMPARISON_ARGUMENTS setting will be used to provide the list of desired arguments.

@curita
Copy link
Member

curita commented Jan 31, 2025

Hi @shafiq-muhammad! The current tests are failing because we need to mock new things or mock them in a different way.

Take for example the test test_jobs_comparison_monitor_get_jobs(). The traceback of the error we get looks like this:

============================================================================================ FAILURES ============================================================================================
_____________________________________________________________________________ test_jobs_comparison_monitor_get_jobs ______________________________________________________________________________

    def test_jobs_comparison_monitor_get_jobs():
        mock_client = Mock()
        with patch(
            "spidermon.contrib.scrapy.monitors.monitors.Client"
        ) as mock_client_class:
            mock_client_class.return_value = mock_client
            monitor = TestZyteJobsComparisonMonitor()
            monitor._get_tags_to_filter = Mock(side_effect=lambda: None)
            monitor.data = Mock()
            monitor.crawler.settings.getlist.return_value = None
            mock_client.spider.jobs.list = Mock(side_effect=get_paginated_jobs)
    
            # Return exact number of jobs
>           jobs = monitor._get_jobs(states=None, number_of_jobs=50)

tests/contrib/scrapy/monitors/test_jobs_comparison_monitor.py:166: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
spidermon/contrib/scrapy/monitors/monitors.py:594: in _get_jobs
    if not self._has_desired_args(job, args):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <MONITOR:(TestZyteJobsComparisonMonitor/runTest) at 0x108fe9ea0>, job = <Mock id='4445866160'>, args = []

    def _has_desired_args(self, job, args):
        if not args and not job.get("spider_args"):
            return True
    
>       job_args = job["spider_args"].keys()
E       TypeError: 'Mock' object is not subscriptable

As seen at the end, it says, "'Mock' object is not subscribable." The act of "subscribing" an object is when we do [] or __getitem__() over that object. So that means that the job that was passed to that function is a Mock() object, which doesn't support calling ["spider_args"] over it.

Investigating where we are mocking the job, we can see that happens when calling client.spider.jobs.list inside ZyteJobsComparisonMonitor._get_jobs(). That call is mocked inside this test when doing: mock_client.spider.jobs.list = Mock(side_effect=get_paginated_jobs). So when calling that function, we call and get the result of its side_effect: get_paginated_jobs().

This is defined like this earlier in the file:

def get_paginated_jobs(**kwargs):
    return [Mock() for _ in range(kwargs["count"])]

So essentially, client.spider.jobs.list() ends up returning a list of Mock() objects.

Now, we need to prevent job["spider_args"] from breaking. One way to do it is to make Mock() support it. That can be accomplished by using a MagicMock() instead, and you can even define the result you expect out of job["spider_args"] this way:

def get_paginated_jobs(**kwargs):
    mocked_job_meta = []
    for _ in range(kwargs["count"]):
        jmock = MagicMock()
        def mock_getitem(key):
            if key == "spider_args":
                return {"arg1": "value1"}
        jmock.__getitem__.side_effect = mock_getitem
        mocked_job_meta.append(jmock)
    return mocked_job_meta

But since the actual result of client.spider.jobs.list() when it's not mocked is a list of dictionaries, we don't really need to use Mock objects, we can replace that with:

def get_paginated_jobs(**kwargs):
    mocked_job_meta = []
    for _ in range(kwargs["count"]):
        mocked_job_meta.append({"spider_args": {"arg1": "value1"}})
    return mocked_job_meta

That should fix the first part of the test. The rest doesn't use get_paginated_jobs(), so you'll have to update the corresponding place accordingly.

I hope this gives you a hint on how to fix the rest of the tests and add new ones. Let me know how it goes!

Copy link

codecov bot commented Feb 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.69%. Comparing base (d7d553a) to head (fc6cf2c).
Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #461      +/-   ##
==========================================
+ Coverage   79.54%   79.69%   +0.14%     
==========================================
  Files          76       76              
  Lines        3242     3260      +18     
  Branches      539      325     -214     
==========================================
+ Hits         2579     2598      +19     
  Misses        593      593              
+ Partials       70       69       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@curita curita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @shafiq-muhammad!

I realized the filtering logic is not quite what is expected, but I think it can be adapted 👍

Comment on lines +544 to +545
The job that will have all the desired arguments will be processed.
Example ["debug_url"] or ["is_full_crawl"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue?: This is not exactly the intention of the original ticket, sorry I didn't realize this before 🙇‍♀️

We not only want to check that the arguments listed in SPIDERMON_JOBS_COMPARISON_ARGUMENTS appear in the jobs we'll use as comparison, but that the values are the same as well.

Following this example, if the job where we are running the monitor was called with the argument is_full_crawl=1 (and "is_full_crawl" is in SPIDERMON_JOBS_COMPARISON_ARGUMENTS), we want to only consider the past jobs that have the argument "is_full_crawl" with a value of "1". If some past job has the spider argument "is_full_crawl" with a value of "0" instead we'll want to skip it.

In the case that the current job doesn't have that argument (even if it's listed in SPIDERMON_JOBS_COMPARISON_ARGUMENTS), we might want to compare them against past jobs that don't have that argument as well. We could consider that case as having that spider argument set to None in a sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants