Skip to content

Optimized dropout process - Technical Exercise Part 1 Product Engineer Achmad Ardani Prasha #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

achmadardanip
Copy link

Benchmark Comparison

Before Optimization

image

After Optimization
image

Comparison Table

Metric Before Optimization After Optimization Improvement
Enrollments to be dropped 500,000 500,000
Excluded from drop out 39,243 39,243
Final dropped out 460,757 460,757
Memory Usage (MiB) 624.00 36.00 ~94% reduction
Execution Time (ms) 1,323,102 204,808 ~84% faster

Summary of Improvements:

  • Memory Usage: Dropped from 624 MiB to 36 MiB, which is roughly a 94% reduction.

  • Execution Time: Reduced from 1,323,102 ms (about 22 minutes) to 204,808 ms (about 3.4 minutes), an 84% faster runtime.

  • Correctness: The number of enrollments processed and dropped out remained the same, ensuring the process logic is preserved.

Optimization Steps

Current Problem: The DropOutEnrollments Artisan command, which processes ~500k enrollments (with 500k related exams and 300k submissions), is extremely slow (1323102 ms ~ 22 minutes) and memory-intensive (624 MiB). This indicates inefficient querying and loading too much data into memory at once.

We can improve execution speed and reduce memory usage by applying several optimizations:

1. Select Only Required Columns
I begin by only selecting the columns I need (id, course_id, and student_id) instead of retrieving full rows. This means my query doesn’t use SELECT * – it explicitly lists required fields. By doing so, I reduce the data transferred from the database and the memory my application uses to hydrate models​ [1]. It’s a best practice in SQL to only fetch the necessary columns for your task, as this lowers CPU load and memory usage on both the database and application side​ [2]. In my code, I implement this with Eloquent’s select method:

Enrollment::select('id', 'course_id', 'student_id')

2. Bulk Fetch Related Records with Composite Keys
Next, I needed to determine which enrollments have related exam or submission records without loading all those records entirely. I achieve this by using a composite key (a combination of course_id and student_id) for Exams and Submissions. By selecting a raw concatenation of these two fields and using distinct, I get a unique list of course_id-student_id pairs. This approach minimizes the data pulled into memory – I’m only retrieving a list of keys rather than full objects. Using Eloquent’s pluck on this raw selection gives me a simple array of composite keys without hydrating full models for each record. Avoiding the creation of model instances for each row saves a lot of memory and overhead, as noted in Laravel performance tips​ [3]. Here’s how I do it in code:

$activeExamKeys = Exam::selectRaw("CONCAT(course_id, '-', student_id) as composite_key")
                    ->whereIn('course_id', $courseIds)
                    ->whereIn('student_id', $studentIds)
                    ->where('status', 'IN_PROGRESS')
                    ->distinct()
                    ->pluck('composite_key')
                    ->toArray();
$waitingSubmissionKeys = Submission::selectRaw("CONCAT(course_id, '-', student_id) as composite_key")
                    ->whereIn('course_id', $courseIds)
                    ->whereIn('student_id', $studentIds)
                    ->where('status', 'WAITING_REVIEW')
                    ->distinct()
                    ->pluck('composite_key')
                    ->toArray();

By querying only these composite keys, I dramatically reduce the amount of data loaded, ensuring I work only with the identifiers I need (course–student pairs) instead of entire records.

3. Cache Timestamp per Chunk
When updating records, I use timestamps (for example, setting an updated_at field). Instead of calling the current time (now()) for every single record, I call it once per chunk of records and reuse it. Calling now() is a relatively inexpensive operation, but in a loop over thousands of records it becomes repetitive overhead. Caching the timestamp in a $now variable means I avoid redundant function calls and ensure consistency of the timestamp within that batch. This follows the general optimization principle of moving expensive or repeatable computations outside of loops​ [4]. In practice, I retrieve the current time at the start of processing a chunk and then use that value for all updates/inserts in that chunk, as shown below:

Enrollment::chunkById(1000, function ($enrollments) {
    // Cache the current timestamp once per chunk
    $now = now(); 

    foreach ($enrollments as $enrollment) {
        // ... use $now for any time-stamps needed ...
        // e.g., preparing an update or insert with $now
    }
});

4. Use Chunking, and Bulk Update & Insert
Finally, I process the enrollments in chunks and perform bulk updates/inserts, rather than handling one record at a time. Chunking the query (e.g., 1000 records at a time) ensures that I never load too many records into memory at once​. This keeps memory usage low and makes it feasible to work through millions of records without crashing​ [5]. Within each chunk, I collect the IDs that need updating and prepare any new records that need inserting. Then I perform a single bulk update for that entire set and a single bulk insert for all new records. Using a set-based update with WHERE IN (...) on the collected IDs lets the database update many rows in one operation, greatly reducing the number of round trips compared to updating each row individually​ [6]. . Likewise, inserting multiple rows in one query (as opposed to one-by-one inserts) is much faster – the MySQL documentation notes that batching many values into one INSERT can be many times more efficient than single-row inserts​ [7]. Below is a snippet illustrating this approach:

        Enrollment::select('id', 'course_id', 'student_id')
            ->where('deadline_at', '<=', $deadline)
            ->chunkById(1000, function ($enrollments) use (&$totalDropped, &$totalChecked) {
                // Extract unique course_ids and student_ids from the current chunk.
                $courseIds = $enrollments->pluck('course_id')->unique()->toArray();
                $studentIds = $enrollments->pluck('student_id')->unique()->toArray();

                // Build lookup for active exams using composite keys.
                $activeExamKeys = Exam::selectRaw("CONCAT(course_id, '-', student_id) as composite_key")
                    ->whereIn('course_id', $courseIds)
                    ->whereIn('student_id', $studentIds)
                    ->where('status', 'IN_PROGRESS')
                    ->distinct()
                    ->pluck('composite_key')
                    ->toArray();
                $activeExamLookup = array_flip($activeExamKeys);

                // Build lookup for waiting submissions using composite keys.
                $waitingSubmissionKeys = Submission::selectRaw("CONCAT(course_id, '-', student_id) as composite_key")
                    ->whereIn('course_id', $courseIds)
                    ->whereIn('student_id', $studentIds)
                    ->where('status', 'WAITING_REVIEW')
                    ->distinct()
                    ->pluck('composite_key')
                    ->toArray();
                $waitingSubmissionLookup = array_flip($waitingSubmissionKeys);

                // Prepare arrays for bulk update and bulk insert.
                $enrollmentIdsToDrop = [];
                $activityLogs = [];
                $now = now(); // Cache current timestamp for the entire chunk

                foreach ($enrollments as $enrollment) {
                    $totalChecked++;
                    $key = $enrollment->course_id . '-' . $enrollment->student_id;

                    // Skip enrollment if it has an active exam or waiting submission.
                    if (isset($activeExamLookup[$key]) || isset($waitingSubmissionLookup[$key])) {
                        continue;
                    }

                    $enrollmentIdsToDrop[] = $enrollment->id;
                    $activityLogs[] = [
                        'resource_id' => $enrollment->id,
                        'user_id'     => $enrollment->student_id,
                        'description' => 'COURSE_DROPOUT',
                        'created_at'  => $now,
                        'updated_at'  => $now,
                    ];
                    $totalDropped++;
                }

                // Bulk update enrollments that qualify for dropout.
                if (!empty($enrollmentIdsToDrop)) {
                    DB::table('enrollments')
                        ->whereIn('id', $enrollmentIdsToDrop)
                        ->update([
                            'status'     => 'DROPOUT',
                            'updated_at' => $now,
                        ]);
                }

                // Bulk insert all the activity log records.
                if (!empty($activityLogs)) {
                    DB::table('activities')->insert($activityLogs);
                }
            });

By processing in chunks, I keep memory usage stable, and by doing bulk database operations, I minimize query overhead. This set-based processing leverages the database’s efficiency at handling multiple records in one go, rather than making thousands of individual calls, thereby dramatically improving performance. Each of these optimizations – selecting minimal columns, fetching keys in bulk, caching timestamps, and using chunked batch updates/inserts – contributes to a more efficient, scalable process.

References

  1. Dudi. (n.d.). 18 tips to optimize Laravel database queries. Dudi.dev. https://dudi.dev/optimize-laravel-database-queries/#:~:text=As%20you%20can%20see%2C%20the,the%20columns%20from%20the%20table
  2. IBM. (n.d.). Ways to select data from columns. IBM Documentation. https://www.ibm.com/docs/en/db2-for-zos/13?topic=data-ways-select-from-columns
  3. Dudi. (n.d.). 18 tips to optimize Laravel database queries. Dudi.dev. https://dudi.dev/optimize-laravel-database-queries/#:~:text=The%20above%20approach%20eliminates%20the,on%20processing%20the%20query%20results
  4. Squash Labs. (2023, September 12). How to use loops in PHP. Squash.io. https://www.squash.io/how-to-use-loops-in-php/#:~:text=3,loop%20to%20avoid%20redundant%20computations
  5. Aiman, A. (2024, November 15). How to handle large datasets in Laravel without running out of memory. DEV Community. https://dev.to/asfiaaiman/how-to-handle-large-datasets-in-laravel-without-running-out-of-memory-nak#:~:text=,app%20crashing%20or%20slowing%20down
  6. Oracle Corporation. (n.d.). Optimizing INSERT statements. MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/insert-optimization.html#:~:text=
  7. Oracle Corporation. (n.d.). Optimizing INSERT statements. MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/insert-optimization.html#:~:text=You%20can%20use%20the%20following,methods%20to%20speed%20up%20inserts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant