Optimized dropout process - Technical Exercise Part 1 Product Engineer Achmad Ardani Prasha #1

achmadardanip · 2025-03-01T07:38:20Z

Benchmark Comparison

Before Optimization

After Optimization

Comparison Table

Metric	Before Optimization	After Optimization	Improvement
Enrollments to be dropped	500,000	500,000	—
Excluded from drop out	39,243	39,243	—
Final dropped out	460,757	460,757	—
Memory Usage (MiB)	624.00	36.00	~94% reduction
Execution Time (ms)	1,323,102	204,808	~84% faster

Summary of Improvements:

Memory Usage: Dropped from 624 MiB to 36 MiB, which is roughly a 94% reduction.
Execution Time: Reduced from 1,323,102 ms (about 22 minutes) to 204,808 ms (about 3.4 minutes), an 84% faster runtime.
Correctness: The number of enrollments processed and dropped out remained the same, ensuring the process logic is preserved.

Optimization Steps

Current Problem: The DropOutEnrollments Artisan command, which processes ~500k enrollments (with 500k related exams and 300k submissions), is extremely slow (1323102 ms ~ 22 minutes) and memory-intensive (624 MiB). This indicates inefficient querying and loading too much data into memory at once.

We can improve execution speed and reduce memory usage by applying several optimizations:

1. Select Only Required Columns
I begin by only selecting the columns I need (id, course_id, and student_id) instead of retrieving full rows. This means my query doesn’t use SELECT * – it explicitly lists required fields. By doing so, I reduce the data transferred from the database and the memory my application uses to hydrate models [1]. It’s a best practice in SQL to only fetch the necessary columns for your task, as this lowers CPU load and memory usage on both the database and application side [2]. In my code, I implement this with Eloquent’s select method:

Enrollment::select('id', 'course_id', 'student_id')

2. Bulk Fetch Related Records with Composite Keys
Next, I needed to determine which enrollments have related exam or submission records without loading all those records entirely. I achieve this by using a composite key (a combination of course_id and student_id) for Exams and Submissions. By selecting a raw concatenation of these two fields and using distinct, I get a unique list of course_id-student_id pairs. This approach minimizes the data pulled into memory – I’m only retrieving a list of keys rather than full objects. Using Eloquent’s pluck on this raw selection gives me a simple array of composite keys without hydrating full models for each record. Avoiding the creation of model instances for each row saves a lot of memory and overhead, as noted in Laravel performance tips [3]. Here’s how I do it in code:

$activeExamKeys = Exam::selectRaw("CONCAT(course_id, '-', student_id) as composite_key")
                    ->whereIn('course_id', $courseIds)
                    ->whereIn('student_id', $studentIds)
                    ->where('status', 'IN_PROGRESS')
                    ->distinct()
                    ->pluck('composite_key')
                    ->toArray();
$waitingSubmissionKeys = Submission::selectRaw("CONCAT(course_id, '-', student_id) as composite_key")
                    ->whereIn('course_id', $courseIds)
                    ->whereIn('student_id', $studentIds)
                    ->where('status', 'WAITING_REVIEW')
                    ->distinct()
                    ->pluck('composite_key')
                    ->toArray();

By querying only these composite keys, I dramatically reduce the amount of data loaded, ensuring I work only with the identifiers I need (course–student pairs) instead of entire records.

3. Cache Timestamp per Chunk
When updating records, I use timestamps (for example, setting an updated_at field). Instead of calling the current time (now()) for every single record, I call it once per chunk of records and reuse it. Calling now() is a relatively inexpensive operation, but in a loop over thousands of records it becomes repetitive overhead. Caching the timestamp in a $now variable means I avoid redundant function calls and ensure consistency of the timestamp within that batch. This follows the general optimization principle of moving expensive or repeatable computations outside of loops [4]. In practice, I retrieve the current time at the start of processing a chunk and then use that value for all updates/inserts in that chunk, as shown below:

Enrollment::chunkById(1000, function ($enrollments) {
    // Cache the current timestamp once per chunk
    $now = now(); 

    foreach ($enrollments as $enrollment) {
        // ... use $now for any time-stamps needed ...
        // e.g., preparing an update or insert with $now
    }
});

4. Use Chunking, and Bulk Update & Insert
Finally, I process the enrollments in chunks and perform bulk updates/inserts, rather than handling one record at a time. Chunking the query (e.g., 1000 records at a time) ensures that I never load too many records into memory at once. This keeps memory usage low and makes it feasible to work through millions of records without crashing [5]. Within each chunk, I collect the IDs that need updating and prepare any new records that need inserting. Then I perform a single bulk update for that entire set and a single bulk insert for all new records. Using a set-based update with WHERE IN (...) on the collected IDs lets the database update many rows in one operation, greatly reducing the number of round trips compared to updating each row individually [6]. . Likewise, inserting multiple rows in one query (as opposed to one-by-one inserts) is much faster – the MySQL documentation notes that batching many values into one INSERT can be many times more efficient than single-row inserts [7]. Below is a snippet illustrating this approach:

        Enrollment::select('id', 'course_id', 'student_id')
            ->where('deadline_at', '<=', $deadline)
            ->chunkById(1000, function ($enrollments) use (&$totalDropped, &$totalChecked) {
                // Extract unique course_ids and student_ids from the current chunk.
                $courseIds = $enrollments->pluck('course_id')->unique()->toArray();
                $studentIds = $enrollments->pluck('student_id')->unique()->toArray();

                // Build lookup for active exams using composite keys.
                $activeExamKeys = Exam::selectRaw("CONCAT(course_id, '-', student_id) as composite_key")
                    ->whereIn('course_id', $courseIds)
                    ->whereIn('student_id', $studentIds)
                    ->where('status', 'IN_PROGRESS')
                    ->distinct()
                    ->pluck('composite_key')
                    ->toArray();
                $activeExamLookup = array_flip($activeExamKeys);

                // Build lookup for waiting submissions using composite keys.
                $waitingSubmissionKeys = Submission::selectRaw("CONCAT(course_id, '-', student_id) as composite_key")
                    ->whereIn('course_id', $courseIds)
                    ->whereIn('student_id', $studentIds)
                    ->where('status', 'WAITING_REVIEW')
                    ->distinct()
                    ->pluck('composite_key')
                    ->toArray();
                $waitingSubmissionLookup = array_flip($waitingSubmissionKeys);

                // Prepare arrays for bulk update and bulk insert.
                $enrollmentIdsToDrop = [];
                $activityLogs = [];
                $now = now(); // Cache current timestamp for the entire chunk

                foreach ($enrollments as $enrollment) {
                    $totalChecked++;
                    $key = $enrollment->course_id . '-' . $enrollment->student_id;

                    // Skip enrollment if it has an active exam or waiting submission.
                    if (isset($activeExamLookup[$key]) || isset($waitingSubmissionLookup[$key])) {
                        continue;
                    }

                    $enrollmentIdsToDrop[] = $enrollment->id;
                    $activityLogs[] = [
                        'resource_id' => $enrollment->id,
                        'user_id'     => $enrollment->student_id,
                        'description' => 'COURSE_DROPOUT',
                        'created_at'  => $now,
                        'updated_at'  => $now,
                    ];
                    $totalDropped++;
                }

                // Bulk update enrollments that qualify for dropout.
                if (!empty($enrollmentIdsToDrop)) {
                    DB::table('enrollments')
                        ->whereIn('id', $enrollmentIdsToDrop)
                        ->update([
                            'status'     => 'DROPOUT',
                            'updated_at' => $now,
                        ]);
                }

                // Bulk insert all the activity log records.
                if (!empty($activityLogs)) {
                    DB::table('activities')->insert($activityLogs);
                }
            });

By processing in chunks, I keep memory usage stable, and by doing bulk database operations, I minimize query overhead. This set-based processing leverages the database’s efficiency at handling multiple records in one go, rather than making thousands of individual calls, thereby dramatically improving performance. Each of these optimizations – selecting minimal columns, fetching keys in bulk, caching timestamps, and using chunked batch updates/inserts – contributes to a more efficient, scalable process.

References

Dudi. (n.d.). 18 tips to optimize Laravel database queries. Dudi.dev. https://dudi.dev/optimize-laravel-database-queries/#:~:text=As%20you%20can%20see%2C%20the,the%20columns%20from%20the%20table
IBM. (n.d.). Ways to select data from columns. IBM Documentation. https://www.ibm.com/docs/en/db2-for-zos/13?topic=data-ways-select-from-columns
Dudi. (n.d.). 18 tips to optimize Laravel database queries. Dudi.dev. https://dudi.dev/optimize-laravel-database-queries/#:~:text=The%20above%20approach%20eliminates%20the,on%20processing%20the%20query%20results
Squash Labs. (2023, September 12). How to use loops in PHP. Squash.io. https://www.squash.io/how-to-use-loops-in-php/#:~:text=3,loop%20to%20avoid%20redundant%20computations
Aiman, A. (2024, November 15). How to handle large datasets in Laravel without running out of memory. DEV Community. https://dev.to/asfiaaiman/how-to-handle-large-datasets-in-laravel-without-running-out-of-memory-nak#:~:text=,app%20crashing%20or%20slowing%20down
Oracle Corporation. (n.d.). Optimizing INSERT statements. MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/insert-optimization.html#:~:text=
Oracle Corporation. (n.d.). Optimizing INSERT statements. MySQL 8.4 Reference Manual. https://dev.mysql.com/doc/refman/8.4/en/insert-optimization.html#:~:text=You%20can%20use%20the%20following,methods%20to%20speed%20up%20inserts

achmadardanip added 3 commits March 1, 2025 14:16

Optimized dropout process

cefe7d6

Update README.md

e6d8be3

Update README.md

38665c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimized dropout process - Technical Exercise Part 1 Product Engineer Achmad Ardani Prasha #1

Optimized dropout process - Technical Exercise Part 1 Product Engineer Achmad Ardani Prasha #1

Uh oh!

achmadardanip commented Mar 1, 2025

Uh oh!

Uh oh!

Optimized dropout process - Technical Exercise Part 1 Product Engineer Achmad Ardani Prasha #1

Are you sure you want to change the base?

Optimized dropout process - Technical Exercise Part 1 Product Engineer Achmad Ardani Prasha #1

Uh oh!

Conversation

achmadardanip commented Mar 1, 2025

Benchmark Comparison

Optimization Steps

References

Uh oh!

Uh oh!