Need help. Tasks spawned by Dagger.@spawn (or Dagger.spawn() ) do not see the get the data. #563

JuanVargas · 2024-08-07T01:14:52Z

JuanVargas
Aug 7, 2024

I am using the MovieLens data to experiment with Dagger. I wrote an app using Base.Threads in which a FindRatingsMaster() function partitions the data into N shards (as DataFrames) and gives a shard to 10 FindRatingsWorfkers() functions, where the processing is done. I had no problem getting this done with Base.Threads on a single desktop, with 26 cores. Then I tried doing the same using Julia's Distributed, but could not make it work and gave up. Then I saw that Dagger has been suggested for things like this, so I went through Dagger's documentation, saw and ran the example code, and then starting coding. Unfortunately I am having similar issues with dagger as I had them with Distributed. Any help and suggestions would be greatly appreciatted.


function dg_FindRatingsMaster()
  nF = 10 # number of files with ratings
  
  # kg is a 1D array that contains the Known Genders
  kg = ["Action", "Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary",
  "Drama", "Fantasy", "Film-Noir", "Horror", "IMAX", "Musical", "Mystery", "Romance",
  "Sci-Fi", "Thriller", "War", "Western", "(no genres listed)" ]

  ng = size(kg,1)       # ng is just the number of rowws in kg
  ra = zeros(ng,nF)     # ra is  2D arraydf
  ca = zeros(ng,nF)     # ca is  2D arraydf

  # dfm has all rows from Movies with cols :movieId, :genres 
  dfm = DataFrame(read_parquet( PrqDir * "movies.parquet"))
  dfm = dfm[: , [:movieId, :genres] ]

  Dagger.spawn() do
    dfr_v = [DataFrame() for i in 1:nF]
        rfn = PrqDir * "ratings_" * string(i, pad = 2) * ".parquet"
        println( rfn ) 
        dfr_v[i] = DataFrame(read_parquet( rfn ))
        #wait( Dagger.spawn()  do
        ra[:,i] , ca[:,i] = dg_FindRatingsWorker( i, ng, kg, dfm, dfr_v[i])
    end 
  
  sra = zeros(ng)     # sra is an 1D array for summing the values of the Ratings for each genre
  sca = zeros(ng)     # sca is an 1D array for summing the counts of the Ratings for each genre
  for i =1:ng
     for j = 1:nF
        sra[i] += ra[i,j]
        sca[i] += ca[i,j]
      end
  end

  for i =1:ng
     println("ca = %14.2f   ra = %14.2f   genre = %s  \n", sca[i], sra[i], kg[i] )
  end

end #dg_FindRatingsMaster()


@everywhere function dg_FindRatingsWorker( w::Integer, ng::Integer, kg::Array, dfm::DataFrame, dfr::DataFrame )
  ra = zeros(ng)    # ra is an 1D array for keeping the values of the Ratings for each genre
  ca = zeros(ng)    # ca is an 1D array to keep the number of Ratings for each genre


  # the innerjoin will have the following columns: {movieId, genre, rating}
  ij = innerjoin(dfm, dfr, on = :movieId)
  nij = size(ij,1)

  # ng = 20

  for i = 1:ng
    for j = 1:nij
      r = ij[j,:]       # get all columns for row j. gender is col=2 of the row
      g = r[2] 
      if ( contains( g , kg[i]) == true)
          ca[i] += 1      # keep the count of ratings for this genre
          ra[i] += r[3]   #add the value for this genre
      end
    end
  end

 return ra, ca

end

jpsamaroo · 2024-08-23T14:06:04Z

jpsamaroo
Aug 23, 2024
Maintainer

Hey there, sorry about the long delay! The code you have here is incorrectly writing to ra and ca; in this code:

  Dagger.spawn() do
    dfr_v = [DataFrame() for i in 1:nF]
        rfn = PrqDir * "ratings_" * string(i, pad = 2) * ".parquet"
        println( rfn ) 
        dfr_v[i] = DataFrame(read_parquet( rfn ))
        #wait( Dagger.spawn()  do
        ra[:,i] , ca[:,i] = dg_FindRatingsWorker( i, ng, kg, dfm, dfr_v[i])
    end

The line ra[:,i] , ca[:,i] = dg_FindRatingsWorker( i, ng, kg, dfm, dfr_v[i]) is invalid because the function passed to Dagger.spawn may run on Distributed workers, in which case they won't be able to access the right ra and ca - instead, those will be implicitly copied by Distributed, and your changes will end up being lost.

One way to solve this is to do the write to ra/ca outside of Dagger:

t = Dagger.spawn() do
    dfr_v = [DataFrame() for i in 1:nF]
        rfn = PrqDir * "ratings_" * string(i, pad = 2) * ".parquet"
        println( rfn ) 
        dfr_v[i] = DataFrame(read_parquet( rfn ))
        #wait( Dagger.spawn()  do
         return dg_FindRatingsWorker( i, ng, kg, dfm, dfr_v[i])
    end 
ra[:,i] , ca[:,i] = fetch(t)

Of course, as this is written, there is no parallelism here. If you want parallelism, you'll have to launch multiple Dagger.spawn calls, and then fetch them all once they've been launched.

0 replies

JuanVargas · 2024-08-23T16:24:36Z

JuanVargas
Aug 23, 2024
Author

Thank you. I will examine your response and go from there.

…

On Fri, Aug 23, 2024 at 10:06 AM Julian Samaroo ***@***.***> wrote: Hey there, sorry about the long delay! The code you have here is incorrectly writing to ra and ca; in this code: Dagger.spawn() do dfr_v = [DataFrame() for i in 1:nF] rfn = PrqDir * "ratings_" * string(i, pad = 2) * ".parquet" println( rfn ) dfr_v[i] = DataFrame(read_parquet( rfn )) #wait( Dagger.spawn() do ra[:,i] , ca[:,i] = dg_FindRatingsWorker( i, ng, kg, dfm, dfr_v[i]) end The line ra[:,i] , ca[:,i] = dg_FindRatingsWorker( i, ng, kg, dfm, dfr_v[i]) is invalid because the function passed to Dagger.spawn may run on Distributed workers, in which case they won't be able to access the right ra and ca - instead, those will be implicitly copied by Distributed, and your changes will end up being lost. One way to solve this is to do the write to ra/ca outside of Dagger: t = Dagger.spawn() do dfr_v = [DataFrame() for i in 1:nF] rfn = PrqDir * "ratings_" * string(i, pad = 2) * ".parquet" println( rfn ) dfr_v[i] = DataFrame(read_parquet( rfn )) #wait( Dagger.spawn() do return dg_FindRatingsWorker( i, ng, kg, dfm, dfr_v[i]) end ra[:,i] , ca[:,i] = fetch(t) Of course, as this is written, there is no parallelism here. If you want parallelism, you'll have to launch multiple Dagger.spawn calls, and then fetch them all once they've been launched. — Reply to this email directly, view it on GitHub <#563 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGK34LCZPZ5EK75G65IQRTZS46WFAVCNFSM6AAAAABMDMF6N6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTANBTGEYTINA> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help. Tasks spawned by Dagger.@spawn (or Dagger.spawn() ) do not see the get the data. #563

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Need help. Tasks spawned by Dagger.@spawn (or Dagger.spawn() ) do not see the get the data. #563

JuanVargas Aug 7, 2024

Replies: 2 comments

jpsamaroo Aug 23, 2024 Maintainer

JuanVargas Aug 23, 2024 Author

JuanVargas
Aug 7, 2024

jpsamaroo
Aug 23, 2024
Maintainer

JuanVargas
Aug 23, 2024
Author