Data size in GP / MB issue #5

jetschny · 2023-11-08T12:46:05Z

I have been running my resources-montoring valdiation for the script:

https://github.com/FAIRiCUBE/uc3-drosophola-genetics/blob/main/projects/gap_filling/src/data/load_tsv_kmeans_elbow_sli.py

where I had a previous "manual" estimate.

While the monitoring itself works smooth (after solving the typical issue of installing libaries) I am missing some info:

0 Data size in grid points

No value provided
can glaldy be the size of the largest array or ideally be the sum of the largest arrays (if possible)
1 Data size (MB) 0.1484375
the original idea was to describe the size of the input data, but this might be a manual process to find out. what is measured here? in any case, less than 1 MB is rather small....

8 Network traffic (MB) 0.23931884765625

well, it was a local job nothing loaded from the network, but it is a small number, so maybe "noise"

cozzolinoac11 · 2023-11-09T09:27:53Z

Hi @jetschny

Data size in grid points currently takes a user-specified value as size. Currently, this mode is in use because the format of the data can be different and, in addition, the user has the option of specifying the size as most appropriately for the use case. Also, through simple calls (for example my_array.shape) it is easy to retrieve the size.
Data size (MB) measures data consumed or freed up on disk (by creation or deletion of files)
Network traffic (MB) has a small value because the 'codecarbon' library uses remote calls to calculate emissions.
From codecarbon's official page:
"An offline version is available to support restricted environments without internet access. The internal computations remain unchanged; however, a country_iso_code parameter, which corresponds to the 3-letter alphabet ISO Code of the country where the compute infrastructure is hosted, is required to fetch Carbon Intensity details of the regional electricity used."
Because the network used is small and the required code cannot always be easily found, I preferred to use the "online version"

What do you think about it?

jetschny · 2023-11-10T13:07:26Z

ahja, now I undertstand, is there a chance that you add a short routine that checks for the largest allocated array and take this as dimension. in any case then need to call this parameter "largets allocated array (in grid points)". if possible at all, we could sum up the allocated variable sizes of the whole workspace, it is sort of similar to the allocated main memory but more specific to the script.

I have tested further the "data size" and while running my script
https://github.com/FAIRiCUBE/uc3-drosophola-genetics/blob/main/projects/gap_filling/src/data/load_csv_apply_GapFil_write_csv.py

and export 82 MB data, the Measurer says: 0.0 MB can you have a look there?

jetschny · 2023-11-23T20:53:40Z

@cozzolinoac11, any update here? we would need some clarification to provide input for deliverable D3_3

cozzolinoac11 · 2023-11-24T11:12:35Z

Hi @jetschny

I'm working on a procedure to obtain:

the sum of all variables allocated by the script (in GB)
the largest array allocated in grid points. It is the variable, instance of np.ndarray, with the highest size value. In NumPy size is the number of elements in the array (the product of the array’s dimensions).

The procedure is already available and working in the measurer.py updated a few minutes ago on GitHub (please also have a look at the updated example.py). I'm currently doing further tests on the procedure which seems to work well.
What do you think about it?

Regarding the test mentioned above, was the file already present in the folder? In this case, an overwriting occurs and the measurer returns 0 (this also happened to me during one of my tests 😄)
Otherwise, can you provide me a few screenshots of the use of the measurer in the script?

jetschny · 2023-11-24T13:46:26Z

Hi @cozzolinoac11 ,
thanks for the update. I have pulled the latest version but I see no change. even if I delete the csv file before execution, the data size in MB states 0.0 but should be around 80 MB. Data size in GP is as specified, largest allocated array is empty as well...
the example above

https://github.com/FAIRiCUBE/uc3-drosophola-genetics/blob/main/projects/gap_filling/src/data/load_csv_apply_GapFil_write_csv.py

should run within a cloned repo (only directory name needs to be modified). you could reproduce my exact situation.
cheers
Stefan

cozzolinoac11 · 2023-11-27T09:58:37Z

Hi @jetschny,

I did a copy and paste of the file https://github.com/FAIRiCUBE/uc3-drosophola-genetics/blob/main/projects/gap_filling/src/data/load_csv_apply_GapFil_write_csv.py updating the paths and adding start and end of the measurer.

In my test, the value of "Data size" is around 81 MB and the "Largest allocated array in grid points" is [50024, 334].

Probably the problem in your test is related to the paths used by the measurer (maybe the value of the parameter 'data_path' in the start/end methods is different from the one on which the file is saved).

The folder https://github.com/FAIRiCUBE/common-code/tree/main/record-computational-demands-automatically/test/uc3_test contains:

the python script
the benchmark csv file
a screenshot of the output in the terminal

cheers

jetschny · 2023-12-05T10:10:02Z

Hi @cozzolinoac11 ,
many thanks for your example. I have tried to compare your script with mine and at the end just copy& pasted your uc3_test.py into my UC3 repo, changed the path to relative (so it runs completely within the repo):

https://github.com/FAIRiCUBE/uc3-drosophola-genetics/blob/main/projects/gap_filling/src/data/load_csv_apply_GapFil_write_csv_test.py

still, the same issue occurs. data size is reported to 0.0 MB. what can I test now to get to your "working environment"
cheers
Stefan

BachirNILU · 2023-12-05T16:36:05Z

Hi,

Thank you @cozzolinoac11 for this nice work.
I have created a general testing python script (you can find it here: https://github.com/FAIRiCUBE/common-code/blob/main/record-computational-demands-automatically/test/General_Test.py).
'Data size' seemed to work for me (it reported 31.5MB == size of the csv I wrote into disk).
@jetschny can you try this general test from your side?
Make sure that the csv file is not present when you run the code, otherwise it will report a size of 0.0 (overwriting file with same data = same size = no change in the hard disk state).

Best regards,

-Bachir.

BachirNILU · 2023-12-06T09:00:12Z

Hi,

@jetschny I have also tested the code provided in https://github.com/FAIRiCUBE/uc3-drosophola-genetics/blob/main/projects/gap_filling/src/data/load_csv_apply_GapFil_write_csv_test.py
I have changed the out_file_name to "Test.csv" and I got a data size of 78.42 MB.

Best regards,

-Bachir.

jetschny · 2023-12-07T13:41:41Z

Hi @BachirNILU and @cozzolinoac11 ,
I tested the script General_Test.py and that works for me as well, reporting 31.5 MB in data size. When I run my example

https://github.com/FAIRiCUBE/uc3-drosophola-genetics/blob/main/projects/gap_filling/src/data/load_csv_apply_GapFil_write_csv_test.py

I get different results for every run. sometimes negative numbers, sometimes closer to 0, never anything like around 80 MB...

jetschny · 2023-12-07T13:46:57Z

@cozzolinoac11 and @BachirNILU
ah well, I believe I see the problem now.

first I tought : the Measurer "listens" to i/o access to a specific path (not data_path ?). if I change the output destination of my csv output to the same directory where I write my measurer-statistics, then I see the 81 MB data size. if both files are written to different folders, then the Measurer does not report the correct data size...

now I tested a bit more and it actually is a matter of whether the program outfile exists already or not. if I remove my output, Measurer works fine, if my program "overrides" the output with the same size and data, Measurer does not detect it! I see the same behavior for the General_Test.py. Once you run it more than once, it gives wrong results.

BachirNILU · 2023-12-07T14:21:34Z

Hi,

@jetschny, yes, If I am not mistaken, I believed this is what @cozzolinoac11 had in mind for 'Data Size'.
'Data Size' answers the following question: How much does my program delete or write in the hard disk?.
So if your program does not change data in the hard disk (it just overwrites an existing one) then your program contributed with 0 MB (hence Data size of 0). Otherwise, if it deletes some file, then it returns a negative value indicating removing data.
@cozzolinoac11 please correct me if I am wrong.

Best regards,

-Bachir.

cozzolinoac11 · 2023-12-11T08:50:56Z

Hi @BachirNILU
yes, this is my idea for the "Data size" field.

cheers

jetschny · 2023-12-11T08:55:42Z

many thanks for the clarification. I see the point and I believe it adds value to our table. However, "I/O stream volume" (amount of data being written - and actually beeing read as well) is also important and it was my original thought when we "requested the table". We can now keep the data_size but have to explain that properly. in addition, can you, @cozzolinoac11, think of a metrics to determine size of output regardless of file existence?

cozzolinoac11 · 2023-12-11T10:00:12Z

Hi @jetschny

I can add two fields:

Data written: calculated as [(system-wide number of bytes written at end of script) - (system-wide number of bytes written at start of script)]
Data read: calculated as [(system-wide number of bytes read at end of script) - (system-wide number of bytes read at start of script)]

For these fields I can use the psutil.disk_io_counters function which returns system-wide disk I/O statistics.

This also adds up the overwrites but, on the other hand, because it is a system-wide calculation, eventual writes/readings concurrently with the script are also added.
The quality of these measures depends on how many and which processes are using the disk during the execution of the script. Therefore, if the system during runtime only (or nearly only) executes the script that reads/writes to disk (as could frequently happen), the measurements could be quite accurate.

Regarding 'Data size' we can also consider renaming it and making it more explanatory.

What do you think about?

jetschny · 2023-12-11T10:58:27Z

@cozzolinoac11 that sounds very good!

BachirNILU · 2023-12-12T09:19:01Z

Hi @cozzolinoac11,

Thanks again for your work.
For the 'Data size' field, I propose the following names: 'Disk I/O' or 'Disk Activity'

Best regards,

-Bachir.

jetschny · 2023-12-12T09:30:28Z

Disk I/O or disk activity actually means something different for me. it is the amount of data read and written to disk, regardless of a files being overwritten.
for "data size" I rather think of something like "data size added to storage" or "created data size"

BachirNILU · 2023-12-13T13:38:45Z

Hi @cozzolinoac11,

Thank you again for your work.
I have recently tested the measurer on EOXHub.
I had some issues on the Compute resources metrics (see screenshot attached):

1- Main memory available (GB) was reported as 61 GB, whereas 32 GB are available.
2- Processor frequency was reported as 0.0 GHz.

Can you investigate this?

Thanks in advance,

-Bachir.

jetschny · 2024-04-16T19:21:13Z

@cozzolinoac11 : while I see the issue on data I/O and disk spaced added to disk can be resolved by a proper labeling of the value, what about the reporting by Bachir at EOX Hub, can this be worked on in the near future?

cozzolinoac11 · 2024-04-22T15:54:09Z

Hi, I have performed several tests and in none of them did I encounter this issue. For the I/O values, the fields have been renamed while, regarding main memory, could it be that less memory (32 GB) is allocated to the development environment than the entire machine (61 GB)?

jetschny · 2024-04-22T17:53:04Z

@BachirNILU can you please have a look, if the issue has indeed resolved, we can close the issue here...

BachirNILU · 2024-04-24T07:45:45Z

Hi,

@cozzolinoac11 thank you for checking this.
While the reported main memory might be of the entire machine (instead of the allocated one), I still have 0.0 Ghz reported as frequency.
I am not sure if you have access to one of FiC EOX profiles?
It might be good to test it under the server from your side.

PS: I took the liberty to correct a typo in measurer.py

Thanks in advance.

Best regards,

-Bachir.

cozzolinoac11 · 2024-04-30T16:19:32Z

Hi @BachirNILU
I can test the measurer on the server.
I probably don't have access to a FiC EOX profile, I'll request one.

PS: tanks for the correction in measurer.py

jetschny · 2024-04-30T18:06:51Z

@cozzolinoac11 I have added you to the UC4 EOX group... you can start logging into https://eoxhub.fairicube.eu/

cozzolinoac11 · 2024-05-03T10:42:40Z

Hi @jetschny @BachirNILU,
thanks for adding me to UC4 EOX group.

I have just made changes to the measurer, using a different library for the CPU frequency.

As a test on EOXHub, I trained a CNN on the CIFAR10 example dataset and the results returned by the measurer are all consistent. The files are in EOX_HUB_test folder.

Thanks for your feedback

mari-s4e · 2024-11-21T12:25:09Z

Hi, I have recently used the measurer in the EOX Lab. I get incorrect results in the Data Size. I get

Measure	Value
Data size (MB)	0.0
Data read (MB)	0.0
Data written (MB)	0.15

But two new files (7.5MB and 350 Bytes) are created. This was the first run of the script, so the files were not overwritten.
From the above discussion I get that Data written should be ~7.6MB, right?
I'd expect the other two metrics to be non-zero as well, as I am loading into memory several tiff files. I am using the AWS API s3fs for reading the files, and I have set up the measurer as described in the aws_example.py. Any idea what is going wrong?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data size in GP / MB issue #5

Data size in GP / MB issue #5

jetschny commented Nov 8, 2023

cozzolinoac11 commented Nov 9, 2023

jetschny commented Nov 10, 2023

jetschny commented Nov 23, 2023

cozzolinoac11 commented Nov 24, 2023

jetschny commented Nov 24, 2023

cozzolinoac11 commented Nov 27, 2023 •

edited

Loading

jetschny commented Dec 5, 2023

BachirNILU commented Dec 5, 2023 •

edited

Loading

BachirNILU commented Dec 6, 2023

jetschny commented Dec 7, 2023

jetschny commented Dec 7, 2023 •

edited

Loading

BachirNILU commented Dec 7, 2023

cozzolinoac11 commented Dec 11, 2023

jetschny commented Dec 11, 2023

cozzolinoac11 commented Dec 11, 2023

jetschny commented Dec 11, 2023

BachirNILU commented Dec 12, 2023

jetschny commented Dec 12, 2023

BachirNILU commented Dec 13, 2023

jetschny commented Apr 16, 2024

cozzolinoac11 commented Apr 22, 2024

jetschny commented Apr 22, 2024

BachirNILU commented Apr 24, 2024

cozzolinoac11 commented Apr 30, 2024

jetschny commented Apr 30, 2024

cozzolinoac11 commented May 3, 2024 •

edited

Loading

mari-s4e commented Nov 21, 2024 •

edited

Loading

Data size in GP / MB issue #5

Data size in GP / MB issue #5

Comments

jetschny commented Nov 8, 2023

cozzolinoac11 commented Nov 9, 2023

jetschny commented Nov 10, 2023

jetschny commented Nov 23, 2023

cozzolinoac11 commented Nov 24, 2023

jetschny commented Nov 24, 2023

cozzolinoac11 commented Nov 27, 2023 • edited Loading

jetschny commented Dec 5, 2023

BachirNILU commented Dec 5, 2023 • edited Loading

BachirNILU commented Dec 6, 2023

jetschny commented Dec 7, 2023

jetschny commented Dec 7, 2023 • edited Loading

BachirNILU commented Dec 7, 2023

cozzolinoac11 commented Dec 11, 2023

jetschny commented Dec 11, 2023

cozzolinoac11 commented Dec 11, 2023

jetschny commented Dec 11, 2023

BachirNILU commented Dec 12, 2023

jetschny commented Dec 12, 2023

BachirNILU commented Dec 13, 2023

jetschny commented Apr 16, 2024

cozzolinoac11 commented Apr 22, 2024

jetschny commented Apr 22, 2024

BachirNILU commented Apr 24, 2024

cozzolinoac11 commented Apr 30, 2024

jetschny commented Apr 30, 2024

cozzolinoac11 commented May 3, 2024 • edited Loading

mari-s4e commented Nov 21, 2024 • edited Loading

cozzolinoac11 commented Nov 27, 2023 •

edited

Loading

BachirNILU commented Dec 5, 2023 •

edited

Loading

jetschny commented Dec 7, 2023 •

edited

Loading

cozzolinoac11 commented May 3, 2024 •

edited

Loading

mari-s4e commented Nov 21, 2024 •

edited

Loading