Skip to content
This repository has been archived by the owner on Dec 3, 2019. It is now read-only.

there has to be way to use a newer version of the license list #39

Open
zvr opened this issue Mar 17, 2019 · 26 comments
Open

there has to be way to use a newer version of the license list #39

zvr opened this issue Mar 17, 2019 · 26 comments

Comments

@zvr
Copy link
Collaborator

zvr commented Mar 17, 2019

The license list should be updated to the latest list published by SPDX , v3.4.
https://spdx.org/licenses/

@Chinmay-Gurjar
Copy link

I have created a python script to extract data from spdx.org website and store it in a .csv fie , should I include the script too in my pull request, so that whenever the license list get updated , we can just run the script and update our license list ?

@gopuvenkat
Copy link
Collaborator

I have maintained my script as a public gist.

@zvr
Copy link
Collaborator Author

zvr commented Mar 19, 2019

The upstream data to be used are in https://github.com/spdx/license-list-data

@Chinmay-Gurjar
Copy link

can we add https://github.com/spdx/license-list-data in our project and extract data from json file instead of a .csv file ?
@zvr

@zvr
Copy link
Collaborator Author

zvr commented Mar 21, 2019

That would only solve the issue of initial import of licenses.
What would be a solution for updating the list when a new version is published?

@Chinmay-Gurjar
Copy link

Then we should directly extract data from https://spdx.org/licenses/ , not from the repostiory https://github.com/spdx/license-list-data.

@zvr
Copy link
Collaborator Author

zvr commented Mar 21, 2019

?!?
the website is generated from the data; the information is the same.
My question is: you get the data and you use them to populate the database. What do you do when the new version of the license list is published (in both license-list-date repo and the spdx.org website) ?

@Chinmay-Gurjar
Copy link

The script by @gopuvenkat at https://gist.github.com/gopuvenkat/1c8b9f75d366c191f1ec4afffb84696f would do the thing with just amending the license-text attribute.
@zvr but when I tried it, there were some formatting issues, which I will have to solve.

@zvr
Copy link
Collaborator Author

zvr commented Mar 21, 2019

No, you don't understand. Ignore Gopu's script (which incorrectly uses the website instead of the data).
You have some license data (from the repo or the website), and you populate the database with this info.
Then you start using clio, add your data about components, etc.
Then a new SPDX license list is published. What do you do?

You cannot re-run populate_license() again, since most of the licenses are already in the database... and you definitely do not want to delete everything and start from scratch again.

@Chinmay-Gurjar
Copy link

This solution may sound lame but, we can write a script that checks for the new licenses in the https://github.com/spdx/license-list-data.
@zvr please share your thoughts if you have some other ideas.

@shivanshuraj1333
Copy link
Contributor

shivanshuraj1333 commented Mar 25, 2019

@zvr and @gopuvenkat The possible solution I can think of is using Hashing, please refer the following steps.
1). On initial clio startup populate database using csv file created from json file https://github.com/spdx/license-list-data/tree/master/json (currently csv file is generated from this url: https://spdx.org/licenses/)
2). Now we have to sync our csv file and json file, whenever a new commit is made in github repo maintaining json file (https://github.com/spdx/license-list-data/tree/master/json). Github API can be used to track commits.
3). Use hashing to update csv file (Append new entries and modify previous entries).
4). populate updated csv using a button (update button) or time-based job scheduler (cron job)
In this way, the updating process is reduced to O(n) time complexity and unnecessarily changes in our data base is avoided.
@zvr , @gopuvenkat please share your views.

@Chinmay-Gurjar
Copy link

There is one more easy and efficient way out.
We could just write a script to clone the repository from https://github.com/spdx/license-list-data and just use "git diff" command to get added files and updated files and just append those files to csv file.
This will be more efficient than the above proposed method because in the above method we'll be comparing each entry for hashing which will eventually be O(n*n) and only the writing part will be O(n).
Please share your thoughts @zvr @gopuvenkat @shivanshu1333

@shivanshuraj1333
Copy link
Contributor

@Chinmay-Gurjar Thanks for your efforts!

  1. There is no need to clone the complete repository, Gihub API can easily be used to track any new commit in the json file.
  2. There might be some cases where few entries are removed and few entries are modified, so just appending changes will not sync csv file with json file of repository.
  3. Reading will be in O(n) if we use hashing and writing will be in O(k), where n is total number of entries and k is modified entries.
    @Chinmay-Gurjar For further doubts you can contact on https://clio.zulipchat.com/#narrow/stream/121073-general/topic/GSoC2019
    @zvr @gopuvenkat please review this possible solution, so that I can work on this issue and make a pull request before GSoC 2019 application period.
    Thanks!

@zvr
Copy link
Collaborator Author

zvr commented Mar 26, 2019

A couple of points:

  1. there is no need to track commits; SPDX license list releases happen every quarter or so
  2. yes, the usual case is that licenses get added. However we also have modifications and deletions (deprecations)
  3. No need to premature optimize something that will be run once every 3 months.

I think this ticket has evolved into how do we keep adding new license list versions; I'll edit the title to reflect that.

@zvr zvr changed the title update license list there has to be way to use a newer version of the license list Mar 26, 2019
@shivanshuraj1333
Copy link
Contributor

@zvr and @gopuvenkat I am working on this issue and as a temporary solution I will add an Update button in existing clio page which will in background fetch and update data when clicked.
Latter on I will add a job scheduler to this job automatically.

@zvr
Copy link
Collaborator Author

zvr commented Mar 27, 2019

Forget the job scheduler; no one will ever want this to run automatically.

But it remains to be decided what to do with the modified license data... what do you propose?

@shivanshuraj1333
Copy link
Contributor

@zvr
Each license has unique identity (say license name), I will use it as a key and will check all the other parameters. If any parameter is modified I will update it in clio's database.
If there is no such existing key (i.e. a new entry) then I will just simply add it to clio's data base.

@shivanshuraj1333
Copy link
Contributor

Is update button on license page of clio is fine to accomplish this?

@zvr
Copy link
Collaborator Author

zvr commented Mar 27, 2019

@shivanshu1333 and what about deleted license identifiers?

@shivanshuraj1333
Copy link
Contributor

@zvr
I will maintain a dict in my script which hashes current data from Database along with two additional key "Boolean" and "PK"(primary key of database table entries). While reading json file, for every hit (i.e. licence is found in dict) it will update other fields if necessary and update Boolean field to "True". So when ever there is any deletion, Boolean field will remain "False" and we will delete it from our Dict. PK field will be used to track back and update entries to our Database.

@zvr
Copy link
Collaborator Author

zvr commented Mar 29, 2019

So basically you will be using your own "database" (in the form of a dict) to store this information.
This is not correct; what will happen when the program stops and starts again? (this info will be lost)

@shivanshuraj1333
Copy link
Contributor

shivanshuraj1333 commented Mar 29, 2019

@zvr No,

  • Dict will be used to avoid direct interaction with data base.
  • It will be used temporarily to track changes.
  • All the data from Dict will be updated on MySQL data base according to pk (Public Key of data base) and Boolean field.
    NOTE: T indicates modified/new entries and F indicates Deleted entries.

Please refer this rough block diagram for better understanding.
Dict will be created and get destroyed when script to update data base run.
pr

@shivanshuraj1333
Copy link
Contributor

  • @zvr I have almost completed the base script to accomplish this.
  • Only thing left is optimisation. What do you think about method described above?
  • In the proposed method website will never be down.

@zvr
Copy link
Collaborator Author

zvr commented Mar 30, 2019

I still don't see why you need the dict...
Since you are going to be processing the licenses one by one, why don't you update the database for each one? It seems you want to process the data, keep the "results" in a dict and then apply them to the database. Why?

The most important issue, though, is what to do with deleted licenses...

@shivanshuraj1333
Copy link
Contributor

It seems you want to process the data, keep the "results" in a dict and then apply them to the database. Why?

Because it will help to track deleted/modified licenses.
Dict will contain an additional field Boolean.
On processing licenses from https://github.com/spdx/license-list-data if a license is deleted it's Boolean field will be FALSE, else TRUE. Only license having TRUE Boolean field will be populated on database.
Also, it will avoid direct interaction with database, which is a good practice.

@shivanshuraj1333
Copy link
Contributor

Okay, lets skip using dict. I got the more efficient way after discussion. (less resources will be used and high throughput)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants