there has to be way to use a newer version of the license list #39

zvr · 2019-03-17T17:13:59Z

The license list should be updated to the latest list published by SPDX , v3.4.
https://spdx.org/licenses/

Chinmay-Gurjar · 2019-03-19T10:08:15Z

I have created a python script to extract data from spdx.org website and store it in a .csv fie , should I include the script too in my pull request, so that whenever the license list get updated , we can just run the script and update our license list ?

gopuvenkat · 2019-03-19T17:03:02Z

I have maintained my script as a public gist.

zvr · 2019-03-19T22:24:14Z

The upstream data to be used are in https://github.com/spdx/license-list-data

Chinmay-Gurjar · 2019-03-21T20:08:47Z

can we add https://github.com/spdx/license-list-data in our project and extract data from json file instead of a .csv file ?
@zvr

zvr · 2019-03-21T20:29:45Z

That would only solve the issue of initial import of licenses.
What would be a solution for updating the list when a new version is published?

Chinmay-Gurjar · 2019-03-21T20:36:32Z

Then we should directly extract data from https://spdx.org/licenses/ , not from the repostiory https://github.com/spdx/license-list-data.

zvr · 2019-03-21T20:39:42Z

?!?
the website is generated from the data; the information is the same.
My question is: you get the data and you use them to populate the database. What do you do when the new version of the license list is published (in both license-list-date repo and the spdx.org website) ?

Chinmay-Gurjar · 2019-03-21T20:53:41Z

The script by @gopuvenkat at https://gist.github.com/gopuvenkat/1c8b9f75d366c191f1ec4afffb84696f would do the thing with just amending the license-text attribute.
@zvr but when I tried it, there were some formatting issues, which I will have to solve.

zvr · 2019-03-21T21:06:24Z

No, you don't understand. Ignore Gopu's script (which incorrectly uses the website instead of the data).
You have some license data (from the repo or the website), and you populate the database with this info.
Then you start using clio, add your data about components, etc.
Then a new SPDX license list is published. What do you do?

You cannot re-run populate_license() again, since most of the licenses are already in the database... and you definitely do not want to delete everything and start from scratch again.

Chinmay-Gurjar · 2019-03-21T21:15:54Z

This solution may sound lame but, we can write a script that checks for the new licenses in the https://github.com/spdx/license-list-data.
@zvr please share your thoughts if you have some other ideas.

shivanshuraj1333 · 2019-03-25T08:19:57Z

@zvr and @gopuvenkat The possible solution I can think of is using Hashing, please refer the following steps.
1). On initial clio startup populate database using csv file created from json file https://github.com/spdx/license-list-data/tree/master/json (currently csv file is generated from this url: https://spdx.org/licenses/)
2). Now we have to sync our csv file and json file, whenever a new commit is made in github repo maintaining json file (https://github.com/spdx/license-list-data/tree/master/json). Github API can be used to track commits.
3). Use hashing to update csv file (Append new entries and modify previous entries).
4). populate updated csv using a button (update button) or time-based job scheduler (cron job)
In this way, the updating process is reduced to O(n) time complexity and unnecessarily changes in our data base is avoided.
@zvr , @gopuvenkat please share your views.

Chinmay-Gurjar · 2019-03-26T17:24:41Z

There is one more easy and efficient way out.
We could just write a script to clone the repository from https://github.com/spdx/license-list-data and just use "git diff" command to get added files and updated files and just append those files to csv file.
This will be more efficient than the above proposed method because in the above method we'll be comparing each entry for hashing which will eventually be O(n*n) and only the writing part will be O(n).
Please share your thoughts @zvr @gopuvenkat @shivanshu1333

shivanshuraj1333 · 2019-03-26T17:49:08Z

@Chinmay-Gurjar Thanks for your efforts!

There is no need to clone the complete repository, Gihub API can easily be used to track any new commit in the json file.
There might be some cases where few entries are removed and few entries are modified, so just appending changes will not sync csv file with json file of repository.
Reading will be in O(n) if we use hashing and writing will be in O(k), where n is total number of entries and k is modified entries.
@Chinmay-Gurjar For further doubts you can contact on https://clio.zulipchat.com/#narrow/stream/121073-general/topic/GSoC2019
@zvr @gopuvenkat please review this possible solution, so that I can work on this issue and make a pull request before GSoC 2019 application period.
Thanks!

zvr · 2019-03-26T22:33:31Z

A couple of points:

there is no need to track commits; SPDX license list releases happen every quarter or so
yes, the usual case is that licenses get added. However we also have modifications and deletions (deprecations)
No need to premature optimize something that will be run once every 3 months.

I think this ticket has evolved into how do we keep adding new license list versions; I'll edit the title to reflect that.

shivanshuraj1333 · 2019-03-27T07:06:46Z

@zvr and @gopuvenkat I am working on this issue and as a temporary solution I will add an Update button in existing clio page which will in background fetch and update data when clicked.
Latter on I will add a job scheduler to this job automatically.

zvr · 2019-03-27T07:56:55Z

Forget the job scheduler; no one will ever want this to run automatically.

But it remains to be decided what to do with the modified license data... what do you propose?

shivanshuraj1333 · 2019-03-27T08:24:48Z

@zvr
Each license has unique identity (say license name), I will use it as a key and will check all the other parameters. If any parameter is modified I will update it in clio's database.
If there is no such existing key (i.e. a new entry) then I will just simply add it to clio's data base.

shivanshuraj1333 · 2019-03-27T08:54:45Z

Is update button on license page of clio is fine to accomplish this?

zvr · 2019-03-27T20:04:52Z

@shivanshu1333 and what about deleted license identifiers?

shivanshuraj1333 · 2019-03-29T08:44:48Z

@zvr
I will maintain a dict in my script which hashes current data from Database along with two additional key "Boolean" and "PK"(primary key of database table entries). While reading json file, for every hit (i.e. licence is found in dict) it will update other fields if necessary and update Boolean field to "True". So when ever there is any deletion, Boolean field will remain "False" and we will delete it from our Dict. PK field will be used to track back and update entries to our Database.

zvr · 2019-03-29T10:52:47Z

So basically you will be using your own "database" (in the form of a dict) to store this information.
This is not correct; what will happen when the program stops and starts again? (this info will be lost)

shivanshuraj1333 · 2019-03-29T11:37:16Z

@zvr No,

Dict will be used to avoid direct interaction with data base.
It will be used temporarily to track changes.
All the data from Dict will be updated on MySQL data base according to pk (Public Key of data base) and Boolean field.
NOTE: T indicates modified/new entries and F indicates Deleted entries.

Please refer this rough block diagram for better understanding.
Dict will be created and get destroyed when script to update data base run.

shivanshuraj1333 · 2019-03-30T11:26:21Z

@zvr I have almost completed the base script to accomplish this.
Only thing left is optimisation. What do you think about method described above?
In the proposed method website will never be down.

zvr · 2019-03-30T22:32:37Z

I still don't see why you need the dict...
Since you are going to be processing the licenses one by one, why don't you update the database for each one? It seems you want to process the data, keep the "results" in a dict and then apply them to the database. Why?

The most important issue, though, is what to do with deleted licenses...

shivanshuraj1333 · 2019-03-30T23:03:54Z

It seems you want to process the data, keep the "results" in a dict and then apply them to the database. Why?

Because it will help to track deleted/modified licenses.
Dict will contain an additional field Boolean.
On processing licenses from https://github.com/spdx/license-list-data if a license is deleted it's Boolean field will be FALSE, else TRUE. Only license having TRUE Boolean field will be populated on database.
Also, it will avoid direct interaction with database, which is a good practice.

shivanshuraj1333 · 2019-03-31T08:49:50Z

Okay, lets skip using dict. I got the more efficient way after discussion. (less resources will be used and high throughput)

zvr changed the title ~~update license list~~ there has to be way to use a newer version of the license list Mar 26, 2019

zvr mentioned this issue Mar 27, 2019

Add update Button on License Page #44

Open

zvr mentioned this issue Mar 29, 2019

Added complete license text #48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

there has to be way to use a newer version of the license list #39

there has to be way to use a newer version of the license list #39

zvr commented Mar 17, 2019

Chinmay-Gurjar commented Mar 19, 2019

gopuvenkat commented Mar 19, 2019

zvr commented Mar 19, 2019

Chinmay-Gurjar commented Mar 21, 2019

zvr commented Mar 21, 2019

Chinmay-Gurjar commented Mar 21, 2019

zvr commented Mar 21, 2019

Chinmay-Gurjar commented Mar 21, 2019

zvr commented Mar 21, 2019

Chinmay-Gurjar commented Mar 21, 2019

shivanshuraj1333 commented Mar 25, 2019 •

edited

Loading

Chinmay-Gurjar commented Mar 26, 2019

shivanshuraj1333 commented Mar 26, 2019

zvr commented Mar 26, 2019

shivanshuraj1333 commented Mar 27, 2019

zvr commented Mar 27, 2019

shivanshuraj1333 commented Mar 27, 2019

shivanshuraj1333 commented Mar 27, 2019

zvr commented Mar 27, 2019

shivanshuraj1333 commented Mar 29, 2019

zvr commented Mar 29, 2019

shivanshuraj1333 commented Mar 29, 2019 •

edited

Loading

shivanshuraj1333 commented Mar 30, 2019

zvr commented Mar 30, 2019

shivanshuraj1333 commented Mar 30, 2019

shivanshuraj1333 commented Mar 31, 2019

there has to be way to use a newer version of the license list #39

there has to be way to use a newer version of the license list #39

Comments

zvr commented Mar 17, 2019

Chinmay-Gurjar commented Mar 19, 2019

gopuvenkat commented Mar 19, 2019

zvr commented Mar 19, 2019

Chinmay-Gurjar commented Mar 21, 2019

zvr commented Mar 21, 2019

Chinmay-Gurjar commented Mar 21, 2019

zvr commented Mar 21, 2019

Chinmay-Gurjar commented Mar 21, 2019

zvr commented Mar 21, 2019

Chinmay-Gurjar commented Mar 21, 2019

shivanshuraj1333 commented Mar 25, 2019 • edited Loading

Chinmay-Gurjar commented Mar 26, 2019

shivanshuraj1333 commented Mar 26, 2019

zvr commented Mar 26, 2019

shivanshuraj1333 commented Mar 27, 2019

zvr commented Mar 27, 2019

shivanshuraj1333 commented Mar 27, 2019

shivanshuraj1333 commented Mar 27, 2019

zvr commented Mar 27, 2019

shivanshuraj1333 commented Mar 29, 2019

zvr commented Mar 29, 2019

shivanshuraj1333 commented Mar 29, 2019 • edited Loading

shivanshuraj1333 commented Mar 30, 2019

zvr commented Mar 30, 2019

shivanshuraj1333 commented Mar 30, 2019

shivanshuraj1333 commented Mar 31, 2019

shivanshuraj1333 commented Mar 25, 2019 •

edited

Loading

shivanshuraj1333 commented Mar 29, 2019 •

edited

Loading