Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing issues in finetuning JSON data using qwen 2.5 1.5b coder instruct #229

Open
salmankhh8 opened this issue Dec 25, 2024 · 1 comment

Comments

@salmankhh8
Copy link

salmankhh8 commented Dec 25, 2024

Hi Team

Currently I am fine tuning Qwen2.5-1.5b-coder-instruct model to generate text and JSON responses

Objective : model should be able to understand our low-code JSON architecture, answer and generate JSON responses.

I have a dataset size of 12,000 rows, in which 2,500 rows explanation of each keys and its structure in JSON, and rest 10,000 are user instruction query and with complete JSON response.
below example similar to our low-code json =>

{
    "pageName": "UserManagementDashboard",
    "properties": {
      "id": "userDashboard",
      "title": "User Management",
      "theme": "light"
    },
    "sections": [
      {
        "sectionName": "Header",
        "properties": {
          "id": "headerSection",
          "title": "User Management Dashboard",
          "style": {
            "fontSize": "24px",
            }
          }
          "table":{
          "columns": [
                    { "key": "id", "label": "ID", "width": "10%" },
                    { "key": "name", "label": "Name", "width": "30%" },
                    { "key": "email", "label": "Email", "width": "30%" },
                    { "key": "role", "label": "Role", "width": "20%" },
               ],
          "rows": [
          {
              "id": "1",
              "name": "Alice Johnson",
              "email": "[email protected]",
              "role": "Admin"
            },
            {
              "id": "2",
              "name": "Bob Smith",
              "email": "[email protected]",
              "role": "Editor"
            },
         }
        ]
      }

here are few samples of my datasets (only for understanding the requirement)=>

 {
       "prompt":"what does page name signify in low-code json architecture?",//similar 2500 rows.
       "completion":"In low-code JSON 'page name' define the name of page here is the json structure to define page name=>  ```json {
           "pageName": "UserManagementDashboard",
            "properties": {
              "id": "userDashboard",
              "title": "User Management",
              "theme": "light"
            }
       }```"
},
{ // 10000 rows
       "prompt":"generate a page json for low-code architecture with page name 'procurementDetails', set page title as 'Procurement 
       Information' with 'dark' theme, add table section with sectionName 'Details section' and table title as 'shipmentData', with 
       columns shipmentId, shipment type, shimpmentPartner, and delivery date."
       
       "completions":" here is generated json for you low-code json woth 'pageName' 'UserManagementDashboard' ,section name as  
        'Details section' , table tile 'shipmentData' and column names as follows 'shipmentId', 'shipment type', 'shimpmentPartner', and 
         'delivery date' generated JSON response \n\n 
         ```json 
         {
    "pageName": "UserManagementDashboard",
    "properties": {
      "id": "userDashboard",
      "title": "User Management",
      "theme": "light"
    },
    "sections": [
      {
        "sectionName": "Header",
        "properties": {
          "id": "headerSection",
          "title": "User Management Dashboard",
          "style": {
            "fontSize": "24px",
            }
          }
          "table":{
          "columns": [
                    { "key": "id", "label": "ID", "width": "10%" },
                    { "key": "name", "label": "Name", "width": "30%" },
                    { "key": "email", "label": "Email", "width": "30%" },
                    { "key": "role", "label": "Role", "width": "20%" },
               ],
          "rows": []
      }
}

Issues faced tried multiple ways to finetune but looks like model is considering JSON as plain text only and giving gibberish reponses multiple times.
example=> here is generated json for you low-code json woth 'pageName' 'UserManagementDashboard' ,section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section.

required support
1- what is the correct format of dataset to fine tune LLM model with our own JSON architecture.
2- does model understand string json or minified string json or parsed json (json mentioned in "completions" is parsed json)
3-if necessary any code or documentation explaining about finetuning with sample dataset.
4- is 12500 size datasets enough to finetune for 1.5b model?
if no pls tell me what should be the total size of datasets,
if yes pls tell me correct format approach to fine tune the model, so far looks like model is considering my json example as plain text only.

@cyente
Copy link
Collaborator

cyente commented Jan 3, 2025

1、 here are our finetuning scripts https://github.com/QwenLM/Qwen2.5-Coder/tree/main/finetuning/sft
2、our model understand string ,
3、refer to 1;
4、i am not sure, but this size should suffice for a basic trial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants