Facing issues in finetuning JSON data using qwen 2.5 1.5b coder instruct #229

salmankhh8 · 2024-12-25T23:54:50Z

Hi Team

Currently I am fine tuning Qwen2.5-1.5b-coder-instruct model to generate text and JSON responses

Objective : model should be able to understand our low-code JSON architecture, answer and generate JSON responses.

I have a dataset size of 12,000 rows, in which 2,500 rows explanation of each keys and its structure in JSON, and rest 10,000 are user instruction query and with complete JSON response.
below example similar to our low-code json =>

{
    "pageName": "UserManagementDashboard",
    "properties": {
      "id": "userDashboard",
      "title": "User Management",
      "theme": "light"
    },
    "sections": [
      {
        "sectionName": "Header",
        "properties": {
          "id": "headerSection",
          "title": "User Management Dashboard",
          "style": {
            "fontSize": "24px",
            }
          }
          "table":{
          "columns": [
                    { "key": "id", "label": "ID", "width": "10%" },
                    { "key": "name", "label": "Name", "width": "30%" },
                    { "key": "email", "label": "Email", "width": "30%" },
                    { "key": "role", "label": "Role", "width": "20%" },
               ],
          "rows": [
          {
              "id": "1",
              "name": "Alice Johnson",
              "email": "[email protected]",
              "role": "Admin"
            },
            {
              "id": "2",
              "name": "Bob Smith",
              "email": "[email protected]",
              "role": "Editor"
            },
         }
        ]
      }

here are few samples of my datasets (only for understanding the requirement)=>

 {
       "prompt":"what does page name signify in low-code json architecture?",//similar 2500 rows.
       "completion":"In low-code JSON 'page name' define the name of page here is the json structure to define page name=>  ```json {
           "pageName": "UserManagementDashboard",
            "properties": {
              "id": "userDashboard",
              "title": "User Management",
              "theme": "light"
            }
       }```"
},
{ // 10000 rows
       "prompt":"generate a page json for low-code architecture with page name 'procurementDetails', set page title as 'Procurement 
       Information' with 'dark' theme, add table section with sectionName 'Details section' and table title as 'shipmentData', with 
       columns shipmentId, shipment type, shimpmentPartner, and delivery date."
       
       "completions":" here is generated json for you low-code json woth 'pageName' 'UserManagementDashboard' ,section name as  
        'Details section' , table tile 'shipmentData' and column names as follows 'shipmentId', 'shipment type', 'shimpmentPartner', and 
         'delivery date' generated JSON response \n\n 
         ```json 
         {
    "pageName": "UserManagementDashboard",
    "properties": {
      "id": "userDashboard",
      "title": "User Management",
      "theme": "light"
    },
    "sections": [
      {
        "sectionName": "Header",
        "properties": {
          "id": "headerSection",
          "title": "User Management Dashboard",
          "style": {
            "fontSize": "24px",
            }
          }
          "table":{
          "columns": [
                    { "key": "id", "label": "ID", "width": "10%" },
                    { "key": "name", "label": "Name", "width": "30%" },
                    { "key": "email", "label": "Email", "width": "30%" },
                    { "key": "role", "label": "Role", "width": "20%" },
               ],
          "rows": []
      }
}

Issues faced tried multiple ways to finetune but looks like model is considering JSON as plain text only and giving gibberish reponses multiple times.
example=> here is generated json for you low-code json woth 'pageName' 'UserManagementDashboard' ,section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section"section":"section.

required support
1- what is the correct format of dataset to fine tune LLM model with our own JSON architecture.
2- does model understand string json or minified string json or parsed json (json mentioned in "completions" is parsed json)
3-if necessary any code or documentation explaining about finetuning with sample dataset.
4- is 12500 size datasets enough to finetune for 1.5b model?
if no pls tell me what should be the total size of datasets,
if yes pls tell me correct format approach to fine tune the model, so far looks like model is considering my json example as plain text only.

The text was updated successfully, but these errors were encountered:

cyente · 2025-01-03T08:32:16Z

1、 here are our finetuning scripts https://github.com/QwenLM/Qwen2.5-Coder/tree/main/finetuning/sft
2、our model understand string ，
3、refer to 1；
4、i am not sure, but this size should suffice for a basic trial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Facing issues in finetuning JSON data using qwen 2.5 1.5b coder instruct #229

Facing issues in finetuning JSON data using qwen 2.5 1.5b coder instruct #229

salmankhh8 commented Dec 25, 2024 •

edited

Loading

cyente commented Jan 3, 2025

Facing issues in finetuning JSON data using qwen 2.5 1.5b coder instruct #229

Facing issues in finetuning JSON data using qwen 2.5 1.5b coder instruct #229

Comments

salmankhh8 commented Dec 25, 2024 • edited Loading

cyente commented Jan 3, 2025

salmankhh8 commented Dec 25, 2024 •

edited

Loading