CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, Chien-Sheng Wu

Salesforce AI Research

Abstract

Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.

Note: This repository is for research purposes only and not for commerical.

Quickstart

pip install -e .

To access our org, use the following credentials.

[email protected]
SALESFORCE_PASSWORD=crmarenatest
SALESFORCE_SECURITY_TOKEN=ugvBSBv0ArI7dayfqUY0wMGu

Accessing the Org via GUI

To access the GUI of our Org, follow the steps below:

Head to https://login.salesforce.com/.
Type in the user name and password using the above credentials.

You can now see the GUI of our Org.

Accessing the Org via API

First, store your Salesforce org / OpenAI / AWS Bedrock / TogetherAI API keys in .env

OPENAI_API_KEY=...
...

Then, you can use Simple Salesforce to connect to our Org.

from simple_salesforce import Salesforce
import os

sf = Salesforce(username=os.getenv("SALESFORCE_USERNAME"), password=os.getenv("SALESFORCE_PASSWORD"), security_token=os.getenv("SALESFORCE_SECURITY_TOKEN"))

sf.query.query_all(...)

Running experiments

To run experiments, you need to download CRMArena queries and schema from Huggingface:

from datasets import load_dataset

queries = load_dataset("Salesforce/CRMArena", "CRMArena")
schema = load_dataset("Salesforce/CRMArena", "schema")

Please refer to crm_sandbox/data/assets.py for more details.

Alternatively, we have prepared the evaluation scripts. Configure your setup in run_tasks.sh and launch experiments:

bash run_tasks.sh

Citation

If you find this work useful, please consider citing:

@misc{huang-etal-2024-crmarena,
    title = "CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments",
    author = "Huang, Kung-Hsiang  and
      Prabhakar, Akshara  and
      Dhawan, Sidharth  and
      Mao, Yixin  and
      Wang, Huan  and
      Savarese, Silvio  and
      Xiong, Caiming  and
      Laban, Philippe  and
      Wu, Chien-Sheng",
    year = "2024",
    archivePrefix = "arXiv",
    eprint={2411.02305},
    primaryClass={cs.CL}
}

Ethical Considerations

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
crm_sandbox		crm_sandbox
figures		figures
test_functions		test_functions
.gitignore		.gitignore
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
run_tasks.py		run_tasks.py
run_tasks.sh		run_tasks.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Abstract

Quickstart

Accessing the Org via GUI

Accessing the Org via API

Running experiments

Citation

Ethical Considerations

About

Releases

Packages

Contributors 4

Languages

License

SalesforceAIResearch/CRMArena

Folders and files

Latest commit

History

Repository files navigation

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Abstract

Quickstart

Accessing the Org via GUI

Accessing the Org via API

Running experiments

Citation

Ethical Considerations

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages