This is the repository for the paper:
CPLS: Optimizing the Assignment of LLM Queries
Abstract: Large Language Models (LLMs) like ChatGPT have gained significant attention because of their impressive capabilities, leading to a dramatic increase in their integration into intelligent software engineering. However, their usage as a service with varying performance and price options presents a challenging trade-off between desired performance and the associated cost. To address this challenge, we propose CPLS, a framework that utilizes transfer learning and local search techniques for assigning intelligent software engineering jobs to LLM-based services. CPLS aims to minimize the total cost of LLM invocations while maximizing the overall accuracy. The framework first leverages knowledge from historical data across different projects to predict the probability of an LLM processing a query correctly. Then, CPLS incorporates problem-specific rules into a local search algorithm to effectively generate Pareto optimal solutions based on the predicted accuracy and cost.
CPLS consists of two main components: cross-project prediction and local-search based optimization. The prediction component utilizes knowledge from historical data across different projects to estimate the probability of an LLM correctly processing a query. With the cost and predicted accuracy, the optimization component selects the most suitable LLM for each query through problem-specific rules.
To demonstrate the generalizability of CPLS, we conducted an additional experiment on text classification. For this experiment, we used the Overruling dataset as the training data to optimize the assignment of LLMs for the Headlines dataset. The Headline dataset consists of news headlines about gold prices, and the task is to classify each headline into one of four categories: up, down, neutral, or none, based on the price movement mentioned in the headline. An example is given as follows:
Please determine the price direction (up, down, neutral, or none) in the following news headlines. Question: Feb. gold gains $10.30, or 0.9%, to settle at $1,162/oz
(Answer: up)
An example of a job from the Overruling dataset is:
Context: As such, La. Civ. Code art. 2365 is not applicable to the matter before us, and we specifically overrule this holding in Nash. Question: Is it overruling?
(Answer: yes)
By using the Overruling dataset to train CPLS and then applying it to the Headline classification task, we aim to showcase the adaptability of our approach to different domains. The results of this experiment are presented below:
The solution with the highest accuracy by all algorithms for LLMs allocation
Cost ($) and Savings by CPLS to match the baseline's highest accuracy
Comparisons of Solution Sets from All Algorithms in terms of IGD,
We present the solution with the highest accuracy and the solution with the lowest cost. Compared to the baselines, CPLS achieves a 2.91-7.21% improvement in accuracy or a 90-98% reduction in cost. Moreover, the metrics used to evaluate the quality of the solution set considering both cost and accuracy, such as the Inverted Generational Distance (IGD) and the delta measure (
The prompt template used in the log parsing experiment is as follows:
You will be provided with a log message. You must identify and abstract all the dynamic variables in logs with {{placeholders}} and output a static log template. For example:
The template of [EXAMPLE LOG MESSAGE] is [EXAMPLE template].
The template of [LOG MESSAGE] is
To investigate the influence of the queries, we also evaluate a different prompt template that doesn't contain any example (0-shot). The prompt template is as follows:
You will be provided with a log message. You must identify and abstract all the dynamic variables in logs with {placeholders} and output a static log template. Print the input log's template delimited by backticks.
Log message: [LOG MESSAGE]
We conduct experiments on a random selection of three instances due to time constraints. The experimental results demonstrate that CPLS effectiveness. On Apache, CPLS improves accuracy by 1.33%-50.29% or reduces cost by 2.22%-86.00%. For Linux, accuracy improvements range from 3.96%-43.84%, and cost savings span 1.01%-76.93%. Proxifier sees the most substantial gains, with accuracy improvements of 28.43%-144.82% and cost savings of 58.09%-82.42%.
CPLS still achieves a competitive improvement on Proxifier. The possible reason is that the current similarity analysis relies on log messages, and the performance of LLMs on this dataset is comparatively lower. In such cases, the prediction model's effectiveness may be limited. As previously discussed, inaccurate predictions can mislead even high-performance optimization algorithms, as they depend on predicted accuracies to guide the search process. Without problem-specific information to guide the search, baseline algorithms may encounter difficulties in escaping local optima and produce a narrow range of solutions, resulting in suboptimal outcomes.
The solution with the highest accuracy by all algorithms for LLMs allocation
Cost ($) and Savings by CPLS to match the baseline's highest accuracy
Comparisons of Solution Sets from All Algorithms in terms of IGD,
-
An Effective Approach for Parsing Large Log Files (2022 ICSME)
-
How to Configure Anomaly Event Detector on Software Logs?(2022 ICSME)
-
Improving Log-Based Anomaly Detection with Component-Aware Analysis (2020 ICSME)
-
An Exploratory Study of Logging Configuration Practice in Java(2019 ICSME)
-
Comparing Constraints Mined From Execution Logs to Understand Software Evolution(2019 ICSME)
-
An Approach to Recommendation of Verbosity Log Levels Based on Logging Intention(2019 ICSME)
-
KnowLog: Knowledge Enhanced Pre-trained Language Model for Log Understanding (2024 ICSE)
-
DivLog: Log Parsing with Prompt Enhanced In-Context Learning(2024 ICSE)
-
LogShrink: Effective Log Compression by Leveraging Commonality and Variability of Log Data(2024 ICSE)
-
MetaLog: Generalizable Cross-System Anomaly Detection from Logs with Meta-Learning(2024 ICSE)
-
LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing(2024 ICSE)
-
An Exploratory Investigation of Log Anomalies in Unmanned Aerial Vehicles(2024 ICSE)
-
A Semantic-aware Parsing Approach for Log Analytics (2023 ICSE)
-
Did We Miss Something Important? Studying and Exploring Variable-Aware Log Abstraction(2023 ICSE)
-
How Do Developers' Profiles and Experiences Influence their Logging Practices? An Empirical Study of Industrial Practitioners(2023 ICSE)
-
Log Parsing with Prompt-based Few-shot Learning(2023 ICSE)
-
LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly(2023 ICSE)
-
PILAR: Studying and Mitigating the Influence of Configurations on Log Parsing(2022 ICSE)
-
Guidelines for Assessing the Accuracy of Log Message Template Identification Techniques(2022 ICSE)
-
Log-based Anomaly Detection with Deep Learning: How Far Are We(2022 ICSE)
-
DeepTraLog: Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning(2022 ICSE)
-
Using Deep Learning to Generate Complete Log Statements(2022 ICSE)
-
Automated Query Reformulation for Efficient Search Based on Query Logs from Stack OverflowACM SIGSOFT Dist(2021 ICSE)
-
DeepLV: Suggesting Log Levels Using Ordinal Based Neural Networks(2021 ICSE)
-
Semi-supervised Log-based Anomaly Detection via Probabilistic Label Estimation(2021 ICSE)
-
Studying the Use of Java Logging Utilities in the Wild(2020 ICSE)
CPLS utilizes a heuristic search-based algorithm in optimization. We compare the effectiveness of this algorithm with well-known multi-objective optimization algorithms, including the Non-dominated Sorting Genetic Algorithm (NSGA-\rom{2})^[8], Multi-objective Particle Swarm Optimisation (MOPSO)^[9], and Multi-objective Evolutionary Algorithm with Decomposition (MOEA/D)^[10]. These three algorithms have been extensively studied and have proven to be effective in solving a wide range of multi-objective optimization problems. In addition, three variants of classic algorithms are also compared, including R-NSGA-\rom{2}^[11], SMS-EMOA^[12], and MOEA/D-GEN^[13]. It is important to note that all the evaluated multi-objective optimization algorithms are integrated with the same prediction component as CPLS, to enable a fair comparison of the optimization strategies.
Optuna is a widely used hyperparameter optimization package. To ensure the effectiveness and efficiency of all algorithms, we conduct parameter tuning using Optuna to choose optimal parameter settings. Based on the experiments, the parameters of algorithms are set as follows:
Algorithm | Parameter Settings |
---|---|
CPLS | initial_solution_size: 87, sampling_sum: 29 |
NSGA-II | crossover_prob: 0.5169, crossover_eta: 3, mutation_prob: 0.4258, mutation_eta: 27, sampling: 'LHS', selection: 'RandomSelection' |
R-NSGA-II | epsilon: 0.6912 |
SMS-EMOA | crossover_prob: 0.8244, crossover_eta: 20, mutation_prob: 0.3149, mutation_eta: 19, sampling: 'FloatRandomSampling' |
MOEA/D | weight_generation: 'grid', decomposition: 'weighted', neighbours: 29 |
MOEA/D-GEN | weight_generation: 'low discrepancy', decomposition: 'tchebycheff', neighbours: 16 |
MOPSO | omega: 0.3634, c1: 0.8446, c2: 0.2482, v_coeff: 0.9121 |
The record of the tunning process is available under CPLS/parameter_setting/res
directory.
When assessing the performance of a single solution, such as submitting all jobs to an individual LLM, a direct comparison of the optimization objectives is feasible.
-
$f_{cost}$ : total cost of invoking LLM APIs -
$f_{acc}$ : the percentage of jobs processed accurately
-
Inverted Generational Distance (IGD): The IGD metric is used to measure the distance between the obtained solution set and the Pareto front (reference point set). A lower value of IGD represents a better performance.
-
$\Delta$ metric: The$\Delta$ metric assesses the diversity and distribution of solutions across the Pareto front by measuring Euclidean distances between solutions and two extreme solutions. -
Computation time: The time for obtaining the solution set, calculated by minute.
To verify the comparison, we conduct a statistical test to evaluate the performance of CPLS and the baselines. We use the following statistical tests:
Friedman Test: The Friedman test is a non-parametric statistical test that ranks the algorithms for each dataset separately. It tests the null hypothesis that all algorithms perform equally well. If the null hypothesis is rejected, it means that there are significant differences among the algorithms' performances.
Nemenyi Test: The Nemenyi test is a post-hoc test that is performed after the Friedman test if the null hypothesis is rejected. It is used to determine which specific pairs of algorithms have significant differences in their performance.
The Friedman test is a non-parametric statistical test used to compare multiple paired samples. The test is based on ranking the data within each block (i.e., each sample) and comparing the average ranks between the different groups. The following table shows the p-values of the Friedman test for the 16 instances on IGD,
Overall, the Friedman test results for all five datasets show extremely small p-values, indicating strong evidence against the null hypothesis. This suggests that there are significant differences between the groups being compared for each dataset. The results provide compelling evidence to reject the null hypothesis and accept the alternative hypothesis that at least one group differs from the others.
All the code is available under CPLS
directory.
-
Python 3.11
-
Pymoo
-
tiktoken
-
...
To install all libraries: $ pip install -r requirements.txt
$ python exp.py $
All source code is available under CPLS
directory.
We used the standard version of NSGA-II, R-NSGA-II and SMS-EMOA implemented in the Pymoo library^[14], and MOPSO and MOEA/D in the Pygmo.
The source code of the baselines is available under CPLS/baselines
directory.
The code os the proposed CPLS is available under CPLS/CPLS
directory.
script | Description |
---|---|
nsga2.py |
Non-dominated Sorting Genetic Algorithm (NSGA-II) |
rnsga2.py |
Reference point based Non-dominated Sorting Genetic Algorithm (R-NSGA-II) |
smsemoa.py |
SMS-EMOA |
moead.py |
Multi-objective EA with Decomposition (MOEA/D) |
moeadgen.py |
MOEA/D-GEN |
mopso.py |
Multi-objective Particle Swarm Optimization (MOPSO) |