Computation and Language 68
☆ Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Complex 3D scene understanding has gained increasing attention, with scene
encoding strategies playing a crucial role in this success. However, the
optimal scene encoding strategies for various scenarios remain unclear,
particularly compared to their image-based counterparts. To address this issue,
we present a comprehensive study that probes various visual encoding models for
3D scene understanding, identifying the strengths and limitations of each model
across different scenarios. Our evaluation spans seven vision foundation
encoders, including image-based, video-based, and 3D foundation models. We
evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual
Grounding, Segmentation, and Registration, each focusing on different aspects
of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates
superior performance, video models excel in object-level tasks, diffusion
models benefit geometric tasks, and language-pretrained models show unexpected
limitations in language-related tasks. These insights challenge some
conventional understandings, provide novel perspectives on leveraging visual
foundation models, and highlight the need for more flexible encoder selection
in future vision-language and scene-understanding tasks.
comment: Project page: https://yunzeman.github.io/lexicon3d , Github:
https://github.com/YunzeMan/Lexicon3D
☆ WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
The increasing availability of real-world conversation data offers exciting
opportunities for researchers to study user-chatbot interactions. However, the
sheer volume of this data makes manually examining individual conversations
impractical. To overcome this challenge, we introduce WildVis, an interactive
tool that enables fast, versatile, and large-scale conversation analysis.
WildVis provides search and visualization capabilities in the text and
embedding spaces based on a list of criteria. To manage million-scale datasets,
we implemented optimizations including search index construction, embedding
precomputation and compression, and caching to ensure responsive user
interactions within seconds. We demonstrate WildVis's utility through three
case studies: facilitating chatbot misuse research, visualizing and comparing
topic distributions across datasets, and characterizing user-specific
conversation patterns. WildVis is open-source and designed to be extendable,
supporting additional datasets and customized search and visualization
functionalities.
☆ Attention Heads of Large Language Models: A Survey
Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in
various tasks but remain largely as black-box systems. Consequently, their
development relies heavily on data-driven approaches, limiting performance
enhancement through changes in internal architecture and reasoning pathways. As
a result, many researchers have begun exploring the potential internal
mechanisms of LLMs, aiming to identify the essence of their reasoning
bottlenecks, with most studies focusing on attention heads. Our survey aims to
shed light on the internal reasoning processes of LLMs by concentrating on the
interpretability and underlying mechanisms of attention heads. We first distill
the human thought process into a four-stage framework: Knowledge Recalling,
In-Context Identification, Latent Reasoning, and Expression Preparation. Using
this framework, we systematically review existing research to identify and
categorize the functions of specific attention heads. Furthermore, we summarize
the experimental methodologies used to discover these special heads, dividing
them into two categories: Modeling-Free methods and Modeling-Required methods.
Also, we outline relevant evaluation methods and benchmarks. Finally, we
discuss the limitations of current research and propose several potential
future directions. Our reference list is open-sourced at
\url{https://github.com/IAAR-Shanghai/Awesome-Attention-Heads}.
comment: 20 pages, 11 figures, 4 tables
☆ Planning In Natural Language Improves LLM Search For Code Generation
Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang
While scaling training compute has led to remarkable improvements in large
language models (LLMs), scaling inference compute has not yet yielded analogous
gains. We hypothesize that a core missing component is a lack of diverse LLM
outputs, leading to inefficient search due to models repeatedly sampling highly
similar, yet incorrect generations. We empirically demonstrate that this lack
of diversity can be mitigated by searching over candidate plans for solving a
problem in natural language. Based on this insight, we propose PLANSEARCH, a
novel search algorithm which shows strong results across HumanEval+, MBPP+, and
LiveCodeBench (a contamination-free benchmark for competitive coding).
PLANSEARCH generates a diverse set of observations about the problem and then
uses these observations to construct plans for solving the problem. By
searching over plans in natural language rather than directly over code
solutions, PLANSEARCH explores a significantly more diverse range of potential
solutions compared to baseline search methods. Using PLANSEARCH on top of
Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on
LiveCodeBench, outperforming both the best score achieved without search
(pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%).
Finally, we show that, across all models, search algorithms, and benchmarks
analyzed, we can accurately predict performance gains due to search as a direct
function of the diversity over generated ideas.
☆ RAG based Question-Answering for Contextual Response Prediction System CIKM'24
Large Language Models (LLMs) have shown versatility in various Natural
Language Processing (NLP) tasks, including their potential as effective
question-answering systems. However, to provide precise and relevant
information in response to specific customer queries in industry settings, LLMs
require access to a comprehensive knowledge base to avoid hallucinations.
Retrieval Augmented Generation (RAG) emerges as a promising technique to
address this challenge. Yet, developing an accurate question-answering
framework for real-world applications using RAG entails several challenges: 1)
data availability issues, 2) evaluating the quality of generated content, and
3) the costly nature of human evaluation. In this paper, we introduce an
end-to-end framework that employs LLMs with RAG capabilities for industry use
cases. Given a customer query, the proposed system retrieves relevant knowledge
documents and leverages them, along with previous chat history, to generate
response suggestions for customer service agents in the contact centers of a
major retail company. Through comprehensive automated and human evaluations, we
show that this solution outperforms the current BERT-based algorithms in
accuracy and relevance. Our findings suggest that RAG-based LLMs can be an
excellent support to human customer service representatives by lightening their
workload.
comment: Accepted at the 1st Workshop on GenAI and RAG Systems for Enterprise,
CIKM'24. 6 pages
☆ A Different Level Text Protection Mechanism With Differential Privacy
The article introduces a method for extracting words of different degrees of
importance based on the BERT pre-training model and proves the effectiveness of
this method. The article also discusses the impact of maintaining the same
perturbation results for words of different importance on the overall text
utility. This method can be applied to long text protection.
☆ LAST: Language Model Aware Speech Tokenization
Speech tokenization serves as the foundation of speech language model (LM),
enabling them to perform various tasks such as spoken language modeling,
text-to-speech, speech-to-text, etc. Most speech tokenizers are trained
independently of the LM training process, relying on separate acoustic models
and quantization methods. Following such an approach may create a mismatch
between the tokenization process and its usage afterward. In this study, we
propose a novel approach to training a speech tokenizer by leveraging
objectives from pre-trained textual LMs. We advocate for the integration of
this objective into the process of learning discrete speech representations.
Our aim is to transform features from a pre-trained speech model into a new
feature space that enables better clustering for speech LMs. We empirically
investigate the impact of various model design choices, including speech
vocabulary size and text LM size. Our results demonstrate the proposed
tokenization method outperforms the evaluated baselines considering both spoken
language modeling and speech-to-text. More importantly, unlike prior work, the
proposed method allows the utilization of a single pre-trained LM for
processing both speech and text inputs, setting it apart from conventional
tokenization approaches.
☆ A Fused Large Language Model for Predicting Startup Success
Investors are continuously seeking profitable investment opportunities in
startups and, hence, for effective decision-making, need to predict a startup's
probability of success. Nowadays, investors can use not only various
fundamental information about a startup (e.g., the age of the startup, the
number of founders, and the business sector) but also textual description of a
startup's innovation and business model, which is widely available through
online venture capital (VC) platforms such as Crunchbase. To support the
decision-making of investors, we develop a machine learning approach with the
aim of locating successful startups on VC platforms. Specifically, we develop,
train, and evaluate a tailored, fused large language model to predict startup
success. Thereby, we assess to what extent self-descriptions on VC platforms
are predictive of startup success. Using 20,172 online profiles from
Crunchbase, we find that our fused large language model can predict startup
success, with textual self-descriptions being responsible for a significant
part of the predictive power. Our work provides a decision support tool for
investors to find profitable investment opportunities.
☆ The representation landscape of few-shot learning and fine-tuning in large language models
In-context learning (ICL) and supervised fine-tuning (SFT) are two common
strategies for improving the performance of modern large language models (LLMs)
on specific tasks. Despite their different natures, these strategies often lead
to comparable performance gains. However, little is known about whether they
induce similar representations inside LLMs. We approach this problem by
analyzing the probability landscape of their hidden representations in the two
cases. More specifically, we compare how LLMs solve the same question-answering
task, finding that ICL and SFT create very different internal structures, in
both cases undergoing a sharp transition in the middle of the network. In the
first half of the network, ICL shapes interpretable representations
hierarchically organized according to their semantic content. In contrast, the
probability landscape obtained with SFT is fuzzier and semantically mixed. In
the second half of the model, the fine-tuned representations develop
probability modes that better encode the identity of answers, while the
landscape of ICL representations is characterized by less defined peaks. Our
approach reveals the diverse computational strategies developed inside LLMs to
solve the same task across different conditions, allowing us to make a step
towards designing optimal methods to extract information from language models.
☆ LLM-based multi-agent poetry generation in non-cooperative environments
Despite substantial progress of large language models (LLMs) for automatic
poetry generation, the generated poetry lacks diversity while the training
process differs greatly from human learning. Under the rationale that the
learning process of the poetry generation systems should be more human-like and
their output more diverse and novel, we introduce a framework based on social
learning where we emphasize non-cooperative interactions besides cooperative
interactions to encourage diversity. Our experiments are the first attempt at
LLM-based multi-agent systems in non-cooperative environments for poetry
generation employing both TRAINING-BASED agents (GPT-2) and PROMPTING-BASED
agents (GPT-3 and GPT-4). Our evaluation based on 96k generated poems shows
that our framework benefits the poetry generation process for TRAINING-BASED
agents resulting in 1) a 3.0-3.7 percentage point (pp) increase in diversity
and a 5.6-11.3 pp increase in novelty according to distinct and novel n-grams.
The generated poetry from TRAINING-BASED agents also exhibits group divergence
in terms of lexicons, styles and semantics. PROMPTING-BASED agents in our
framework also benefit from non-cooperative environments and a more diverse
ensemble of models with non-homogeneous agents has the potential to further
enhance diversity, with an increase of 7.0-17.5 pp according to our
experiments. However, PROMPTING-BASED agents show a decrease in lexical
diversity over time and do not exhibit the group-based divergence intended in
the social network. Our paper argues for a paradigm shift in creative tasks
such as automatic poetry generation to include social learning processes (via
LLM-based agent modeling) similar to human interaction.
comment: preprint
☆ On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
Yong Lin, Skyler Seto, Maartje ter Hoeve, Katherine Metcalf, Barry-John Theobald, Xuan Wang, Yizhe Zhang, Chen Huang, Tong Zhang
Reinforcement Learning from Human Feedback (RLHF) is an effective approach
for aligning language models to human preferences. Central to RLHF is learning
a reward function for scoring human preferences. Two main approaches for
learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in
RLHF, and 2) using an implicit reward learned from preference data through
methods such as Direct Preference Optimization (DPO). Prior work has shown that
the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in
the limit. DPORM's effectiveness directly implies the optimality of the learned
policy, and also has practical implication for LLM alignment methods including
iterative DPO. However, it is unclear how well DPORM empirically matches the
performance of EXRM. This work studies the accuracy at distinguishing preferred
and rejected answers for both DPORM and EXRM. Our findings indicate that even
though DPORM fits the training dataset comparably, it generalizes less
effectively than EXRM, especially when the validation datasets contain
distribution shifts. Across five out-of-distribution settings, DPORM has a mean
drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that
DPORM has limited generalization ability and substantiates the integration of
an explicit reward model in iterative DPO approaches.
comment: 12 pages, 8 tables, 2 figures
☆ CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
Formula recognition presents significant challenges due to the complicated
structure and varied notation of mathematical expressions. Despite continuous
advancements in formula recognition models, the evaluation metrics employed by
these models, such as BLEU and Edit Distance, still exhibit notable
limitations. They overlook the fact that the same formula has diverse
representations and is highly sensitive to the distribution of training data,
thereby causing the unfairness in formula recognition evaluation. To this end,
we propose a Character Detection Matching (CDM) metric, ensuring the evaluation
objectivity by designing a image-level rather than LaTex-level metric score.
Specifically, CDM renders both the model-predicted LaTeX and the ground-truth
LaTeX formulas into image-formatted formulas, then employs visual feature
extraction and localization techniques for precise character-level matching,
incorporating spatial position information. Such a spatially-aware and
character-matching method offers a more accurate and equitable evaluation
compared with previous BLEU and Edit Distance metrics that rely solely on
text-based character matching. Experimentally, we evaluated various formula
recognition models using CDM, BLEU, and ExpRate metrics. Their results
demonstrate that the CDM aligns more closely with human evaluation standards
and provides a fairer comparison across different models by eliminating
discrepancies caused by diverse formula representations.
comment: Project Website:
https://github.com/opendatalab/UniMERNet/tree/main/cdm
☆ Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers
In decoder-based LLMs, the representation of a given layer serves two
purposes: as input to the next layer during the computation of the current
token; and as input to the attention mechanism of future tokens. In this work,
we show that the importance of the latter role might be overestimated. To show
that, we start by manipulating the representations of previous tokens; e.g. by
replacing the hidden states at some layer k with random vectors. Our
experimenting with four LLMs and four tasks show that this operation often
leads to small to negligible drop in performance. Importantly, this happens if
the manipulation occurs in the top part of the model-k is in the final 30-50%
of the layers. In contrast, doing the same manipulation in earlier layers might
lead to chance level performance. We continue by switching the hidden state of
certain tokens with hidden states of other tokens from another prompt; e.g.,
replacing the word "Italy" with "France" in "What is the capital of Italy?". We
find that when applying this switch in the top 1/3 of the model, the model
ignores it (answering "Rome"). However if we apply it before, the model
conforms to the switch ("Paris"). Our results hint at a two stage process in
transformer-based LLMs: the first part gathers input from previous tokens,
while the second mainly processes that information internally.
☆ 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances KDD
Predicting the performance of LLMs on individual task instances is essential
to ensure their reliability in high-stakes applications. To do so, a
possibility is to evaluate the considered LLM on a set of task instances and
train an assessor to predict its performance based on features of the
instances. However, this approach requires evaluating each new LLM on a
sufficiently large set of task instances to train an assessor specific to it.
In this work, we leverage the evaluation results of previously tested LLMs to
reduce the number of evaluations required to predict the performance of a new
LLM. In practice, we propose to test the new LLM on a small set of reference
instances and train a generic assessor which predicts the performance of the
LLM on an instance based on the performance of the former on the reference set
and features of the instance of interest. We conduct empirical studies on
HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets
that we introduce, where we evaluate all instruction-fine-tuned OpenAI models
until the January 2024 version of GPT4. When predicting performance on
instances with the same distribution as those used to train the generic
assessor, we find this achieves performance comparable to the LLM-specific
assessors trained on the full set of instances. Additionally, we find that
randomly selecting the reference instances performs as well as some advanced
selection methods we tested. For out of distribution, however, no clear winner
emerges and the overall performance is worse, suggesting that the inherent
predictability of LLMs is low.
comment: Presented at the 2024 KDD workshop on Evaluation and Trustworthiness
of Generative AI Models
☆ From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents
Jifan Yu, Zheyuan Zhang, Daniel Zhang-li, Shangqing Tu, Zhanxin Hao, Rui Miao Li, Haoxuan Li, Yuanchun Wang, Hanming Li, Linlu Gong, Jie Cao, Jiayin Lin, Jinchang Zhou, Fei Qin, Haohua Wang, Jianxiao Jiang, Lijun Deng, Yisi Zhan, Chaojun Xiao, Xusheng Dai, Xuan Yan, Nianyi Lin, Nan Zhang, Ruixin Ni, Yang Dang, Lei Hou, Yu Zhang, Xu Han, Manli Li, Juanzi Li, Zhiyuan Liu, Huiqin Liu, Maosong Sun
Since the first instances of online education, where courses were uploaded to
accessible and shared online platforms, this form of scaling the dissemination
of human knowledge to reach a broader audience has sparked extensive discussion
and widespread adoption. Recognizing that personalized learning still holds
significant potential for improvement, new AI technologies have been
continuously integrated into this learning format, resulting in a variety of
educational AI applications such as educational recommendation and intelligent
tutoring. The emergence of intelligence in large language models (LLMs) has
allowed for these educational enhancements to be built upon a unified
foundational model, enabling deeper integration. In this context, we propose
MAIC (Massive AI-empowered Course), a new form of online education that
leverages LLM-driven multi-agent systems to construct an AI-augmented
classroom, balancing scalability with adaptivity. Beyond exploring the
conceptual framework and technical innovations, we conduct preliminary
experiments at Tsinghua University, one of China's leading universities.
Drawing from over 100,000 learning records of more than 500 students, we obtain
a series of valuable observations and initial analyses. This project will
continue to evolve, ultimately aiming to establish a comprehensive open
platform that supports and unifies research, technology, and applications in
exploring the possibilities of online education in the era of large model AI.
We envision this platform as a collaborative hub, bringing together educators,
researchers, and innovators to collectively explore the future of AI-driven
online education.
☆ How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes
Decoder-only LLMs have shown impressive performance in MT due to their
ability to learn from extensive datasets and generate high-quality
translations. However, LLMs often struggle with the nuances and style required
for organisation-specific translation. In this study, we explore the
effectiveness of fine-tuning Large Language Models (LLMs), particularly Llama 3
8B Instruct, leveraging translation memories (TMs), as a valuable resource to
enhance accuracy and efficiency. We investigate the impact of fine-tuning the
Llama 3 model using TMs from a specific organisation in the software sector.
Our experiments cover five translation directions across languages of varying
resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and
Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to
evaluate their influence on translation quality. We fine-tune separate models
for each training set and evaluate their performance based on automatic
metrics, BLEU, chrF++, TER, and COMET. Our findings reveal improvement in
translation performance with larger datasets across all metrics. On average,
BLEU and COMET scores increase by 13 and 25 points, respectively, on the
largest training set against the baseline model. Notably, there is a
performance deterioration in comparison with the baseline model when
fine-tuning on only 1k and 2k examples; however, we observe a substantial
improvement as the training dataset size increases. The study highlights the
potential of integrating TMs with LLMs to create bespoke translation models
tailored to the specific needs of businesses, thus enhancing translation
quality and reducing turn-around times. This approach offers a valuable insight
for organisations seeking to leverage TMs and LLMs for optimal translation
outcomes, especially in narrower domains.
☆ Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities
The advancement of Large Language Models (LLMs) for domain applications in
fields such as materials science and engineering depends on the development of
fine-tuning strategies that adapt models for specialized, technical
capabilities. In this work, we explore the effects of Continued Pretraining
(CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization
approaches, including Direct Preference Optimization (DPO) and Odds Ratio
Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis
shows how these strategies influence model outcomes and reveals that the
merging of multiple fine-tuned models can lead to the emergence of capabilities
that surpass the individual contributions of the parent models. We find that
model merging leads to new functionalities that neither parent model could
achieve alone, leading to improved performance in domain-specific assessments.
Experiments with different model architectures are presented, including Llama
3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring
whether the results hold also for much smaller models, we use a tiny LLM with
1.7 billion parameters and show that very small LLMs do not necessarily feature
emergent capabilities under model merging, suggesting that model scaling may be
a key component. In open-ended yet consistent chat conversations between a
human and AI models, our assessment reveals detailed insights into how
different model variants perform and show that the smallest model achieves a
high intelligence score across key criteria including reasoning depth,
creativity, clarity, and quantitative precision. Other experiments include the
development of image generation prompts based on disparate biological material
design concepts, to create new microstructures, architectural concepts, and
urban design based on biological materials-inspired construction principles.
☆ Rx Strategist: Prescription Verification using LLM Agents System
To protect patient safety, modern pharmaceutical complexity demands strict
prescription verification. We offer a new approach - Rx Strategist - that makes
use of knowledge graphs and different search strategies to enhance the power of
Large Language Models (LLMs) inside an agentic framework. This multifaceted
technique allows for a multi-stage LLM pipeline and reliable information
retrieval from a custom-built active ingredient database. Different facets of
prescription verification, such as indication, dose, and possible drug
interactions, are covered in each stage of the pipeline. We alleviate the
drawbacks of monolithic LLM techniques by spreading reasoning over these
stages, improving correctness and reliability while reducing memory demands.
Our findings demonstrate that Rx Strategist surpasses many current LLMs,
achieving performance comparable to that of a highly experienced clinical
pharmacist. In the complicated world of modern medications, this combination of
LLMs with organized knowledge and sophisticated search methods presents a
viable avenue for reducing prescription errors and enhancing patient outcomes.
comment: 17 Pages, 6 Figures, Under Review
☆ CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks
Cognitive psychology investigates perception, attention, memory, language,
problem-solving, decision-making, and reasoning. Kahneman's dual-system theory
elucidates the human decision-making process, distinguishing between the rapid,
intuitive System 1 and the deliberative, rational System 2. Recent advancements
have positioned large language Models (LLMs) as formidable tools nearing
human-level proficiency in various cognitive tasks. Nonetheless, the presence
of a dual-system framework analogous to human cognition in LLMs remains
unexplored. This study introduces the \textbf{CogniDual Framework for LLMs}
(CFLLMs), designed to assess whether LLMs can, through self-training, evolve
from deliberate deduction to intuitive responses, thereby emulating the human
process of acquiring and mastering new information. Our findings reveal the
cognitive mechanisms behind LLMs' response generation, enhancing our
understanding of their capabilities in cognitive psychology. Practically,
self-trained models can provide faster responses to certain queries, reducing
computational demands during inference.
☆ Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time
Based on official estimates, 50 million people worldwide are affected by
dementia, and this number increases by 10 million new patients every year.
Without a cure, clinical prognostication and early intervention represent the
most effective ways to delay its progression. To this end, Artificial
Intelligence and computational linguistics can be exploited for natural
language analysis, personalized assessment, monitoring, and treatment. However,
traditional approaches need more semantic knowledge management and
explicability capabilities. Moreover, using Large Language Models (LLMs) for
cognitive decline diagnosis is still scarce, even though these models represent
the most advanced way for clinical-patient communication using intelligent
systems. Consequently, we leverage an LLM using the latest Natural Language
Processing (NLP) techniques in a chatbot solution to provide interpretable
Machine Learning prediction of cognitive decline in real-time.
Linguistic-conceptual features are exploited for appropriate natural language
analysis. Through explainability, we aim to fight potential biases of the
models and improve their potential to help clinical workers in their diagnosis
decisions. More in detail, the proposed pipeline is composed of (i) data
extraction employing NLP-based prompt engineering; (ii) stream-based data
processing including feature engineering, analysis, and selection; (iii)
real-time classification; and (iv) the explainability dashboard to provide
visual and natural language descriptions of the prediction outcome.
Classification results exceed 80 % in all evaluation metrics, with a recall
value for the mental deterioration class about 85 %. To sum up, we contribute
with an affordable, flexible, non-invasive, personalized diagnostic system to
this work.
☆ Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding
The training data in large language models is key to their success, but it
also presents privacy and security risks, as it may contain sensitive
information. Detecting pre-training data is crucial for mitigating these
concerns. Existing methods typically analyze target text in isolation or solely
with non-member contexts, overlooking potential insights from simultaneously
considering both member and non-member contexts. While previous work suggested
that member contexts provide little information due to the minor distributional
shift they induce, our analysis reveals that these subtle shifts can be
effectively leveraged when contrasted with non-member contexts. In this paper,
we propose Con-ReCall, a novel approach that leverages the asymmetric
distributional shifts induced by member and non-member contexts through
contrastive decoding, amplifying subtle differences to enhance membership
inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves
state-of-the-art performance on the WikiMIA benchmark and is robust against
various text manipulation techniques.
☆ Sketch: A Toolkit for Streamlining LLM Operations
Xin Jiang, Xiang Li, Wenjia Ma, Xuezhi Fang, Yiqun Yao, Naitong Yu, Xuying Meng, Peng Han, Jing Li, Aixin Sun, Yequan Wang
Large language models (LLMs) represented by GPT family have achieved
remarkable success. The characteristics of LLMs lie in their ability to
accommodate a wide range of tasks through a generative approach. However, the
flexibility of their output format poses challenges in controlling and
harnessing the model's outputs, thereby constraining the application of LLMs in
various domains. In this work, we present Sketch, an innovative toolkit
designed to streamline LLM operations across diverse fields. Sketch comprises
the following components: (1) a suite of task description schemas and prompt
templates encompassing various NLP tasks; (2) a user-friendly, interactive
process for building structured output LLM services tailored to various NLP
tasks; (3) an open-source dataset for output format control, along with tools
for dataset construction; and (4) an open-source model based on
LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting
instructions. We anticipate this initiative to bring considerable convenience
to LLM users, achieving the goal of ''plug-and-play'' for various applications.
The components of Sketch will be progressively open-sourced at
https://github.com/cofe-ai/Sketch.
☆ Normal forms in Virus Machines
In the present work, we further study the computational power of virus
machines (VMs in short). VMs provide a computing paradigm inspired by the
transmission and replication networks of viruses. VMs consist of process units
(called hosts) structured by a directed graph whose arcs are called channels
and an instruction graph that controls the transmissions of virus objects among
hosts. The present work complements our understanding of the computing power of
VMs by introducing normal forms; these expressions restrict the features in a
given computing model. Some of the features that we restrict in our normal
forms include (a) the number of hosts, (b) the number of instructions, and (c)
the number of virus objects in each host. After we recall some known results on
the computing power of VMs we give our normal forms, such as the size of the
loops in the network, proving new characterisations of family of sets, such as
the finite sets, semilinear sets, or NRE.
☆ N-gram Prediction and Word Difference Representations for Language Modeling
Causal language modeling (CLM) serves as the foundational framework
underpinning remarkable successes of recent large language models (LLMs).
Despite its success, the training approach for next word prediction poses a
potential risk of causing the model to overly focus on local dependencies
within a sentence. While prior studies have been introduced to predict future N
words simultaneously, they were primarily applied to tasks such as masked
language modeling (MLM) and neural machine translation (NMT). In this study, we
introduce a simple N-gram prediction framework for the CLM task. Moreover, we
introduce word difference representation (WDR) as a surrogate and
contextualized target representation during model training on the basis of
N-gram prediction framework. To further enhance the quality of next word
prediction, we propose an ensemble method that incorporates the future N words'
prediction results. Empirical evaluations across multiple benchmark datasets
encompassing CLM and NMT tasks demonstrate the significant advantages of our
proposed methods over the conventional CLM.
☆ LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts EMNLP
With the emergence of widely available powerful LLMs, disinformation
generated by large Language Models (LLMs) has become a major concern.
Historically, LLM detectors have been touted as a solution, but their
effectiveness in the real world is still to be proven. In this paper, we focus
on an important setting in information operations -- short news-like posts
generated by moderately sophisticated attackers.
We demonstrate that existing LLM detectors, whether zero-shot or
purpose-trained, are not ready for real-world use in that setting. All tested
zero-shot detectors perform inconsistently with prior benchmarks and are highly
vulnerable to sampling temperature increase, a trivial attack absent from
recent benchmarks. A purpose-trained detector generalizing across LLMs and
unseen attacks can be developed, but it fails to generalize to new
human-written texts.
We argue that the former indicates domain-specific benchmarking is needed,
while the latter suggests a trade-off between the adversarial evasion
resilience and overfitting to the reference human text, with both needing
evaluation in benchmarks and currently absent. We believe this suggests a
re-consideration of current LLM detector benchmarking approaches and provides a
dynamically extensible benchmark to allow it
(https://github.com/Reliable-Information-Lab-HEVS/dynamic_llm_detector_benchmark).
comment: 20 pages, 7 tables, 13 figures, under consideration for EMNLP
☆ iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models
Most available data is unstructured, making it challenging to access valuable
information. Automatically building Knowledge Graphs (KGs) is crucial for
structuring data and making it accessible, allowing users to search for
information effectively. KGs also facilitate insights, inference, and
reasoning. Traditional NLP methods, such as named entity recognition and
relation extraction, are key in information retrieval but face limitations,
including the use of predefined entity types and the need for supervised
learning. Current research leverages large language models' capabilities, such
as zero- or few-shot learning. However, unresolved and semantically duplicated
entities and relations still pose challenges, leading to inconsistent graphs
and requiring extensive post-processing. Additionally, most approaches are
topic-dependent. In this paper, we propose iText2KG, a method for incremental,
topic-independent KG construction without post-processing. This plug-and-play,
zero-shot method is applicable across a wide range of KG construction scenarios
and comprises four modules: Document Distiller, Incremental Entity Extractor,
Incremental Relation Extractor, and Graph Integrator and Visualization. Our
method demonstrates superior performance compared to baseline methods across
three scenarios: converting scientific papers to graphs, websites to graphs,
and CVs to graphs.
comment: Accepted at The International Web Information Systems Engineering
conference (the WISE conference) 2024
☆ ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
Automatic chart understanding is crucial for content comprehension and
document parsing. Multimodal large language models (MLLMs) have demonstrated
remarkable capabilities in chart understanding through domain-specific
alignment and fine-tuning. However, the application of alignment training
within the chart domain is still underexplored. To address this, we propose
ChartMoE, which employs the mixture of expert (MoE) architecture to replace the
traditional linear projector to bridge the modality gap. Specifically, we train
multiple linear connectors through distinct alignment tasks, which are utilized
as the foundational initialization parameters for different experts.
Additionally, we introduce ChartMoE-Align, a dataset with over 900K
chart-table-JSON-code quadruples to conduct three alignment tasks
(chart-table/JSON/code). Combined with the vanilla connector, we initialize
different experts in four distinct ways and adopt high-quality knowledge
learning to further refine the MoE connector and LLM parameters. Extensive
experiments demonstrate the effectiveness of the MoE connector and our
initialization strategy, e.g., ChartMoE improves the accuracy of the previous
state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.
☆ Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation
Yu Wang, Shiwan Zhao, Zhihu Wang, Heyuan Huang, Ming Fan, Yubo Zhang, Zhixing Wang, Haijun Wang, Ting Liu
The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for
enhancing the reasoning capabilities of large language models (LLMs). However,
despite their widespread adoption and success, CoT methods often exhibit
instability due to their inability to consistently ensure the quality of
generated reasoning paths, leading to sub-optimal reasoning performance. To
address this challenge, we propose the \textbf{Strategic Chain-of-Thought}
(SCoT), a novel methodology designed to refine LLM performance by integrating
strategic knowledge prior to generating intermediate reasoning steps. SCoT
employs a two-stage approach within a single prompt: first eliciting an
effective problem-solving strategy, which is then used to guide the generation
of high-quality CoT paths and final answers. Our experiments across eight
challenging reasoning datasets demonstrate significant improvements, including
a 21.05\% increase on the GSM8K dataset and 24.13\% on the Tracking\_Objects
dataset, respectively, using the Llama3-8b model. Additionally, we extend the
SCoT framework to develop a few-shot method with automatically matched
demonstrations, yielding even stronger results. These findings underscore the
efficacy of SCoT, highlighting its potential to substantially enhance LLM
performance in complex reasoning tasks.
☆ GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding
Although Large Language Models (LLMs) have demonstrated potential in
processing graphs, they struggle with comprehending graphical structure
information through prompts of graph description sequences, especially as the
graph size increases. We attribute this challenge to the uneven memory
performance of LLMs across different positions in graph description sequences,
known as ''positional biases''. To address this, we propose GraphInsight, a
novel framework aimed at improving LLMs' comprehension of both macro- and
micro-level graphical information. GraphInsight is grounded in two key
strategies: 1) placing critical graphical information in positions where LLMs
exhibit stronger memory performance, and 2) investigating a lightweight
external knowledge base for regions with weaker memory performance, inspired by
retrieval-augmented generation (RAG). Moreover, GraphInsight explores
integrating these two strategies into LLM agent processes for composite graph
tasks that require multi-step reasoning. Extensive empirical studies on
benchmarks with a wide range of evaluation tasks show that GraphInsight
significantly outperforms all other graph description methods (e.g., prompting
techniques and reordering strategies) in understanding graph structures of
varying sizes.
☆ Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard
This paper conducts a longitudinal study over eleven months to address the
limitations of prior research on the Open Ko-LLM Leaderboard, which have relied
on empirical studies with restricted observation periods of only five months.
By extending the analysis duration, we aim to provide a more comprehensive
understanding of the progression in developing Korean large language models
(LLMs). Our study is guided by three primary research questions: (1) What are
the specific challenges in improving LLM performance across diverse tasks on
the Open Ko-LLM Leaderboard over time? (2) How does model size impact task
performance correlations across various benchmarks? (3) How have the patterns
in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By
analyzing 1,769 models over this period, our research offers a comprehensive
examination of the ongoing advancements in LLMs and the evolving nature of
evaluation frameworks.
☆ E2CL: Exploration-based Error Correction Learning for Embodied Agents
Language models are exhibiting increasing capability in knowledge utilization
and reasoning. However, when applied as agents in embodied environments, they
often suffer from misalignment between their intrinsic knowledge and
environmental knowledge, leading to infeasible actions. Traditional environment
alignment methods, such as supervised learning on expert trajectories and
reinforcement learning, face limitations in covering environmental knowledge
and achieving efficient convergence, respectively. Inspired by human learning,
we propose Exploration-based Error Correction Learning (E2CL), a novel
framework that leverages exploration-induced errors and environmental feedback
to enhance environment alignment for LM-based agents. E2CL incorporates
teacher-guided and teacher-free exploration to gather environmental feedback
and correct erroneous actions. The agent learns to provide feedback and
self-correct, thereby enhancing its adaptability to target environments.
Evaluations in the Virtualhome environment demonstrate that E2CL-trained agents
outperform those trained by baseline methods and exhibit superior
self-correction capabilities.
☆ Preserving Empirical Probabilities in BERT for Small-sample Clinical Entity Recognition
Named Entity Recognition (NER) encounters the challenge of unbalanced labels,
where certain entity types are overrepresented while others are
underrepresented in real-world datasets. This imbalance can lead to biased
models that perform poorly on minority entity classes, impeding accurate and
equitable entity recognition. This paper explores the effects of unbalanced
entity labels of the BERT-based pre-trained model. We analyze the different
mechanisms of loss calculation and loss propagation for the task of token
classification on randomized datasets. Then we propose ways to improve the
token classification for the highly imbalanced task of clinical entity
recognition.
comment: 8 pages, 8 figures
☆ Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration
Black-box large language models (LLMs) are increasingly deployed in various
environments, making it essential for these models to effectively convey their
confidence and uncertainty, especially in high-stakes settings. However, these
models often exhibit overconfidence, leading to potential risks and
misjudgments. Existing techniques for eliciting and calibrating LLM confidence
have primarily focused on general reasoning datasets, yielding only modest
improvements. Accurate calibration is crucial for informed decision-making and
preventing adverse outcomes but remains challenging due to the complexity and
variability of tasks these models perform. In this work, we investigate the
miscalibration behavior of black-box LLMs within the healthcare setting. We
propose a novel method, \textit{Atypical Presentations Recalibration}, which
leverages atypical presentations to adjust the model's confidence estimates.
Our approach significantly improves calibration, reducing calibration errors by
approximately 60\% on three medical question answering datasets and
outperforming existing methods such as vanilla verbalized confidence, CoT
verbalized confidence and others. Additionally, we provide an in-depth analysis
of the role of atypicality within the recalibration framework.
☆ xLAM: A Family of Large Action Models to Empower AI Agent Systems
Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Autonomous agents powered by large language models (LLMs) have attracted
significant research interest. However, the open-source community faces many
challenges in developing specialized models for agent tasks, driven by the
scarcity of high-quality agent datasets and the absence of standard protocols
in this area. We introduce and publicly release xLAM, a series of large action
models designed for AI agent tasks. The xLAM series includes five models with
both dense and mixture-of-expert architectures, ranging from 1B to 8x22B
parameters, trained using a scalable, flexible pipeline that unifies, augments,
and synthesizes diverse datasets to enhance AI agents' generalizability and
performance across varied environments. Our experimental results demonstrate
that xLAM consistently delivers exceptional performance across multiple agent
ability benchmarks, notably securing the 1st position on the Berkeley
Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other
models in terms of tool use. By releasing the xLAM series, we aim to advance
the performance of open-source LLMs for autonomous AI agents, potentially
accelerating progress and democratizing access to high-performance models for
agent tasks. Models are available at
https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4
comment: Technical report for the Salesforce xLAM model series
☆ An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification
Sentiment classification (SC) often suffers from low-resource challenges such
as domain-specific contexts, imbalanced label distributions, and few-shot
scenarios. The potential of the diffusion language model (LM) for textual data
augmentation (DA) remains unexplored, moreover, textual DA methods struggle to
balance the diversity and consistency of new samples. Most DA methods either
perform logical modifications or rephrase less important tokens in the original
sequence with the language model. In the context of SC, strong emotional tokens
could act critically on the sentiment of the whole sequence. Therefore,
contrary to rephrasing less important context, we propose DiffusionCLS to
leverage a diffusion LM to capture in-domain knowledge and generate pseudo
samples by reconstructing strong label-related tokens. This approach ensures a
balance between consistency and diversity, avoiding the introduction of noise
and augmenting crucial features of datasets. DiffusionCLS also comprises a
Noise-Resistant Training objective to help the model generalize. Experiments
demonstrate the effectiveness of our method in various low-resource scenarios
including domain-specific and domain-general problems. Ablation studies confirm
the effectiveness of our framework's modules, and visualization studies
highlight optimal deployment conditions, reinforcing our conclusions.
☆ Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
Neural networks (NN) classification models for Natural Language Processing
(NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that
triggers a model to produce a specific prediction for any input. DARCY borrows
the "honeypot" concept to bait multiple trapdoors, effectively detecting the
adversarial examples generated by UAT. Unfortunately, we find a new UAT
generation method, called IndisUAT, which produces triggers (i.e., tokens) and
uses them to craft adversarial examples whose feature distribution is
indistinguishable from that of the benign examples in a randomly-chosen
category at the detection layer of DARCY. The produced adversarial examples
incur the maximal loss of predicting results in the DARCY-protected models.
Meanwhile, the produced triggers are effective in black-box models for text
generation, text inference, and reading comprehension. Finally, the evaluation
results under NN models for NLP tasks indicate that the IndisUAT method can
effectively circumvent DARCY and penetrate other defenses. For example,
IndisUAT can reduce the true positive rate of DARCY's detection by at least
40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN
and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's
adversarial defense model by at least 34.0%, and makes the GPT-2 language model
spew racist outputs even when conditioned on non-racial context.
comment: 13 pages, 5 figures
☆ MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering KDD
In this paper we present a multi-adapter retrieval augmented generation
system (MARAGS) for Meta's Comprehensive RAG (CRAG) competition for KDD CUP
2024. CRAG is a question answering dataset contains 3 different subtasks aimed
at realistic question and answering RAG related tasks, with a diverse set of
question topics, question types, time dynamic answers, and questions featuring
entities of varying popularity.
Our system follows a standard setup for web based RAG, which uses processed
web pages to provide context for an LLM to produce generations, while also
querying API endpoints for additional information. MARAGS also utilizes
multiple different adapters to solve the various requirements for these tasks
with a standard cross-encoder model for ranking candidate passages relevant for
answering the question. Our system achieved 2nd place for Task 1 as well as 3rd
place on Task 2.
comment: Accepted to CRAG KDD Cup 24 Workshop
☆ Continual Skill and Task Learning via Dialogue
Continual and interactive robot learning is a challenging problem as the
robot is present with human users who expect the robot to learn novel skills to
solve novel tasks perpetually with sample efficiency. In this work we present a
framework for robots to query and learn visuo-motor robot skills and task
relevant information via natural language dialog interactions with human users.
Previous approaches either focus on improving the performance of instruction
following agents, or passively learn novel skills or concepts. Instead, we used
dialog combined with a language-skill grounding embedding to query or confirm
skills and/or tasks requested by a user. To achieve this goal, we developed and
integrated three different components for our agent. Firstly, we propose a
novel visual-motor control policy ACT with Low Rank Adaptation (ACT-LoRA),
which enables the existing SoTA ACT model to perform few-shot continual
learning. Secondly, we develop an alignment model that projects demonstrations
across skill embodiments into a shared embedding allowing us to know when to
ask questions and/or demonstrations from users. Finally, we integrated an
existing LLM to interact with a human user to perform grounded interactive
continual skill learning to solve a task. Our ACT-LoRA model learns novel
fine-tuned skills with a 100% accuracy when trained with only five
demonstrations for a novel skill while still maintaining a 74.75% accuracy on
pre-trained skills in the RLBench dataset where other models fall significantly
short. We also performed a human-subjects study with 8 subjects to demonstrate
the continual learning capabilities of our combined framework. We achieve a
success rate of 75% in the task of sandwich making with the real robot learning
from participant data demonstrating that robots can learn novel skills or task
knowledge from dialogue with non-expert users using our approach.
☆ MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models
A college-level benchmark dataset for large language models (LLMs) in the
materials science field, MaterialBENCH, is constructed. This dataset consists
of problem-answer pairs, based on university textbooks. There are two types of
problems: one is the free-response answer type, and the other is the
multiple-choice type. Multiple-choice problems are constructed by adding three
incorrect answers as choices to a correct answer, so that LLMs can choose one
of the four as a response. Most of the problems for free-response answer and
multiple-choice types overlap except for the format of the answers. We also
conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5,
ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with
the OpenAI API. The differences and similarities in the performance of LLMs
measured by the MaterialBENCH are analyzed and discussed. Performance
differences between the free-response type and multiple-choice type in the same
models and the influence of using system massages on multiple-choice problems
are also studied. We anticipate that MaterialBENCH will encourage further
developments of LLMs in reasoning abilities to solve more complicated problems
and eventually contribute to materials research and discovery.
☆ Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models
Jie Ma, Zhitao Gao, Qi Chai, Wangchun Sun, Pinghui Wang, Hongbin Pei, Jing Tao, Lingyun Song, Jun Liu, Chen Zhang, Lizhen Cui
Large Language Models (LLMs) may suffer from hallucinations in real-world
applications due to the lack of relevant knowledge. In contrast, knowledge
graphs encompass extensive, multi-relational structures that store a vast array
of symbolic facts. Consequently, integrating LLMs with knowledge graphs has
been extensively explored, with Knowledge Graph Question Answering (KGQA)
serving as a critical touchstone for the integration. This task requires LLMs
to answer natural language questions by retrieving relevant triples from
knowledge graphs. However, existing methods face two significant challenges:
\textit{excessively long reasoning paths distracting from the answer
generation}, and \textit{false-positive relations hindering the path
refinement}. In this paper, we propose an iterative interactive KGQA framework
that leverages the interactive learning capabilities of LLMs to perform
reasoning and Debating over Graphs (DoG). Specifically, DoG employs a
subgraph-focusing mechanism, allowing LLMs to perform answer trying after each
reasoning step, thereby mitigating the impact of lengthy reasoning paths. On
the other hand, DoG utilizes a multi-role debate team to gradually simplify
complex questions, reducing the influence of false-positive relations. This
debate mechanism ensures the reliability of the reasoning process. Experimental
results on five public datasets demonstrate the effectiveness and superiority
of our architecture. Notably, DoG outperforms the state-of-the-art method ToG
by 23.7\% and 9.1\% in accuracy on WebQuestions and GrailQA, respectively.
Furthermore, the integration experiments with various LLMs on the mentioned
datasets highlight the flexibility of DoG. Code is available at
\url{https://github.com/reml-group/DoG}.
comment: 12 pages
☆ GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation
Online sellers and advertisers are recommended keyphrases for their listed
products, which they bid on to enhance their sales. One popular paradigm that
generates such recommendations is Extreme Multi-Label Classification (XMC),
which involves tagging/mapping keyphrases to items. We outline the limitations
of using traditional item-query based tagging or mapping techniques for
keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an
innovative graph-based approach that recommends keyphrases to sellers using
extraction of token permutations from item titles. Additionally, we demonstrate
that relying on traditional metrics such as precision/recall can be misleading
in practical applications, thereby necessitating a combination of metrics to
evaluate performance in real-world scenarios. These metrics are designed to
assess the relevance of keyphrases to items and the potential for buyer
outreach. GraphEx outperforms production models at eBay, achieving the
objectives mentioned above. It supports near real-time inferencing in
resource-constrained production environments and scales effectively for
billions of items.
♻ ☆ PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity
One of the components of natural language processing that has received a lot
of investigation recently is semantic textual similarity. In computational
linguistics and natural language processing, assessing the semantic similarity
of words, phrases, paragraphs, and texts is crucial. Calculating the degree of
semantic resemblance between two textual pieces, paragraphs, or phrases
provided in both monolingual and cross-lingual versions is known as semantic
similarity. Cross lingual semantic similarity requires corpora in which there
are sentence pairs in both the source and target languages with a degree of
semantic similarity between them. Many existing cross lingual semantic
similarity models use a machine translation due to the unavailability of cross
lingual semantic similarity dataset, which the propagation of the machine
translation error reduces the accuracy of the model. On the other hand, when we
want to use semantic similarity features for machine translation the same
machine translations should not be used for semantic similarity. For Persian,
which is one of the low resource languages, no effort has been made in this
regard and the need for a model that can understand the context of two
languages is felt more than ever. In this article, the corpus of semantic
textual similarity between sentences in Persian and English languages has been
produced for the first time by using linguistic experts. We named this dataset
PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375
sentence pairs. Also, different models based on transformers have been
fine-tuned using this dataset. The results show that using the PESTS dataset,
the Pearson correlation of the XLM ROBERTa model increases from 85.87% to
95.62%.
♻ ☆ Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation
Tianyu Zheng, Shuyue Guo, Xingwei Qu, Jiawei Guo, Xinrun Du, Qi Jia, Chenghua Lin, Wenhao Huang, Jie Fu, Ge Zhang
In this paper, we introduce Kun, a novel approach for creating high-quality
instruction-tuning datasets for large language models (LLMs) without relying on
manual annotations. Adapting a self-training algorithm based on instruction
back-translation and answer polishment, Kun leverages unlabelled data from
diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial
dataset of over a million Chinese instructional data points. This approach
significantly deviates from traditional methods by using a self-curation
process to refine and select the most effective instruction-output pairs. Our
experiments with the 6B-parameter Yi model across various benchmarks
demonstrate Kun's robustness and scalability. Our method's core contributions
lie in its algorithmic advancement, which enhances data retention and clarity,
and its innovative data generation approach that substantially reduces the
reliance on costly and time-consuming manual annotations. This methodology
presents a scalable and efficient solution for improving the
instruction-following capabilities of LLMs, with significant implications for
their application across diverse fields. The code and dataset can be found at
https://github.com/Zheng0428/COIG-Kun
comment: 12 pages, 12 figures
♻ ☆ Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
The pretraining data of today's strongest language models is opaque; in
particular, little is known about the proportions of various domains or
languages represented. In this work, we tackle a task which we call data
mixture inference, which aims to uncover the distributional make-up of training
data. We introduce a novel attack based on a previously overlooked source of
information: byte-pair encoding (BPE) tokenizers, used by the vast majority of
modern language models. Our key insight is that the ordered list of merge rules
learned by a BPE tokenizer naturally reveals information about the token
frequencies in its training data. Given a tokenizer's merge list along with
example data for each category of interest, we formulate a linear program that
solves for the proportion of each category in the tokenizer's training set. In
controlled experiments, we show that our attack recovers mixture ratios with
high precision for tokenizers trained on known mixtures of natural languages,
programming languages, and data sources. We then apply our approach to
off-the-shelf tokenizers released with recent LMs. We confirm much publicly
disclosed information about these models, and also make several new inferences:
GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their
predecessors, training on 39% and 47% non-English language data, respectively;
Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use;
GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We
hope our work sheds light on current design practices for pretraining data, and
inspires continued research into data mixture inference for LMs.
comment: new robustness experiments; new baselines; include Mistral,
Mistral-Nemo and GPT-NeoX; link to code
♻ ☆ Cost-Efficient Subjective Task Annotation and Modeling through Few-Shot Annotator Adaptation
In subjective NLP tasks, where a single ground truth does not exist, the
inclusion of diverse annotators becomes crucial as their unique perspectives
significantly influence the annotations. In realistic scenarios, the annotation
budget often becomes the main determinant of the number of perspectives (i.e.,
annotators) included in the data and subsequent modeling. We introduce a novel
framework for annotation collection and modeling in subjective tasks that aims
to minimize the annotation budget while maximizing the predictive performance
for each annotator. Our framework has a two-stage design: first, we rely on a
small set of annotators to build a multitask model, and second, we augment the
model for a new perspective by strategically annotating a few samples per
annotator. To test our framework at scale, we introduce and release a unique
dataset, Moral Foundations Subjective Corpus, of 2000 Reddit posts annotated by
24 annotators for moral sentiment. We demonstrate that our framework surpasses
the previous SOTA in capturing the annotators' individual perspectives with as
little as 25% of the original annotation budget on two datasets. Furthermore,
our framework results in more equitable models, reducing the performance
disparity among annotators.
♻ ☆ Exploring Group and Symmetry Principles in Large Language Models
Large Language Models (LLMs) have demonstrated impressive performance across
a wide range of applications; however, assessing their reasoning capabilities
remains a significant challenge. In this paper, we introduce a framework
grounded in group and symmetry principles, which have played a crucial role in
fields such as physics and mathematics, and offer another way to evaluate their
capabilities. While the proposed framework is general, to showcase the benefits
of employing these properties, we focus on arithmetic reasoning and investigate
the performance of these models on four group properties: closure, identity,
inverse, and associativity. Our findings reveal that LLMs studied in this work
struggle to preserve group properties across different test regimes. In the
closure test, we observe biases towards specific outputs and an abrupt
degradation in their performance from 100% to 0% after a specific sequence
length. They also perform poorly in the identity test, which represents adding
irrelevant information in the context, and show sensitivity when subjected to
inverse test, which examines the robustness of the model with respect to
negation. In addition, we demonstrate that breaking down problems into smaller
steps helps LLMs in the associativity test that we have conducted. To support
these tests we have developed a synthetic dataset which will be released.
♻ ☆ Positioning Political Texts with Large Language Models by Asking and Averaging
We use instruction-tuned Large Language Models (LLMs) like GPT-4, Llama 3,
MiXtral, or Aya to position political texts within policy and ideological
spaces. We ask an LLM where a tweet or a sentence of a political text stands on
the focal dimension and take the average of the LLM responses to position
political actors such as US Senators, or longer texts such as UK party
manifestos or EU policy speeches given in 10 different languages. The
correlations between the position estimates obtained with the best LLMs and
benchmarks based on text coding by experts, crowdworkers, or roll call votes
exceed .90. This approach is generally more accurate than the positions
obtained with supervised classifiers trained on large amounts of research data.
Using instruction-tuned LLMs to position texts in policy and ideological spaces
is fast, cost-efficient, reliable, and reproducible (in the case of open LLMs)
even if the texts are short and written in different languages. We conclude
with cautionary notes about the need for empirical validation.
♻ ☆ Towards Evaluating and Building Versatile Large Language Models for Medicine
In this study, we present MedS-Bench, a comprehensive benchmark designed to
evaluate the performance of large language models (LLMs) in clinical contexts.
Unlike existing benchmarks that focus on multiple-choice question answering,
MedS-Bench spans 11 high-level clinical tasks, including clinical report
summarization, treatment recommendations, diagnosis, named entity recognition,
and medical concept explanation, among others. We evaluated six leading LLMs,
e.g., MEDITRON, Mistral, InternLM 2, Llama 3, GPT-4, and Claude-3.5 using
few-shot prompting, and found that even the most sophisticated models struggle
with these complex tasks. To address these limitations, we developed MedS-Ins,
a large-scale instruction tuning dataset for medicine. MedS-Ins comprises 58
medically oriented language corpora, totaling 13.5 million samples across 122
tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept
experiment by performing instruction tuning on a lightweight, open-source
medical language model. The resulting model, MMedIns-Llama 3, significantly
outperformed existing models across nearly all clinical tasks. To promote
further advancements in the application of LLMs to clinical challenges, we have
made the MedS-Ins dataset fully accessible and invite the research community to
contribute to its expansion.Additionally, we have launched a dynamic
leaderboard for MedS-Bench, which we plan to regularly update the test set to
track progress and enhance the adaptation of general LLMs to the medical
domain. Leaderboard: https://henrychur.github.io/MedS-Bench/. Github:
https://github.com/MAGIC-AI4Med/MedS-Ins.
♻ ☆ Legilimens: Practical and Unified Content Moderation for Large Language Model Services CCS
Given the societal impact of unsafe content generated by large language
models (LLMs), ensuring that LLM services comply with safety standards is a
crucial concern for LLM service providers. Common content moderation methods
are limited by an effectiveness-and-efficiency dilemma, where simple models are
fragile while sophisticated models consume excessive computational resources.
In this paper, we reveal for the first time that effective and efficient
content moderation can be achieved by extracting conceptual features from
chat-oriented LLMs, despite their initial fine-tuning for conversation rather
than content moderation. We propose a practical and unified content moderation
framework for LLM services, named Legilimens, which features both effectiveness
and efficiency. Our red-team model-based data augmentation enhances the
robustness of Legilimens against state-of-the-art jailbreaking. Additionally,
we develop a framework to theoretically analyze the cost-effectiveness of
Legilimens compared to other methods. We have conducted extensive experiments
on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify
the effectiveness, efficiency, and robustness of Legilimens against normal and
adaptive adversaries. A comparison of Legilimens with both commercial and
academic baselines demonstrates the superior performance of Legilimens.
Furthermore, we confirm that Legilimens can be applied to few-shot scenarios
and extended to multi-label classification tasks.
comment: Accepted by ACM Conference on Computer and Communications Security
(CCS) 2024
♻ ☆ Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions
The responses generated by Large Language Models (LLMs) can include sensitive
information from individuals and organizations, leading to potential privacy
leakage. This work implements Influence Functions (IFs) to trace privacy
leakage back to the training data, thereby mitigating privacy concerns of
Language Models (LMs). However, we notice that current IFs struggle to
accurately estimate the influence of tokens with large gradient norms,
potentially overestimating their influence. When tracing the most influential
samples, this leads to frequently tracing back to samples with large gradient
norm tokens, overshadowing the actual most influential samples even if their
influences are well estimated. To address this issue, we propose Heuristically
Adjusted IF (HAIF), which reduces the weight of tokens with large gradient
norms, thereby significantly improving the accuracy of tracing the most
influential samples. To establish easily obtained groundtruth for tracing
privacy leakage, we construct two datasets, PII-E and PII-CR, representing two
distinct scenarios: one with identical text in the model outputs and
pre-training data, and the other where models leverage their reasoning
abilities to generate text divergent from pre-training data. HAIF significantly
improves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E
dataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA
IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs
on real-world pretraining data CLUECorpus2020, demonstrating strong robustness
regardless prompt and response lengths.
♻ ☆ Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Model merging is an efficient empowerment technique in the machine learning
community that does not require the collection of raw training data and does
not require expensive computation. As model merging becomes increasingly
prevalent across various fields, it is crucial to understand the available
model merging techniques comprehensively. However, there is a significant gap
in the literature regarding a systematic and thorough review of these
techniques. This survey provides a comprehensive overview of model merging
methods and theories, their applications in various domains and settings, and
future research directions. Specifically, we first propose a new taxonomic
approach that exhaustively discusses existing model merging methods. Secondly,
we discuss the application of model merging techniques in large language
models, multimodal large language models, and 10+ machine learning subfields,
including continual learning, multi-task learning, few-shot learning, etc.
Finally, we highlight the remaining challenges of model merging and discuss
future research directions. A comprehensive list of papers about model merging
is available at
\url{https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications}.
♻ ☆ Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review
This comprehensive review delves into the pivotal role of prompt engineering
in unleashing the capabilities of Large Language Models (LLMs). The development
of Artificial Intelligence (AI), from its inception in the 1950s to the
emergence of advanced neural networks and deep learning architectures, has made
a breakthrough in LLMs, with models such as GPT-4o and Claude-3, and in
Vision-Language Models (VLMs), with models such as CLIP and ALIGN. Prompt
engineering is the process of structuring inputs, which has emerged as a
crucial technique to maximize the utility and accuracy of these models. This
paper explores both foundational and advanced methodologies of prompt
engineering, including techniques such as self-consistency, chain-of-thought,
and generated knowledge, which significantly enhance model performance.
Additionally, it examines the prompt method of VLMs through innovative
approaches such as Context Optimization (CoOp), Conditional Context
Optimization (CoCoOp), and Multimodal Prompt Learning (MaPLe). Critical to this
discussion is the aspect of AI security, particularly adversarial attacks that
exploit vulnerabilities in prompt engineering. Strategies to mitigate these
risks and enhance model robustness are thoroughly reviewed. The evaluation of
prompt methods is also addressed, through both subjective and objective
metrics, ensuring a robust analysis of their efficacy. This review also
reflects the essential role of prompt engineering in advancing AI capabilities,
providing a structured framework for future research and application.
♻ ☆ Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model
Due to the inherent difficulty in modeling phonetic similarities across
different languages, code-switching speech recognition presents a formidable
challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE)
model that leverages a collaborative mechanism among expert groups. Initially,
a preceding routing network explicitly learns Language Identification (LID)
tasks and selects experts based on acquired LID weights. This process ensures
robust routing information to the MoE layer, mitigating interference from
diverse language domains on expert network parameter updates. The LID weights
are also employed to facilitate inter-group collaboration, enabling the
integration of language-specific representations. Furthermore, within each
language expert group, a gating network operates unsupervised to foster
collaboration on attributes beyond language. Extensive experiments demonstrate
the efficacy of our approach, achieving significant performance enhancements
compared to alternative methods. Importantly, our method preserves the
efficient inference capabilities characteristic of MoE models without
necessitating additional pre-training.
comment: Accepted by IEEE SLT 2024
♻ ☆ Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
Transferring linguistic knowledge from a pretrained language model (PLM) to
an acoustic model has been shown to greatly improve the performance of
automatic speech recognition (ASR). However, due to the heterogeneous feature
distributions in cross-modalities, designing an effective model for feature
alignment and knowledge transfer between linguistic and acoustic sequences
remains a challenging task. Optimal transport (OT), which efficiently measures
probability distribution discrepancies, holds great potential for aligning and
transferring knowledge between acoustic and linguistic modalities. Nonetheless,
the original OT treats acoustic and linguistic feature sequences as two
unordered sets in alignment and neglects temporal order information during OT
coupling estimation. Consequently, a time-consuming pretraining stage is
required to learn a good alignment between the acoustic and linguistic
representations. In this paper, we propose a Temporal Order Preserved OT
(TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for
ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are
smoothly mapped to neighboring regions of linguistic sequences, preserving
their temporal order relationship in feature alignment and matching. With the
TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained
Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the
proposed TOT-CAKT significantly improves ASR performance compared to several
state-of-the-art models employing linguistic knowledge transfer, and addresses
the weaknesses of the original OT-based method in sequential feature alignment
for ASR.
comment: Accepted to IEEE SLT 2024
♻ ☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Jiayi Gui, Yiming Liu, Jiale Cheng, Xiaotao Gu, Xiao Liu, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang
Large Language Models (LLMs) have demonstrated notable capabilities across
various tasks, showcasing complex problem-solving abilities. Understanding and
executing complex rules, along with multi-step planning, are fundamental to
logical reasoning and critical for practical LLM agents and decision-making
systems. However, evaluating LLMs as effective rule-based executors and
planners remains underexplored. In this paper, we introduce LogicGame, a novel
benchmark designed to evaluate the comprehensive rule understanding, execution,
and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame
provides diverse games that contain a series of rules with an initial state,
requiring models to comprehend and apply predefined regulations to solve
problems. We create simulated scenarios in which models execute or plan
operations to achieve specific outcomes. These game scenarios are specifically
designed to distinguish logical reasoning from mere knowledge by relying
exclusively on predefined rules. This separation allows for a pure assessment
of rule-based reasoning capabilities. The evaluation considers not only final
outcomes but also intermediate steps, providing a comprehensive assessment of
model performance. Moreover, these intermediate steps are deterministic and can
be automatically verified. LogicGame defines game scenarios with varying
difficulty levels, from simple rule applications to complex reasoning chains,
in order to offer a precise evaluation of model performance on rule
understanding and multi-step execution. Utilizing LogicGame, we test various
LLMs and identify notable shortcomings in their rule-based logical reasoning
abilities.
♻ ☆ Exposing and Explaining Fake News On-the-Fly
Francisco de Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan Carlos Burguillo
Social media platforms enable the rapid dissemination and consumption of
information. However, users instantly consume such content regardless of the
reliability of the shared data. Consequently, the latter crowdsourcing model is
exposed to manipulation. This work contributes with an explainable and online
classification method to recognize fake news in real-time. The proposed method
combines both unsupervised and supervised Machine Learning approaches with
online created lexica. The profiling is built using creator-, content- and
context-based features using Natural Language Processing techniques. The
explainable classification mechanism displays in a dashboard the features
selected for classification and the prediction confidence. The performance of
the proposed solution has been validated with real data sets from Twitter and
the results attain 80 % accuracy and macro F-measure. This proposal is the
first to jointly provide data stream processing, profiling, classification and
explainability. Ultimately, the proposed early detection, isolation and
explanation of fake news contribute to increase the quality and trustworthiness
of social media contents.
♻ ☆ A review on the use of large language models as virtual tutors
Transformer architectures contribute to managing long-term dependencies for
Natural Language Processing, representing one of the most recent changes in the
field. These architectures are the basis of the innovative, cutting-edge Large
Language Models (LLMs) that have produced a huge buzz in several fields and
industrial sectors, among the ones education stands out. Accordingly, these
generative Artificial Intelligence-based solutions have directed the change in
techniques and the evolution in educational methods and contents, along with
network infrastructure, towards high-quality learning. Given the popularity of
LLMs, this review seeks to provide a comprehensive overview of those solutions
designed specifically to generate and evaluate educational materials and which
involve students and teachers in their design or experimental plan. To the best
of our knowledge, this is the first review of educational applications (e.g.,
student assessment) of LLMs. As expected, the most common role of these systems
is as virtual tutors for automatic question generation. Moreover, the most
popular models are GTP-3 and BERT. However, due to the continuous launch of new
generative models, new works are expected to be published shortly.
♻ ☆ Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications
Past studies on end-to-end meeting transcription have focused on model
architecture and have mostly been evaluated on simulated meeting data. We
present a novel study aiming to optimize the use of a Speaker-Attributed ASR
(SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for
improved speaker assignment of speech segments. First, we propose a pipeline
tailored to real-life applications involving Voice Activity Detection (VAD),
Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output
segments to fine-tune the SA-ASR model, considering that it is also applied to
VAD segments during test, and show that this results in a relative reduction of
Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance
the extraction of the speaker embedding templates used as inputs by the SA-ASR
system. We show that extracting them from SD output rather than annotated
speaker segments results in a relative SER reduction up to 20%.
comment: Submitted to Odyssey 2024
♻ ☆ Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?
The significant advancements of Large Language Models (LLMs) in generative
tasks have led to a growing body of work exploring LLM-based embedding models.
While these models, employing different pooling and attention strategies, have
achieved state-of-the-art performance on public embedding benchmarks, questions
still arise about what constitutes an effective design for LLM-based embedding
models. However, these models are often trained on different datasets, using
different LLM base models or training settings. Moreover, evaluations on public
embedding benchmarks often fail to report statistical significance, making it
difficult to determine which designs truly contribute to final performance.
This complicates the process for practitioners seeking optimal training recipes
for LLM-based embedding models. In this study, we conduct a large-scale
experiment by training a series of LLM-based embedding models using the same
training data and base model but differing in their pooling and attention
strategies. The results show that there is no one-size-fits-all solution: while
bidirectional attention and an additional trainable pooling layer outperform in
text similarity and information retrieval tasks, they do not significantly
surpass simpler designs like EOS-last token pooling and default causal
attention in clustering and classification tasks. Furthermore, we propose a new
pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs
of all hidden layers, rather than just the last layer, using a cross-attention
network. This method proves to be statistically superior in text similarity and
retrieval tasks compared to existing pooling methods. Overall, this paper sheds
light on effective training strategies for LLM-based embedding models.
comment: https://github.com/yixuantt/PoolingAndAttn
♻ ☆ CAVE: Controllable Authorship Verification Explanations
Authorship Verification (AV) (do two documents have the same author?) is
essential in many sensitive real-life applications. AV is often used in
proprietary domains that require a private, offline model, making SOTA online
models like ChatGPT undesirable. Current offline models however have lower
downstream utility due to low accuracy/scalability (eg: traditional stylometry
AV systems) and lack of accessible post-hoc explanations. In this work, we take
the first step to address the above challenges with our trained, offline
Llama-3-8B model CAVE (Controllable Authorship Verification Explanations): CAVE
generates free-text AV explanations that are controlled to be (1) structured
(can be decomposed into sub-explanations in terms of relevant linguistic
features), and (2) easily verified for explanation-label consistency (via
intermediate labels in sub-explanations). We first engineer a prompt that can
generate silver training data from a SOTA teacher model in the desired CAVE
output format. We then filter and distill this data into a pretrained
Llama-3-8B, our carefully selected student model. Results on three difficult AV
datasets IMDb62, Blog-Auth, and Fanfiction show that CAVE generates high
quality explanations (as measured by automatic and human evaluation) as well as
competitive task accuracies.
♻ ☆ Aligning Large Language Models to a Domain-specific Graph Database for NL2GQL
Graph Databases (Graph DB) find extensive application across diverse domains
such as finance, social networks, and medicine. Yet, the translation of Natural
Language (NL) into the Graph Query Language (GQL), referred to as NL2GQL, poses
significant challenges owing to its intricate and specialized nature. Some
approaches have sought to utilize Large Language Models (LLMs) to address
analogous tasks like text2SQL. Nonetheless, in the realm of NL2GQL tasks
tailored to a particular domain, the absence of domain-specific NL-GQL data
pairs adds complexity to aligning LLMs with the graph DB. To tackle this
challenge, we present a well-defined pipeline. Initially, we utilize ChatGPT to
generate NL-GQL data pairs, leveraging the provided graph DB with
self-instruction. Subsequently, we employ the generated data to fine-tune LLMs,
ensuring alignment between LLMs and the graph DB. Moreover, we find the
importance of relevant schema in efficiently generating accurate GQLs. Thus, we
introduce a method to extract relevant schema as the input context. We evaluate
our method using two carefully constructed datasets derived from graph DBs in
the finance and medicine domains, named FinGQL and MediGQL. Experimental
results reveal that our approach significantly outperforms a set of baseline
methods, with improvements of 5.90 and 6.36 absolute points on EM, and 6.00 and
7.09 absolute points on EX for FinGQL and MediGQL, respectively.
comment: 13 pages,2 figures
♻ ☆ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
Enabling Large Language Models (LLMs) to comprehend the 3D physical world
remains a significant challenge. Due to the lack of large-scale 3D-text pair
datasets, the success of LLMs has yet to be replicated in 3D understanding. In
this paper, we rethink this issue and propose a new task: 3D Data-Efficient
Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D
object understanding with minimal 3D point cloud and text data pairs. To
address this task, we introduce GreenPLM, which leverages more text data to
compensate for the lack of 3D data. First, inspired by using CLIP to align
images and text, we utilize a pre-trained point cloud-text encoder to map the
3D point cloud space to the text space. This mapping leaves us to seamlessly
connect the text space with LLMs. Once the point-text-LLM connection is
established, we further enhance text-LLM alignment by expanding the
intermediate text space, thereby reducing the reliance on 3D point cloud data.
Specifically, we generate 6M free-text descriptions of 3D objects, and design a
three-stage training strategy to help LLMs better explore the intrinsic
connections between different modalities. To achieve efficient modality
alignment, we design a zero-parameter cross-attention module for token pooling.
Extensive experimental results show that GreenPLM requires only 12% of the 3D
training data used by existing state-of-the-art models to achieve superior 3D
understanding. Remarkably, GreenPLM also achieves competitive performance using
text-only data. The code and weights are available at:
https://github.com/TangYuan96/GreenPLM.
♻ ☆ OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation
Włodzimierz Lewoniewski, Piotr Stolarski, Milena Stróżyna, Elzbieta Lewańska, Aleksandra Wojewoda, Ewelina Księżniak, Marcin Sawiński
This paper presents the experiments and results for the CheckThat! Lab at
CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial
Examples (InCrediblAE). The primary objective of this task was to generate
adversarial examples in five problem domains in order to evaluate the
robustness of widely used text classification methods (fine-tuned BERT, BiLSTM,
and RoBERTa) when applied to credibility assessment issues.
This study explores the application of ensemble learning to enhance
adversarial attacks on natural language processing (NLP) models. We
systematically tested and refined several adversarial attack methods, including
BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across
various misinformation tasks. By developing modified versions of BERT-Attack
and hybrid methods, we achieved significant improvements in attack
effectiveness. Our results demonstrate the potential of modification and
combining multiple methods to create more sophisticated and effective
adversarial attack strategies, contributing to the development of more robust
and secure systems.
comment: CLEF 2024 - Conference and Labs of the Evaluation Forum
♻ ☆ Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models
(LLMs) and cognitive science, examining similarities and differences between
LLMs and human cognitive processes. We analyze methods for evaluating LLMs
cognitive abilities and discuss their potential as cognitive models. The review
covers applications of LLMs in various cognitive fields, highlighting insights
gained for cognitive science research. We assess cognitive biases and
limitations of LLMs, along with proposed methods for improving their
performance. The integration of LLMs with cognitive architectures is examined,
revealing promising avenues for enhancing artificial intelligence (AI)
capabilities. Key challenges and future research directions are identified,
emphasizing the need for continued refinement of LLMs to better align with
human cognition. This review provides a balanced perspective on the current
state and future potential of LLMs in advancing our understanding of both
artificial and human intelligence.
comment: 10 pages, 1 figure
♻ ☆ LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li
Though current long-context large language models (LLMs) have demonstrated
impressive capacities in answering user questions based on extensive text, the
lack of citations in their responses makes user verification difficult, leading
to concerns about their trustworthiness due to their potential hallucinations.
In this work, we aim to enable long-context LLMs to generate responses with
fine-grained sentence-level citations, improving their faithfulness and
verifiability. We first introduce LongBench-Cite, an automated benchmark for
assessing current LLMs' performance in Long-Context Question Answering with
Citations (LQAC), revealing considerable room for improvement. To this end, we
propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs
to automatically generate long-context QA instances with precise sentence-level
citations, and leverage this pipeline to construct LongCite-45k, a large-scale
SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the
LongCite-45k dataset, successfully enabling their generation of accurate
responses and fine-grained sentence-level citations in a single output. The
evaluation results on LongBench-Cite show that our trained models achieve
state-of-the-art citation quality, surpassing advanced proprietary models
including GPT-4o.
♻ ☆ Resolving Knowledge Conflicts in Large Language Models
Large language models (LLMs) often encounter knowledge conflicts, scenarios
where discrepancy arises between the internal parametric knowledge of LLMs and
non-parametric information provided in the prompt context. In this work we ask
what are the desiderata for LLMs when a knowledge conflict arises and whether
existing LLMs fulfill them. We posit that LLMs should 1) identify knowledge
conflicts, 2) pinpoint conflicting information segments, and 3) provide
distinct answers or viewpoints in conflicting scenarios. To this end, we
introduce KNOWLEDGE CONFLICT, an evaluation framework for simulating contextual
knowledge conflicts and quantitatively evaluating to what extent LLMs achieve
these goals. KNOWLEDGE CONFLICT includes diverse and complex situations of
knowledge conflict, knowledge from diverse entities and domains, two synthetic
conflict creation methods, and settings with progressively increasing
difficulty to reflect realistic knowledge conflicts. Extensive experiments with
the KNOWLEDGE CONFLICT framework reveal that while LLMs perform well in
identifying the existence of knowledge conflicts, they struggle to determine
the specific conflicting knowledge and produce a response with distinct answers
amidst conflicting information. To address these challenges, we propose new
instruction-based approaches that augment LLMs to better achieve the three
goals. Further analysis shows that abilities to tackle knowledge conflicts are
greatly impacted by factors such as knowledge domain and prompt text, while
generating robust responses to knowledge conflict scenarios remains an open
research question.
comment: Published at COLM 2024
♻ ☆ Prediction of COPD Using Machine Learning, Clinical Summary Notes, and Vital Signs
Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung
disease that causes obstructed airflow from the lungs. In the United States,
more than 15.7 million Americans have been diagnosed with COPD, with 96% of
individuals living with at least one other chronic health condition. It is the
4th leading cause of death in the country. Over 2.2 million patients are
admitted to hospitals annually due to COPD exacerbations. Monitoring and
predicting patient exacerbations on-time could save their life. This paper
presents two different predictive models to predict COPD exacerbation using AI
and natural language processing (NLP) approaches. These models use respiration
summary notes, symptoms, and vital signs. To train and test these models, data
records containing physiologic signals and vital signs time series were used.
These records were captured from patient monitors and comprehensive clinical
data obtained from hospital medical information systems for tens of thousands
of Intensive Care Unit (ICU) patients. We achieved an area under the Receiver
operating characteristic (ROC) curve of 0.82 in detection and prediction of
COPD exacerbation.
comment: 11 pages, 5 figures
♻ ☆ Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation
Large language models (LLMs) have achieved state-of-the-art performance in
various language processing tasks, motivating their adoption in simultaneous
translation. Current fine-tuning methods to adapt LLMs for simultaneous
translation focus on prompting optimization strategies using either data
augmentation or prompt structure modifications. However, these methods suffer
from several issues, such as unnecessarily expanded training sets,
computational inefficiency from dumping the key and value cache, increased
prompt sizes, or restriction to a single decision policy. To eliminate these
issues, in this work, we propose SimulMask, a new paradigm for fine-tuning LLMs
for simultaneous translation. It utilizes a novel attention mask approach that
models simultaneous translation during fine-tuning by masking attention for a
desired decision policy. Applying the proposed SimulMask on a Falcon LLM for
the IWSLT 2017 dataset, we have observed a significant translation quality
improvement compared to state-of-the-art prompting optimization strategies on
five language pairs while reducing the computational cost.