MyArxiv
Computation and Language 68
☆ Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.
comment: Project page: https://yunzeman.github.io/lexicon3d , Github: https://github.com/YunzeMan/Lexicon3D
☆ WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis's utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.
☆ Attention Heads of Large Language Models: A Survey
Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in various tasks but remain largely as black-box systems. Consequently, their development relies heavily on data-driven approaches, limiting performance enhancement through changes in internal architecture and reasoning pathways. As a result, many researchers have begun exploring the potential internal mechanisms of LLMs, aiming to identify the essence of their reasoning bottlenecks, with most studies focusing on attention heads. Our survey aims to shed light on the internal reasoning processes of LLMs by concentrating on the interpretability and underlying mechanisms of attention heads. We first distill the human thought process into a four-stage framework: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation. Using this framework, we systematically review existing research to identify and categorize the functions of specific attention heads. Furthermore, we summarize the experimental methodologies used to discover these special heads, dividing them into two categories: Modeling-Free methods and Modeling-Required methods. Also, we outline relevant evaluation methods and benchmarks. Finally, we discuss the limitations of current research and propose several potential future directions. Our reference list is open-sourced at \url{https://github.com/IAAR-Shanghai/Awesome-Attention-Heads}.
comment: 20 pages, 11 figures, 4 tables
☆ Planning In Natural Language Improves LLM Search For Code Generation
While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PLANSEARCH generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas.
RAG based Question-Answering for Contextual Response Prediction System CIKM'24
Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-answering framework for real-world applications using RAG entails several challenges: 1) data availability issues, 2) evaluating the quality of generated content, and 3) the costly nature of human evaluation. In this paper, we introduce an end-to-end framework that employs LLMs with RAG capabilities for industry use cases. Given a customer query, the proposed system retrieves relevant knowledge documents and leverages them, along with previous chat history, to generate response suggestions for customer service agents in the contact centers of a major retail company. Through comprehensive automated and human evaluations, we show that this solution outperforms the current BERT-based algorithms in accuracy and relevance. Our findings suggest that RAG-based LLMs can be an excellent support to human customer service representatives by lightening their workload.
comment: Accepted at the 1st Workshop on GenAI and RAG Systems for Enterprise, CIKM'24. 6 pages
☆ A Different Level Text Protection Mechanism With Differential Privacy
The article introduces a method for extracting words of different degrees of importance based on the BERT pre-training model and proves the effectiveness of this method. The article also discusses the impact of maintaining the same perturbation results for words of different importance on the overall text utility. This method can be applied to long text protection.
☆ LAST: Language Model Aware Speech Tokenization
Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization process and its usage afterward. In this study, we propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs. We advocate for the integration of this objective into the process of learning discrete speech representations. Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs. We empirically investigate the impact of various model design choices, including speech vocabulary size and text LM size. Our results demonstrate the proposed tokenization method outperforms the evaluated baselines considering both spoken language modeling and speech-to-text. More importantly, unlike prior work, the proposed method allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.
☆ A Fused Large Language Model for Predicting Startup Success
Investors are continuously seeking profitable investment opportunities in startups and, hence, for effective decision-making, need to predict a startup's probability of success. Nowadays, investors can use not only various fundamental information about a startup (e.g., the age of the startup, the number of founders, and the business sector) but also textual description of a startup's innovation and business model, which is widely available through online venture capital (VC) platforms such as Crunchbase. To support the decision-making of investors, we develop a machine learning approach with the aim of locating successful startups on VC platforms. Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success. Thereby, we assess to what extent self-descriptions on VC platforms are predictive of startup success. Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success, with textual self-descriptions being responsible for a significant part of the predictive power. Our work provides a decision support tool for investors to find profitable investment opportunities.
☆ The representation landscape of few-shot learning and fine-tuning in large language models
In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.
LLM-based multi-agent poetry generation in non-cooperative environments
Despite substantial progress of large language models (LLMs) for automatic poetry generation, the generated poetry lacks diversity while the training process differs greatly from human learning. Under the rationale that the learning process of the poetry generation systems should be more human-like and their output more diverse and novel, we introduce a framework based on social learning where we emphasize non-cooperative interactions besides cooperative interactions to encourage diversity. Our experiments are the first attempt at LLM-based multi-agent systems in non-cooperative environments for poetry generation employing both TRAINING-BASED agents (GPT-2) and PROMPTING-BASED agents (GPT-3 and GPT-4). Our evaluation based on 96k generated poems shows that our framework benefits the poetry generation process for TRAINING-BASED agents resulting in 1) a 3.0-3.7 percentage point (pp) increase in diversity and a 5.6-11.3 pp increase in novelty according to distinct and novel n-grams. The generated poetry from TRAINING-BASED agents also exhibits group divergence in terms of lexicons, styles and semantics. PROMPTING-BASED agents in our framework also benefit from non-cooperative environments and a more diverse ensemble of models with non-homogeneous agents has the potential to further enhance diversity, with an increase of 7.0-17.5 pp according to our experiments. However, PROMPTING-BASED agents show a decrease in lexical diversity over time and do not exhibit the group-based divergence intended in the social network. Our paper argues for a paradigm shift in creative tasks such as automatic poetry generation to include social learning processes (via LLM-based agent modeling) similar to human interaction.
comment: preprint
☆ On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM's effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM. Our findings indicate that even though DPORM fits the training dataset comparably, it generalizes less effectively than EXRM, especially when the validation datasets contain distribution shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.
comment: 12 pages, 8 tables, 2 figures
☆ CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.
comment: Project Website: https://github.com/opendatalab/UniMERNet/tree/main/cdm
☆ Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers
In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidden states at some layer k with random vectors. Our experimenting with four LLMs and four tasks show that this operation often leads to small to negligible drop in performance. Importantly, this happens if the manipulation occurs in the top part of the model-k is in the final 30-50% of the layers. In contrast, doing the same manipulation in earlier layers might lead to chance level performance. We continue by switching the hidden state of certain tokens with hidden states of other tokens from another prompt; e.g., replacing the word "Italy" with "France" in "What is the capital of Italy?". We find that when applying this switch in the top 1/3 of the model, the model ignores it (answering "Rome"). However if we apply it before, the model conforms to the switch ("Paris"). Our results hint at a two stage process in transformer-based LLMs: the first part gathers input from previous tokens, while the second mainly processes that information internally.
☆ 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances KDD
Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.
comment: Presented at the 2024 KDD workshop on Evaluation and Trustworthiness of Generative AI Models
☆ From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents
Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integrated into this learning format, resulting in a variety of educational AI applications such as educational recommendation and intelligent tutoring. The emergence of intelligence in large language models (LLMs) has allowed for these educational enhancements to be built upon a unified foundational model, enabling deeper integration. In this context, we propose MAIC (Massive AI-empowered Course), a new form of online education that leverages LLM-driven multi-agent systems to construct an AI-augmented classroom, balancing scalability with adaptivity. Beyond exploring the conceptual framework and technical innovations, we conduct preliminary experiments at Tsinghua University, one of China's leading universities. Drawing from over 100,000 learning records of more than 500 students, we obtain a series of valuable observations and initial analyses. This project will continue to evolve, ultimately aiming to establish a comprehensive open platform that supports and unifies research, technology, and applications in exploring the possibilities of online education in the era of large model AI. We envision this platform as a collaborative hub, bringing together educators, researchers, and innovators to collectively explore the future of AI-driven online education.
☆ How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes
Decoder-only LLMs have shown impressive performance in MT due to their ability to learn from extensive datasets and generate high-quality translations. However, LLMs often struggle with the nuances and style required for organisation-specific translation. In this study, we explore the effectiveness of fine-tuning Large Language Models (LLMs), particularly Llama 3 8B Instruct, leveraging translation memories (TMs), as a valuable resource to enhance accuracy and efficiency. We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector. Our experiments cover five translation directions across languages of varying resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to evaluate their influence on translation quality. We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET. Our findings reveal improvement in translation performance with larger datasets across all metrics. On average, BLEU and COMET scores increase by 13 and 25 points, respectively, on the largest training set against the baseline model. Notably, there is a performance deterioration in comparison with the baseline model when fine-tuning on only 1k and 2k examples; however, we observe a substantial improvement as the training dataset size increases. The study highlights the potential of integrating TMs with LLMs to create bespoke translation models tailored to the specific needs of businesses, thus enhancing translation quality and reducing turn-around times. This approach offers a valuable insight for organisations seeking to leverage TMs and LLMs for optimal translation outcomes, especially in narrower domains.
☆ Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities
The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.
☆ Rx Strategist: Prescription Verification using LLM Agents System
To protect patient safety, modern pharmaceutical complexity demands strict prescription verification. We offer a new approach - Rx Strategist - that makes use of knowledge graphs and different search strategies to enhance the power of Large Language Models (LLMs) inside an agentic framework. This multifaceted technique allows for a multi-stage LLM pipeline and reliable information retrieval from a custom-built active ingredient database. Different facets of prescription verification, such as indication, dose, and possible drug interactions, are covered in each stage of the pipeline. We alleviate the drawbacks of monolithic LLM techniques by spreading reasoning over these stages, improving correctness and reliability while reducing memory demands. Our findings demonstrate that Rx Strategist surpasses many current LLMs, achieving performance comparable to that of a highly experienced clinical pharmacist. In the complicated world of modern medications, this combination of LLMs with organized knowledge and sophisticated search methods presents a viable avenue for reducing prescription errors and enhancing patient outcomes.
comment: 17 Pages, 6 Figures, Under Review
☆ CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks
Cognitive psychology investigates perception, attention, memory, language, problem-solving, decision-making, and reasoning. Kahneman's dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2. Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level proficiency in various cognitive tasks. Nonetheless, the presence of a dual-system framework analogous to human cognition in LLMs remains unexplored. This study introduces the \textbf{CogniDual Framework for LLMs} (CFLLMs), designed to assess whether LLMs can, through self-training, evolve from deliberate deduction to intuitive responses, thereby emulating the human process of acquiring and mastering new information. Our findings reveal the cognitive mechanisms behind LLMs' response generation, enhancing our understanding of their capabilities in cognitive psychology. Practically, self-trained models can provide faster responses to certain queries, reducing computational demands during inference.
☆ Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time
Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, Artificial Intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using Large Language Models (LLMs) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical-patient communication using intelligent systems. Consequently, we leverage an LLM using the latest Natural Language Processing (NLP) techniques in a chatbot solution to provide interpretable Machine Learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing NLP-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80 % in all evaluation metrics, with a recall value for the mental deterioration class about 85 %. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.
☆ Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding
The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member and non-member contexts. While previous work suggested that member contexts provide little information due to the minor distributional shift they induce, our analysis reveals that these subtle shifts can be effectively leveraged when contrasted with non-member contexts. In this paper, we propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts through contrastive decoding, amplifying subtle differences to enhance membership inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves state-of-the-art performance on the WikiMIA benchmark and is robust against various text manipulation techniques.
☆ Sketch: A Toolkit for Streamlining LLM Operations
Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model's outputs, thereby constraining the application of LLMs in various domains. In this work, we present Sketch, an innovative toolkit designed to streamline LLM operations across diverse fields. Sketch comprises the following components: (1) a suite of task description schemas and prompt templates encompassing various NLP tasks; (2) a user-friendly, interactive process for building structured output LLM services tailored to various NLP tasks; (3) an open-source dataset for output format control, along with tools for dataset construction; and (4) an open-source model based on LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting instructions. We anticipate this initiative to bring considerable convenience to LLM users, achieving the goal of ''plug-and-play'' for various applications. The components of Sketch will be progressively open-sourced at https://github.com/cofe-ai/Sketch.
☆ Normal forms in Virus Machines
In the present work, we further study the computational power of virus machines (VMs in short). VMs provide a computing paradigm inspired by the transmission and replication networks of viruses. VMs consist of process units (called hosts) structured by a directed graph whose arcs are called channels and an instruction graph that controls the transmissions of virus objects among hosts. The present work complements our understanding of the computing power of VMs by introducing normal forms; these expressions restrict the features in a given computing model. Some of the features that we restrict in our normal forms include (a) the number of hosts, (b) the number of instructions, and (c) the number of virus objects in each host. After we recall some known results on the computing power of VMs we give our normal forms, such as the size of the loops in the network, proving new characterisations of family of sets, such as the finite sets, semilinear sets, or NRE.
☆ N-gram Prediction and Word Difference Representations for Language Modeling
Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.
LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts EMNLP
With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (https://github.com/Reliable-Information-Lab-HEVS/dynamic_llm_detector_benchmark).
comment: 20 pages, 7 tables, 13 figures, under consideration for EMNLP
☆ iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models
Most available data is unstructured, making it challenging to access valuable information. Automatically building Knowledge Graphs (KGs) is crucial for structuring data and making it accessible, allowing users to search for information effectively. KGs also facilitate insights, inference, and reasoning. Traditional NLP methods, such as named entity recognition and relation extraction, are key in information retrieval but face limitations, including the use of predefined entity types and the need for supervised learning. Current research leverages large language models' capabilities, such as zero- or few-shot learning. However, unresolved and semantically duplicated entities and relations still pose challenges, leading to inconsistent graphs and requiring extensive post-processing. Additionally, most approaches are topic-dependent. In this paper, we propose iText2KG, a method for incremental, topic-independent KG construction without post-processing. This plug-and-play, zero-shot method is applicable across a wide range of KG construction scenarios and comprises four modules: Document Distiller, Incremental Entity Extractor, Incremental Relation Extractor, and Graph Integrator and Visualization. Our method demonstrates superior performance compared to baseline methods across three scenarios: converting scientific papers to graphs, websites to graphs, and CVs to graphs.
comment: Accepted at The International Web Information Systems Engineering conference (the WISE conference) 2024
☆ ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, the application of alignment training within the chart domain is still underexplored. To address this, we propose ChartMoE, which employs the mixture of expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train multiple linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with over 900K chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts in four distinct ways and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.
☆ Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation
The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to sub-optimal reasoning performance. To address this challenge, we propose the \textbf{Strategic Chain-of-Thought} (SCoT), a novel methodology designed to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers. Our experiments across eight challenging reasoning datasets demonstrate significant improvements, including a 21.05\% increase on the GSM8K dataset and 24.13\% on the Tracking\_Objects dataset, respectively, using the Llama3-8b model. Additionally, we extend the SCoT framework to develop a few-shot method with automatically matched demonstrations, yielding even stronger results. These findings underscore the efficacy of SCoT, highlighting its potential to substantially enhance LLM performance in complex reasoning tasks.
GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding
Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs' comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.
☆ Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard
This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.
☆ E2CL: Exploration-based Error Correction Learning for Embodied Agents
Language models are exhibiting increasing capability in knowledge utilization and reasoning. However, when applied as agents in embodied environments, they often suffer from misalignment between their intrinsic knowledge and environmental knowledge, leading to infeasible actions. Traditional environment alignment methods, such as supervised learning on expert trajectories and reinforcement learning, face limitations in covering environmental knowledge and achieving efficient convergence, respectively. Inspired by human learning, we propose Exploration-based Error Correction Learning (E2CL), a novel framework that leverages exploration-induced errors and environmental feedback to enhance environment alignment for LM-based agents. E2CL incorporates teacher-guided and teacher-free exploration to gather environmental feedback and correct erroneous actions. The agent learns to provide feedback and self-correct, thereby enhancing its adaptability to target environments. Evaluations in the Virtualhome environment demonstrate that E2CL-trained agents outperform those trained by baseline methods and exhibit superior self-correction capabilities.
☆ Preserving Empirical Probabilities in BERT for Small-sample Clinical Entity Recognition
Named Entity Recognition (NER) encounters the challenge of unbalanced labels, where certain entity types are overrepresented while others are underrepresented in real-world datasets. This imbalance can lead to biased models that perform poorly on minority entity classes, impeding accurate and equitable entity recognition. This paper explores the effects of unbalanced entity labels of the BERT-based pre-trained model. We analyze the different mechanisms of loss calculation and loss propagation for the task of token classification on randomized datasets. Then we propose ways to improve the token classification for the highly imbalanced task of clinical entity recognition.
comment: 8 pages, 8 figures
☆ Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration
Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to effectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence, leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, \textit{Atypical Presentations Recalibration}, which leverages atypical presentations to adjust the model's confidence estimates. Our approach significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within the recalibration framework.
☆ xLAM: A Family of Large Action Models to Empower AI Agent Systems
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4
comment: Technical report for the Salesforce xLAM model series
☆ An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification
Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
☆ Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
comment: 13 pages, 5 figures
☆ MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering KDD
In this paper we present a multi-adapter retrieval augmented generation system (MARAGS) for Meta's Comprehensive RAG (CRAG) competition for KDD CUP 2024. CRAG is a question answering dataset contains 3 different subtasks aimed at realistic question and answering RAG related tasks, with a diverse set of question topics, question types, time dynamic answers, and questions featuring entities of varying popularity. Our system follows a standard setup for web based RAG, which uses processed web pages to provide context for an LLM to produce generations, while also querying API endpoints for additional information. MARAGS also utilizes multiple different adapters to solve the various requirements for these tasks with a standard cross-encoder model for ranking candidate passages relevant for answering the question. Our system achieved 2nd place for Task 1 as well as 3rd place on Task 2.
comment: Accepted to CRAG KDD Cup 24 Workshop
☆ Continual Skill and Task Learning via Dialogue
Continual and interactive robot learning is a challenging problem as the robot is present with human users who expect the robot to learn novel skills to solve novel tasks perpetually with sample efficiency. In this work we present a framework for robots to query and learn visuo-motor robot skills and task relevant information via natural language dialog interactions with human users. Previous approaches either focus on improving the performance of instruction following agents, or passively learn novel skills or concepts. Instead, we used dialog combined with a language-skill grounding embedding to query or confirm skills and/or tasks requested by a user. To achieve this goal, we developed and integrated three different components for our agent. Firstly, we propose a novel visual-motor control policy ACT with Low Rank Adaptation (ACT-LoRA), which enables the existing SoTA ACT model to perform few-shot continual learning. Secondly, we develop an alignment model that projects demonstrations across skill embodiments into a shared embedding allowing us to know when to ask questions and/or demonstrations from users. Finally, we integrated an existing LLM to interact with a human user to perform grounded interactive continual skill learning to solve a task. Our ACT-LoRA model learns novel fine-tuned skills with a 100% accuracy when trained with only five demonstrations for a novel skill while still maintaining a 74.75% accuracy on pre-trained skills in the RLBench dataset where other models fall significantly short. We also performed a human-subjects study with 8 subjects to demonstrate the continual learning capabilities of our combined framework. We achieve a success rate of 75% in the task of sandwich making with the real robot learning from participant data demonstrating that robots can learn novel skills or task knowledge from dialogue with non-expert users using our approach.
☆ MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models
A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the multiple-choice type. Multiple-choice problems are constructed by adding three incorrect answers as choices to a correct answer, so that LLMs can choose one of the four as a response. Most of the problems for free-response answer and multiple-choice types overlap except for the format of the answers. We also conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5, ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with the OpenAI API. The differences and similarities in the performance of LLMs measured by the MaterialBENCH are analyzed and discussed. Performance differences between the free-response type and multiple-choice type in the same models and the influence of using system massages on multiple-choice problems are also studied. We anticipate that MaterialBENCH will encourage further developments of LLMs in reasoning abilities to solve more complicated problems and eventually contribute to materials research and discovery.
☆ Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models
Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: \textit{excessively long reasoning paths distracting from the answer generation}, and \textit{false-positive relations hindering the path refinement}. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7\% and 9.1\% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG. Code is available at \url{https://github.com/reml-group/DoG}.
comment: 12 pages
GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation
Online sellers and advertisers are recommended keyphrases for their listed products, which they bid on to enhance their sales. One popular paradigm that generates such recommendations is Extreme Multi-Label Classification (XMC), which involves tagging/mapping keyphrases to items. We outline the limitations of using traditional item-query based tagging or mapping techniques for keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles. Additionally, we demonstrate that relying on traditional metrics such as precision/recall can be misleading in practical applications, thereby necessitating a combination of metrics to evaluate performance in real-world scenarios. These metrics are designed to assess the relevance of keyphrases to items and the potential for buyer outreach. GraphEx outperforms production models at eBay, achieving the objectives mentioned above. It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.
♻ ☆ PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity
One of the components of natural language processing that has received a lot of investigation recently is semantic textual similarity. In computational linguistics and natural language processing, assessing the semantic similarity of words, phrases, paragraphs, and texts is crucial. Calculating the degree of semantic resemblance between two textual pieces, paragraphs, or phrases provided in both monolingual and cross-lingual versions is known as semantic similarity. Cross lingual semantic similarity requires corpora in which there are sentence pairs in both the source and target languages with a degree of semantic similarity between them. Many existing cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset, which the propagation of the machine translation error reduces the accuracy of the model. On the other hand, when we want to use semantic similarity features for machine translation the same machine translations should not be used for semantic similarity. For Persian, which is one of the low resource languages, no effort has been made in this regard and the need for a model that can understand the context of two languages is felt more than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Also, different models based on transformers have been fine-tuned using this dataset. The results show that using the PESTS dataset, the Pearson correlation of the XLM ROBERTa model increases from 85.87% to 95.62%.
♻ ☆ Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation
In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun
comment: 12 pages, 12 figures
♻ ☆ Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.
comment: new robustness experiments; new baselines; include Mistral, Mistral-Nemo and GPT-NeoX; link to code
♻ ☆ Cost-Efficient Subjective Task Annotation and Modeling through Few-Shot Annotator Adaptation
In subjective NLP tasks, where a single ground truth does not exist, the inclusion of diverse annotators becomes crucial as their unique perspectives significantly influence the annotations. In realistic scenarios, the annotation budget often becomes the main determinant of the number of perspectives (i.e., annotators) included in the data and subsequent modeling. We introduce a novel framework for annotation collection and modeling in subjective tasks that aims to minimize the annotation budget while maximizing the predictive performance for each annotator. Our framework has a two-stage design: first, we rely on a small set of annotators to build a multitask model, and second, we augment the model for a new perspective by strategically annotating a few samples per annotator. To test our framework at scale, we introduce and release a unique dataset, Moral Foundations Subjective Corpus, of 2000 Reddit posts annotated by 24 annotators for moral sentiment. We demonstrate that our framework surpasses the previous SOTA in capturing the annotators' individual perspectives with as little as 25% of the original annotation budget on two datasets. Furthermore, our framework results in more equitable models, reducing the performance disparity among annotators.
♻ ☆ Exploring Group and Symmetry Principles in Large Language Models
Large Language Models (LLMs) have demonstrated impressive performance across a wide range of applications; however, assessing their reasoning capabilities remains a significant challenge. In this paper, we introduce a framework grounded in group and symmetry principles, which have played a crucial role in fields such as physics and mathematics, and offer another way to evaluate their capabilities. While the proposed framework is general, to showcase the benefits of employing these properties, we focus on arithmetic reasoning and investigate the performance of these models on four group properties: closure, identity, inverse, and associativity. Our findings reveal that LLMs studied in this work struggle to preserve group properties across different test regimes. In the closure test, we observe biases towards specific outputs and an abrupt degradation in their performance from 100% to 0% after a specific sequence length. They also perform poorly in the identity test, which represents adding irrelevant information in the context, and show sensitivity when subjected to inverse test, which examines the robustness of the model with respect to negation. In addition, we demonstrate that breaking down problems into smaller steps helps LLMs in the associativity test that we have conducted. To support these tests we have developed a synthetic dataset which will be released.
♻ ☆ Positioning Political Texts with Large Language Models by Asking and Averaging
We use instruction-tuned Large Language Models (LLMs) like GPT-4, Llama 3, MiXtral, or Aya to position political texts within policy and ideological spaces. We ask an LLM where a tweet or a sentence of a political text stands on the focal dimension and take the average of the LLM responses to position political actors such as US Senators, or longer texts such as UK party manifestos or EU policy speeches given in 10 different languages. The correlations between the position estimates obtained with the best LLMs and benchmarks based on text coding by experts, crowdworkers, or roll call votes exceed .90. This approach is generally more accurate than the positions obtained with supervised classifiers trained on large amounts of research data. Using instruction-tuned LLMs to position texts in policy and ideological spaces is fast, cost-efficient, reliable, and reproducible (in the case of open LLMs) even if the texts are short and written in different languages. We conclude with cautionary notes about the need for empirical validation.
♻ ☆ Towards Evaluating and Building Versatile Large Language Models for Medicine
In this study, we present MedS-Bench, a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. Unlike existing benchmarks that focus on multiple-choice question answering, MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation, among others. We evaluated six leading LLMs, e.g., MEDITRON, Mistral, InternLM 2, Llama 3, GPT-4, and Claude-3.5 using few-shot prompting, and found that even the most sophisticated models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models across nearly all clinical tasks. To promote further advancements in the application of LLMs to clinical challenges, we have made the MedS-Ins dataset fully accessible and invite the research community to contribute to its expansion.Additionally, we have launched a dynamic leaderboard for MedS-Bench, which we plan to regularly update the test set to track progress and enhance the adaptation of general LLMs to the medical domain. Leaderboard: https://henrychur.github.io/MedS-Bench/. Github: https://github.com/MAGIC-AI4Med/MedS-Ins.
♻ ☆ Legilimens: Practical and Unified Content Moderation for Large Language Model Services CCS
Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.
comment: Accepted by ACM Conference on Computer and Communications Security (CCS) 2024
♻ ☆ Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions
The responses generated by Large Language Models (LLMs) can include sensitive information from individuals and organizations, leading to potential privacy leakage. This work implements Influence Functions (IFs) to trace privacy leakage back to the training data, thereby mitigating privacy concerns of Language Models (LMs). However, we notice that current IFs struggle to accurately estimate the influence of tokens with large gradient norms, potentially overestimating their influence. When tracing the most influential samples, this leads to frequently tracing back to samples with large gradient norm tokens, overshadowing the actual most influential samples even if their influences are well estimated. To address this issue, we propose Heuristically Adjusted IF (HAIF), which reduces the weight of tokens with large gradient norms, thereby significantly improving the accuracy of tracing the most influential samples. To establish easily obtained groundtruth for tracing privacy leakage, we construct two datasets, PII-E and PII-CR, representing two distinct scenarios: one with identical text in the model outputs and pre-training data, and the other where models leverage their reasoning abilities to generate text divergent from pre-training data. HAIF significantly improves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E dataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs on real-world pretraining data CLUECorpus2020, demonstrating strong robustness regardless prompt and response lengths.
♻ ☆ Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \url{https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications}.
♻ ☆ Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review
This comprehensive review delves into the pivotal role of prompt engineering in unleashing the capabilities of Large Language Models (LLMs). The development of Artificial Intelligence (AI), from its inception in the 1950s to the emergence of advanced neural networks and deep learning architectures, has made a breakthrough in LLMs, with models such as GPT-4o and Claude-3, and in Vision-Language Models (VLMs), with models such as CLIP and ALIGN. Prompt engineering is the process of structuring inputs, which has emerged as a crucial technique to maximize the utility and accuracy of these models. This paper explores both foundational and advanced methodologies of prompt engineering, including techniques such as self-consistency, chain-of-thought, and generated knowledge, which significantly enhance model performance. Additionally, it examines the prompt method of VLMs through innovative approaches such as Context Optimization (CoOp), Conditional Context Optimization (CoCoOp), and Multimodal Prompt Learning (MaPLe). Critical to this discussion is the aspect of AI security, particularly adversarial attacks that exploit vulnerabilities in prompt engineering. Strategies to mitigate these risks and enhance model robustness are thoroughly reviewed. The evaluation of prompt methods is also addressed, through both subjective and objective metrics, ensuring a robust analysis of their efficacy. This review also reflects the essential role of prompt engineering in advancing AI capabilities, providing a structured framework for future research and application.
♻ ☆ Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model
Due to the inherent difficulty in modeling phonetic similarities across different languages, code-switching speech recognition presents a formidable challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Initially, a preceding routing network explicitly learns Language Identification (LID) tasks and selects experts based on acquired LID weights. This process ensures robust routing information to the MoE layer, mitigating interference from diverse language domains on expert network parameter updates. The LID weights are also employed to facilitate inter-group collaboration, enabling the integration of language-specific representations. Furthermore, within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language. Extensive experiments demonstrate the efficacy of our approach, achieving significant performance enhancements compared to alternative methods. Importantly, our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.
comment: Accepted by IEEE SLT 2024
♻ ☆ Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.
comment: Accepted to IEEE SLT 2024
♻ ☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
♻ ☆ Exposing and Explaining Fake News On-the-Fly
Social media platforms enable the rapid dissemination and consumption of information. However, users instantly consume such content regardless of the reliability of the shared data. Consequently, the latter crowdsourcing model is exposed to manipulation. This work contributes with an explainable and online classification method to recognize fake news in real-time. The proposed method combines both unsupervised and supervised Machine Learning approaches with online created lexica. The profiling is built using creator-, content- and context-based features using Natural Language Processing techniques. The explainable classification mechanism displays in a dashboard the features selected for classification and the prediction confidence. The performance of the proposed solution has been validated with real data sets from Twitter and the results attain 80 % accuracy and macro F-measure. This proposal is the first to jointly provide data stream processing, profiling, classification and explainability. Ultimately, the proposed early detection, isolation and explanation of fake news contribute to increase the quality and trustworthiness of social media contents.
♻ ☆ A review on the use of large language models as virtual tutors
Transformer architectures contribute to managing long-term dependencies for Natural Language Processing, representing one of the most recent changes in the field. These architectures are the basis of the innovative, cutting-edge Large Language Models (LLMs) that have produced a huge buzz in several fields and industrial sectors, among the ones education stands out. Accordingly, these generative Artificial Intelligence-based solutions have directed the change in techniques and the evolution in educational methods and contents, along with network infrastructure, towards high-quality learning. Given the popularity of LLMs, this review seeks to provide a comprehensive overview of those solutions designed specifically to generate and evaluate educational materials and which involve students and teachers in their design or experimental plan. To the best of our knowledge, this is the first review of educational applications (e.g., student assessment) of LLMs. As expected, the most common role of these systems is as virtual tutors for automatic question generation. Moreover, the most popular models are GTP-3 and BERT. However, due to the continuous launch of new generative models, new works are expected to be published shortly.
♻ ☆ Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications
Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.
comment: Submitted to Odyssey 2024
♻ ☆ Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?
The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.
comment: https://github.com/yixuantt/PoolingAndAttn
♻ ☆ CAVE: Controllable Authorship Verification Explanations
Authorship Verification (AV) (do two documents have the same author?) is essential in many sensitive real-life applications. AV is often used in proprietary domains that require a private, offline model, making SOTA online models like ChatGPT undesirable. Current offline models however have lower downstream utility due to low accuracy/scalability (eg: traditional stylometry AV systems) and lack of accessible post-hoc explanations. In this work, we take the first step to address the above challenges with our trained, offline Llama-3-8B model CAVE (Controllable Authorship Verification Explanations): CAVE generates free-text AV explanations that are controlled to be (1) structured (can be decomposed into sub-explanations in terms of relevant linguistic features), and (2) easily verified for explanation-label consistency (via intermediate labels in sub-explanations). We first engineer a prompt that can generate silver training data from a SOTA teacher model in the desired CAVE output format. We then filter and distill this data into a pretrained Llama-3-8B, our carefully selected student model. Results on three difficult AV datasets IMDb62, Blog-Auth, and Fanfiction show that CAVE generates high quality explanations (as measured by automatic and human evaluation) as well as competitive task accuracies.
♻ ☆ Aligning Large Language Models to a Domain-specific Graph Database for NL2GQL
Graph Databases (Graph DB) find extensive application across diverse domains such as finance, social networks, and medicine. Yet, the translation of Natural Language (NL) into the Graph Query Language (GQL), referred to as NL2GQL, poses significant challenges owing to its intricate and specialized nature. Some approaches have sought to utilize Large Language Models (LLMs) to address analogous tasks like text2SQL. Nonetheless, in the realm of NL2GQL tasks tailored to a particular domain, the absence of domain-specific NL-GQL data pairs adds complexity to aligning LLMs with the graph DB. To tackle this challenge, we present a well-defined pipeline. Initially, we utilize ChatGPT to generate NL-GQL data pairs, leveraging the provided graph DB with self-instruction. Subsequently, we employ the generated data to fine-tune LLMs, ensuring alignment between LLMs and the graph DB. Moreover, we find the importance of relevant schema in efficiently generating accurate GQLs. Thus, we introduce a method to extract relevant schema as the input context. We evaluate our method using two carefully constructed datasets derived from graph DBs in the finance and medicine domains, named FinGQL and MediGQL. Experimental results reveal that our approach significantly outperforms a set of baseline methods, with improvements of 5.90 and 6.36 absolute points on EM, and 6.00 and 7.09 absolute points on EX for FinGQL and MediGQL, respectively.
comment: 13 pages,2 figures
♻ ☆ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.
♻ ☆ OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation
This paper presents the experiments and results for the CheckThat! Lab at CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE). The primary objective of this task was to generate adversarial examples in five problem domains in order to evaluate the robustness of widely used text classification methods (fine-tuned BERT, BiLSTM, and RoBERTa) when applied to credibility assessment issues. This study explores the application of ensemble learning to enhance adversarial attacks on natural language processing (NLP) models. We systematically tested and refined several adversarial attack methods, including BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across various misinformation tasks. By developing modified versions of BERT-Attack and hybrid methods, we achieved significant improvements in attack effectiveness. Our results demonstrate the potential of modification and combining multiple methods to create more sophisticated and effective adversarial attack strategies, contributing to the development of more robust and secure systems.
comment: CLEF 2024 - Conference and Labs of the Evaluation Forum
♻ ☆ Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.
comment: 10 pages, 1 figure
♻ ☆ LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.
♻ ☆ Resolving Knowledge Conflicts in Large Language Models
Large language models (LLMs) often encounter knowledge conflicts, scenarios where discrepancy arises between the internal parametric knowledge of LLMs and non-parametric information provided in the prompt context. In this work we ask what are the desiderata for LLMs when a knowledge conflict arises and whether existing LLMs fulfill them. We posit that LLMs should 1) identify knowledge conflicts, 2) pinpoint conflicting information segments, and 3) provide distinct answers or viewpoints in conflicting scenarios. To this end, we introduce KNOWLEDGE CONFLICT, an evaluation framework for simulating contextual knowledge conflicts and quantitatively evaluating to what extent LLMs achieve these goals. KNOWLEDGE CONFLICT includes diverse and complex situations of knowledge conflict, knowledge from diverse entities and domains, two synthetic conflict creation methods, and settings with progressively increasing difficulty to reflect realistic knowledge conflicts. Extensive experiments with the KNOWLEDGE CONFLICT framework reveal that while LLMs perform well in identifying the existence of knowledge conflicts, they struggle to determine the specific conflicting knowledge and produce a response with distinct answers amidst conflicting information. To address these challenges, we propose new instruction-based approaches that augment LLMs to better achieve the three goals. Further analysis shows that abilities to tackle knowledge conflicts are greatly impacted by factors such as knowledge domain and prompt text, while generating robust responses to knowledge conflict scenarios remains an open research question.
comment: Published at COLM 2024
♻ ☆ Prediction of COPD Using Machine Learning, Clinical Summary Notes, and Vital Signs
Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung disease that causes obstructed airflow from the lungs. In the United States, more than 15.7 million Americans have been diagnosed with COPD, with 96% of individuals living with at least one other chronic health condition. It is the 4th leading cause of death in the country. Over 2.2 million patients are admitted to hospitals annually due to COPD exacerbations. Monitoring and predicting patient exacerbations on-time could save their life. This paper presents two different predictive models to predict COPD exacerbation using AI and natural language processing (NLP) approaches. These models use respiration summary notes, symptoms, and vital signs. To train and test these models, data records containing physiologic signals and vital signs time series were used. These records were captured from patient monitors and comprehensive clinical data obtained from hospital medical information systems for tens of thousands of Intensive Care Unit (ICU) patients. We achieved an area under the Receiver operating characteristic (ROC) curve of 0.82 in detection and prediction of COPD exacerbation.
comment: 11 pages, 5 figures
♻ ☆ Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation
Large language models (LLMs) have achieved state-of-the-art performance in various language processing tasks, motivating their adoption in simultaneous translation. Current fine-tuning methods to adapt LLMs for simultaneous translation focus on prompting optimization strategies using either data augmentation or prompt structure modifications. However, these methods suffer from several issues, such as unnecessarily expanded training sets, computational inefficiency from dumping the key and value cache, increased prompt sizes, or restriction to a single decision policy. To eliminate these issues, in this work, we propose SimulMask, a new paradigm for fine-tuning LLMs for simultaneous translation. It utilizes a novel attention mask approach that models simultaneous translation during fine-tuning by masking attention for a desired decision policy. Applying the proposed SimulMask on a Falcon LLM for the IWSLT 2017 dataset, we have observed a significant translation quality improvement compared to state-of-the-art prompting optimization strategies on five language pairs while reducing the computational cost.
Computer Vision and Pattern Recognition 104
☆ Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.
comment: Project page: https://yunzeman.github.io/lexicon3d , Github: https://github.com/YunzeMan/Lexicon3D
☆ DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation ECCV 2024
Diffusion probabilistic models (DPMs) have shown remarkable performance in visual synthesis but are computationally expensive due to the need for multiple evaluations during the sampling. Recent predictor-corrector diffusion samplers have significantly reduced the required number of function evaluations (NFE), but inherently suffer from a misalignment issue caused by the extra corrector step, especially with a large classifier-free guidance scale (CFG). In this paper, we introduce a new fast DPM sampler called DC-Solver, which leverages dynamic compensation (DC) to mitigate the misalignment of the predictor-corrector samplers. The dynamic compensation is controlled by compensation ratios that are adaptive to the sampling steps and can be optimized on only 10 datapoints by pushing the sampling trajectory toward a ground truth trajectory. We further propose a cascade polynomial regression (CPR) which can instantly predict the compensation ratios on unseen sampling configurations. Additionally, we find that the proposed dynamic compensation can also serve as a plug-and-play module to boost the performance of predictor-only samplers. Extensive experiments on both unconditional sampling and conditional sampling demonstrate that our DC-Solver can consistently improve the sampling quality over previous methods on different DPMs with a wide range of resolutions up to 1024$\times$1024. Notably, we achieve 10.38 FID (NFE=5) on unconditional FFHQ and 0.394 MSE (NFE=5, CFG=7.5) on Stable-Diffusion-2.1. Code is available at https://github.com/wl-zhao/DC-Solver
comment: Accepted by ECCV 2024
☆ Foundation Model or Finetune? Evaluation of few-shot semantic segmentation for river pollution ECCV 2024
Foundation models (FMs) are a popular topic of research in AI. Their ability to generalize to new tasks and datasets without retraining or needing an abundance of data makes them an appealing candidate for applications on specialist datasets. In this work, we compare the performance of FMs to finetuned pre-trained supervised models in the task of semantic segmentation on an entirely new dataset. We see that finetuned models consistently outperform the FMs tested, even in cases were data is scarce. We release the code and dataset for this work on GitHub.
comment: Accepted at ECCV 2024 Green Foundation Models workshop
☆ ArtiFade: Learning to Generate High-quality Subject from Blemished Images
Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images. However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts. This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts. The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning. ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images. We further devise evaluation benchmarks tailored for this task. Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.
☆ Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation
Generating high-quality 3D objects from textual descriptions remains a challenging problem due to computational cost, the scarcity of 3D data, and complex 3D representations. We introduce Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures. By integrating a Collaborative Control mechanism, we exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion. This enables strong generalization even with limited 3D training data (allowing us to use only high-quality training data) as well as retaining compatibility with guidance techniques such as IPAdapter. In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models. The generated objects consist of semantically meaningful, separate parts and include internal structures, enhancing both usability and versatility.
comment: 11 pages, 9 figures, Project page: https://unity-research.github.io/Geometry-Image-Diffusion.github.io/
☆ View-Invariant Policy Learning via Zero-Shot Novel View Synthesis
Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.
comment: Accepted to CoRL 2024
☆ RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images
In recent years, diffusion models have revolutionized visual generation, outperforming traditional frameworks like Generative Adversarial Networks (GANs). However, generating images of humans with realistic semantic parts, such as hands and faces, remains a significant challenge due to their intricate structural complexity. To address this issue, we propose a novel post-processing solution named RealisHuman. The RealisHuman framework operates in two stages. First, it generates realistic human parts, such as hands or faces, using the original malformed parts as references, ensuring consistent details with the original image. Second, it seamlessly integrates the rectified human parts back into their corresponding positions by repainting the surrounding areas to ensure smooth and realistic blending. The RealisHuman framework significantly enhances the realism of human generation, as demonstrated by notable improvements in both qualitative and quantitative metrics. Code is available at https://github.com/Wangbenzhi/RealisHuman.
☆ CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.
comment: Project Website: https://github.com/opendatalab/UniMERNet/tree/main/cdm
☆ Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction ECCV 2024
Reconstructing the high-fidelity surface from multi-view images, especially sparse images, is a critical and practical task that has attracted widespread attention in recent years. However, existing methods are impeded by the memory constraint or the requirement of ground-truth depths and cannot recover satisfactory geometric details. To this end, we propose SuRF, a new Surface-centric framework that incorporates a new Region sparsification based on a matching Field, achieving good trade-offs between performance, efficiency and scalability. To our knowledge, this is the first unsupervised method achieving end-to-end sparsification powered by the introduced matching field, which leverages the weight distribution to efficiently locate the boundary regions containing surface. Instead of predicting an SDF value for each voxel, we present a new region sparsification approach to sparse the volume by judging whether the voxel is inside the surface region. In this way, our model can exploit higher frequency features around the surface with less memory and computational consumption. Extensive experiments on multiple benchmarks containing complex large-scale scenes show that our reconstructions exhibit high-quality details and achieve new state-of-the-art performance, i.e., 46% improvements with 80% less memory consumption. Code is available at https://github.com/prstrive/SuRF.
comment: ECCV 2024 Accepted
☆ SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing
Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.
comment: 10 pages, 7 figures, 3 tables
☆ TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces
A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion). Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand. Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency. In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance. Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. Our source code is available at: https://github.com/BOVIFOCR/tcdiff.
comment: SIBGRAPI 2024
☆ A practical approach to evaluating the adversarial distance for machine learning classifiers
Robustness is critical for machine learning (ML) classifiers to ensure consistent performance in real-world applications where models may encounter corrupted or adversarial inputs. In particular, assessing the robustness of classifiers to adversarial inputs is essential to protect systems from vulnerabilities and thus ensure safety in use. However, methods to accurately compute adversarial robustness have been challenging for complex ML models and high-dimensional data. Furthermore, evaluations typically measure adversarial accuracy on specific attack budgets, limiting the informative value of the resulting metrics. This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach. Combined, the methods provide a comprehensive evaluation of adversarial robustness by computing estimates for the upper and lower bounds of the adversarial distance. We present visualisations and ablation studies that provide insights into how this evaluation method should be applied and parameterised. We find that our adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations. The approach in this paper should encourage a more informative way of evaluating the adversarial robustness of ML classifiers.
comment: Accepted manuscript at International Mechanical Engineering Congress and Exposition IMECE2024
☆ Text-Guided Mixup Towards Long-Tailed Image Categorization BMVC'24
In many real-world applications, the frequency distribution of class labels for training data can exhibit a long-tailed distribution, which challenges traditional approaches of training deep neural networks that require heavy amounts of balanced data. Gathering and labeling data to balance out the class label distribution can be both costly and time-consuming. Many existing solutions that enable ensemble learning, re-balancing strategies, or fine-tuning applied to deep neural networks are limited by the inert problem of few class samples across a subset of classes. Recently, vision-language models like CLIP have been observed as effective solutions to zero-shot or few-shot learning by grasping a similarity between vision and language features for image and text pairs. Considering that large pre-trained vision-language models may contain valuable side textual information for minor classes, we propose to leverage text supervision to tackle the challenge of long-tailed learning. Concretely, we propose a novel text-guided mixup technique that takes advantage of the semantic relations between classes recognized by the pre-trained text encoder to help alleviate the long-tailed problem. Our empirical study on benchmark long-tailed tasks demonstrates the effectiveness of our proposal with a theoretical guarantee. Our code is available at https://github.com/rsamf/text-guided-mixup.
comment: Accepted by BMVC'24, code is available at https://github.com/rsamf/text-guided-mixup
☆ MaskVal: Simple but Effective Uncertainty Quantification for 6D Pose Estimation
For the use of 6D pose estimation in robotic applications, reliable poses are of utmost importance to ensure a safe, reliable and predictable operational performance. Despite these requirements, state-of-the-art 6D pose estimators often do not provide any uncertainty quantification for their pose estimates at all, or if they do, it has been shown that the uncertainty provided is only weakly correlated with the actual true error. To address this issue, we investigate a simple but effective uncertainty quantification, that we call MaskVal, which compares the pose estimates with their corresponding instance segmentations by rendering and does not require any modification of the pose estimator itself. Despite its simplicity, MaskVal significantly outperforms a state-of-the-art ensemble method on both a dataset and a robotic setup. We show that by using MaskVal, the performance of a state-of-the-art 6D pose estimator is significantly improved towards a safe and reliable operation. In addition, we propose a new and specific approach to compare and evaluate uncertainty quantification methods for 6D pose estimation in the context of robotic manipulation.
☆ Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection
Despite their high accuracy, complex neural networks demand significant computational resources, posing challenges for deployment on resource-constrained devices such as mobile phones and embedded systems. Compression algorithms have been developed to address these challenges by reducing model size and computational demands while maintaining accuracy. Among these approaches, factorization methods based on tensor decomposition are theoretically sound and effective. However, they face difficulties in selecting the appropriate rank for decomposition. This paper tackles this issue by presenting a unified framework that simultaneously applies decomposition and optimal rank selection, employing a composite compression loss within defined rank constraints. Our approach includes an automatic rank search in a continuous space, efficiently identifying optimal rank configurations without the use of training data, making it computationally efficient. Combined with a subsequent fine-tuning step, our approach maintains the performance of highly compressed models on par with their original counterparts. Using various benchmark datasets, we demonstrate the efficacy of our method through a comprehensive analysis.
☆ Organized Grouped Discrete Representation for Object-Centric Learning
Object-Centric Learning (OCL) represents dense image or video pixels as sparse object features. Representative methods utilize discrete representation composed of Variational Autoencoder (VAE) template features to suppress pixel-level information redundancy and guide object-level feature aggregation. The most recent advancement, Grouped Discrete Representation (GDR), further decomposes these template features into attributes. However, its naive channel grouping as decomposition may erroneously group channels belonging to different attributes together and discretize them as sub-optimal template attributes, which losses information and harms expressivity. We propose Organized GDR (OGDR) to organize channels belonging to the same attributes together for correct decomposition from features into attributes. In unsupervised segmentation experiments, OGDR is fully superior to GDR in augmentating classical transformer-based OCL methods; it even improves state-of-the-art diffusion-based ones. Codebook PCA and representation similarity analyses show that compared with GDR, our OGDR eliminates redundancy and preserves information better for guiding object representation learning. The source code is available in the supplementary material.
☆ DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
Diffusion models (DMs) have demonstrated exceptional generative capabilities across various areas, while they are hindered by slow inference speeds and high computational demands during deployment. The most common way to accelerate DMs involves reducing the number of denoising steps during generation, achieved through faster sampling solvers or knowledge distillation (KD). In contrast to prior approaches, we propose a novel method that transfers the capability of large pretrained DMs to faster architectures. Specifically, we employ KD in a distinct manner to compress DMs by distilling their generative ability into more rapid variants. Furthermore, considering that the source data is either unaccessible or too enormous to store for current generative models, we introduce a new paradigm for their distillation without source data, termed Data-Free Knowledge Distillation for Diffusion Models (DKDM). Generally, our established DKDM framework comprises two main components: 1) a DKDM objective that uses synthetic denoising data produced by pretrained DMs to optimize faster DMs without source data, and 2) a dynamic iterative distillation method that flexibly organizes the synthesis of denoising data, preventing it from slowing down the optimization process as the generation is slow. To our knowledge, this is the first attempt at using KD to distill DMs into any architecture in a data-free manner. Importantly, our DKDM is orthogonal to most existing acceleration methods, such as denoising step reduction, quantization and pruning. Experiments show that our DKDM is capable of deriving 2x faster DMs with performance remaining on par with the baseline. Notably, our DKDM enables pretrained DMs to function as "datasets" for training new DMs.
☆ Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift
Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.
comment: This preprint has not undergone any post-submission improvements or corrections
☆ Use of triplet loss for facial restoration in low-resolution images
In recent years, facial recognition (FR) models have become the most widely used biometric tool, achieving impressive results on numerous datasets. However, inherent hardware challenges or shooting distances often result in low-resolution images, which significantly impact the performance of FR models. To address this issue, several solutions have been proposed, including super-resolution (SR) models that generate highly realistic faces. Despite these efforts, significant improvements in FR algorithms have not been achieved. We propose a novel SR model FTLGAN, which focuses on generating high-resolution images that preserve individual identities rather than merely improving image quality, thereby maximizing the performance of FR models. The results are compelling, demonstrating a mean value of d' 21% above the best current state-of-the-art models, specifically having a value of d' = 1.099 and AUC = 0.78 for 14x14 pixels, d' = 2.112 and AUC = 0.92 for 28x28 pixels, and d' = 3.049 and AUC = 0.98 for 56x56 pixels. The contributions of this study are significant in several key areas. Firstly, a notable improvement in facial recognition performance has been achieved in low-resolution images, specifically at resolutions of 14x14, 28x28, and 56x56 pixels. Secondly, the enhancements demonstrated by FTLGAN show a consistent response across all resolutions, delivering outstanding performance uniformly, unlike other comparative models. Thirdly, an innovative approach has been implemented using triplet loss logic, enabling the training of the super-resolution model solely with real images, contrasting with current models, and expanding potential real-world applications. Lastly, this study introduces a novel model that specifically addresses the challenge of improving classification performance in facial recognition systems by integrating facial recognition quality as a loss during model training.
comment: 10 pages, 8 figures
☆ FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation
Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into the learnable queries and CLIP features within the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck. Extensive experiments demonstrate that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at https://github.com/chenxi52/FrozenSeg.
comment: 14 pages, 9 figures
☆ Have Large Vision-Language Models Mastered Art History?
The emergence of large Vision-Language Models (VLMs) has recently established new baselines in image classification across multiple domains. However, the performance of VLMs in the specific task of artwork classification, particularly art style classification of paintings - a domain traditionally mastered by art historians - has not been explored yet. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively predict the art historical attributes of paintings. We conduct an in-depth analysis of four VLMs, namely CLIP, LLaVA, OpenFlamingo, and GPT-4o, focusing on zero-shot classification of art style, author and time period using two public benchmarks of artworks. Additionally, we present ArTest, a well-curated test set of artworks, including pivotal paintings studied by art historians.
☆ Tissue Concepts: supervised foundation models in computational pathology
Due to the increasing workload of pathologists, the need for automation to support diagnostic tasks and quantitative biomarker evaluation is becoming more and more apparent. Foundation models have the potential to improve generalizability within and across centers and serve as starting points for data efficient development of specialized yet robust AI models. However, the training foundation models themselves is usually very expensive in terms of data, computation, and time. This paper proposes a supervised training method that drastically reduces these expenses. The proposed method is based on multi-task learning to train a joint encoder, by combining 16 different classification, segmentation, and detection tasks on a total of 912,000 patches. Since the encoder is capable of capturing the properties of the samples, we term it the Tissue Concepts encoder. To evaluate the performance and generalizability of the Tissue Concepts encoder across centers, classification of whole slide images from four of the most prevalent solid cancers - breast, colon, lung, and prostate - was used. The experiments show that the Tissue Concepts model achieve comparable performance to models trained with self-supervision, while requiring only 6% of the amount of training patches. Furthermore, the Tissue Concepts encoder outperforms an ImageNet pre-trained encoder on both in-domain and out-of-domain data.
comment: 22 Pages, 3 Figures, submitted to and under revision at Computers in Biology and Medicine
☆ LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution
Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at https://github.com/jwgdmkj/LMLT.
☆ Blended Latent Diffusion under Attention Control for Real-World Video Editing
Due to lack of fully publicly available text-to-video models, current video editing methods tend to build on pre-trained text-to-image generation models, however, they still face grand challenges in dealing with the local editing of video with temporal information. First, although existing methods attempt to focus on local area editing by a pre-defined mask, the preservation of the outside-area background is non-ideal due to the spatially entire generation of each frame. In addition, specially providing a mask by user is an additional costly undertaking, so an autonomous masking strategy integrated into the editing process is desirable. Last but not least, image-level pretrained model hasn't learned temporal information across frames of a video which is vital for expressing the motion and dynamics. In this paper, we propose to adapt a image-level blended latent diffusion model to perform local video editing tasks. Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones to better preserve the background information of the input video. We further introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps. Finally, we enhance the temporal consistency across video frames by transforming the self-attention blocks of U-Net into temporal-spatial blocks. Through extensive experiments, our proposed approach demonstrates effectiveness in different real-world video editing tasks.
☆ ScreenMark: Watermarking Arbitrary Visual Content on Screen
Digital watermarking has demonstrated its effectiveness in protecting multimedia content. However, existing watermarking are predominantly tailored for specific media types, rendering them less effective for the protection of content displayed on computer screens, which is often multimodal and dynamic. Visual Screen Content (VSC), is particularly susceptible to theft and leakage via screenshots, a vulnerability that current watermarking methods fail to adequately address.To tackle these challenges, we propose ScreenMark, a robust and practical watermarking method designed specifically for arbitrary VSC protection. ScreenMark utilizes a three-stage progressive watermarking framework. Initially, inspired by diffusion principles, we initialize the mutual transformation between regular watermark information and irregular watermark patterns. Subsequently, these patterns are integrated with screen content using a pre-multiplication alpha blending technique, supported by a pre-trained screen decoder for accurate watermark retrieval. The progressively complex distorter enhances the robustness of the watermark in real-world screenshot scenarios. Finally, the model undergoes fine-tuning guided by a joint-level distorter to ensure optimal performance.To validate the effectiveness of ScreenMark, we compiled a dataset comprising 100,000 screenshots from various devices and resolutions. Extensive experiments across different datasets confirm the method's superior robustness, imperceptibility, and practical applicability.
☆ Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation
Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the "utility" of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at https://github.com/prerakmody/bayesuncertainty-error-correspondence
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:018
☆ LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones WACV 2025
Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at https://github.com/ altair199797/LowFormer.
comment: Accepted at WACV 2025. Features 11 pages in total
☆ Non-Uniform Illumination Attack for Fooling Convolutional Neural Networks
Convolutional Neural Networks (CNNs) have made remarkable strides; however, they remain susceptible to vulnerabilities, particularly in the face of minor image perturbations that humans can easily recognize. This weakness, often termed as 'attacks', underscores the limited robustness of CNNs and the need for research into fortifying their resistance against such manipulations. This study introduces a novel Non-Uniform Illumination (NUI) attack technique, where images are subtly altered using varying NUI masks. Extensive experiments are conducted on widely-accepted datasets including CIFAR10, TinyImageNet, and CalTech256, focusing on image classification with 12 different NUI attack models. The resilience of VGG, ResNet, MobilenetV3-small and InceptionV3 models against NUI attacks are evaluated. Our results show a substantial decline in the CNN models' classification accuracy when subjected to NUI attacks, indicating their vulnerability under non-uniform illumination. To mitigate this, a defense strategy is proposed, including NUI-attacked images, generated through the new NUI transformation, into the training set. The results demonstrate a significant enhancement in CNN model performance when confronted with perturbed images affected by NUI attacks. This strategy seeks to bolster CNN models' resilience against NUI attacks.
☆ LM-Gaussian: Boost Sparse-view 3D Gaussian Splatting with Large Model Priors
We aim to address sparse-view reconstruction of a 3D scene by leveraging priors from large-scale vision models. While recent advancements such as 3D Gaussian Splatting (3DGS) have demonstrated remarkable successes in 3D reconstruction, these methods typically necessitate hundreds of input images that densely capture the underlying scene, making them time-consuming and impractical for real-world applications. However, sparse-view reconstruction is inherently ill-posed and under-constrained, often resulting in inferior and incomplete outcomes. This is due to issues such as failed initialization, overfitting on input images, and a lack of details. To mitigate these challenges, we introduce LM-Gaussian, a method capable of generating high-quality reconstructions from a limited number of images. Specifically, we propose a robust initialization module that leverages stereo priors to aid in the recovery of camera poses and the reliable point clouds. Additionally, a diffusion-based refinement is iteratively applied to incorporate image diffusion priors into the Gaussian optimization process to preserve intricate scene details. Finally, we utilize video diffusion priors to further enhance the rendered images for realistic visual effects. Overall, our approach significantly reduces the data acquisition requirements compared to previous 3DGS methods. We validate the effectiveness of our framework through experiments on various public datasets, demonstrating its potential for high-quality 360-degree scene reconstruction. Visual results are on our website.
comment: Project page: https://hanyangyu1021.github.io/lm-gaussian.github.io/
☆ Data-free Distillation with Degradation-prompt Diffusion for Multi-weather Image Restoration
Multi-weather image restoration has witnessed incredible progress, while the increasing model capacity and expensive data acquisition impair its applications in memory-limited devices. Data-free distillation provides an alternative for allowing to learn a lightweight student model from a pre-trained teacher model without relying on the original training data. The existing data-free learning methods mainly optimize the models with the pseudo data generated by GANs or the real data collected from the Internet. However, they inevitably suffer from the problems of unstable training or domain shifts with the original data. In this paper, we propose a novel Data-free Distillation with Degradation-prompt Diffusion framework for multi-weather Image Restoration (D4IR). It replaces GANs with pre-trained diffusion models to avoid model collapse and incorporates a degradation-aware prompt adapter to facilitate content-driven conditional diffusion for generating domain-related images. Specifically, a contrast-based degradation prompt adapter is firstly designed to capture degradation-aware prompts from web-collected degraded images. Then, the collected unpaired clean images are perturbed to latent features of stable diffusion, and conditioned with the degradation-aware prompts to synthesize new domain-related degraded images for knowledge distillation. Experiments illustrate that our proposal achieves comparable performance to the model distilled with original training data, and is even superior to other mainstream unsupervised methods.
☆ Automatic occlusion removal from 3D maps for maritime situational awareness SP
We introduce a novel method for updating 3D geospatial models, specifically targeting occlusion removal in large-scale maritime environments. Traditional 3D reconstruction techniques often face problems with dynamic objects, like cars or vessels, that obscure the true environment, leading to inaccurate models or requiring extensive manual editing. Our approach leverages deep learning techniques, including instance segmentation and generative inpainting, to directly modify both the texture and geometry of 3D meshes without the need for costly reprocessing. By selectively targeting occluding objects and preserving static elements, the method enhances both geometric and visual accuracy. This approach not only preserves structural and textural details of map data but also maintains compatibility with current geospatial standards, ensuring robust performance across diverse datasets. The results demonstrate significant improvements in 3D model fidelity, making this method highly applicable for maritime situational awareness and the dynamic display of auxiliary information.
comment: Preprint of SPIE Sensor + Imaging 2024 conference paper
☆ Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression
Existing methods for driver facial expression recognition (DFER) are often computationally intensive, rendering them unsuitable for real-time applications. In this work, we introduce a novel transfer learning-based dual architecture, named ShuffViT-DFER, which elegantly combines computational efficiency and accuracy. This is achieved by harnessing the strengths of two lightweight and efficient models using convolutional neural network (CNN) and vision transformers (ViT). We efficiently fuse the extracted features to enhance the performance of the model in accurately recognizing the facial expressions of the driver. Our experimental results on two benchmarking and public datasets, KMU-FED and KDEF, highlight the validity of our proposed method for real-time application with superior performance when compared to state-of-the-art methods.
comment: Accepted for publication in The 6th IEEE International Conference on Artificial Intelligence Circuits and Systems (IEEE AICAS 2024), 5 pages, 3 figures
☆ A Key-Driven Framework for Identity-Preserving Face Anonymization NDSS
Virtual faces are crucial content in the metaverse. Recently, attempts have been made to generate virtual faces for privacy protection. Nevertheless, these virtual faces either permanently remove the identifiable information or map the original identity into a virtual one, which loses the original identity forever. In this study, we first attempt to address the conflict between privacy and identifiability in virtual faces, where a key-driven face anonymization and authentication recognition (KFAAR) framework is proposed. Concretely, the KFAAR framework consists of a head posture-preserving virtual face generation (HPVFG) module and a key-controllable virtual face authentication (KVFA) module. The HPVFG module uses a user key to project the latent vector of the original face into a virtual one. Then it maps the virtual vectors to obtain an extended encoding, based on which the virtual face is generated. By simultaneously adding a head posture and facial expression correction module, the virtual face has the same head posture and facial expression as the original face. During the authentication, we propose a KVFA module to directly recognize the virtual faces using the correct user key, which can obtain the original identity without exposing the original face image. We also propose a multi-task learning objective to train HPVFG and KVFA. Extensive experiments demonstrate the advantages of the proposed HPVFG and KVFA modules, which effectively achieve both facial anonymity and identifiability.
comment: Accepted by NDSS Symposium 2025. Please cite this paper as "Miaomiao Wang, Guang Hua, Sheng Li, and Guorui Feng. A Key-Driven Framework for Identity-Preserving Face Anonymization. In the 32nd Annual Network and Distributed System Security Symposium (NDSS 2025)."
☆ UV-Mamba: A DCN-Enhanced State Space Model for Urban Village Boundary Identification in High-Resolution Remote Sensing Images
Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi'an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi'an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.
comment: 5 pages, 4 figures, 2 tables
☆ Weight Conditioning for Smooth Optimization of Neural Networks ECCV 2024
In this article, we introduce a novel normalization technique for neural network weight matrices, which we term weight conditioning. This approach aims to narrow the gap between the smallest and largest singular values of the weight matrices, resulting in better-conditioned matrices. The inspiration for this technique partially derives from numerical linear algebra, where well-conditioned matrices are known to facilitate stronger convergence results for iterative solvers. We provide a theoretical foundation demonstrating that our normalization technique smoothens the loss landscape, thereby enhancing convergence of stochastic gradient descent algorithms. Empirically, we validate our normalization across various neural network architectures, including Convolutional Neural Networks (CNNs), Vision Transformers (ViT), Neural Radiance Fields (NeRF), and 3D shape modeling. Our findings indicate that our normalization method is not only competitive but also outperforms existing weight normalization techniques from the literature.
comment: ECCV 2024
☆ mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.
comment: 15 pages, 7 figures
☆ TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model
We propose TG-LMM (Text-Guided Large Multi-Modal Model), a novel approach that leverages textual descriptions of organs to enhance segmentation accuracy in medical images. Existing medical image segmentation methods face several challenges: current medical automatic segmentation models do not effectively utilize prior knowledge, such as descriptions of organ locations; previous text-visual models focus on identifying the target rather than improving the segmentation accuracy; prior models attempt to use prior knowledge to enhance accuracy but do not incorporate pre-trained models. To address these issues, TG-LMM integrates prior knowledge, specifically expert descriptions of the spatial locations of organs, into the segmentation process. Our model utilizes pre-trained image and text encoders to reduce the number of training parameters and accelerate the training process. Additionally, we designed a comprehensive image-text information fusion structure to ensure thorough integration of the two modalities of data. We evaluated TG-LMM on three authoritative medical image datasets, encompassing the segmentation of various parts of the human body. Our method demonstrated superior performance compared to existing approaches, such as MedSAM, SAM and nnUnet.
comment: 11 pages, 2 figures
☆ KAN See In the Dark
Existing low-light image enhancement methods are difficult to fit the complex nonlinear relationship between normal and low-light images due to uneven illumination and noise effects. The recently proposed Kolmogorov-Arnold networks (KANs) feature spline-based convolutional layers and learnable activation functions, which can effectively capture nonlinear dependencies. In this paper, we design a KAN-Block based on KANs and innovatively apply it to low-light image enhancement. This method effectively alleviates the limitations of current methods constrained by linear network structures and lack of interpretability, further demonstrating the potential of KANs in low-level vision tasks. Given the poor perception of current low-light image enhancement methods and the stochastic nature of the inverse diffusion process, we further introduce frequency-domain perception for visually oriented enhancement. Extensive experiments demonstrate the competitive performance of our method on benchmark datasets. The code will be available at: https://github.com/AXNing/KSID}{https://github.com/AXNing/KSID.
☆ Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression
One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.
comment: 12 pages to appear in IEEE Transactions on Multimedia
☆ TBConvL-Net: A Hybrid Deep Learning Architecture for Robust Medical Image Segmentation
Deep learning has shown great potential for automated medical image segmentation to improve the precision and speed of disease diagnostics. However, the task presents significant difficulties due to variations in the scale, shape, texture, and contrast of the pathologies. Traditional convolutional neural network (CNN) models have certain limitations when it comes to effectively modelling multiscale context information and facilitating information interaction between skip connections across levels. To overcome these limitations, a novel deep learning architecture is introduced for medical image segmentation, taking advantage of CNNs and vision transformers. Our proposed model, named TBConvL-Net, involves a hybrid network that combines the local features of a CNN encoder-decoder architecture with long-range and temporal dependencies using biconvolutional long-short-term memory (LSTM) networks and vision transformers (ViT). This enables the model to capture contextual channel relationships in the data and account for the uncertainty of segmentation over time. Additionally, we introduce a novel composite loss function that considers both the segmentation robustness and the boundary agreement of the predicted output with the gold standard. Our proposed model shows consistent improvement over the state of the art on ten publicly available datasets of seven different medical imaging modalities.
☆ MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice ECCV
Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: ($i$) a new task termed \emph{space-time instance segmentation}, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and ($ii$) \emph{\dname}, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\url{https://github.com/tub-rip/MouseSIS}
comment: 18 pages, 5 figures, ECCV Workshops
☆ Few-Shot Continual Learning for Activity Recognition in Classroom Surveillance Images
The application of activity recognition in the "AI + Education" field is gaining increasing attention. However, current work mainly focuses on the recognition of activities in manually captured videos and a limited number of activity types, with little attention given to recognizing activities in surveillance images from real classrooms. In real classroom settings, normal teaching activities such as reading, account for a large proportion of samples, while rare non-teaching activities such as eating, continue to appear. This requires a model that can learn non-teaching activities from few samples without forgetting the normal teaching activities, which necessitates fewshot continual learning (FSCL) capability. To address this gap, we constructed a continual learning dataset focused on classroom surveillance image activity recognition called ARIC (Activity Recognition in Classroom). The dataset has advantages such as multiple perspectives, a wide variety of activities, and real-world scenarios, but it also presents challenges like similar activities and imbalanced sample distribution. To overcome these challenges, we designed a few-shot continual learning method that combines supervised contrastive learning (SCL) and an adaptive covariance classifier (ACC). During the base phase, we proposed a SCL approach based on feature augmentation to enhance the model's generalization ability. In the incremental phase, we employed an ACC to more accurately describe the distribution of new classes. Experimental results demonstrate that our method outperforms other existing methods on the ARIC dataset.
☆ Eetimating Indoor Scene Depth Maps from Ultrasonic Echoes ICIP 2024
Measuring 3D geometric structures of indoor scenes requires dedicated depth sensors, which are not always available. Echo-based depth estimation has recently been studied as a promising alternative solution. All previous studies have assumed the use of echoes in the audible range. However, one major problem is that audible echoes cannot be used in quiet spaces or other situations where producing audible sounds is prohibited. In this paper, we consider echo-based depth estimation using inaudible ultrasonic echoes. While ultrasonic waves provide high measurement accuracy in theory, the actual depth estimation accuracy when ultrasonic echoes are used has remained unclear, due to its disadvantage of being sensitive to noise and susceptible to attenuation. We first investigate the depth estimation accuracy when the frequency of the sound source is restricted to the high-frequency band, and found that the accuracy decreased when the frequency was limited to ultrasonic ranges. Based on this observation, we propose a novel deep learning method to improve the accuracy of ultrasonic echo-based depth estimation by using audible echoes as auxiliary data only during training. Experimental results with a public dataset demonstrate that our method improves the estimation accuracy.
comment: ICIP 2024
☆ Enhancing User-Centric Privacy Protection: An Interactive Framework through Diffusion Models and Machine Unlearning
In the realm of multimedia data analysis, the extensive use of image datasets has escalated concerns over privacy protection within such data. Current research predominantly focuses on privacy protection either in data sharing or upon the release of trained machine learning models. Our study pioneers a comprehensive privacy protection framework that safeguards image data privacy concurrently during data sharing and model publication. We propose an interactive image privacy protection framework that utilizes generative machine learning models to modify image information at the attribute level and employs machine unlearning algorithms for the privacy preservation of model parameters. This user-interactive framework allows for adjustments in privacy protection intensity based on user feedback on generated images, striking a balance between maximal privacy safeguarding and maintaining model performance. Within this framework, we instantiate two modules: a differential privacy diffusion model for protecting attribute information in images and a feature unlearning algorithm for efficient updates of the trained model on the revised image dataset. Our approach demonstrated superiority over existing methods on facial datasets across various attribute classifications.
☆ YOLO-PPA based Efficient Traffic Sign Detection for Cruise Control in Autonomous Driving
It is very important to detect traffic signs efficiently and accurately in autonomous driving systems. However, the farther the distance, the smaller the traffic signs. Existing object detection algorithms can hardly detect these small scaled signs.In addition, the performance of embedded devices on vehicles limits the scale of detection models.To address these challenges, a YOLO PPA based traffic sign detection algorithm is proposed in this paper.The experimental results on the GTSDB dataset show that compared to the original YOLO, the proposed method improves inference efficiency by 11.2%. The mAP 50 is also improved by 93.2%, which demonstrates the effectiveness of the proposed YOLO PPA.
☆ Improving Robustness to Multiple Spurious Correlations by Multi-Objective Optimization
We study the problem of training an unbiased and accurate model given a dataset with multiple biases. This problem is challenging since the multiple biases cause multiple undesirable shortcuts during training, and even worse, mitigating one may exacerbate the other. We propose a novel training method to tackle this challenge. Our method first groups training data so that different groups induce different shortcuts, and then optimizes a linear combination of group-wise losses while adjusting their weights dynamically to alleviate conflicts between the groups in performance; this approach, rooted in the multi-objective optimization theory, encourages to achieve the minimax Pareto solution. We also present a new benchmark with multiple biases, dubbed MultiCelebA, for evaluating debiased training methods under realistic and challenging scenarios. Our method achieved the best on three datasets with multiple biases, and also showed superior performance on conventional single-bias datasets.
comment: International Conference on Machine Learning 2024
☆ ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, the application of alignment training within the chart domain is still underexplored. To address this, we propose ChartMoE, which employs the mixture of expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train multiple linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with over 900K chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts in four distinct ways and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.
☆ OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving
The rise of multi-modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM-based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy-language-action generative world model, which uses semantic occupancy as a general visual representation and unifies vision-language-action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE-like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a unified multi-modal vocabulary for vision, language and action. Furthermore, we enhance LLM, specifically LLaMA, to perform the next token/scene prediction on the unified vocabulary to complete multiple tasks in autonomous driving. Extensive experiments demonstrate that OccLLaMA achieves competitive performance across multiple tasks, including 4D occupancy forecasting, motion planning, and visual question answering, showcasing its potential as a foundation model in autonomous driving.
☆ SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model
Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the 'bespoked' contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
☆ Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision ECCV 2024
Recent advances in interactive keypoint estimation methods have enhanced accuracy while minimizing user intervention. However, these methods require user input for error correction, which can be costly in vertebrae keypoint estimation where inaccurate keypoints are densely clustered or overlap. We introduce a novel approach, KeyBot, specifically designed to identify and correct significant and typical errors in existing models, akin to user revision. By characterizing typical error types and using simulated errors for training, KeyBot effectively corrects these errors and significantly reduces user workload. Comprehensive quantitative and qualitative evaluations on three public datasets confirm that KeyBot significantly outperforms existing methods, achieving state-of-the-art performance in interactive vertebrae keypoint estimation. The source code and demo video are available at: https://ts-kim.github.io/KeyBot/
comment: 33 pages, ECCV 2024, Project Page: https://ts-kim.github.io/KeyBot/
☆ Granular-ball Representation Learning for Deep CNN on Learning with Label Noise
In actual scenarios, whether manually or automatically annotated, label noise is inevitably generated in the training data, which can affect the effectiveness of deep CNN models. The popular solutions require data cleaning or designing additional optimizations to punish the data with mislabeled data, thereby enhancing the robustness of models. However, these methods come at the cost of weakening or even losing some data during the training process. As we know, content is the inherent attribute of an image that does not change with changes in annotations. In this study, we propose a general granular-ball computing (GBC) module that can be embedded into a CNN model, where the classifier finally predicts the label of granular-ball ($gb$) samples instead of each individual samples. Specifically, considering the classification task: (1) in forward process, we split the input samples as $gb$ samples at feature-level, each of which can correspond to multiple samples with varying numbers and share one single label; (2) during the backpropagation process, we modify the gradient allocation strategy of the GBC module to enable it to propagate normally; and (3) we develop an experience replay policy to ensure the stability of the training process. Experiments demonstrate that the proposed method can improve the robustness of CNN models with no additional data or optimization.
☆ Gr-IoU: Ground-Intersection over Union for Robust Multi-Object Tracking with 3D Geometric Constraints ECCV 2024
We propose a Ground IoU (Gr-IoU) to address the data association problem in multi-object tracking. When tracking objects detected by a camera, it often occurs that the same object is assigned different IDs in consecutive frames, especially when objects are close to each other or overlapping. To address this issue, we introduce Gr-IoU, which takes into account the 3D structure of the scene. Gr-IoU transforms traditional bounding boxes from the image space to the ground plane using the vanishing point geometry. The IoU calculated with these transformed bounding boxes is more sensitive to the front-to-back relationships of objects, thereby improving data association accuracy and reducing ID switches. We evaluated our Gr-IoU method on the MOT17 and MOT20 datasets, which contain diverse tracking scenarios including crowded scenes and sequences with frequent occlusions. Experimental results demonstrated that Gr-IoU outperforms conventional real-time methods without appearance features.
comment: Accepted for the ECCV 2024 Workshop on Affective Behavior Analysis in-the-wild(ABAW)
☆ Multiple weather images restoration using the task transformer and adaptive mixup strategy
The current state-of-the-art in severe weather removal predominantly focuses on single-task applications, such as rain removal, haze removal, and snow removal. However, real-world weather conditions often consist of a mixture of several weather types, and the degree of weather mixing in autonomous driving scenarios remains unknown. In the presence of complex and diverse weather conditions, a single weather removal model often encounters challenges in producing clear images from severe weather images. Therefore, there is a need for the development of multi-task severe weather removal models that can effectively handle mixed weather conditions and improve image quality in autonomous driving scenarios. In this paper, we introduce a novel multi-task severe weather removal model that can effectively handle complex weather conditions in an adaptive manner. Our model incorporates a weather task sequence generator, enabling the self-attention mechanism to selectively focus on features specific to different weather types. To tackle the challenge of repairing large areas of weather degradation, we introduce Fast Fourier Convolution (FFC) to increase the receptive field. Additionally, we propose an adaptive upsampling technique that effectively processes both the weather task information and underlying image features by selectively retaining relevant information. Our proposed model has achieved state-of-the-art performance on the publicly available dataset.
comment: 10 pages, 5 figures and 2 table
☆ UAV (Unmanned Aerial Vehicles): Diverse Applications of UAV Datasets in Segmentation, Classification, Detection, and Tracking
Unmanned Aerial Vehicles (UAVs), have greatly revolutionized the process of gathering and analyzing data in diverse research domains, providing unmatched adaptability and effectiveness. This paper presents a thorough examination of Unmanned Aerial Vehicle (UAV) datasets, emphasizing their wide range of applications and progress. UAV datasets consist of various types of data, such as satellite imagery, images captured by drones, and videos. These datasets can be categorized as either unimodal or multimodal, offering a wide range of detailed and comprehensive information. These datasets play a crucial role in disaster damage assessment, aerial surveillance, object recognition, and tracking. They facilitate the development of sophisticated models for tasks like semantic segmentation, pose estimation, vehicle re-identification, and gesture recognition. By leveraging UAV datasets, researchers can significantly enhance the capabilities of computer vision models, thereby advancing technology and improving our understanding of complex, dynamic environments from an aerial perspective. This review aims to encapsulate the multifaceted utility of UAV datasets, emphasizing their pivotal role in driving innovation and practical applications in multiple domains.
☆ Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly Detection
Detecting anomalies in human-related videos is crucial for surveillance applications. Current methods primarily include appearance-based and action-based techniques. Appearance-based methods rely on low-level visual features such as color, texture, and shape. They learn a large number of pixel patterns and features related to known scenes during training, making them effective in detecting anomalies within these familiar contexts. However, when encountering new or significantly changed scenes, i.e., unknown scenes, they often fail because existing SOTA methods do not effectively capture the relationship between actions and their surrounding scenes, resulting in low generalization. In contrast, action-based methods focus on detecting anomalies in human actions but are usually less informative because they tend to overlook the relationship between actions and their scenes, leading to incorrect detection. For instance, the normal event of running on the beach and the abnormal event of running on the street might both be considered normal due to the lack of scene information. In short, current methods struggle to integrate low-level visual and high-level action features, leading to poor anomaly detection in varied and complex scenes. To address this challenge, we propose a novel decoupling-based architecture for human-related video anomaly detection (DecoAD). DecoAD significantly improves the integration of visual and action features through the decoupling and interweaving of scenes and actions, thereby enabling a more intuitive and accurate understanding of complex behaviors and scenes. DecoAD supports fully supervised, weakly supervised, and unsupervised settings.
comment: 13pages, 9 figures
☆ Labeled-to-Unlabeled Distribution Alignment for Partially-Supervised Multi-Organ Medical Image Segmentation
Partially-supervised multi-organ medical image segmentation aims to develop a unified semantic segmentation model by utilizing multiple partially-labeled datasets, with each dataset providing labels for a single class of organs. However, the limited availability of labeled foreground organs and the absence of supervision to distinguish unlabeled foreground organs from the background pose a significant challenge, which leads to a distribution mismatch between labeled and unlabeled pixels. Although existing pseudo-labeling methods can be employed to learn from both labeled and unlabeled pixels, they are prone to performance degradation in this task, as they rely on the assumption that labeled and unlabeled pixels have the same distribution. In this paper, to address the problem of distribution mismatch, we propose a labeled-to-unlabeled distribution alignment (LTUDA) framework that aligns feature distributions and enhances discriminative capability. Specifically, we introduce a cross-set data augmentation strategy, which performs region-level mixing between labeled and unlabeled organs to reduce distribution discrepancy and enrich the training set. Besides, we propose a prototype-based distribution alignment method that implicitly reduces intra-class variation and increases the separation between the unlabeled foreground and background. This can be achieved by encouraging consistency between the outputs of two prototype classifiers and a linear classifier. Extensive experimental results on the AbdomenCT-1K dataset and a union of four benchmark datasets (including LiTS, MSD-Spleen, KiTS, and NIH82) demonstrate that our method outperforms the state-of-the-art partially-supervised methods by a considerable margin, and even surpasses the fully-supervised methods. The source code is publicly available at https://github.com/xjiangmed/LTUDA.
comment: Accepted by Medical Image Analysis
☆ Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion
Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process.
☆ Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction
3D Gaussian Splatting (3DGS) has emerged as a promising approach for 3D scene representation, offering a reduction in computational overhead compared to Neural Radiance Fields (NeRF). However, 3DGS is susceptible to high-frequency artifacts and demonstrates suboptimal performance under sparse viewpoint conditions, thereby limiting its applicability in robotics and computer vision. To address these limitations, we introduce SVS-GS, a novel framework for Sparse Viewpoint Scene reconstruction that integrates a 3D Gaussian smoothing filter to suppress artifacts. Furthermore, our approach incorporates a Depth Gradient Profile Prior (DGPP) loss with a dynamic depth mask to sharpen edges and 2D diffusion with Score Distillation Sampling (SDS) loss to enhance geometric consistency in novel view synthesis. Experimental evaluations on the MipNeRF-360 and SeaThru-NeRF datasets demonstrate that SVS-GS markedly improves 3D reconstruction from sparse viewpoints, offering a robust and efficient solution for scene understanding in robotics and computer vision applications.
☆ Bi-capacity Choquet Integral for Sensor Fusion with Label Uncertainty
Sensor fusion combines data from multiple sensor sources to improve reliability, robustness, and accuracy of data interpretation. The Fuzzy Integral (FI), in particular, the Choquet integral (ChI), is often used as a powerful nonlinear aggregator for fusion across multiple sensors. However, existing supervised ChI learning algorithms typically require precise training labels for each input data point, which can be difficult or impossible to obtain. Additionally, prior work on ChI fusion is often based only on the normalized fuzzy measures, which bounds the fuzzy measure values between [0, 1]. This can be limiting in cases where the underlying scales of input data sources are bipolar (i.e., between [-1, 1]). To address these challenges, this paper proposes a novel Choquet integral-based fusion framework, named Bi-MIChI (pronounced "bi-mi-kee"), which uses bi-capacities to represent the interactions between pairs of subsets of the input sensor sources on a bi-polar scale. This allows for extended non-linear interactions between the sensor sources and can lead to interesting fusion results. Bi-MIChI also addresses label uncertainty through Multiple Instance Learning, where training labels are applied to "bags" (sets) of data instead of per-instance. Our proposed Bi-MIChI framework shows effective classification and detection performance on both synthetic and real-world experiments for sensor fusion with label uncertainty. We also provide detailed analyses on the behavior of the fuzzy measures to demonstrate our fusion process.
comment: 10 pages, 7 figures, 7 tables; Accepted to 2024 FUZZ-IEEE and presented at 2024 IEEE WCCI; Code available at https://github.com/hvak/Bi-MIChI
☆ iSeg: An Iterative Refinement-based Framework for Training-free Segmentation
Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. Inspired by this, researchers have explored employing stable diffusion for trainingfree segmentation. Most existing approaches either simply employ cross-attention map or refine it by self-attention map, to generate segmentation masks. We believe that iterative refinement with self-attention map would lead to better results. However, we mpirically demonstrate that such a refinement is sub-optimal likely due to the self-attention map containing irrelevant global information which hampers accurately refining cross-attention map with multiple iterations. To address this, we propose an iterative refinement framework for training-free segmentation, named iSeg, having an entropy-reduced self-attention module which utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves refined crossattention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kind of images and interactions.
☆ TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.
☆ Active Fake: DeepFake Camouflage
DeepFake technology has gained significant attention due to its ability to manipulate facial attributes with high realism, raising serious societal concerns. Face-Swap DeepFake is the most harmful among these techniques, which fabricates behaviors by swapping original faces with synthesized ones. Existing forensic methods, primarily based on Deep Neural Networks (DNNs), effectively expose these manipulations and have become important authenticity indicators. However, these methods mainly concentrate on capturing the blending inconsistency in DeepFake faces, raising a new security issue, termed Active Fake, emerges when individuals intentionally create blending inconsistency in their authentic videos to evade responsibility. This tactic is called DeepFake Camouflage. To achieve this, we introduce a new framework for creating DeepFake camouflage that generates blending inconsistencies while ensuring imperceptibility, effectiveness, and transferability. This framework, optimized via an adversarial learning strategy, crafts imperceptible yet effective inconsistencies to mislead forensic detectors. Extensive experiments demonstrate the effectiveness and robustness of our method, highlighting the need for further research in active fake detection.
☆ RoomDiffusion: A Specialized Diffusion Model in the Interior Design Industry
Recent advancements in text-to-image diffusion models have significantly transformed visual content generation, yet their application in specialized fields such as interior design remains underexplored. In this paper, we present RoomDiffusion, a pioneering diffusion model meticulously tailored for the interior design industry. To begin with, we build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. Subsequently, techniques such as multiaspect training, multi-stage fine-tune and model fusion are applied to enhance both the visual appeal and precision of the generated results. Lastly, leveraging the latent consistency Distillation method, we distill and expedite the model for optimal efficiency. Unlike existing models optimized for general scenarios, RoomDiffusion addresses specific challenges in interior design, such as lack of fashion, high furniture duplication rate, and inaccurate style. Through our holistic human evaluation protocol with more than 20 professional human evaluators, RoomDiffusion demonstrates industry-leading performance in terms of aesthetics, accuracy, and efficiency, surpassing all existing open source models such as stable diffusion and SDXL.
☆ PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning
Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.Our code has been open sourced at https://github.com/TianSuya/SemiFG.
comment: Under review
☆ Perceptual-Distortion Balanced Image Super-Resolution is a Multi-Objective Optimization Problem
Training Single-Image Super-Resolution (SISR) models using pixel-based regression losses can achieve high distortion metrics scores (e.g., PSNR and SSIM), but often results in blurry images due to insufficient recovery of high-frequency details. Conversely, using GAN or perceptual losses can produce sharp images with high perceptual metric scores (e.g., LPIPS), but may introduce artifacts and incorrect textures. Balancing these two types of losses can help achieve a trade-off between distortion and perception, but the challenge lies in tuning the loss function weights. To address this issue, we propose a novel method that incorporates Multi-Objective Optimization (MOO) into the training process of SISR models to balance perceptual quality and distortion. We conceptualize the relationship between loss weights and image quality assessment (IQA) metrics as black-box objective functions to be optimized within our Multi-Objective Bayesian Optimization Super-Resolution (MOBOSR) framework. This approach automates the hyperparameter tuning process, reduces overall computational cost, and enables the use of numerous loss functions simultaneously. Extensive experiments demonstrate that MOBOSR outperforms state-of-the-art methods in terms of both perceptual quality and distortion, significantly advancing the perception-distortion Pareto frontier. Our work points towards a new direction for future research on balancing perceptual quality and fidelity in nearly all image restoration tasks. The source code and pretrained models are available at: https://github.com/ZhuKeven/MOBOSR.
♻ ☆ Segment Beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation AAAI-24
Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.
comment: AAAI-24 (Fixed some erros)
♻ ☆ Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation ECCV 2024
We present Mesh2NeRF, an approach to derive ground-truth radiance fields from textured meshes for 3D generation tasks. Many 3D generative approaches represent 3D scenes as radiance fields for training. Their ground-truth radiance fields are usually fitted from multi-view renderings from a large-scale synthetic 3D dataset, which often results in artifacts due to occlusions or under-fitting issues. In Mesh2NeRF, we propose an analytic solution to directly obtain ground-truth radiance fields from 3D meshes, characterizing the density field with an occupancy function featuring a defined surface thickness, and determining view-dependent color through a reflection function considering both the mesh and environment lighting. Mesh2NeRF extracts accurate radiance fields which provides direct supervision for training generative NeRFs and single scene representation. We validate the effectiveness of Mesh2NeRF across various tasks, achieving a noteworthy 3.12dB improvement in PSNR for view synthesis in single scene representation on the ABO dataset, a 0.69 PSNR enhancement in the single-view conditional generation of ShapeNet Cars, and notably improved mesh extraction from NeRF in the unconditional generation of Objaverse Mugs.
comment: Accepted to ECCV 2024, Project page: https://terencecyj.github.io/projects/Mesh2NeRF/ Video: https://youtu.be/SsFkhSuQYGM
♻ ☆ UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
The paper introduces the UniMER dataset, marking the first study on Mathematical Expression Recognition (MER) targeting complex real-world scenarios. The UniMER dataset includes a large-scale training set, UniMER-1M, which offers unprecedented scale and diversity with one million training instances to train high-quality, robust models. Additionally, UniMER features a meticulously designed, diverse test set, UniMER-Test, which covers a variety of formula distributions found in real-world scenarios, providing a more comprehensive and fair evaluation. To better utilize the UniMER dataset, the paper proposes a Universal Mathematical Expression Recognition Network (UniMERNet), tailored to the characteristics of formula recognition. UniMERNet consists of a carefully designed encoder that incorporates detail-aware and local context features, and an optimized decoder for accelerated performance. Extensive experiments conducted using the UniMER-1M dataset and UniMERNet demonstrate that training on the large-scale UniMER-1M dataset can produce a more generalizable formula recognition model, significantly outperforming all previous datasets. Furthermore, the introduction of UniMERNet enhances the model's performance in formula recognition, achieving higher accuracy and speeds. All data, models, and code are available at https://github.com/opendatalab/UniMERNet.
comment: Project Website: https://github.com/opendatalab/UniMERNet
♻ ☆ FDNet: Feature Decoupled Segmentation Network for Tooth CBCT Image
Precise Tooth Cone Beam Computed Tomography (CBCT) image segmentation is crucial for orthodontic treatment planning. In this paper, we propose FDNet, a Feature Decoupled Segmentation Network, to excel in the face of the variable dental conditions encountered in CBCT scans, such as complex artifacts and indistinct tooth boundaries. The Low-Frequency Wavelet Transform (LF-Wavelet) is employed to enrich the semantic content by emphasizing the global structural integrity of the teeth, while the SAM encoder is leveraged to refine the boundary delineation, thus improving the contrast between adjacent dental structures. By integrating these dual aspects, FDNet adeptly addresses the semantic gap, providing a detailed and accurate segmentation. The framework's effectiveness is validated through rigorous benchmarks, achieving the top Dice and IoU scores of 85.28% and 75.23%, respectively. This innovative decoupling of semantic and boundary features capitalizes on the unique strengths of each element to elevate the quality of segmentation performance.
comment: IEEE ISBI 2024, Oral
♻ ☆ Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \url{https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications}.
♻ ☆ HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding ACM MM 2024
Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.
comment: Accepted by ACM MM 2024. The project page: https://github.com/linhuixiao/HiVG
♻ ☆ Triple-domain Feature Learning with Frequency-aware Memory Enhancement for Moving Infrared Small Target Detection
As a sub-field of object detection, moving infrared small target detection presents significant challenges due to tiny target sizes and low contrast against backgrounds. Currently-existing methods primarily rely on the features extracted only from spatio-temporal domain. Frequency domain has hardly been concerned yet, although it has been widely applied in image processing. To extend feature source domains and enhance feature representation, we propose a new Triple-domain Strategy (Tridos) with the frequency-aware memory enhancement on spatio-temporal domain for infrared small target detection. In this scheme, it effectively detaches and enhances frequency features by a local-global frequency-aware module with Fourier transform. Inspired by human visual system, our memory enhancement is designed to capture the spatial relations of infrared targets among video frames. Furthermore, it encodes temporal dynamics motion features via differential learning and residual enhancing. Additionally, we further design a residual compensation to reconcile possible cross-domain feature mismatches. To our best knowledge, proposed Tridos is the first work to explore infrared target feature learning comprehensively in spatio-temporal-frequency domains. The extensive experiments on three datasets (i.e., DAUB, ITSDT-15K and IRDST) validate that our triple-domain infrared feature learning scheme could often be obviously superior to state-of-the-art ones. Source codes are available at https://github.com/UESTC-nnLab/Tridos.
comment: This paper has accepted IEEE TGRS
♻ ☆ GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns ECCV 2024
Recent research interest in the learning-based processing of garments, from virtual fitting to generation and reconstruction, stumbles on a scarcity of high-quality public data in the domain. We contribute to resolving this need by presenting the first large-scale synthetic dataset of 3D made-to-measure garments with sewing patterns, as well as its generation pipeline. GarmentCodeData contains 115,000 data points that cover a variety of designs in many common garment categories: tops, shirts, dresses, jumpsuits, skirts, pants, etc., fitted to a variety of body shapes sampled from a custom statistical body model based on CAESAR, as well as a standard reference body shape, applying three different textile materials. To enable the creation of datasets of such complexity, we introduce a set of algorithms for automatically taking tailor's measures on sampled body shapes, sampling strategies for sewing pattern design, and propose an automatic, open-source 3D garment draping pipeline based on a fast XPBD simulator, while contributing several solutions for collision resolution and drape correctness to enable scalability. Project Page: https://igl.ethz.ch/projects/GarmentCodeData/
comment: Accepted to ECCV 2024. Sept 4th, 2024: release of GarmentCodeData(v2)
♻ ☆ EaDeblur-GS: Event assisted 3D Deblur Reconstruction with Gaussian Splatting
3D deblurring reconstruction techniques have recently seen significant advancements with the development of Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although these techniques can recover relatively clear 3D reconstructions from blurry image inputs, they still face limitations in handling severe blurring and complex camera motion. To address these issues, we propose Event-assisted 3D Deblur Reconstruction with Gaussian Splatting (EaDeblur-GS), which integrates event camera data to enhance the robustness of 3DGS against motion blur. By employing an Adaptive Deviation Estimator (ADE) network to estimate Gaussian center deviations and using novel loss functions, EaDeblur-GS achieves sharp 3D reconstructions in real-time, demonstrating performance comparable to state-of-the-art methods.
♻ ☆ Shapley Values-enabled Progressive Pseudo Bag Augmentation for Whole Slide Image Classification
In computational pathology, whole-slide image (WSI) classification presents a formidable challenge due to its gigapixel resolution and limited fine-grained annotations. Multiple-instance learning (MIL) offers a weakly supervised solution, yet refining instance-level information from bag-level labels remains challenging. While most of the conventional MIL methods use attention scores to estimate instance importance scores (IIS) which contribute to the prediction of the slide labels, these often lead to skewed attention distributions and inaccuracies in identifying crucial instances. To address these issues, we propose a new approach inspired by cooperative game theory: employing Shapley values to assess each instance's contribution, thereby improving IIS estimation. The computation of the Shapley value is then accelerated using attention, meanwhile retaining the enhanced instance identification and prioritization. We further introduce a framework for the progressive assignment of pseudo bags based on estimated IIS, encouraging more balanced attention distributions in MIL models. Our extensive experiments on CAMELYON-16, BRACS, TCGA-LUNG, and TCGA-BRCA datasets show our method's superiority over existing state-of-the-art approaches, offering enhanced interpretability and class-wise insights. Our source code is available at https://github.com/RenaoYan/PMIL.
comment: IEEE TRANSACTIONS ON MEDICAL IMAGING 2024
♻ ☆ EgoHDM: An Online Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System
We present EgoHDM, an online egocentric-inertial human motion capture (mocap), localization, and dense mapping system. Our system uses 6 inertial measurement units (IMUs) and a commodity head-mounted RGB camera. EgoHDM is the first human mocap system that offers dense scene mapping in near real-time. Further, it is fast and robust to initialize and fully closes the loop between physically plausible map-aware global human motion estimation and mocap-aware 3D scene reconstruction. Our key idea is integrating camera localization and mapping information with inertial human motion capture bidirectionally in our system. To achieve this, we design a tightly coupled mocap-aware dense bundle adjustment and physics-based body pose correction module leveraging a local body-centric elevation map. The latter introduces a novel terrain-aware contact PD controller, which enables characters to physically contact the given local elevation map thereby reducing human floating or penetration. We demonstrate the performance of our system on established synthetic and real-world benchmarks. The results show that our method reduces human localization, camera pose, and mapping accuracy error by 41%, 71%, 46%, respectively, compared to the state of the art. Our qualitative evaluations on newly captured data further demonstrate that EgoHDM can cover challenging scenarios in non-flat terrain including stepping over stairs and outdoor scenes in the wild.
comment: Project Page: https://handiyin.github.io/EgoHDM/
♻ ☆ Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks: Application to 7th ABAW Challenge
We present our contribution to the 7th ABAW challenge at ECCV 2024, by utilizing a Dual-Direction Attention Mixed Feature Network (DDAMFN) for multitask facial expression recognition, we achieve results far beyond the proposed baseline for the Multi-Task ABAW challenge. Our proposal uses the well-known DDAMFN architecture as base to effectively predict valence-arousal, emotion recognition, and facial action units. We demonstrate the architecture ability to handle these tasks simultaneously, providing insights into its architecture and the rationale behind its design. Additionally, we compare our results for a multitask solution with independent single-task performance.
comment: 11 pages
♻ ☆ Human-AI Collaborative Multi-modal Multi-rater Learning for Endometriosis Diagnosis
Endometriosis, affecting about 10\% of individuals assigned female at birth, is challenging to diagnose and manage. Diagnosis typically involves the identification of various signs of the disease using either laparoscopic surgery or the analysis of T1/T2 MRI images, with the latter being quicker and cheaper but less accurate. A key diagnostic sign of endometriosis is the obliteration of the Pouch of Douglas (POD). However, even experienced clinicians struggle with accurately classifying POD obliteration from MRI images, which complicates the training of reliable AI models. In this paper, we introduce the \underline{H}uman-\underline{AI} \underline{Co}llaborative \underline{M}ulti-modal \underline{M}ulti-rater Learning (HAICOMM) methodology to address the challenge above. HAICOMM is the first method that explores three important aspects of this problem: 1) multi-rater learning to extract a cleaner label from the multiple ``noisy'' labels available per training sample; 2) multi-modal learning to leverage the presence of T1/T2 MRI images for training and testing; and 3) human-AI collaboration to build a system that leverages the predictions from clinicians and the AI model to provide more accurate classification than standalone clinicians and AI models. Presenting results on the multi-rater T1/T2 MRI endometriosis dataset that we collected to validate our methodology, the proposed HAICOMM model outperforms an ensemble of clinicians, noisy-label learning models, and multi-rater learning methods.
♻ ☆ Giving each task what it needs -- leveraging structured sparsity for tailored multi-task learning ECCV 2024
In the Multi-task Learning (MTL) framework, every task demands distinct feature representations, ranging from low-level to high-level attributes. It is vital to address the specific (feature/parameter) needs of each task, especially in computationally constrained environments. This work, therefore, introduces Layer-Optimized Multi-Task (LOMT) models that utilize structured sparsity to refine feature selection for individual tasks and enhance the performance of all tasks in a multi-task scenario. Structured or group sparsity systematically eliminates parameters from trivial channels and, sometimes, eventually, entire layers within a convolution neural network during training. Consequently, the remaining layers provide the most optimal features for a given task. In this two-step approach, we subsequently leverage this sparsity-induced optimal layer information to build the LOMT models by connecting task-specific decoders to these strategically identified layers, deviating from conventional approaches that uniformly connect decoders at the end of the network. This tailored architecture optimizes the network, focusing on essential features while reducing redundancy. We validate the efficacy of the proposed approach on two datasets, i.e., NYU-v2 and CelebAMask-HD datasets, for multiple heterogeneous tasks. A detailed performance analysis of the LOMT models, in contrast to the conventional MTL models, reveals that the LOMT models outperform for most task combinations. The excellent qualitative and quantitative outcomes highlight the effectiveness of employing structured sparsity for optimal layer (or feature) selection.
comment: Accepted at ECCV 2024 workshop - Computational Aspects of Deep Learning
♻ ☆ Kernel Adversarial Learning for Real-world Image Super-resolution
Current deep image super-resolution (SR) approaches aim to restore high-resolution images from down-sampled images or by assuming degradation from simple Gaussian kernels and additive noises. However, these techniques only assume crude approximations of the real-world image degradation process, which should involve complex kernels and noise patterns that are difficult to model using simple assumptions. In this paper, we propose a more realistic process to synthesise low-resolution images for real-world image SR by introducing a new Kernel Adversarial Learning Super-resolution (KASR) framework. In the proposed framework, degradation kernels and noises are adaptively modelled rather than explicitly specified. Moreover, we also propose a high-frequency selective objective and an iterative supervision process to further boost the model SR reconstruction accuracy. Extensive experiments validate the effectiveness of the proposed framework on real-world datasets.
♻ ☆ Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.
comment: Homepage: https://loopyavatar.github.io/
♻ ☆ G-Style: Stylized Gaussian Splatting
We introduce G-Style, a novel algorithm designed to transfer the style of an image onto a 3D scene represented using Gaussian Splatting. Gaussian Splatting is a powerful 3D representation for novel view synthesis, as -- compared to other approaches based on Neural Radiance Fields -- it provides fast scene renderings and user control over the scene. Recent pre-prints have demonstrated that the style of Gaussian Splatting scenes can be modified using an image exemplar. However, since the scene geometry remains fixed during the stylization process, current solutions fall short of producing satisfactory results. Our algorithm aims to address these limitations by following a three-step process: In a pre-processing step, we remove undesirable Gaussians with large projection areas or highly elongated shapes. Subsequently, we combine several losses carefully designed to preserve different scales of the style in the image, while maintaining as much as possible the integrity of the original scene content. During the stylization process and following the original design of Gaussian Splatting, we split Gaussians where additional detail is necessary within our scene by tracking the gradient of the stylized color. Our experiments demonstrate that G-Style generates high-quality stylizations within just a few minutes, outperforming existing methods both qualitatively and quantitatively.
♻ ☆ MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion BMVC 2024
Autonomous systems, such as self-driving cars, rely on reliable semantic environment perception for decision making. Despite great advances in video semantic segmentation, existing approaches ignore important inductive biases and lack structured and interpretable internal representations. In this work, we propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera, while also estimating the motion of external objects. Our model leverages these representations to improve the temporal consistency of semantic segmentation without sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion approach in which scene geometry and camera motion are first used to compensate for ego-motion, then residual flow is used to compensate motion of dynamic objects, and finally the predicted scene features are fused with the current features to obtain a temporally consistent scene segmentation. Our model parses automotive scenes into multiple decoupled interpretable representations such as scene geometry, ego-motion, and object motion. Quantitative evaluation shows that MCDS-VSS achieves superior temporal consistency on video sequences while retaining competitive segmentation performance.
comment: Accepted for publication at BMVC 2024
♻ ☆ BEVal: A Cross-dataset Evaluation Study of BEV Segmentation Models for Autononomous Driving
Current research in semantic bird's-eye view segmentation for autonomous driving focuses solely on optimizing neural network models using a single dataset, typically nuScenes. This practice leads to the development of highly specialized models that may fail when faced with different environments or sensor setups, a problem known as domain shift. In this paper, we conduct a comprehensive cross-dataset evaluation of state-of-the-art BEV segmentation models to assess their performance across different training and testing datasets and setups, as well as different semantic categories. We investigate the influence of different sensors, such as cameras and LiDAR, on the models' ability to generalize to diverse conditions and scenarios. Additionally, we conduct multi-dataset training experiments that improve models' BEV segmentation performance compared to single-dataset training. Our work addresses the gap in evaluating BEV segmentation models under cross-dataset validation. And our findings underscore the importance of enhancing model generalizability and adaptability to ensure more robust and reliable BEV segmentation approaches for autonomous driving applications. The code for this paper available at https://github.com/manueldiaz96/beval .
♻ ☆ Adaptive Explicit Knowledge Transfer for Knowledge Distillation
Logit-based knowledge distillation (KD) for classification is cost-efficient compared to feature-based KD but often subject to inferior performance. Recently, it was shown that the performance of logit-based KD can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model, which is known as `implicit (dark) knowledge', to the student model. Through gradient analysis, we first show that this actually has an effect of adaptively controlling the learning of implicit knowledge. Then, we propose a new loss that enables the student to learn explicit knowledge (i.e., the teacher's confidence about the target class) along with implicit knowledge in an adaptive manner. Furthermore, we propose to separate the classification and distillation tasks for effective distillation and inter-class relationship modeling. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods on the CIFAR-100 and ImageNet datasets.
comment: 19 pages, 5 figures
♻ ☆ Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space
Skeletal motion plays a pivotal role in human activity recognition (HAR). Recently, attack methods have been proposed to identify the universal vulnerability of skeleton-based HAR(S-HAR). However, the research of adversarial transferability on S-HAR is largely missing. More importantly, existing attacks all struggle in transfer across unknown S-HAR models. We observed that the key reason is that the loss landscape of the action recognizers is rugged and sharp. Given the established correlation in prior studies~\cite{qin2022boosting,wu2020towards} between loss landscape and adversarial transferability, we assume and empirically validate that smoothing the loss landscape could potentially improve adversarial transferability on S-HAR. This is achieved by proposing a new post-train Dual Bayesian strategy, which can effectively explore the model posterior space for a collection of surrogates without the need for re-training. Furthermore, to craft adversarial examples along the motion manifold, we incorporate the attack gradient with information of the motion dynamics in a Bayesian manner. Evaluated on benchmark datasets, e.g. HDM05 and NTU 60, the average transfer success rate can reach as high as 35.9\% and 45.5\% respectively. In comparison, current state-of-the-art skeletal attacks achieve only 3.6\% and 9.8\%. The high adversarial transferability remains consistent across various surrogate, victim, and even defense models. Through a comprehensive analysis of the results, we provide insights on what surrogates are more likely to exhibit transferability, to shed light on future research.
comment: We have submitted a new version of our work at arXiv:2409.02483. This version, arXiv:2407.08572, is no longer valid. Any update for this work will be conducted in arXiv:2409.02483
♻ ☆ 3D Single-object Tracking in Point Clouds with High Temporal Variation ECCV24
The high temporal variation of the point clouds is the key challenge of 3D single-object tracking (3D SOT). Existing approaches rely on the assumption that the shape variation of the point clouds and the motion of the objects across neighboring frames are smooth, failing to cope with high temporal variation data. In this paper, we present a novel framework for 3D SOT in point clouds with high temporal variation, called HVTrack. HVTrack proposes three novel components to tackle the challenges in the high temporal variation scenario: 1) A Relative-Pose-Aware Memory module to handle temporal point cloud shape variations; 2) a Base-Expansion Feature Cross-Attention module to deal with similar object distractions in expanded search areas; 3) a Contextual Point Guided Self-Attention module for suppressing heavy background noise. We construct a dataset with high temporal variation (KITTI-HV) by setting different frame intervals for sampling in the KITTI dataset. On the KITTI-HV with 5 frame intervals, our HVTrack surpasses the state-of-the-art tracker CXTracker by 11.3%/15.7% in Success/Precision.
comment: Accepted by ECCV24
♻ ☆ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.
♻ ☆ Prediction of soil fertility parameters using USB-microscope imagery and portable X-ray fluorescence spectrometry
This study investigated the use of portable X-ray fluorescence (PXRF) spectrometry and soil image analysis for rapid soil fertility assessment, with a focus on key indicators such as available boron (B), organic carbon (OC), available manganese (Mn), available sulfur (S), and the sulfur availability index (SAI). A total of 1,133 soil samples from diverse agro-climatic zones in Eastern India were analyzed. The research integrated color and texture features from microscopic soil images, PXRF data, and auxiliary soil variables (AVs) using a Random Forest model. Results showed that combining image features (IFs) with AVs significantly improved prediction accuracy for available B (R2 = 0.80) and OC (R2 = 0.88). A data fusion approach, incorporating IFs, AVs, and PXRF data, further enhanced predictions for available Mn and SAI, with R2 values of 0.72 and 0.70, respectively. The study highlights the potential of integrating these technologies to offer rapid, cost-effective soil testing methods, paving the way for more advanced predictive models and a deeper understanding of soil fertility. Future work should explore the application of deep learning models on a larger dataset, incorporating soils from a wider range of agro-climatic zones under field conditions.
comment: Published in 'Soil Advances'
♻ ☆ D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction ECCV 2024
Reconstructing hand-held objects from a single RGB image is a challenging task in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, we employ a point cloud denoising diffusion model to account for the probabilistic nature of this problem. In the core, we introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm, enhancing the stability of diffusion and reverse processes and the precision of feature projection. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding, enhancing the reconstruction performance of the hand-occluded region of the object. Experiments on the synthetic ObMan dataset and three real-world datasets HO3D, MOW and DexYCB demonstrate that our approach can surpass all other state-of-the-art methods.
comment: ECCV 2024
♻ ☆ AICAttack: Adversarial Image Captioning Attack with Attention-Based Optimization
Recent advances in deep learning research have shown remarkable achievements across many tasks in computer vision (CV) and natural language processing (NLP). At the intersection of CV and NLP is the problem of image captioning, where the related models' robustness against adversarial attacks has not been well studied. This paper presents a novel adversarial attack strategy, AICAttack (Attention-based Image Captioning Attack), designed to attack image captioning models through subtle perturbations on images. Operating within a black-box attack scenario, our algorithm requires no access to the target model's architecture, parameters, or gradient information. We introduce an attention-based candidate selection mechanism that identifies the optimal pixels to attack, followed by a customised differential evolution method to optimise the perturbations of pixels' RGB values. We demonstrate AICAttack's effectiveness through extensive experiments on benchmark datasets against multiple victim models. The experimental results demonstrate that our method outperforms current leading-edge techniques by achieving consistently higher attack success rates.
♻ ☆ LinFusion: 1 GPU, 1 Minute, 16K Image
Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.
comment: Work in Progress. Codes are available at https://github.com/Huage001/LinFusion
♻ ☆ UVL2: A Unified Framework for Video Tampering Localization
With the advancement of deep learning-driven video editing technology, security risks have emerged. Malicious video tampering can lead to public misunderstanding, property losses, and legal disputes. Currently, detection methods are mostly limited to specific datasets, with limited detection performance for unknown forgeries, and lack of robustness for processed data. This paper proposes an effective video tampering localization network that significantly improves the detection performance of video inpainting and splicing by extracting more generalized features of forgery traces. Considering the inherent differences between tampered videos and original videos, such as edge artifacts, pixel distribution, texture features, and compress information, we have specifically designed four modules to independently extract these features. Furthermore, to seamlessly integrate these features, we employ a two-stage approach utilizing both a Convolutional Neural Network and a Vision Transformer, enabling us to learn these features in a local-to-global manner. Experimental results demonstrate that the method significantly outperforms the existing state-of-the-art methods and exhibits robustness.
♻ ☆ LuSNAR:A Lunar Segmentation, Navigation and Reconstruction Dataset based on Muti-sensor for Autonomous Exploration
With the complexity of lunar exploration missions, the moon needs to have a higher level of autonomy. Environmental perception and navigation algorithms are the foundation for lunar rovers to achieve autonomous exploration. The development and verification of algorithms require highly reliable data support. Most of the existing lunar datasets are targeted at a single task, lacking diverse scenes and high-precision ground truth labels. To address this issue, we propose a multi-task, multi-scene, and multi-label lunar benchmark dataset LuSNAR. This dataset can be used for comprehensive evaluation of autonomous perception and navigation systems, including high-resolution stereo image pairs, panoramic semantic labels, dense depth maps, LiDAR point clouds, and the position of rover. In order to provide richer scene data, we built 9 lunar simulation scenes based on Unreal Engine. Each scene is divided according to topographic relief and the density of objects. To verify the usability of the dataset, we evaluated and analyzed the algorithms of semantic segmentation, 3D reconstruction, and autonomous navigation. The experiment results prove that the dataset proposed in this paper can be used for ground verification of tasks such as autonomous environment perception and navigation, and provides a lunar benchmark dataset for testing the accessibility of algorithm metrics. We make LuSNAR publicly available at: https://github.com/autumn999999/LuSNAR-dataset.
comment: 19 pages, 13 figures, 11 tables
♻ ☆ A Survey for Foundation Models in Autonomous Driving
The advent of foundation models has revolutionized the fields of natural language processing and computer vision, paving the way for their application in autonomous driving (AD). This survey presents a comprehensive review of more than 40 research papers, demonstrating the role of foundation models in enhancing AD. Large language models contribute to planning and simulation in AD, particularly through their proficiency in reasoning, code generation and translation. In parallel, vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking, as well as creating realistic driving scenarios for simulation and testing. Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning, crucial for end-to-end AD. This survey not only provides a structured taxonomy, categorizing foundation models based on their modalities and functionalities within the AD domain but also delves into the methods employed in current research. It identifies the gaps between existing foundation models and cutting-edge AD approaches, thereby charting future research directions and proposing a roadmap for bridging these gaps.
♻ ☆ CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention
Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.
comment: Homepage: https://cyberhost.github.io/
♻ ☆ PointCloud-Text Matching: Benchmark Datasets and a Baseline
In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching~(PTM), which aims to find the exact cross-modal instance that matches a given point-cloud query or text query. PTM could be applied to various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there exists no suitable and targeted dataset for PTM in practice. Therefore, we construct three new PTM benchmark datasets, namely 3D2T-SR, 3D2T-NR, and 3D2T-QA. We observe that the data is challenging and with noisy correspondence due to the sparsity, noise, or disorder of point clouds and the ambiguity, vagueness, or incompleteness of texts, which make existing cross-modal matching methods ineffective for PTM. To tackle these challenges, we propose a PTM baseline, named Robust PointCloud-Text Matching method (RoMa). RoMa consists of two modules: a Dual Attention Perception module (DAP) and a Robust Negative Contrastive Learning module (RNCL). Specifically, DAP leverages token-level and feature-level attention to adaptively focus on useful local and global features, and aggregate them into common representations, thereby reducing the adverse impact of noise and ambiguity. To handle noisy correspondence, RNCL divides negative pairs, which are much less error-prone than positive pairs, into clean and noisy subsets, and assigns them forward and reverse optimization directions respectively, thus enhancing robustness against noisy correspondence. We conduct extensive experiments on our benchmarks and demonstrate the superiority of our RoMa.
comment: Upon further consideration, we have concluded that the current version requires significant revision and may not yet be ready for publication. We plan to conduct additional experiments and make the necessary improvements to ensure the paper meets the standards for future submission
♻ ☆ Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.
comment: Accepted to ACM Multimedia 2024. Project page: https://liyingxuan1012.github.io/zeroshot-speaker-prediction ; Github repo: https://github.com/liyingxuan1012/zeroshot-speaker-prediction
♻ ☆ Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset, namely SRI-ADV. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different demographics, video elements, EEG, and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on SRI-ADV and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The codes and dataset will be released at https://github.com/mininglamp-MLLM/HMLLM.
comment: Accepted by ACM MULTIMEDIA 2024
♻ ☆ Q-Seg: Quantum Annealing-Based Unsupervised Image Segmentation
We present Q-Seg, a novel unsupervised image segmentation method based on quantum annealing, tailored for existing quantum hardware. We formulate the pixel-wise segmentation problem, which assimilates spectral and spatial information of the image, as a graph-cut optimization task. Our method efficiently leverages the interconnected qubit topology of the D-Wave Advantage device, offering superior scalability over existing quantum approaches and outperforming several tested state-of-the-art classical methods. Empirical evaluations on synthetic datasets have shown that Q-Seg has better runtime performance than the state-of-the-art classical optimizer Gurobi. The method has also been tested on earth observation image segmentation, a critical area with noisy and unreliable annotations. In the era of noisy intermediate-scale quantum, Q-Seg emerges as a reliable contender for real-world applications in comparison to advanced techniques like Segment Anything. Consequently, Q-Seg offers a promising solution using available quantum hardware, especially in situations constrained by limited labeled data and the need for efficient computational runtime.
comment: 12 pages, 9 figures, 1 table
♻ ☆ Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding
Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent works have sought to transfer knowledge embedded in vision-language models from 2D to 3D domains. However, these approaches often require prior knowledge from specific 3D scene datasets, limiting their applicability in open-world scenarios. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities, prompting us to investigate its potential for comprehending 3D scenes without training. In this paper, we introduce OV-SAM3D, a training-free method that contains a universal framework for understanding open-vocabulary 3D scenes. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Specifically, our method is composed of two key sub-modules: First, we initiate the process by generating superpoints as the initial 3D prompts and refine these prompts using segment masks derived from SAM. Moreover, we then integrate a specially designed overlapping score table with open tags from the Recognize Anything Model (RAM) to produce final 3D instances with open-world labels. Empirical evaluations on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.
comment: Project page: https://hithqd.github.io/projects/OV-SAM3D
♻ ☆ An Efficient Instance Segmentation Framework Using Segmentation Foundation Models with Oriented Bounding Box Prompts
Instance segmentation in unmanned aerial vehicle measurement is a long-standing challenge. Since horizontal bounding boxes introduce many interference objects, oriented bounding boxes (OBBs) are usually used for instance identification. However, based on ``segmentation within bounding box'' paradigm, current instance segmentation methods using OBBs are overly dependent on bounding box detection performance. To tackle this, this paper proposes OBSeg, an efficient instance segmentation framework using OBBs. OBSeg is based on box prompt-based segmentation foundation models (BSMs), e.g., Segment Anything Model. Specifically, OBSeg first detects OBBs to distinguish instances and provide coarse localization information. Then, it predicts OBB prompt-related masks for fine segmentation. Since OBBs only serve as prompts, OBSeg alleviates the over-dependence on bounding box detection performance of current instance segmentation methods using OBBs. In addition, to enable BSMs to handle OBB prompts, we propose a novel OBB prompt encoder. To make OBSeg more lightweight and further improve the performance of lightweight distilled BSMs, a Gaussian smoothing-based knowledge distillation method is introduced. Experiments demonstrate that OBSeg outperforms current instance segmentation methods on multiple public datasets. The code is available at https://github.com/zhen6618/OBBInstanceSegmentation.
♻ ☆ Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos
Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future hand-object interactions in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our newly proposed evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D will be released at https://github.com/IRMVLab/Diff-IP2D.
♻ ☆ TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation
Semantic segmentation benchmarks in the realm of autonomous driving are dominated by large pre-trained transformers, yet their widespread adoption is impeded by substantial computational costs and prolonged training durations. To lift this constraint, we look at efficient semantic segmentation from a perspective of comprehensive knowledge distillation and aim to bridge the gap between multi-source knowledge extractions and transformer-specific patch embeddings. We put forward the Transformer-based Knowledge Distillation (TransKD) framework which learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers, bypassing the long pre-training process and reducing the FLOPs by >85.0%. Specifically, we propose two fundamental modules to realize feature map distillation and patch embedding distillation, respectively: (1) Cross Selective Fusion (CSF) enables knowledge transfer between cross-stage features via channel attention and feature map distillation within hierarchical transformers; (2) Patch Embedding Alignment (PEA) performs dimensional transformation within the patchifying process to facilitate the patch embedding distillation. Furthermore, we introduce two optimization modules to enhance the patch embedding distillation from different perspectives: (1) Global-Local Context Mixer (GL-Mixer) extracts both global and local information of a representative embedding; (2) Embedding Assistant (EA) acts as an embedding method to seamlessly bridge teacher and student models with the teacher's number of channels. Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks and rivals the time-consuming pre-training method. The source code is publicly available at https://github.com/RuipingL/TransKD.
comment: Accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS). The source code is publicly available at https://github.com/RuipingL/TransKD
Artificial Intelligence 92
☆ Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.
comment: Project page: https://yunzeman.github.io/lexicon3d , Github: https://github.com/YunzeMan/Lexicon3D
☆ WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis's utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.
LLM-CI: Assessing Contextual Integrity Norms in Language Models
Large language models (LLMs), while memorizing parts of their training data scraped from the Internet, may also inadvertently encode societal preferences and norms. As these models are integrated into sociotechnical systems, it is crucial that the norms they encode align with societal expectations. These norms could vary across models, hyperparameters, optimization techniques, and datasets. This is especially challenging due to prompt sensitivity$-$small variations in prompts yield different responses, rendering existing assessment methodologies unreliable. There is a need for a comprehensive framework covering various models, optimization, and datasets, along with a reliable methodology to assess encoded norms. We present LLM-CI, the first open-sourced framework to assess privacy norms encoded in LLMs. LLM-CI uses a Contextual Integrity-based factorial vignette methodology to assess the encoded norms across different contexts and LLMs. We propose the multi-prompt assessment methodology to address prompt sensitivity by assessing the norms from only the prompts that yield consistent responses across multiple variants. Using LLM-CI and our proposed methodology, we comprehensively evaluate LLMs using IoT and COPPA vignettes datasets from prior work, examining the impact of model properties (e.g., hyperparameters, capacity) and optimization strategies (e.g., alignment, quantization).
comment: 20 pages, 8 Figures, 4 Tables
☆ Planning In Natural Language Improves LLM Search For Code Generation
While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PLANSEARCH generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas.
☆ A Different Level Text Protection Mechanism With Differential Privacy
The article introduces a method for extracting words of different degrees of importance based on the BERT pre-training model and proves the effectiveness of this method. The article also discusses the impact of maintaining the same perturbation results for words of different importance on the overall text utility. This method can be applied to long text protection.
☆ View-Invariant Policy Learning via Zero-Shot Novel View Synthesis
Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.
comment: Accepted to CoRL 2024
☆ TRACE-cs: Trustworthy Reasoning for Contrastive Explanations in Course Scheduling Problems
We present TRACE-cs, a novel hybrid system that combines symbolic reasoning with large language models (LLMs) to address contrastive queries in scheduling problems. TRACE-cs leverages SAT solving techniques to encode scheduling constraints and generate explanations for user queries, while utilizing an LLM to process the user queries into logical clauses as well as refine the explanations generated by the symbolic solver to natural language sentences. By integrating these components, our approach demonstrates the potential of combining symbolic methods with LLMs to create explainable AI agents with correctness guarantees.
☆ A method to benchmark high-dimensional process drift detection
Process curves are multi-variate finite time series data coming from manufacturing processes. This paper studies machine learning methods for drifts of process curves. A theoretic framework to synthetically generate process curves in a controlled way is introduced in order to benchmark machine learning algorithms for process drift detection. A evaluation score, called the temporal area under the curve, is introduced, which allows to quantify how well machine learning models unveil curves belonging to drift segments. Finally, a benchmark study comparing popular machine learning approaches on synthetic data generated with the introduced framework shown.
☆ Limited but consistent gains in adversarial robustness by co-training object recognition models with human EEG
In contrast to human vision, artificial neural networks (ANNs) remain relatively susceptible to adversarial attacks. To address this vulnerability, efforts have been made to transfer inductive bias from human brains to ANNs, often by training the ANN representations to match their biological counterparts. Previous works relied on brain data acquired in rodents or primates using invasive techniques, from specific regions of the brain, under non-natural conditions (anesthetized animals), and with stimulus datasets lacking diversity and naturalness. In this work, we explored whether aligning model representations to human EEG responses to a rich set of real-world images increases robustness to ANNs. Specifically, we trained ResNet50-backbone models on a dual task of classification and EEG prediction; and evaluated their EEG prediction accuracy and robustness to adversarial attacks. We observed significant correlation between the networks' EEG prediction accuracy, often highest around 100 ms post stimulus onset, and their gains in adversarial robustness. Although effect size was limited, effects were consistent across different random initializations and robust for architectural variants. We further teased apart the data from individual EEG channels and observed strongest contribution from electrodes in the parieto-occipital regions. The demonstrated utility of human EEG for such tasks opens up avenues for future efforts that scale to larger datasets under diverse stimuli conditions with the promise of stronger effects.
☆ Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Cord Paralysis
This paper presents the Multimodal Analyzing System for Laryngoscope (MASL), a system that combines audio and video data to automatically extract key segments and metrics from laryngeal videostroboscopic videos for clinical assessment. MASL integrates glottis detection with keyword spotting to analyze patient vocalizations and refine video highlights for better inspection of vocal cord movements. The system includes a strobing video extraction module that identifies frames by analyzing hue, saturation, and value fluctuations. MASL also provides effective metrics for vocal cord paralysis detection, employing a two-stage glottis segmentation process using U-Net followed by diffusion-based refinement to reduce false positives. Instead of glottal area waveforms, MASL estimates anterior glottic angle waveforms (AGAW) from glottis masks, evaluating both left and right vocal cords to detect unilateral vocal cord paralysis (UVFP). By comparing AGAW variances, MASL distinguishes between left and right paralysis. Ablation studies and experiments on public and real-world datasets validate MASL's segmentation module and demonstrate its ability to provide reliable metrics for UVFP diagnosis.
☆ 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances KDD
Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.
comment: Presented at the 2024 KDD workshop on Evaluation and Trustworthiness of Generative AI Models
☆ DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
Diffusion models (DMs) have demonstrated exceptional generative capabilities across various areas, while they are hindered by slow inference speeds and high computational demands during deployment. The most common way to accelerate DMs involves reducing the number of denoising steps during generation, achieved through faster sampling solvers or knowledge distillation (KD). In contrast to prior approaches, we propose a novel method that transfers the capability of large pretrained DMs to faster architectures. Specifically, we employ KD in a distinct manner to compress DMs by distilling their generative ability into more rapid variants. Furthermore, considering that the source data is either unaccessible or too enormous to store for current generative models, we introduce a new paradigm for their distillation without source data, termed Data-Free Knowledge Distillation for Diffusion Models (DKDM). Generally, our established DKDM framework comprises two main components: 1) a DKDM objective that uses synthetic denoising data produced by pretrained DMs to optimize faster DMs without source data, and 2) a dynamic iterative distillation method that flexibly organizes the synthesis of denoising data, preventing it from slowing down the optimization process as the generation is slow. To our knowledge, this is the first attempt at using KD to distill DMs into any architecture in a data-free manner. Importantly, our DKDM is orthogonal to most existing acceleration methods, such as denoising step reduction, quantization and pruning. Experiments show that our DKDM is capable of deriving 2x faster DMs with performance remaining on par with the baseline. Notably, our DKDM enables pretrained DMs to function as "datasets" for training new DMs.
☆ Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift
Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.
comment: This preprint has not undergone any post-submission improvements or corrections
☆ LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution
Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at https://github.com/jwgdmkj/LMLT.
☆ Disclosure of AI-Generated News Increases Engagement but Does Not Reduce Aversion, Despite Positive Quality Ratings
The advancement of artificial intelligence (AI) has led to its application in many areas, including journalism. One key issue is the public's perception of AI-generated content. This preregistered study investigates (i) the perceived quality of AI-assisted and AI-generated versus human-generated news articles, (ii) whether disclosure of AI's involvement in generating these news articles influences engagement with them, and (iii) whether such awareness affects the willingness to read AI-generated articles in the future. We employed a between-subjects survey experiment with 599 participants from the German-speaking part of Switzerland, who evaluated the credibility, readability, and expertise of news articles. These articles were either written by journalists (control group), rewritten by AI (AI-assisted group), or entirely generated by AI (AI-generated group). Our results indicate that all news articles, regardless of whether they were written by journalists or AI, were perceived to be of equal quality. When participants in the treatment groups were subsequently made aware of AI's involvement in generating the articles, they expressed a higher willingness to engage with (i.e., continue reading) the articles than participants in the control group. However, they were not more willing to read AI-generated news in the future. These results suggest that aversion to AI usage in news media is not primarily rooted in a perceived lack of quality, and that by disclosing using AI, journalists could attract more immediate engagement with their content, at least in the short term.
☆ Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation
Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the "utility" of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at https://github.com/prerakmody/bayesuncertainty-error-correspondence
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:018
☆ Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks
Graph Neural Networks (GNNs) have become increasingly popular for effectively modeling data with graph structures. Recently, attention mechanisms have been integrated into GNNs to improve their ability to capture complex patterns. This paper presents the first comprehensive study revealing a critical, unexplored consequence of this integration: the emergence of Massive Activations (MAs) within attention layers. We introduce a novel method for detecting and analyzing MAs, focusing on edge features in different graph transformer architectures. Our study assesses various GNN models using benchmark datasets, including ZINC, TOX21, and PROTEINS. Key contributions include (1) establishing the direct link between attention mechanisms and MAs generation in GNNs, (2) developing a robust definition and detection method for MAs based on activation ratio distributions, (3) introducing the Explicit Bias Term (EBT) as a potential countermeasure and exploring it as an adversarial framework to assess models robustness based on the presence or absence of MAs. Our findings highlight the prevalence and impact of attention-induced MAs across different architectures, such as GraphTransformer, GraphiT, and SAN. The study reveals the complex interplay between attention mechanisms, model architecture, dataset characteristics, and MAs emergence, providing crucial insights for developing more robust and reliable graph models.
☆ How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes
Decoder-only LLMs have shown impressive performance in MT due to their ability to learn from extensive datasets and generate high-quality translations. However, LLMs often struggle with the nuances and style required for organisation-specific translation. In this study, we explore the effectiveness of fine-tuning Large Language Models (LLMs), particularly Llama 3 8B Instruct, leveraging translation memories (TMs), as a valuable resource to enhance accuracy and efficiency. We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector. Our experiments cover five translation directions across languages of varying resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to evaluate their influence on translation quality. We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET. Our findings reveal improvement in translation performance with larger datasets across all metrics. On average, BLEU and COMET scores increase by 13 and 25 points, respectively, on the largest training set against the baseline model. Notably, there is a performance deterioration in comparison with the baseline model when fine-tuning on only 1k and 2k examples; however, we observe a substantial improvement as the training dataset size increases. The study highlights the potential of integrating TMs with LLMs to create bespoke translation models tailored to the specific needs of businesses, thus enhancing translation quality and reducing turn-around times. This approach offers a valuable insight for organisations seeking to leverage TMs and LLMs for optimal translation outcomes, especially in narrower domains.
☆ Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities
The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.
☆ KiloBot: A Programming Language for Deploying Perception-Guided Industrial Manipulators at Scale
We would like industrial robots to handle unstructured environments with cameras and perception pipelines. In contrast to traditional industrial robots that replay offline-crafted trajectories, online behavior planning is required for these perception-guided industrial applications. Aside from perception and planning algorithms, deploying perception-guided manipulators also requires substantial effort in integration. One approach is writing scripts in a traditional language (such as Python) to construct the planning problem and perform integration with other algorithmic modules & external devices. While scripting in Python is feasible for a handful of robots and applications, deploying perception-guided manipulation at scale (e.g., more than 10000 robot workstations in over 2000 customer sites) becomes intractable. To resolve this challenge, we propose a Domain-Specific Language (DSL) for perception-guided manipulation applications. To scale up the deployment,our DSL provides: 1) an easily accessible interface to construct & solve a sub-class of Task and Motion Planning (TAMP) problems that are important in practical applications; and 2) a mechanism to implement flexible control flow to perform integration and address customized requirements of distinct industrial application. Combined with an intuitive graphical programming frontend, our DSL is mainly used by machine operators without coding experience in traditional programming languages. Within hours of training, operators are capable of orchestrating interesting sophisticated manipulation behaviors with our DSL. Extensive practical deployments demonstrate the efficacy of our method.
☆ Reinforcement Learning Approach to Optimizing Profilometric Sensor Trajectories for Surface Inspection
High-precision surface defect detection in manufacturing is essential for ensuring quality control. Laser triangulation profilometric sensors are key to this process, providing detailed and accurate surface measurements over a line. To achieve a complete and precise surface scan, accurate relative motion between the sensor and the workpiece is required. It is crucial to control the sensor pose to maintain optimal distance and relative orientation to the surface. It is also important to ensure uniform profile distribution throughout the scanning process. This paper presents a novel Reinforcement Learning (RL) based approach to optimize robot inspection trajectories for profilometric sensors. Building upon the Boustrophedon scanning method, our technique dynamically adjusts the sensor position and tilt to maintain optimal orientation and distance from the surface, while also ensuring a consistent profile distance for uniform and high-quality scanning. Utilizing a simulated environment based on the CAD model of the part, we replicate real-world scanning conditions, including sensor noise and surface irregularities. This simulation-based approach enables offline trajectory planning based on CAD models. Key contributions include the modeling of the state space, action space, and reward function, specifically designed for inspection applications using profilometric sensors. We use Proximal Policy Optimization (PPO) algorithm to efficiently train the RL agent, demonstrating its capability to optimize inspection trajectories with profilometric sensors. To validate our approach, we conducted several experiments where a model trained on a specific training piece was tested on various parts in simulation. Also, we conducted a real-world experiment by executing the optimized trajectory, generated offline from a CAD model, to inspect a part using a UR3e robotic arm model.
☆ KAN See In the Dark
Existing low-light image enhancement methods are difficult to fit the complex nonlinear relationship between normal and low-light images due to uneven illumination and noise effects. The recently proposed Kolmogorov-Arnold networks (KANs) feature spline-based convolutional layers and learnable activation functions, which can effectively capture nonlinear dependencies. In this paper, we design a KAN-Block based on KANs and innovatively apply it to low-light image enhancement. This method effectively alleviates the limitations of current methods constrained by linear network structures and lack of interpretability, further demonstrating the potential of KANs in low-level vision tasks. Given the poor perception of current low-light image enhancement methods and the stochastic nature of the inverse diffusion process, we further introduce frequency-domain perception for visually oriented enhancement. Extensive experiments demonstrate the competitive performance of our method on benchmark datasets. The code will be available at: https://github.com/AXNing/KSID}{https://github.com/AXNing/KSID.
☆ Game On: Towards Language Models as RL Experimenters
We propose an agent architecture that automates parts of the common reinforcement learning experiment workflow, to enable automated mastery of control domains for embodied agents. To do so, it leverages a VLM to perform some of the capabilities normally required of a human experimenter, including the monitoring and analysis of experiment progress, the proposition of new tasks based on past successes and failures of the agent, decomposing tasks into a sequence of subtasks (skills), and retrieval of the skill to execute - enabling our system to build automated curricula for learning. We believe this is one of the first proposals for a system that leverages a VLM throughout the full experiment cycle of reinforcement learning. We provide a first prototype of this system, and examine the feasibility of current models and techniques for the desired level of automation. For this, we use a standard Gemini model, without additional fine-tuning, to provide a curriculum of skills to a language-conditioned Actor-Critic algorithm, in order to steer data collection so as to aid learning new skills. Data collected in this way is shown to be useful for learning and iteratively improving control policies in a robotics domain. Additional examination of the ability of the system to build a growing library of skills, and to judge the progress of the training of those skills, also shows promising results, suggesting that the proposed architecture provides a potential recipe for fully automated mastery of tasks and domains for embodied agents.
☆ Hardware Acceleration of LLMs: A comprehensive survey and comparison
Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. In this paper, we present a comprehensive survey of the several research efforts that have been presented for the acceleration of transformer networks for Large Language Models using hardware accelerators. The survey presents the frameworks that have been proposed and then performs a qualitative and quantitative comparison regarding the technology, the processing platform (FPGA, ASIC, In-Memory, GPU), the speedup, the energy efficiency, the performance (GOPs), and the energy efficiency (GOPs/W) of each framework. The main challenge in comparison is that every proposed scheme is implemented on a different process technology making hard a fair comparison. The main contribution of this paper is that we extrapolate the results of the performance and the energy efficiency on the same technology to make a fair comparison; one theoretical and one more practical. We implement part of the LLMs on several FPGA chips to extrapolate the results to the same process technology and then we make a fair comparison of the performance.
comment: https://airtable.com/appC2VwR6X4EeZ50s/shrKwchys0iktvDwk
☆ CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks
Cognitive psychology investigates perception, attention, memory, language, problem-solving, decision-making, and reasoning. Kahneman's dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2. Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level proficiency in various cognitive tasks. Nonetheless, the presence of a dual-system framework analogous to human cognition in LLMs remains unexplored. This study introduces the \textbf{CogniDual Framework for LLMs} (CFLLMs), designed to assess whether LLMs can, through self-training, evolve from deliberate deduction to intuitive responses, thereby emulating the human process of acquiring and mastering new information. Our findings reveal the cognitive mechanisms behind LLMs' response generation, enhancing our understanding of their capabilities in cognitive psychology. Practically, self-trained models can provide faster responses to certain queries, reducing computational demands during inference.
☆ Raw Speech Enhancement with Deep State Space Modeling
We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement in an end-to-end fashion. The network's performance is primarily evaluated on raw speech denoising, with additional assessments on tasks such as super-resolution and de-quantization. We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets. The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency. Even as a raw waveform processing model, the model maintains high fidelity to the clean signal with minimal audible artifacts. In addition, the model remains performant even when the noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech enhancement capabilities in low-resource environments.
comment: 7 pages, 2 figures
☆ Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time
Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, Artificial Intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using Large Language Models (LLMs) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical-patient communication using intelligent systems. Consequently, we leverage an LLM using the latest Natural Language Processing (NLP) techniques in a chatbot solution to provide interpretable Machine Learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing NLP-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80 % in all evaluation metrics, with a recall value for the mental deterioration class about 85 %. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.
☆ Sketch: A Toolkit for Streamlining LLM Operations
Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model's outputs, thereby constraining the application of LLMs in various domains. In this work, we present Sketch, an innovative toolkit designed to streamline LLM operations across diverse fields. Sketch comprises the following components: (1) a suite of task description schemas and prompt templates encompassing various NLP tasks; (2) a user-friendly, interactive process for building structured output LLM services tailored to various NLP tasks; (3) an open-source dataset for output format control, along with tools for dataset construction; and (4) an open-source model based on LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting instructions. We anticipate this initiative to bring considerable convenience to LLM users, achieving the goal of ''plug-and-play'' for various applications. The components of Sketch will be progressively open-sourced at https://github.com/cofe-ai/Sketch.
☆ YOLO-PPA based Efficient Traffic Sign Detection for Cruise Control in Autonomous Driving
It is very important to detect traffic signs efficiently and accurately in autonomous driving systems. However, the farther the distance, the smaller the traffic signs. Existing object detection algorithms can hardly detect these small scaled signs.In addition, the performance of embedded devices on vehicles limits the scale of detection models.To address these challenges, a YOLO PPA based traffic sign detection algorithm is proposed in this paper.The experimental results on the GTSDB dataset show that compared to the original YOLO, the proposed method improves inference efficiency by 11.2%. The mAP 50 is also improved by 93.2%, which demonstrates the effectiveness of the proposed YOLO PPA.
☆ N-gram Prediction and Word Difference Representations for Language Modeling
Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.
LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts EMNLP
With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (https://github.com/Reliable-Information-Lab-HEVS/dynamic_llm_detector_benchmark).
comment: 20 pages, 7 tables, 13 figures, under consideration for EMNLP
☆ iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models
Most available data is unstructured, making it challenging to access valuable information. Automatically building Knowledge Graphs (KGs) is crucial for structuring data and making it accessible, allowing users to search for information effectively. KGs also facilitate insights, inference, and reasoning. Traditional NLP methods, such as named entity recognition and relation extraction, are key in information retrieval but face limitations, including the use of predefined entity types and the need for supervised learning. Current research leverages large language models' capabilities, such as zero- or few-shot learning. However, unresolved and semantically duplicated entities and relations still pose challenges, leading to inconsistent graphs and requiring extensive post-processing. Additionally, most approaches are topic-dependent. In this paper, we propose iText2KG, a method for incremental, topic-independent KG construction without post-processing. This plug-and-play, zero-shot method is applicable across a wide range of KG construction scenarios and comprises four modules: Document Distiller, Incremental Entity Extractor, Incremental Relation Extractor, and Graph Integrator and Visualization. Our method demonstrates superior performance compared to baseline methods across three scenarios: converting scientific papers to graphs, websites to graphs, and CVs to graphs.
comment: Accepted at The International Web Information Systems Engineering conference (the WISE conference) 2024
☆ ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, the application of alignment training within the chart domain is still underexplored. To address this, we propose ChartMoE, which employs the mixture of expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train multiple linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with over 900K chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts in four distinct ways and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.
☆ Recent Advances in Attack and Defense Approaches of Large Language Models
Large Language Models (LLMs) have revolutionized artificial intelligence and machine learning through their advanced text processing and generating capabilities. However, their widespread deployment has raised significant safety and reliability concerns. Established vulnerabilities in deep neural networks, coupled with emerging threat models, may compromise security evaluations and create a false sense of security. Given the extensive research in the field of LLM security, we believe that summarizing the current state of affairs will help the research community better understand the present landscape and inform future developments. This paper reviews current research on LLM vulnerabilities and threats, and evaluates the effectiveness of contemporary defense mechanisms. We analyze recent studies on attack vectors and model weaknesses, providing insights into attack mechanisms and the evolving threat landscape. We also examine current defense strategies, highlighting their strengths and limitations. By contrasting advancements in attack and defense methodologies, we identify research gaps and propose future directions to enhance LLM security. Our goal is to advance the understanding of LLM safety challenges and guide the development of more robust security measures.
☆ Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation
The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to sub-optimal reasoning performance. To address this challenge, we propose the \textbf{Strategic Chain-of-Thought} (SCoT), a novel methodology designed to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers. Our experiments across eight challenging reasoning datasets demonstrate significant improvements, including a 21.05\% increase on the GSM8K dataset and 24.13\% on the Tracking\_Objects dataset, respectively, using the Llama3-8b model. Additionally, we extend the SCoT framework to develop a few-shot method with automatically matched demonstrations, yielding even stronger results. These findings underscore the efficacy of SCoT, highlighting its potential to substantially enhance LLM performance in complex reasoning tasks.
☆ Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision ECCV 2024
Recent advances in interactive keypoint estimation methods have enhanced accuracy while minimizing user intervention. However, these methods require user input for error correction, which can be costly in vertebrae keypoint estimation where inaccurate keypoints are densely clustered or overlap. We introduce a novel approach, KeyBot, specifically designed to identify and correct significant and typical errors in existing models, akin to user revision. By characterizing typical error types and using simulated errors for training, KeyBot effectively corrects these errors and significantly reduces user workload. Comprehensive quantitative and qualitative evaluations on three public datasets confirm that KeyBot significantly outperforms existing methods, achieving state-of-the-art performance in interactive vertebrae keypoint estimation. The source code and demo video are available at: https://ts-kim.github.io/KeyBot/
comment: 33 pages, ECCV 2024, Project Page: https://ts-kim.github.io/KeyBot/
☆ In Search of Trees: Decision-Tree Policy Synthesis for Black-Box Systems via Search
Decision trees, owing to their interpretability, are attractive as control policies for (dynamical) systems. Unfortunately, constructing, or synthesising, such policies is a challenging task. Previous approaches do so by imitating a neural-network policy, approximating a tabular policy obtained via formal synthesis, employing reinforcement learning, or modelling the problem as a mixed-integer linear program. However, these works may require access to a hard-to-obtain accurate policy or a formal model of the environment (within reach of formal synthesis), and may not provide guarantees on the quality or size of the final tree policy. In contrast, we present an approach to synthesise optimal decision-tree policies given a black-box environment and specification, and a discretisation of the tree predicates, where optimality is defined with respect to the number of steps to achieve the goal. Our approach is a specialised search algorithm which systematically explores the (exponentially large) space of decision trees under the given discretisation. The key component is a novel pruning mechanism that significantly reduces the search space. Our approach represents a conceptually novel way of synthesising small decision-tree policies with optimality guarantees even for black-box environments with black-box specifications.
comment: 8 pages main text incl. references, 1 page appendix
☆ Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard
This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.
☆ E2CL: Exploration-based Error Correction Learning for Embodied Agents
Language models are exhibiting increasing capability in knowledge utilization and reasoning. However, when applied as agents in embodied environments, they often suffer from misalignment between their intrinsic knowledge and environmental knowledge, leading to infeasible actions. Traditional environment alignment methods, such as supervised learning on expert trajectories and reinforcement learning, face limitations in covering environmental knowledge and achieving efficient convergence, respectively. Inspired by human learning, we propose Exploration-based Error Correction Learning (E2CL), a novel framework that leverages exploration-induced errors and environmental feedback to enhance environment alignment for LM-based agents. E2CL incorporates teacher-guided and teacher-free exploration to gather environmental feedback and correct erroneous actions. The agent learns to provide feedback and self-correct, thereby enhancing its adaptability to target environments. Evaluations in the Virtualhome environment demonstrate that E2CL-trained agents outperform those trained by baseline methods and exhibit superior self-correction capabilities.
☆ Granular-ball Representation Learning for Deep CNN on Learning with Label Noise
In actual scenarios, whether manually or automatically annotated, label noise is inevitably generated in the training data, which can affect the effectiveness of deep CNN models. The popular solutions require data cleaning or designing additional optimizations to punish the data with mislabeled data, thereby enhancing the robustness of models. However, these methods come at the cost of weakening or even losing some data during the training process. As we know, content is the inherent attribute of an image that does not change with changes in annotations. In this study, we propose a general granular-ball computing (GBC) module that can be embedded into a CNN model, where the classifier finally predicts the label of granular-ball ($gb$) samples instead of each individual samples. Specifically, considering the classification task: (1) in forward process, we split the input samples as $gb$ samples at feature-level, each of which can correspond to multiple samples with varying numbers and share one single label; (2) during the backpropagation process, we modify the gradient allocation strategy of the GBC module to enable it to propagate normally; and (3) we develop an experience replay policy to ensure the stability of the training process. Experiments demonstrate that the proposed method can improve the robustness of CNN models with no additional data or optimization.
☆ DiffGrad for Physics-Informed Neural Networks
Physics-Informed Neural Networks (PINNs) are regarded as state-of-the-art tools for addressing highly nonlinear problems based on partial differential equations. Despite their broad range of applications, PINNs encounter several performance challenges, including issues related to efficiency, minimization of computational cost, and enhancement of accuracy. Burgers' equation, a fundamental equation in fluid dynamics that is extensively used in PINNs, provides flexible results with the Adam optimizer that does not account for past gradients. This paper introduces a novel strategy for solving Burgers' equation by incorporating DiffGrad with PINNs, a method that leverages the difference between current and immediately preceding gradients to enhance performance. A comprehensive computational analysis is conducted using optimizers such as Adam, Adamax, RMSprop, and DiffGrad to evaluate and compare their effectiveness. Our approach includes visualizing the solutions over space at various time intervals to demonstrate the accuracy of the network. The results show that DiffGrad not only improves the accuracy of the solution but also reduces training time compared to the other optimizers.
comment: 20 pages, 14 figures
☆ Content Moderation by LLM: From Accuracy to Legitimacy
One trending application of LLM (large language model) is to use it for content moderation in online platforms. Most current studies on this application have focused on the metric of accuracy - the extent to which LLM makes correct decisions about content. This article argues that accuracy is insufficient and misleading, because it fails to grasp the distinction between easy cases and hard cases as well as the inevitable trade-offs in achieving higher accuracy. Closer examination reveals that content moderation is a constitutive part of platform governance, the key of which is to gain and enhance legitimacy. Instead of making moderation decisions correct, the chief goal of LLM is to make them legitimate. In this regard, this article proposes a paradigm shift from the single benchmark of accuracy towards a legitimacy-based framework of evaluating the performance of LLM moderators. The framework suggests that for easy cases, the key is to ensure accuracy, speed and transparency, while for hard cases, what matters is reasoned justification and user participation. Examined under this framework, LLM's real potential in moderation is not accuracy improvement. Rather, LLM can better contribute in four other aspects: to conduct screening of hard cases from easy cases, to provide quality explanations for moderation decisions, to assist human reviewers in getting more contextual information, and to facilitate user participation in a more interactive way. Using normative theories from law and social sciences to critically assess the new technological application, this article seeks to redefine LLM's role in content moderation and redirect relevant research in this field.
☆ xLAM: A Family of Large Action Models to Empower AI Agent Systems
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4
comment: Technical report for the Salesforce xLAM model series
☆ TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.
☆ An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification
Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
☆ Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
comment: 13 pages, 5 figures
☆ InfraLib: Enabling Reinforcement Learning and Decision Making for Large Scale Infrastructure Management
Efficient management of infrastructure systems is crucial for economic stability, sustainability, and public safety. However, infrastructure management is challenging due to the vast scale of systems, stochastic deterioration of components, partial observability, and resource constraints. While data-driven approaches like reinforcement learning (RL) offer a promising avenue for optimizing management policies, their application to infrastructure has been limited by the lack of suitable simulation environments. We introduce InfraLib, a comprehensive framework for modeling and analyzing infrastructure management problems. InfraLib employs a hierarchical, stochastic approach to realistically model infrastructure systems and their deterioration. It supports practical functionality such as modeling component unavailability, cyclical budgets, and catastrophic failures. To facilitate research, InfraLib provides tools for expert data collection, simulation-driven analysis, and visualization. We demonstrate InfraLib's capabilities through case studies on a real-world road network and a synthetic benchmark with 100,000 components.
☆ Continual Skill and Task Learning via Dialogue
Continual and interactive robot learning is a challenging problem as the robot is present with human users who expect the robot to learn novel skills to solve novel tasks perpetually with sample efficiency. In this work we present a framework for robots to query and learn visuo-motor robot skills and task relevant information via natural language dialog interactions with human users. Previous approaches either focus on improving the performance of instruction following agents, or passively learn novel skills or concepts. Instead, we used dialog combined with a language-skill grounding embedding to query or confirm skills and/or tasks requested by a user. To achieve this goal, we developed and integrated three different components for our agent. Firstly, we propose a novel visual-motor control policy ACT with Low Rank Adaptation (ACT-LoRA), which enables the existing SoTA ACT model to perform few-shot continual learning. Secondly, we develop an alignment model that projects demonstrations across skill embodiments into a shared embedding allowing us to know when to ask questions and/or demonstrations from users. Finally, we integrated an existing LLM to interact with a human user to perform grounded interactive continual skill learning to solve a task. Our ACT-LoRA model learns novel fine-tuned skills with a 100% accuracy when trained with only five demonstrations for a novel skill while still maintaining a 74.75% accuracy on pre-trained skills in the RLBench dataset where other models fall significantly short. We also performed a human-subjects study with 8 subjects to demonstrate the continual learning capabilities of our combined framework. We achieve a success rate of 75% in the task of sandwich making with the real robot learning from participant data demonstrating that robots can learn novel skills or task knowledge from dialogue with non-expert users using our approach.
☆ Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models
Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: \textit{excessively long reasoning paths distracting from the answer generation}, and \textit{false-positive relations hindering the path refinement}. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7\% and 9.1\% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG. Code is available at \url{https://github.com/reml-group/DoG}.
comment: 12 pages
☆ Addressing the Gaps in Early Dementia Detection: A Path Towards Enhanced Diagnostic Models through Machine Learning
The rapid global aging trend has led to an increase in dementia cases, including Alzheimer's disease, underscoring the urgent need for early and accurate diagnostic methods. Traditional diagnostic techniques, such as cognitive tests, neuroimaging, and biomarker analysis, face significant limitations in sensitivity, accessibility, and cost, particularly in the early stages. This study explores the potential of machine learning (ML) as a transformative approach to enhance early dementia detection by leveraging ML models to analyze and integrate complex multimodal datasets, including cognitive assessments, neuroimaging, and genetic information. A comprehensive review of existing literature was conducted to evaluate various ML models, including supervised learning, deep learning, and advanced techniques such as ensemble learning and transformer models, assessing their accuracy, interpretability, and potential for clinical integration. The findings indicate that while ML models show significant promise in improving diagnostic precision and enabling earlier interventions, challenges remain in their generalizability, interpretability, and ethical deployment. This research concludes by outlining future directions aimed at enhancing the clinical utility of ML models in dementia detection, emphasizing interdisciplinary collaboration and ethically sound frameworks to improve early detection and intervention strategies for Alzheimer's disease and other forms of dementia.
♻ ☆ PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity
One of the components of natural language processing that has received a lot of investigation recently is semantic textual similarity. In computational linguistics and natural language processing, assessing the semantic similarity of words, phrases, paragraphs, and texts is crucial. Calculating the degree of semantic resemblance between two textual pieces, paragraphs, or phrases provided in both monolingual and cross-lingual versions is known as semantic similarity. Cross lingual semantic similarity requires corpora in which there are sentence pairs in both the source and target languages with a degree of semantic similarity between them. Many existing cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset, which the propagation of the machine translation error reduces the accuracy of the model. On the other hand, when we want to use semantic similarity features for machine translation the same machine translations should not be used for semantic similarity. For Persian, which is one of the low resource languages, no effort has been made in this regard and the need for a model that can understand the context of two languages is felt more than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Also, different models based on transformers have been fine-tuned using this dataset. The results show that using the PESTS dataset, the Pearson correlation of the XLM ROBERTa model increases from 85.87% to 95.62%.
♻ ☆ SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into multiple categories: human-based, optimization-based, generation-based, and the recent indirect and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delays to user prompts, as well as be compatible with both open-source and closed-source LLMs. Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM as a defense instance to concurrently protect the target LLM instance in the normal stack and collaborate with it for checkpoint-based access control. The effectiveness of SelfDefend builds upon our observation that existing LLMs (both target and defense LLMs) have the capability to identify harmful prompts or intentions in user queries, which we empirically validate using the commonly used GPT-3.5/4 models across all major jailbreak attacks. To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models. These models outperform six state-of-the-art defenses and match the performance of GPT-4-based SelfDefend, with significantly lower extra delays. We also empirically show that the tuned models are robust to adaptive jailbreaks and prompt injections.
comment: This paper completes its earlier vision paper, available at arXiv:2402.15727. Updated to the latest analysis and results
♻ ☆ Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation
In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun
comment: 12 pages, 12 figures
♻ ☆ How to Train your Antivirus: RL-based Hardening through the Problem-Space
ML-based malware detection on dynamic analysis reports is vulnerable to both evasion and spurious correlations. In this work, we investigate a specific ML architecture employed in the pipeline of a widely-known commercial antivirus company, with the goal to harden it against adversarial malware. Adversarial training, the sole defensive technique that can confer empirical robustness, is not applicable out of the box in this domain, for the principal reason that gradient-based perturbations rarely map back to feasible problem-space programs. We introduce a novel Reinforcement Learning approach for constructing adversarial examples, a constituent part of adversarially training a model against evasion. Our approach comes with multiple advantages. It performs modifications that are feasible in the problem-space, and only those; thus it circumvents the inverse mapping problem. It also makes possible to provide theoretical guarantees on the robustness of the model against a particular set of adversarial capabilities. Our empirical exploration validates our theoretical insights, where we can consistently reach 0% Attack Success Rate after a few adversarial retraining iterations.
comment: 20 pages,4 figures
♻ ☆ A Graph-based Adversarial Imitation Learning Framework for Reliable & Realtime Fleet Scheduling in Urban Air Mobility
The advent of Urban Air Mobility (UAM) presents the scope for a transformative shift in the domain of urban transportation. However, its widespread adoption and economic viability depends in part on the ability to optimally schedule the fleet of aircraft across vertiports in a UAM network, under uncertainties attributed to airspace congestion, changing weather conditions, and varying demands. This paper presents a comprehensive optimization formulation of the fleet scheduling problem, while also identifying the need for alternate solution approaches, since directly solving the resulting integer nonlinear programming problem is computationally prohibitive for daily fleet scheduling. Previous work has shown the effectiveness of using (graph) reinforcement learning (RL) approaches to train real-time executable policy models for fleet scheduling. However, such policies can often be brittle on out-of-distribution scenarios or edge cases. Moreover, training performance also deteriorates as the complexity (e.g., number of constraints) of the problem increases. To address these issues, this paper presents an imitation learning approach where the RL-based policy exploits expert demonstrations yielded by solving the exact optimization using a Genetic Algorithm. The policy model comprises Graph Neural Network (GNN) based encoders that embed the space of vertiports and aircraft, Transformer networks to encode demand, passenger fare, and transport cost profiles, and a Multi-head attention (MHA) based decoder. Expert demonstrations are used through the Generative Adversarial Imitation Learning (GAIL) algorithm. Interfaced with a UAM simulation environment involving 8 vertiports and 40 aircrafts, in terms of the daily profits earned reward, the new imitative approach achieves better mean performance and remarkable improvement in the case of unseen worst-case scenarios, compared to pure RL results.
comment: Presented at the AIAA Aviation Forum 2024
♻ ☆ Decision Theoretic Foundations for Experiments Evaluating Human Decisions
DeHow well people use information displays to make decisions is of primary interest in human-centered AI, model explainability, data visualization, and related areas. However, what constitutes a decision problem, and what is required for a study to establish that human decisions could be improved remain open to speculation. We propose a widely applicable definition of a decision problem synthesized from statistical decision theory and information economics as a standard for establishing when human decisions can be improved in HCI. We argue that to attribute loss in human performance to forms of bias, an experiment must provide participants with the information that a rational agent would need to identify the utility-maximizing decision. As a demonstration, we evaluate the extent to which recent evaluations of decision-making from the literature on AI-assisted decisions achieve these criteria. We find that only 10 (26\%) of 39 studies that claim to identify biased behavior present participants with sufficient information to characterize their behavior as deviating from good decision-making in at least one treatment condition. We motivate the value of studying well-defined decision problems by describing a characterization of performance losses they allow us to conceive. In contrast, the ambiguities of a poorly communicated decision problem preclude normative interpretation. We conclude with recommendations for practice.
♻ ☆ Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition
Serialized output training (SOT) attracts increasing attention due to its convenience and flexibility for multi-speaker automatic speech recognition (ASR). However, it is not easy to train with attention loss only. In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. This additional separator is inserted after the encoder to extract the multi-speaker information with CTC losses. Furthermore, we propose the serialized speech information guidance SOT (GEncSep) to further utilize the separated encodings. The separated streams are concatenated to provide single-speaker information to guide attention during decoding. The experimental results on LibriMix show that the single-speaker encoding can be separated from the overlapped encoding. The CTC loss helps to improve the encoder representation under complex scenarios. GEncSep further improved performance.
comment: something wrong with input data
♻ ☆ Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions
This paper considers a scenario in city navigation: an AI agent is provided with language descriptions of the goal location with respect to some well-known landmarks; By only observing the scene around, including recognizing landmarks and road network connections, the agent has to make decisions to navigate to the goal location without instructions. This problem is very challenging, because it requires agent to establish self-position and acquire spatial representation of complex urban environment, where landmarks are often invisible. In the absence of navigation instructions, such abilities are vital for the agent to make high-quality decisions in long-range city navigation. With the emergent reasoning ability of large language models (LLMs), a tempting baseline is to prompt LLMs to "react" on each observation and make decisions accordingly. However, this baseline has very poor performance that the agent often repeatedly visits same locations and make short-sighted, inconsistent decisions. To address these issues, this paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan. Specifically, we find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation. Moreover, reflection is achieved through a memory mechanism, where past experiences are stored and can be retrieved with current perception for effective decision argumentation. Planning uses reflection results to produce long-term plans, which can avoid short-sighted decisions in long-range navigation. We show the designed workflow significantly improves navigation ability of the LLM agent compared with the state-of-the-art baselines.
comment: The experiment and dataset are not enough, and we need more experiments to verify our model
♻ ☆ Hierarchical Generative Adversarial Imitation Learning with Mid-level Input Generation for Autonomous Driving on Urban Environments
Deriving robust control policies for realistic urban navigation scenarios is not a trivial task. In an end-to-end approach, these policies must map high-dimensional images from the vehicle's cameras to low-level actions such as steering and throttle. While pure Reinforcement Learning (RL) approaches are based exclusively on engineered rewards, Generative Adversarial Imitation Learning (GAIL) agents learn from expert demonstrations while interacting with the environment, which favors GAIL on tasks for which a reward signal is difficult to derive, such as autonomous driving. However, training deep networks directly from raw images on RL tasks is known to be unstable and troublesome. To deal with that, this work proposes a hierarchical GAIL-based architecture (hGAIL) which decouples representation learning from the driving task to solve the autonomous navigation of a vehicle. The proposed architecture consists of two modules: a GAN (Generative Adversarial Net) which generates an abstract mid-level input representation, which is the Bird's-Eye View (BEV) from the surroundings of the vehicle; and the GAIL which learns to control the vehicle based on the BEV predictions from the GAN as input. hGAIL is able to learn both the policy and the mid-level representation simultaneously as the agent interacts with the environment. Our experiments made in the CARLA simulation environment have shown that GAIL exclusively from cameras (without BEV) fails to even learn the task, while hGAIL, after training exclusively on one city, was able to autonomously navigate successfully in 98% of the intersections of a new city not used in training phase. Videos and code available at: https://sites.google.com/view/hgail
♻ ☆ Implementation of The Future of Drug Discovery: QuantumBased Machine Learning Simulation (QMLS)
The Research & Development (R&D) phase of drug development is a lengthy and costly process. To revolutionize this process, we introduce our new concept QMLS to shorten the whole R&D phase to three to six months and decrease the cost to merely fifty to eighty thousand USD. For Hit Generation, Machine Learning Molecule Generation (MLMG) generates possible hits according to the molecular structure of the target protein while the Quantum Simulation (QS) filters molecules from the primary essay based on the reaction and binding effectiveness with the target protein. Then, For Lead Optimization, the resultant molecules generated and filtered from MLMG and QS are compared, and molecules that appear as a result of both processes will be made into dozens of molecular variations through Machine Learning Molecule Variation (MLMV), while others will only be made into a few variations. Lastly, all optimized molecules would undergo multiple rounds of QS filtering with a high standard for reaction effectiveness and safety, creating a few dozen pre-clinical-trail-ready drugs. This paper is based on our first paper, where we pitched the concept of machine learning combined with quantum simulations. In this paper we will go over the detailed design and framework of QMLS, including MLMG, MLMV, and QS.
comment: 13 pages, 6 figures
♻ ☆ LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning
Large language models (LLMs) have demonstrated significant potential in various tasks, including vulnerability detection. However, current efforts in this area are preliminary, lacking clarity on whether LLMs' vulnerability reasoning capabilities stem from the models themselves or external aids such as knowledge retrieval and tooling support. This paper aims to isolate LLMs' vulnerability reasoning from other capabilities, such as vulnerability knowledge adoption, context information retrieval, and structured output generation. We introduce LLM4Vuln, a unified evaluation framework that separates and assesses LLMs' vulnerability reasoning capabilities and examines improvements when combined with other enhancements. We conducted controlled experiments with 97 ground-truth vulnerabilities and 97 non-vulnerable cases in Solidity and Java, testing them in a total of 9,312 scenarios across four LLMs (GPT-4, GPT-3.5, Mixtral, and Llama 3). Our findings reveal the varying impacts of knowledge enhancement, context supplementation, prompt schemes, and models. Additionally, we identified 14 zero-day vulnerabilities in four pilot bug bounty programs, resulting in \$3,576 in bounties.
comment: This is a technical report by Nanyang Technological University. Updated to support both Solidity and Java
♻ ☆ Large-Batch, Iteration-Efficient Neural Bayesian Design Optimization
Bayesian optimization (BO) provides a powerful framework for optimizing black-box, expensive-to-evaluate functions. It is therefore an attractive tool for engineering design problems, typically involving multiple objectives. Thanks to the rapid advances in fabrication and measurement methods as well as parallel computing infrastructure, querying many design problems can be heavily parallelized. This class of problems challenges BO with an unprecedented setup where it has to deal with very large batches, shifting its focus from sample efficiency to iteration efficiency. We present a novel Bayesian optimization framework specifically tailored to address these limitations. Our key contribution is a highly scalable, sample-based acquisition function that performs a non-dominated sorting of not only the objectives but also their associated uncertainty. We show that our acquisition function in combination with different Bayesian neural network surrogates is effective in data-intensive environments with a minimal number of iterations. We demonstrate the superiority of our method by comparing it with state-of-the-art multi-objective optimizations. We perform our evaluation on two real-world problems -- airfoil design and 3D printing -- showcasing the applicability and efficiency of our approach. Our code is available at: https://github.com/an-on-ym-ous/lbn_mobo
♻ ☆ Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \url{https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications}.
♻ ☆ Triple-domain Feature Learning with Frequency-aware Memory Enhancement for Moving Infrared Small Target Detection
As a sub-field of object detection, moving infrared small target detection presents significant challenges due to tiny target sizes and low contrast against backgrounds. Currently-existing methods primarily rely on the features extracted only from spatio-temporal domain. Frequency domain has hardly been concerned yet, although it has been widely applied in image processing. To extend feature source domains and enhance feature representation, we propose a new Triple-domain Strategy (Tridos) with the frequency-aware memory enhancement on spatio-temporal domain for infrared small target detection. In this scheme, it effectively detaches and enhances frequency features by a local-global frequency-aware module with Fourier transform. Inspired by human visual system, our memory enhancement is designed to capture the spatial relations of infrared targets among video frames. Furthermore, it encodes temporal dynamics motion features via differential learning and residual enhancing. Additionally, we further design a residual compensation to reconcile possible cross-domain feature mismatches. To our best knowledge, proposed Tridos is the first work to explore infrared target feature learning comprehensively in spatio-temporal-frequency domains. The extensive experiments on three datasets (i.e., DAUB, ITSDT-15K and IRDST) validate that our triple-domain infrared feature learning scheme could often be obviously superior to state-of-the-art ones. Source codes are available at https://github.com/UESTC-nnLab/Tridos.
comment: This paper has accepted IEEE TGRS
♻ ☆ AI-Driven Intrusion Detection Systems (IDS) on the ROAD Dataset: A Comparative Analysis for Automotive Controller Area Network (CAN)
The integration of digital devices in modern vehicles has revolutionized automotive technology, enhancing safety and the overall driving experience. The Controller Area Network (CAN) bus is a central system for managing in-vehicle communication between the electronic control units (ECUs). However, the CAN protocol poses security challenges due to inherent vulnerabilities, lacking encryption and authentication, which, combined with an expanding attack surface, necessitates robust security measures. In response to this challenge, numerous Intrusion Detection Systems (IDS) have been developed and deployed. Nonetheless, an open, comprehensive, and realistic dataset to test the effectiveness of such IDSs remains absent in the existing literature. This paper addresses this gap by considering the latest ROAD dataset, containing stealthy and sophisticated injections. The methodology involves dataset labelling and the implementation of both state-of-the-art deep learning models and traditional machine learning models to show the discrepancy in performance between the datasets most commonly used in the literature and the ROAD dataset, a more realistic alternative.
♻ ☆ On the design space between molecular mechanics and machine learning force fields
A force field as accurate as quantum mechanics (QM) and as fast as molecular mechanics (MM), with which one can simulate a biomolecular system efficiently enough and meaningfully enough to get quantitative insights, is among the most ardent dreams of biophysicists -- a dream, nevertheless, not to be fulfilled any time soon. Machine learning force fields (MLFFs) represent a meaningful endeavor towards this direction, where differentiable neural functions are parametrized to fit ab initio energies, and furthermore forces through automatic differentiation. We argue that, as of now, the utility of the MLFF models is no longer bottlenecked by accuracy but primarily by their speed (as well as stability and generalizability), as many recent variants, on limited chemical spaces, have long surpassed the chemical accuracy of $1$ kcal/mol -- the empirical threshold beyond which realistic chemical predictions are possible -- though still magnitudes slower than MM. Hoping to kindle explorations and designs of faster, albeit perhaps slightly less accurate MLFFs, in this review, we focus our attention on the design space (the speed-accuracy tradeoff) between MM and ML force fields. After a brief review of the building blocks of force fields of either kind, we discuss the desired properties and challenges now faced by the force field development community, survey the efforts to make MM force fields more accurate and ML force fields faster, envision what the next generation of MLFF might look like.
♻ ☆ On the Impact of Data Heterogeneity in Federated Learning Environments with Application to Healthcare Networks
Federated Learning (FL) allows multiple privacy-sensitive applications to leverage their dataset for a global model construction without any disclosure of the information. One of those domains is healthcare, where groups of silos collaborate in order to generate a global predictor with improved accuracy and generalization. However, the inherent challenge lies in the high heterogeneity of medical data, necessitating sophisticated techniques for assessment and compensation. This paper presents a comprehensive exploration of the mathematical formalization and taxonomy of heterogeneity within FL environments, focusing on the intricacies of medical data. In particular, we address the evaluation and comparison of the most popular FL algorithms with respect to their ability to cope with quantity-based, feature and label distribution-based heterogeneity. The goal is to provide a quantitative evaluation of the impact of data heterogeneity in FL systems for healthcare networks as well as a guideline on FL algorithm selection. Our research extends beyond existing studies by benchmarking seven of the most common FL algorithms against the unique challenges posed by medical data use cases. The paper targets the prediction of the risk of stroke recurrence through a set of tabular clinical reports collected by different federated hospital silos: data heterogeneity frequently encountered in this scenario and its impact on FL performance are discussed.
♻ ☆ CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning
Federated learning (FL) has been proposed to enable distributed learning on Artificial Intelligence Internet of Things (AIoT) devices with guarantees of high-level data privacy. Since random initial models in FL can easily result in unregulated Stochastic Gradient Descent (SGD) processes, existing FL methods greatly suffer from both slow convergence and poor accuracy, especially in non-IID scenarios. To address this problem, we propose a novel method named CyclicFL, which can quickly derive effective initial models to guide the SGD processes, thus improving the overall FL training performance. We formally analyze the significance of data consistency between the pre-training and training stages of CyclicFL, showing the limited Lipschitzness of loss for the pre-trained models by CyclicFL. Moreover, we systematically prove that our method can achieve faster convergence speed under various convexity assumptions. Unlike traditional centralized pre-training methods that require public proxy data, CyclicFL pre-trains initial models on selected AIoT devices cyclically without exposing their local data. Therefore, they can be easily integrated into any security-critical FL methods. Comprehensive experimental results show that CyclicFL can not only improve the maximum classification accuracy by up to $14.11\%$ but also significantly accelerate the overall FL training process.
♻ ☆ Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review
This comprehensive review delves into the pivotal role of prompt engineering in unleashing the capabilities of Large Language Models (LLMs). The development of Artificial Intelligence (AI), from its inception in the 1950s to the emergence of advanced neural networks and deep learning architectures, has made a breakthrough in LLMs, with models such as GPT-4o and Claude-3, and in Vision-Language Models (VLMs), with models such as CLIP and ALIGN. Prompt engineering is the process of structuring inputs, which has emerged as a crucial technique to maximize the utility and accuracy of these models. This paper explores both foundational and advanced methodologies of prompt engineering, including techniques such as self-consistency, chain-of-thought, and generated knowledge, which significantly enhance model performance. Additionally, it examines the prompt method of VLMs through innovative approaches such as Context Optimization (CoOp), Conditional Context Optimization (CoCoOp), and Multimodal Prompt Learning (MaPLe). Critical to this discussion is the aspect of AI security, particularly adversarial attacks that exploit vulnerabilities in prompt engineering. Strategies to mitigate these risks and enhance model robustness are thoroughly reviewed. The evaluation of prompt methods is also addressed, through both subjective and objective metrics, ensuring a robust analysis of their efficacy. This review also reflects the essential role of prompt engineering in advancing AI capabilities, providing a structured framework for future research and application.
♻ ☆ Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.
comment: Accepted to IEEE SLT 2024
♻ ☆ The Role of Transformer Models in Advancing Blockchain Technology: A Systematic Survey
As blockchain technology rapidly evolves, the demand for enhanced efficiency, security, and scalability grows.Transformer models, as powerful deep learning architectures,have shown unprecedented potential in addressing various blockchain challenges. However, a systematic review of Transformer applications in blockchain is lacking. This paper aims to fill this research gap by surveying over 200 relevant papers, comprehensively reviewing practical cases and research progress of Transformers in blockchain applications. Our survey covers key areas including anomaly detection, smart contract security analysis, cryptocurrency prediction and trend analysis, and code summary generation. To clearly articulate the advancements of Transformers across various blockchain domains, we adopt a domain-oriented classification system, organizing and introducing representative methods based on major challenges in current blockchain research. For each research domain,we first introduce its background and objectives, then review previous representative methods and analyze their limitations,and finally introduce the advancements brought by Transformer models. Furthermore, we explore the challenges of utilizing Transformer, such as data privacy, model complexity, and real-time processing requirements. Finally, this article proposes future research directions, emphasizing the importance of exploring the Transformer architecture in depth to adapt it to specific blockchain applications, and discusses its potential role in promoting the development of blockchain technology. This review aims to provide new perspectives and a research foundation for the integrated development of blockchain technology and machine learning, supporting further innovation and application expansion of blockchain technology.
♻ ☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
♻ ☆ Exposing and Explaining Fake News On-the-Fly
Social media platforms enable the rapid dissemination and consumption of information. However, users instantly consume such content regardless of the reliability of the shared data. Consequently, the latter crowdsourcing model is exposed to manipulation. This work contributes with an explainable and online classification method to recognize fake news in real-time. The proposed method combines both unsupervised and supervised Machine Learning approaches with online created lexica. The profiling is built using creator-, content- and context-based features using Natural Language Processing techniques. The explainable classification mechanism displays in a dashboard the features selected for classification and the prediction confidence. The performance of the proposed solution has been validated with real data sets from Twitter and the results attain 80 % accuracy and macro F-measure. This proposal is the first to jointly provide data stream processing, profiling, classification and explainability. Ultimately, the proposed early detection, isolation and explanation of fake news contribute to increase the quality and trustworthiness of social media contents.
♻ ☆ A review on the use of large language models as virtual tutors
Transformer architectures contribute to managing long-term dependencies for Natural Language Processing, representing one of the most recent changes in the field. These architectures are the basis of the innovative, cutting-edge Large Language Models (LLMs) that have produced a huge buzz in several fields and industrial sectors, among the ones education stands out. Accordingly, these generative Artificial Intelligence-based solutions have directed the change in techniques and the evolution in educational methods and contents, along with network infrastructure, towards high-quality learning. Given the popularity of LLMs, this review seeks to provide a comprehensive overview of those solutions designed specifically to generate and evaluate educational materials and which involve students and teachers in their design or experimental plan. To the best of our knowledge, this is the first review of educational applications (e.g., student assessment) of LLMs. As expected, the most common role of these systems is as virtual tutors for automatic question generation. Moreover, the most popular models are GTP-3 and BERT. However, due to the continuous launch of new generative models, new works are expected to be published shortly.
♻ ☆ G-Style: Stylized Gaussian Splatting
We introduce G-Style, a novel algorithm designed to transfer the style of an image onto a 3D scene represented using Gaussian Splatting. Gaussian Splatting is a powerful 3D representation for novel view synthesis, as -- compared to other approaches based on Neural Radiance Fields -- it provides fast scene renderings and user control over the scene. Recent pre-prints have demonstrated that the style of Gaussian Splatting scenes can be modified using an image exemplar. However, since the scene geometry remains fixed during the stylization process, current solutions fall short of producing satisfactory results. Our algorithm aims to address these limitations by following a three-step process: In a pre-processing step, we remove undesirable Gaussians with large projection areas or highly elongated shapes. Subsequently, we combine several losses carefully designed to preserve different scales of the style in the image, while maintaining as much as possible the integrity of the original scene content. During the stylization process and following the original design of Gaussian Splatting, we split Gaussians where additional detail is necessary within our scene by tracking the gradient of the stylized color. Our experiments demonstrate that G-Style generates high-quality stylizations within just a few minutes, outperforming existing methods both qualitatively and quantitatively.
♻ ☆ Tightly-Coupled LiDAR-IMU-Wheel Odometry with Online Calibration of a Kinematic Model for Skid-Steering Robots
Tunnels and long corridors are challenging environments for mobile robots because a LiDAR point cloud should degenerate in these environments. To tackle point cloud degeneration, this study presents a tightly-coupled LiDAR-IMU-wheel odometry algorithm with an online calibration for skid-steering robots. We propose a full linear wheel odometry factor, which not only serves as a motion constraint but also performs the online calibration of kinematic models for skid-steering robots. Despite the dynamically changing kinematic model (e.g., wheel radii changes caused by tire pressures) and terrain conditions, our method can address the model error via online calibration. Moreover, our method enables an accurate localization in cases of degenerated environments, such as long and straight corridors, by calibration while the LiDAR-IMU fusion sufficiently operates. Furthermore, we estimate the uncertainty (i.e., covariance matrix) of the wheel odometry online for creating a reasonable constraint. The proposed method is validated through three experiments. The first indoor experiment shows that the proposed method is robust in severe degeneracy cases (long corridors) and changes in the wheel radii. The second outdoor experiment demonstrates that our method accurately estimates the sensor trajectory despite being in rough outdoor terrain owing to online uncertainty estimation of wheel odometry. The third experiment shows the proposed online calibration enables robust odometry estimation in changing terrains.
comment: open-source: https://github.com/TakuOkawara/full_linear_wheel_odometry_factor
♻ ☆ FREA: Feasibility-Guided Generation of Safety-Critical Scenarios with Reasonable Adversariality
Generating safety-critical scenarios, which are essential yet difficult to collect at scale, offers an effective method to evaluate the robustness of autonomous vehicles (AVs). Existing methods focus on optimizing adversariality while preserving the naturalness of scenarios, aiming to achieve a balance through data-driven approaches. However, without an appropriate upper bound for adversariality, the scenarios might exhibit excessive adversariality, potentially leading to unavoidable collisions. In this paper, we introduce FREA, a novel safety-critical scenarios generation method that incorporates the Largest Feasible Region (LFR) of AV as guidance to ensure the reasonableness of the adversarial scenarios. Concretely, FREA initially pre-calculates the LFR of AV from offline datasets. Subsequently, it learns a reasonable adversarial policy that controls the scene's critical background vehicles (CBVs) to generate adversarial yet AV-feasible scenarios by maximizing a novel feasibility-dependent adversarial objective function. Extensive experiments illustrate that FREA can effectively generate safety-critical scenarios, yielding considerable near-miss events while ensuring AV's feasibility. Generalization analysis also confirms the robustness of FREA in AV testing across various surrogate AV methods and traffic environments.
comment: Accepted by CoRL 2024
♻ ☆ SynthAI: A Multi Agent Generative AI Framework for Automated Modular HLS Design Generation
In this paper, we introduce SynthAI, a new method for the automated creation of High-Level Synthesis (HLS) designs. SynthAI integrates ReAct agents, Chain-of-Thought (CoT) prompting, web search technologies, and the Retrieval-Augmented Generation (RAG) framework within a structured decision graph. This innovative approach enables the systematic decomposition of complex hardware design tasks into multiple stages and smaller, manageable modules. As a result, SynthAI produces synthesizable designs that closely adhere to user-specified design objectives and functional requirements. We further validate the capabilities of SynthAI through several case studies, highlighting its proficiency in generating complex, multi-module logic designs from a single initial prompt. The SynthAI code is provided via the following repo: \url{https://github.com/sarashs/FPGA_AGI}
comment: This work is in progress and we will be updating it
♻ ☆ Trustworthy Human-AI Collaboration: Reinforcement Learning with Human Feedback and Physics Knowledge for Safe Autonomous Driving
In the field of autonomous driving, developing safe and trustworthy autonomous driving policies remains a significant challenge. Recently, Reinforcement Learning with Human Feedback (RLHF) has attracted substantial attention due to its potential to enhance training safety and sampling efficiency. Nevertheless, existing RLHF-enabled methods often falter when faced with imperfect human demonstrations, potentially leading to training oscillations or even worse performance than rule-based approaches. Inspired by the human learning process, we propose Physics-enhanced Reinforcement Learning with Human Feedback (PE-RLHF). This novel framework synergistically integrates human feedback (e.g., human intervention and demonstration) and physics knowledge (e.g., traffic flow model) into the training loop of reinforcement learning. The key advantage of PE-RLHF is its guarantee that the learned policy will perform at least as well as the given physics-based policy, even when human feedback quality deteriorates, thus ensuring trustworthy safety improvements. PE-RLHF introduces a Physics-enhanced Human-AI (PE-HAI) collaborative paradigm for dynamic action selection between human and physics-based actions, employs a reward-free approach with a proxy value function to capture human preferences, and incorporates a minimal intervention mechanism to reduce the cognitive load on human mentors. Extensive experiments across diverse driving scenarios demonstrate that PE-RLHF significantly outperforms traditional methods, achieving state-of-the-art (SOTA) performance in safety, efficiency, and generalizability, even with varying quality of human feedback. The philosophy behind PE-RLHF not only advances autonomous driving technology but can also offer valuable insights for other safety-critical domains. Demo video and code are available at: \https://zilin-huang.github.io/PE-RLHF-website/
comment: 33 pages, 20 figures
♻ ☆ Adaptive Explicit Knowledge Transfer for Knowledge Distillation
Logit-based knowledge distillation (KD) for classification is cost-efficient compared to feature-based KD but often subject to inferior performance. Recently, it was shown that the performance of logit-based KD can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model, which is known as `implicit (dark) knowledge', to the student model. Through gradient analysis, we first show that this actually has an effect of adaptively controlling the learning of implicit knowledge. Then, we propose a new loss that enables the student to learn explicit knowledge (i.e., the teacher's confidence about the target class) along with implicit knowledge in an adaptive manner. Furthermore, we propose to separate the classification and distillation tasks for effective distillation and inter-class relationship modeling. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods on the CIFAR-100 and ImageNet datasets.
comment: 19 pages, 5 figures
♻ ☆ Painful intelligence: What AI can tell us about human suffering
This book uses the modern theory of artificial intelligence (AI) to understand human suffering or mental pain. Both humans and sophisticated AI agents process information about the world in order to achieve goals and obtain rewards, which is why AI can be used as a model of the human brain and mind. This book intends to make the theory accessible to a relatively general audience, requiring only some relevant scientific background. The book starts with the assumption that suffering is mainly caused by frustration. Frustration means the failure of an agent (whether AI or human) to achieve a goal or a reward it wanted or expected. Frustration is inevitable because of the overwhelming complexity of the world, limited computational resources, and scarcity of good data. In particular, such limitations imply that an agent acting in the real world must cope with uncontrollability, unpredictability, and uncertainty, which all lead to frustration. Fundamental in such modelling is the idea of learning, or adaptation to the environment. While AI uses machine learning, humans and animals adapt by a combination of evolutionary mechanisms and ordinary learning. Even frustration is fundamentally an error signal that the system uses for learning. This book explores various aspects and limitations of learning algorithms and their implications regarding suffering. At the end of the book, the computational theory is used to derive various interventions or training methods that will reduce suffering in humans. The amount of frustration is expressed by a simple equation which indicates how it can be reduced. The ensuing interventions are very similar to those proposed by Buddhist and Stoic philosophy, and include mindfulness meditation. Therefore, this book can be interpreted as an exposition of a computational theory justifying why such philosophies and meditation reduce human suffering.
comment: Second Edition of this book with 258 pages
♻ ☆ A Sequential Decision-Making Model for Perimeter Identification
Perimeter identification involves ascertaining the boundaries of a designated area or zone, requiring traffic flow monitoring, control, or optimization. Various methodologies and technologies exist for accurately defining these perimeters; however, they often necessitate specialized equipment, precise mapping, or comprehensive data for effective problem delineation. In this study, we propose a sequential decision-making framework for perimeter search, designed to operate efficiently in real-time and require only publicly accessible information. We conceptualize the perimeter search as a game between a playing agent and an artificial environment, where the agent's objective is to identify the optimal perimeter by sequentially improving the current perimeter. We detail the model for the game and discuss its adaptability in determining the definition of an optimal perimeter. Ultimately, we showcase the model's efficacy through a real-world scenario, highlighting the identification of corresponding optimal perimeters.
♻ ☆ 3D Single-object Tracking in Point Clouds with High Temporal Variation ECCV24
The high temporal variation of the point clouds is the key challenge of 3D single-object tracking (3D SOT). Existing approaches rely on the assumption that the shape variation of the point clouds and the motion of the objects across neighboring frames are smooth, failing to cope with high temporal variation data. In this paper, we present a novel framework for 3D SOT in point clouds with high temporal variation, called HVTrack. HVTrack proposes three novel components to tackle the challenges in the high temporal variation scenario: 1) A Relative-Pose-Aware Memory module to handle temporal point cloud shape variations; 2) a Base-Expansion Feature Cross-Attention module to deal with similar object distractions in expanded search areas; 3) a Contextual Point Guided Self-Attention module for suppressing heavy background noise. We construct a dataset with high temporal variation (KITTI-HV) by setting different frame intervals for sampling in the KITTI dataset. On the KITTI-HV with 5 frame intervals, our HVTrack surpasses the state-of-the-art tracker CXTracker by 11.3%/15.7% in Success/Precision.
comment: Accepted by ECCV24
♻ ☆ CAVE: Controllable Authorship Verification Explanations
Authorship Verification (AV) (do two documents have the same author?) is essential in many sensitive real-life applications. AV is often used in proprietary domains that require a private, offline model, making SOTA online models like ChatGPT undesirable. Current offline models however have lower downstream utility due to low accuracy/scalability (eg: traditional stylometry AV systems) and lack of accessible post-hoc explanations. In this work, we take the first step to address the above challenges with our trained, offline Llama-3-8B model CAVE (Controllable Authorship Verification Explanations): CAVE generates free-text AV explanations that are controlled to be (1) structured (can be decomposed into sub-explanations in terms of relevant linguistic features), and (2) easily verified for explanation-label consistency (via intermediate labels in sub-explanations). We first engineer a prompt that can generate silver training data from a SOTA teacher model in the desired CAVE output format. We then filter and distill this data into a pretrained Llama-3-8B, our carefully selected student model. Results on three difficult AV datasets IMDb62, Blog-Auth, and Fanfiction show that CAVE generates high quality explanations (as measured by automatic and human evaluation) as well as competitive task accuracies.
♻ ☆ Aligning Large Language Models to a Domain-specific Graph Database for NL2GQL
Graph Databases (Graph DB) find extensive application across diverse domains such as finance, social networks, and medicine. Yet, the translation of Natural Language (NL) into the Graph Query Language (GQL), referred to as NL2GQL, poses significant challenges owing to its intricate and specialized nature. Some approaches have sought to utilize Large Language Models (LLMs) to address analogous tasks like text2SQL. Nonetheless, in the realm of NL2GQL tasks tailored to a particular domain, the absence of domain-specific NL-GQL data pairs adds complexity to aligning LLMs with the graph DB. To tackle this challenge, we present a well-defined pipeline. Initially, we utilize ChatGPT to generate NL-GQL data pairs, leveraging the provided graph DB with self-instruction. Subsequently, we employ the generated data to fine-tune LLMs, ensuring alignment between LLMs and the graph DB. Moreover, we find the importance of relevant schema in efficiently generating accurate GQLs. Thus, we introduce a method to extract relevant schema as the input context. We evaluate our method using two carefully constructed datasets derived from graph DBs in the finance and medicine domains, named FinGQL and MediGQL. Experimental results reveal that our approach significantly outperforms a set of baseline methods, with improvements of 5.90 and 6.36 absolute points on EM, and 6.00 and 7.09 absolute points on EX for FinGQL and MediGQL, respectively.
comment: 13 pages,2 figures
♻ ☆ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.
♻ ☆ OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation
This paper presents the experiments and results for the CheckThat! Lab at CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE). The primary objective of this task was to generate adversarial examples in five problem domains in order to evaluate the robustness of widely used text classification methods (fine-tuned BERT, BiLSTM, and RoBERTa) when applied to credibility assessment issues. This study explores the application of ensemble learning to enhance adversarial attacks on natural language processing (NLP) models. We systematically tested and refined several adversarial attack methods, including BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across various misinformation tasks. By developing modified versions of BERT-Attack and hybrid methods, we achieved significant improvements in attack effectiveness. Our results demonstrate the potential of modification and combining multiple methods to create more sophisticated and effective adversarial attack strategies, contributing to the development of more robust and secure systems.
comment: CLEF 2024 - Conference and Labs of the Evaluation Forum
♻ ☆ Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.
comment: 10 pages, 1 figure
♻ ☆ LuSNAR:A Lunar Segmentation, Navigation and Reconstruction Dataset based on Muti-sensor for Autonomous Exploration
With the complexity of lunar exploration missions, the moon needs to have a higher level of autonomy. Environmental perception and navigation algorithms are the foundation for lunar rovers to achieve autonomous exploration. The development and verification of algorithms require highly reliable data support. Most of the existing lunar datasets are targeted at a single task, lacking diverse scenes and high-precision ground truth labels. To address this issue, we propose a multi-task, multi-scene, and multi-label lunar benchmark dataset LuSNAR. This dataset can be used for comprehensive evaluation of autonomous perception and navigation systems, including high-resolution stereo image pairs, panoramic semantic labels, dense depth maps, LiDAR point clouds, and the position of rover. In order to provide richer scene data, we built 9 lunar simulation scenes based on Unreal Engine. Each scene is divided according to topographic relief and the density of objects. To verify the usability of the dataset, we evaluated and analyzed the algorithms of semantic segmentation, 3D reconstruction, and autonomous navigation. The experiment results prove that the dataset proposed in this paper can be used for ground verification of tasks such as autonomous environment perception and navigation, and provides a lunar benchmark dataset for testing the accessibility of algorithm metrics. We make LuSNAR publicly available at: https://github.com/autumn999999/LuSNAR-dataset.
comment: 19 pages, 13 figures, 11 tables
♻ ☆ CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention
Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.
comment: Homepage: https://cyberhost.github.io/
♻ ☆ PointCloud-Text Matching: Benchmark Datasets and a Baseline
In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching~(PTM), which aims to find the exact cross-modal instance that matches a given point-cloud query or text query. PTM could be applied to various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there exists no suitable and targeted dataset for PTM in practice. Therefore, we construct three new PTM benchmark datasets, namely 3D2T-SR, 3D2T-NR, and 3D2T-QA. We observe that the data is challenging and with noisy correspondence due to the sparsity, noise, or disorder of point clouds and the ambiguity, vagueness, or incompleteness of texts, which make existing cross-modal matching methods ineffective for PTM. To tackle these challenges, we propose a PTM baseline, named Robust PointCloud-Text Matching method (RoMa). RoMa consists of two modules: a Dual Attention Perception module (DAP) and a Robust Negative Contrastive Learning module (RNCL). Specifically, DAP leverages token-level and feature-level attention to adaptively focus on useful local and global features, and aggregate them into common representations, thereby reducing the adverse impact of noise and ambiguity. To handle noisy correspondence, RNCL divides negative pairs, which are much less error-prone than positive pairs, into clean and noisy subsets, and assigns them forward and reverse optimization directions respectively, thus enhancing robustness against noisy correspondence. We conduct extensive experiments on our benchmarks and demonstrate the superiority of our RoMa.
comment: Upon further consideration, we have concluded that the current version requires significant revision and may not yet be ready for publication. We plan to conduct additional experiments and make the necessary improvements to ensure the paper meets the standards for future submission
♻ ☆ Explanation Space: A New Perspective into Time Series Interpretability
Human understandable explanation of deep learning models is necessary for many critical and sensitive applications. Unlike image or tabular data where the importance of each input feature (for the classifier's decision) can be directly projected into the input, time series distinguishable features (e.g. dominant frequency) are often hard to manifest in time domain for a user to easily understand. Moreover, most explanation methods require a baseline value as an indication of the absence of any feature. However, the notion of lack of feature, which is often defined as black pixels for vision tasks or zero/mean values for tabular data, is not well-defined in time series. Despite the adoption of explainable AI methods (XAI) from tabular and vision domain into time series domain, these differences limit the application of these XAI methods in practice. In this paper, we propose a simple yet effective method that allows a model originally trained on time domain to be interpreted in other explanation spaces using existing methods. We suggest four explanation spaces that each can potentially alleviate these issues in certain types of time series. Our method can be readily adopted in existing platforms without any change to trained models or XAI methods. The code is available at https://github.com/shrezaei/TS-X-spaces.
Machine Learning 135
☆ Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.
comment: Project page: https://yunzeman.github.io/lexicon3d , Github: https://github.com/YunzeMan/Lexicon3D
☆ WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis's utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.
☆ Dynamics of Supervised and Reinforcement Learning in the Non-Linear Perceptron
The ability of a brain or a neural network to efficiently learn depends crucially on both the task structure and the learning rule. Previous works have analyzed the dynamical equations describing learning in the relatively simplified context of the perceptron under assumptions of a student-teacher framework or a linearized output. While these assumptions have facilitated theoretical understanding, they have precluded a detailed understanding of the roles of the nonlinearity and input-data distribution in determining the learning dynamics, limiting the applicability of the theories to real biological or artificial neural networks. Here, we use a stochastic-process approach to derive flow equations describing learning, applying this framework to the case of a nonlinear perceptron performing binary classification. We characterize the effects of the learning rule (supervised or reinforcement learning, SL/RL) and input-data distribution on the perceptron's learning curve and the forgetting curve as subsequent tasks are learned. In particular, we find that the input-data noise differently affects the learning speed under SL vs. RL, as well as determines how quickly learning of a task is overwritten by subsequent learning. Additionally, we verify our approach with real data using the MNIST dataset. This approach points a way toward analyzing learning dynamics for more-complex circuit architectures.
☆ Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm? NDSS
Machine learning has revolutionized numerous domains, playing a crucial role in driving advancements and enabling data-centric processes. The significance of data in training models and shaping their performance cannot be overstated. Recent research has highlighted the heterogeneous impact of individual data samples, particularly the presence of valuable data that significantly contributes to the utility and effectiveness of machine learning models. However, a critical question remains unanswered: are these valuable data samples more vulnerable to machine learning attacks? In this work, we investigate the relationship between data importance and machine learning attacks by analyzing five distinct attack types. Our findings reveal notable insights. For example, we observe that high importance data samples exhibit increased vulnerability in certain attacks, such as membership inference and model stealing. By analyzing the linkage between membership inference vulnerability and data importance, we demonstrate that sample characteristics can be integrated into membership metrics by introducing sample-specific criteria, therefore enhancing the membership inference performance. These findings emphasize the urgent need for innovative defense mechanisms that strike a balance between maximizing utility and safeguarding valuable data against potential exploitation.
comment: To Appear in Network and Distributed System Security (NDSS) Symposium 2025
☆ Differentiable Discrete Event Simulation for Queuing Network Control
Queuing network control is essential for managing congestion in job-processing systems such as service systems, communication networks, and manufacturing processes. Despite growing interest in applying reinforcement learning (RL) techniques, queueing network control poses distinct challenges, including high stochasticity, large state and action spaces, and lack of stability. To tackle these challenges, we propose a scalable framework for policy optimization based on differentiable discrete event simulation. Our main insight is that by implementing a well-designed smoothing technique for discrete event dynamics, we can compute pathwise policy gradients for large-scale queueing networks using auto-differentiation software (e.g., Tensorflow, PyTorch) and GPU parallelization. Through extensive empirical experiments, we observe that our policy gradient estimators are several orders of magnitude more accurate than typical REINFORCE-based estimators. In addition, We propose a new policy architecture, which drastically improves stability while maintaining the flexibility of neural-network policies. In a wide variety of scheduling and admission control tasks, we demonstrate that training control policies with pathwise gradients leads to a 50-1000x improvement in sample efficiency over state-of-the-art RL methods. Unlike prior tailored approaches to queueing, our methods can flexibly handle realistic scenarios, including systems operating in non-stationary environments and those with non-exponential interarrival/service times.
LLM-CI: Assessing Contextual Integrity Norms in Language Models
Large language models (LLMs), while memorizing parts of their training data scraped from the Internet, may also inadvertently encode societal preferences and norms. As these models are integrated into sociotechnical systems, it is crucial that the norms they encode align with societal expectations. These norms could vary across models, hyperparameters, optimization techniques, and datasets. This is especially challenging due to prompt sensitivity$-$small variations in prompts yield different responses, rendering existing assessment methodologies unreliable. There is a need for a comprehensive framework covering various models, optimization, and datasets, along with a reliable methodology to assess encoded norms. We present LLM-CI, the first open-sourced framework to assess privacy norms encoded in LLMs. LLM-CI uses a Contextual Integrity-based factorial vignette methodology to assess the encoded norms across different contexts and LLMs. We propose the multi-prompt assessment methodology to address prompt sensitivity by assessing the norms from only the prompts that yield consistent responses across multiple variants. Using LLM-CI and our proposed methodology, we comprehensively evaluate LLMs using IoT and COPPA vignettes datasets from prior work, examining the impact of model properties (e.g., hyperparameters, capacity) and optimization strategies (e.g., alignment, quantization).
comment: 20 pages, 8 Figures, 4 Tables
☆ Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry
Emerging marketplaces for large language models and other large-scale machine learning (ML) models appear to exhibit market concentration, which has raised concerns about whether there are insurmountable barriers to entry in such markets. In this work, we study this issue from both an economic and an algorithmic point of view, focusing on a phenomenon that reduces barriers to entry. Specifically, an incumbent company risks reputational damage unless its model is sufficiently aligned with safety objectives, whereas a new company can more easily avoid reputational damage. To study this issue formally, we define a multi-objective high-dimensional regression framework that captures reputational damage, and we characterize the number of data points that a new company needs to enter the market. Our results demonstrate how multi-objective considerations can fundamentally reduce barriers to entry -- the required number of data points can be significantly smaller than the incumbent company's dataset size. En route to proving these results, we develop scaling laws for high-dimensional linear regression in multi-objective environments, showing that the scaling rate becomes slower when the dataset size is large, which could be of independent interest.
☆ Planning In Natural Language Improves LLM Search For Code Generation
While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PLANSEARCH generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas.
☆ A Deep Generative Learning Approach for Two-stage Adaptive Robust Optimization
Two-stage adaptive robust optimization is a powerful approach for planning under uncertainty that aims to balance costs of "here-and-now" first-stage decisions with those of "wait-and-see" recourse decisions made after uncertainty is realized. To embed robustness against uncertainty, modelers typically assume a simple polyhedral or ellipsoidal set over which contingencies may be realized. However, these simple uncertainty sets tend to yield highly conservative decision-making when uncertainties are high-dimensional. In this work, we introduce AGRO, a column-and-constraint generation algorithm that performs adversarial generation for two-stage adaptive robust optimization using a variational autoencoder. AGRO identifies realistic and cost-maximizing contingencies by optimizing over spherical uncertainty sets in a latent space using a projected gradient ascent approach that differentiates the optimal recourse cost with respect to the latent variable. To demonstrate the cost- and time-efficiency of our approach experimentally, we apply AGRO to an adaptive robust capacity expansion problem for a regional power system and show that AGRO is able to reduce costs by up to 7.8% and runtimes by up to 77% in comparison to the conventional column-and-constraint generation algorithm.
☆ Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model
We derive approximation bounds for learning single neuron models using thresholded gradient descent when both the labels and the covariates are possibly corrupted adversarially. We assume the data follows the model $y = \sigma(\mathbf{w}^{*} \cdot \mathbf{x}) + \xi,$ where $\sigma$ is a nonlinear activation function, the noise $\xi$ is Gaussian, and the covariate vector $\mathbf{x}$ is sampled from a sub-Gaussian distribution. We study sigmoidal, leaky-ReLU, and ReLU activation functions and derive a $O(\nu\sqrt{\epsilon\log(1/\epsilon)})$ approximation bound in $\ell_{2}$-norm, with sample complexity $O(d/\epsilon)$ and failure probability $e^{-\Omega(d)}$. We also study the linear regression problem, where $\sigma(\mathbf{x}) = \mathbf{x}$. We derive a $O(\nu\epsilon\log(1/\epsilon))$ approximation bound, improving upon the previous $O(\nu)$ approximation bounds for the gradient-descent based iterative thresholding algorithms of Bhatia et al. (NeurIPS 2015) and Shen and Sanghavi (ICML 2019). Our algorithm has a $O(\textrm{polylog}(N,d)\log(R/\epsilon))$ runtime complexity when $\|\mathbf{w}^{*}\|_2 \leq R$, improving upon the $O(\text{polylog}(N,d)/\epsilon^2)$ runtime complexity of Awasthi et al. (NeurIPS 2022).
comment: 35 pages
☆ Classification and Prediction of Heart Diseases using Machine Learning Algorithms
Heart disease is a serious worldwide health issue because it claims the lives of many people who might have been treated if the disease had been identified earlier. The leading cause of death in the world is cardiovascular disease, usually referred to as heart disease. Creating reliable, effective, and precise predictions for these diseases is one of the biggest issues facing the medical world today. Although there are tools for predicting heart diseases, they are either expensive or challenging to apply for determining a patient's risk. The best classifier for foretelling and spotting heart disease was the aim of this research. This experiment examined a range of machine learning approaches, including Logistic Regression, K-Nearest Neighbor, Support Vector Machine, and Artificial Neural Networks, to determine which machine learning algorithm was most effective at predicting heart diseases. One of the most often utilized data sets for this purpose, the UCI heart disease repository provided the data set for this study. The K-Nearest Neighbor technique was shown to be the most effective machine learning algorithm for determining whether a patient has heart disease. It will be beneficial to conduct further studies on the application of additional machine learning algorithms for heart disease prediction.
comment: 10 pages, 8 figures
☆ View-Invariant Policy Learning via Zero-Shot Novel View Synthesis
Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.
comment: Accepted to CoRL 2024
☆ Predicting quantum channels over general product distributions
We investigate the problem of predicting the output behavior of unknown quantum channels. Given query access to an $n$-qubit channel $E$ and an observable $O$, we aim to learn the mapping \begin{equation*} \rho \mapsto \mathrm{Tr}(O E[\rho]) \end{equation*} to within a small error for most $\rho$ sampled from a distribution $D$. Previously, Huang, Chen, and Preskill proved a surprising result that even if $E$ is arbitrary, this task can be solved in time roughly $n^{O(\log(1/\epsilon))}$, where $\epsilon$ is the target prediction error. However, their guarantee applied only to input distributions $D$ invariant under all single-qubit Clifford gates, and their algorithm fails for important cases such as general product distributions over product states $\rho$. In this work, we propose a new approach that achieves accurate prediction over essentially any product distribution $D$, provided it is not "classical" in which case there is a trivial exponential lower bound. Our method employs a "biased Pauli analysis," analogous to classical biased Fourier analysis. Implementing this approach requires overcoming several challenges unique to the quantum setting, including the lack of a basis with appropriate orthogonality properties. The techniques we develop to address these issues may have broader applications in quantum information.
comment: 20 pages, comments welcome
☆ A New First-Order Meta-Learning Algorithm with Convergence Guarantees
Learning new tasks by drawing on prior experience gathered from other (related) tasks is a core property of any intelligent system. Gradient-based meta-learning, especially MAML and its variants, has emerged as a viable solution to accomplish this goal. One problem MAML encounters is its computational and memory burdens needed to compute the meta-gradients. We propose a new first-order variant of MAML that we prove converges to a stationary point of the MAML objective, unlike other first-order variants. We also show that the MAML objective does not satisfy the smoothness assumption assumed in previous works; we show instead that its smoothness constant grows with the norm of the meta-gradient, which theoretically suggests the use of normalized or clipped-gradient methods compared to the plain gradient method used in previous works. We validate our theory on a synthetic experiment.
☆ Practical Forecasting of Cryptocoins Timeseries using Correlation Patterns
Cryptocoins (i.e., Bitcoin, Ether, Litecoin) are tradable digital assets. Ownerships of cryptocoins are registered on distributed ledgers (i.e., blockchains). Secure encryption techniques guarantee the security of the transactions (transfers of coins among owners), registered into the ledger. Cryptocoins are exchanged for specific trading prices. The extreme volatility of such trading prices across all different sets of crypto-assets remains undisputed. However, the relations between the trading prices across different cryptocoins remains largely unexplored. Major coin exchanges indicate trend correlation to advise for sells or buys. However, price correlations remain largely unexplored. We shed some light on the trend correlations across a large variety of cryptocoins, by investigating their coin/price correlation trends over the past two years. We study the causality between the trends, and exploit the derived correlations to understand the accuracy of state-of-the-art forecasting techniques for time series modeling (e.g., GBMs, LSTM and GRU) of correlated cryptocoins. Our evaluation shows (i) strong correlation patterns between the most traded coins (e.g., Bitcoin and Ether) and other types of cryptocurrencies, and (ii) state-of-the-art time series forecasting algorithms can be used to forecast cryptocoins price trends. We released datasets and code to reproduce our analysis to the research community.
☆ Wind turbine condition monitoring based on intra- and inter-farm federated learning
As wind energy adoption is growing, ensuring the efficient operation and maintenance of wind turbines becomes essential for maximizing energy production and minimizing costs and downtime. Many AI applications in wind energy, such as in condition monitoring and power forecasting, may benefit from using operational data not only from individual wind turbines but from multiple turbines and multiple wind farms. Collaborative distributed AI which preserves data privacy holds a strong potential for these applications. Federated learning has emerged as a privacy-preserving distributed machine learning approach in this context. We explore federated learning in wind turbine condition monitoring, specifically for fault detection using normal behaviour models. We investigate various federated learning strategies, including collaboration across different wind farms and turbine models, as well as collaboration restricted to the same wind farm and turbine model. Our case study results indicate that federated learning across multiple wind turbines consistently outperforms models trained on a single turbine, especially when training data is scarce. Moreover, the amount of historical data necessary to train an effective model can be significantly reduced by employing a collaborative federated learning strategy. Finally, our findings show that extending the collaboration to multiple wind farms may result in inferior performance compared to restricting learning within a farm, specifically when faced with statistical heterogeneity and imbalanced datasets.
☆ A method to benchmark high-dimensional process drift detection
Process curves are multi-variate finite time series data coming from manufacturing processes. This paper studies machine learning methods for drifts of process curves. A theoretic framework to synthetically generate process curves in a controlled way is introduced in order to benchmark machine learning algorithms for process drift detection. A evaluation score, called the temporal area under the curve, is introduced, which allows to quantify how well machine learning models unveil curves belonging to drift segments. Finally, a benchmark study comparing popular machine learning approaches on synthetic data generated with the introduced framework shown.
☆ A Fused Large Language Model for Predicting Startup Success
Investors are continuously seeking profitable investment opportunities in startups and, hence, for effective decision-making, need to predict a startup's probability of success. Nowadays, investors can use not only various fundamental information about a startup (e.g., the age of the startup, the number of founders, and the business sector) but also textual description of a startup's innovation and business model, which is widely available through online venture capital (VC) platforms such as Crunchbase. To support the decision-making of investors, we develop a machine learning approach with the aim of locating successful startups on VC platforms. Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success. Thereby, we assess to what extent self-descriptions on VC platforms are predictive of startup success. Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success, with textual self-descriptions being responsible for a significant part of the predictive power. Our work provides a decision support tool for investors to find profitable investment opportunities.
☆ Threat Classification on Deployed Optical Networks Using MIMO Digital Fiber Sensing, Wavelets, and Machine Learning
We demonstrate mechanical threats classification including jackhammers and excavators, leveraging wavelet transform of MIMO-DFS output data across a 57-km operational network link. Our machine learning framework incorporates transfer learning and shows 93% classification accuracy from field data, with benefits for optical network supervision.
☆ Weather-Adaptive Multi-Step Forecasting of State of Polarization Changes in Aerial Fibers Using Wavelet Neural Networks
We introduce a novel weather-adaptive approach for multi-step forecasting of multi-scale SOP changes in aerial fiber links. By harnessing the discrete wavelet transform and incorporating weather data, our approach improves forecasting accuracy by over 65% in RMSE and 63% in MAPE compared to baselines.
comment: ECOC 2024
☆ The representation landscape of few-shot learning and fine-tuning in large language models
In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.
☆ A DNN Biophysics Model with Topological and Electrostatic Features
In this project, we provide a deep-learning neural network (DNN) based biophysics model to predict protein properties. The model uses multi-scale and uniform topological and electrostatic features generated with protein structural information and force field, which governs the molecular mechanics. The topological features are generated using the element specified persistent homology (ESPH) while the electrostatic features are fast computed using a Cartesian treecode. These features are uniform in number for proteins with various sizes thus the broadly available protein structure database can be used in training the network. These features are also multi-scale thus the resolution and computational cost can be balanced by the users. The machine learning simulation on over 4000 protein structures shows the efficiency and fidelity of these features in representing the protein structure and force field for the predication of their biophysical properties such as electrostatic solvation energy. Tests on topological or electrostatic features alone and the combination of both showed the optimal performance when both features are used. This model shows its potential as a general tool in assisting biophysical properties and function prediction for the broad biomolecules using data from both theoretical computing and experiments.
☆ Unsupervised Anomaly Detection and Localization with Generative Adversarial Networks
We propose a novel unsupervised anomaly detection approach using generative adversarial networks and SOP-derived spectrograms. Demonstrating remarkable efficacy, our method achieves over 97% accuracy on SOP datasets from both submarine and terrestrial fiber links, all achieved without the need for labelled data.
comment: ECOC 2024
☆ Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization
Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its utility, including linguistic and paralinguistic aspects. However, anonymizing speech while maintaining emotional state remains challenging. We explore this problem in the context of the VoicePrivacy 2024 challenge. Specifically, we developed various speaker anonymization pipelines and find that approaches either excel at anonymization or preserving emotion state, but not both simultaneously. Achieving both would require an in-domain emotion recognizer. Additionally, we found that it is feasible to train a semi-effective speaker verification system using only emotion representations, demonstrating the challenge of separating these two modalities.
comment: accepted by 2024 IEEE Spoken Language Technology Workshop
☆ On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM's effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM. Our findings indicate that even though DPORM fits the training dataset comparably, it generalizes less effectively than EXRM, especially when the validation datasets contain distribution shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.
comment: 12 pages, 8 tables, 2 figures
☆ Limited but consistent gains in adversarial robustness by co-training object recognition models with human EEG
In contrast to human vision, artificial neural networks (ANNs) remain relatively susceptible to adversarial attacks. To address this vulnerability, efforts have been made to transfer inductive bias from human brains to ANNs, often by training the ANN representations to match their biological counterparts. Previous works relied on brain data acquired in rodents or primates using invasive techniques, from specific regions of the brain, under non-natural conditions (anesthetized animals), and with stimulus datasets lacking diversity and naturalness. In this work, we explored whether aligning model representations to human EEG responses to a rich set of real-world images increases robustness to ANNs. Specifically, we trained ResNet50-backbone models on a dual task of classification and EEG prediction; and evaluated their EEG prediction accuracy and robustness to adversarial attacks. We observed significant correlation between the networks' EEG prediction accuracy, often highest around 100 ms post stimulus onset, and their gains in adversarial robustness. Although effect size was limited, effects were consistent across different random initializations and robust for architectural variants. We further teased apart the data from individual EEG channels and observed strongest contribution from electrodes in the parieto-occipital regions. The demonstrated utility of human EEG for such tasks opens up avenues for future efforts that scale to larger datasets under diverse stimuli conditions with the promise of stronger effects.
☆ Beyond Model Interpretability: Socio-Structural Explanations in Machine Learning
What is it to interpret the outputs of an opaque machine learning model. One approach is to develop interpretable machine learning techniques. These techniques aim to show how machine learning models function by providing either model centric local or global explanations, which can be based on mechanistic interpretations revealing the inner working mechanisms of models or nonmechanistic approximations showing input feature output data relationships. In this paper, we draw on social philosophy to argue that interpreting machine learning outputs in certain normatively salient domains could require appealing to a third type of explanation that we call sociostructural explanation. The relevance of this explanation type is motivated by the fact that machine learning models are not isolated entities but are embedded within and shaped by social structures. Sociostructural explanations aim to illustrate how social structures contribute to and partially explain the outputs of machine learning models. We demonstrate the importance of sociostructural explanations by examining a racially biased healthcare allocation algorithm. Our proposal highlights the need for transparency beyond model interpretability, understanding the outputs of machine learning systems could require a broader analysis that extends beyond the understanding of the machine learning model itself.
☆ DART2: a robust multiple testing method to smartly leverage helpful or misleading ancillary information
In many applications of multiple testing, ancillary information is available, reflecting the hypothesis null or alternative status. Several methods have been developed to leverage this ancillary information to enhance testing power, typically requiring the ancillary information is helpful enough to ensure favorable performance. In this paper, we develop a robust and effective distance-assisted multiple testing procedure named DART2, designed to be powerful and robust regardless of the quality of ancillary information. When the ancillary information is helpful, DART2 can asymptotically control FDR while improving power; otherwise, DART2 can still control FDR and maintain power at least as high as ignoring the ancillary information. We demonstrated DART2's superior performance compared to existing methods through numerical studies under various settings. In addition, DART2 has been applied to a gene association study where we have shown its superior accuracy and robustness under two different types of ancillary information.
comment: 26 pages, 6 figures
☆ 1 Modular Parallel Manipulator for Long-Term Soft Robotic Data Collection
Performing long-term experimentation or large-scale data collection for machine learning in the field of soft robotics is challenging, due to the hardware robustness and experimental flexibility required. In this work, we propose a modular parallel robotic manipulation platform suitable for such large-scale data collection and compatible with various soft-robotic fabrication methods. Considering the computational and theoretical difficulty of replicating the high-fidelity, faster-than-real-time simulations that enable large-scale data collection in rigid robotic systems, a robust soft-robotic hardware platform becomes a high priority development task for the field. The platform's modules consist of a pair of off-the-shelf electrical motors which actuate a customizable finger consisting of a compliant parallel structure. The parallel mechanism of the finger can be as simple as a single 3D-printed urethane or molded silicone bulk structure, due to the motors being able to fully actuate a passive structure. This design flexibility allows experimentation with soft mechanism varied geometries, bulk properties and surface properties. Additionally, while the parallel mechanism does not require separate electronics or additional parts, these can be included, and it can be constructed using multi-functional soft materials to study compatible soft sensors and actuators in the learning process. In this work, we validate the platform's ability to be used for policy gradient reinforcement learning directly on hardware in a benchmark 2D manipulation task. We additionally demonstrate compatibility with multiple fingers and characterize the design constraints for compatible extensions.
☆ VFLGAN-TS: Vertical Federated Learning-based Generative Adversarial Networks for Publication of Vertically Partitioned Time-Series Data
In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, often original data cannot be shared due to privacy concerns and regulations. A potential solution is to release a synthetic dataset with a similar distribution to the private dataset. Nevertheless, in some scenarios, the attributes required to train an AI model are distributed among different parties, and the parties cannot share the local data for synthetic data construction due to privacy regulations. In PETS 2024, we recently introduced the first Vertical Federated Learning-based Generative Adversarial Network (VFLGAN) for publishing vertically partitioned static data. However, VFLGAN cannot effectively handle time-series data, presenting both temporal and attribute dimensions. In this article, we proposed VFLGAN-TS, which combines the ideas of attribute discriminator and vertical federated learning to generate synthetic time-series data in the vertically partitioned scenario. The performance of VFLGAN-TS is close to that of its counterpart, which is trained in a centralized manner and represents the upper limit for VFLGAN-TS. To further protect privacy, we apply a Gaussian mechanism to make VFLGAN-TS satisfy an $(\epsilon,\delta)$-differential privacy. Besides, we develop an enhanced privacy auditing scheme to evaluate the potential privacy breach through the framework of VFLGAN-TS and synthetic datasets.
☆ A practical approach to evaluating the adversarial distance for machine learning classifiers
Robustness is critical for machine learning (ML) classifiers to ensure consistent performance in real-world applications where models may encounter corrupted or adversarial inputs. In particular, assessing the robustness of classifiers to adversarial inputs is essential to protect systems from vulnerabilities and thus ensure safety in use. However, methods to accurately compute adversarial robustness have been challenging for complex ML models and high-dimensional data. Furthermore, evaluations typically measure adversarial accuracy on specific attack budgets, limiting the informative value of the resulting metrics. This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach. Combined, the methods provide a comprehensive evaluation of adversarial robustness by computing estimates for the upper and lower bounds of the adversarial distance. We present visualisations and ablation studies that provide insights into how this evaluation method should be applied and parameterised. We find that our adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations. The approach in this paper should encourage a more informative way of evaluating the adversarial robustness of ML classifiers.
comment: Accepted manuscript at International Mechanical Engineering Congress and Exposition IMECE2024
☆ Costs Estimation in Unit Commitment Problems using Simulation-Based Inference
The Unit Commitment (UC) problem is a key optimization task in power systems to forecast the generation schedules of power units over a finite time period by minimizing costs while meeting demand and technical constraints. However, many parameters required by the UC problem are unknown, such as the costs. In this work, we estimate these unknown costs using simulation-based inference on an illustrative UC problem, which provides an approximated posterior distribution of the parameters given observed generation schedules and demands. Our results highlight that the learned posterior distribution effectively captures the underlying distribution of the data, providing a range of possible values for the unknown parameters given a past observation. This posterior allows for the estimation of past costs using observed past generation schedules, enabling operators to better forecast future costs and make more robust generation scheduling forecasts. We present avenues for future research to address overconfidence in posterior estimation, enhance the scalability of the methodology and apply it to more complex UC problems modeling the network constraints and renewable energy sources.
☆ CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning
Reinforcement learning agents can achieve superhuman performance in static tasks but are costly to train and fragile to task changes. This limits their deployment in real-world scenarios where training experience is expensive or the context changes through factors like sensor degradation, environmental processes or changing mission priorities. Lifelong reinforcement learning aims to improve sample efficiency and adaptability by studying how agents perform in evolving problems. The difficulty that these changes pose to an agent is rarely measured directly, however. Agent performances can be compared across a change, but this is often prohibitively expensive. We propose Change-Induced Regret Proxy (CHIRP) metrics, a class of metrics for approximating a change's difficulty while avoiding the high costs of using trained agents. A relationship between a CHIRP metric and agent performance is identified in two environments, a simple grid world and MetaWorld's suite of robotic arm tasks. We demonstrate two uses for these metrics: for learning, an agent that clusters MDPs based on a CHIRP metric achieves $17\%$ higher average returns than three existing agents in a sequence of MetaWorld tasks. We also show how a CHIRP can be calibrated to compare the difficulty of changes across distinctly different environments.
comment: 8 pages, 9 figures
☆ 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances KDD
Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.
comment: Presented at the 2024 KDD workshop on Evaluation and Trustworthiness of Generative AI Models
☆ MaskVal: Simple but Effective Uncertainty Quantification for 6D Pose Estimation
For the use of 6D pose estimation in robotic applications, reliable poses are of utmost importance to ensure a safe, reliable and predictable operational performance. Despite these requirements, state-of-the-art 6D pose estimators often do not provide any uncertainty quantification for their pose estimates at all, or if they do, it has been shown that the uncertainty provided is only weakly correlated with the actual true error. To address this issue, we investigate a simple but effective uncertainty quantification, that we call MaskVal, which compares the pose estimates with their corresponding instance segmentations by rendering and does not require any modification of the pose estimator itself. Despite its simplicity, MaskVal significantly outperforms a state-of-the-art ensemble method on both a dataset and a robotic setup. We show that by using MaskVal, the performance of a state-of-the-art 6D pose estimator is significantly improved towards a safe and reliable operation. In addition, we propose a new and specific approach to compare and evaluate uncertainty quantification methods for 6D pose estimation in the context of robotic manipulation.
☆ Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection
Despite their high accuracy, complex neural networks demand significant computational resources, posing challenges for deployment on resource-constrained devices such as mobile phones and embedded systems. Compression algorithms have been developed to address these challenges by reducing model size and computational demands while maintaining accuracy. Among these approaches, factorization methods based on tensor decomposition are theoretically sound and effective. However, they face difficulties in selecting the appropriate rank for decomposition. This paper tackles this issue by presenting a unified framework that simultaneously applies decomposition and optimal rank selection, employing a composite compression loss within defined rank constraints. Our approach includes an automatic rank search in a continuous space, efficiently identifying optimal rank configurations without the use of training data, making it computationally efficient. Combined with a subsequent fine-tuning step, our approach maintains the performance of highly compressed models on par with their original counterparts. Using various benchmark datasets, we demonstrate the efficacy of our method through a comprehensive analysis.
☆ DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
Diffusion models (DMs) have demonstrated exceptional generative capabilities across various areas, while they are hindered by slow inference speeds and high computational demands during deployment. The most common way to accelerate DMs involves reducing the number of denoising steps during generation, achieved through faster sampling solvers or knowledge distillation (KD). In contrast to prior approaches, we propose a novel method that transfers the capability of large pretrained DMs to faster architectures. Specifically, we employ KD in a distinct manner to compress DMs by distilling their generative ability into more rapid variants. Furthermore, considering that the source data is either unaccessible or too enormous to store for current generative models, we introduce a new paradigm for their distillation without source data, termed Data-Free Knowledge Distillation for Diffusion Models (DKDM). Generally, our established DKDM framework comprises two main components: 1) a DKDM objective that uses synthetic denoising data produced by pretrained DMs to optimize faster DMs without source data, and 2) a dynamic iterative distillation method that flexibly organizes the synthesis of denoising data, preventing it from slowing down the optimization process as the generation is slow. To our knowledge, this is the first attempt at using KD to distill DMs into any architecture in a data-free manner. Importantly, our DKDM is orthogonal to most existing acceleration methods, such as denoising step reduction, quantization and pruning. Experiments show that our DKDM is capable of deriving 2x faster DMs with performance remaining on par with the baseline. Notably, our DKDM enables pretrained DMs to function as "datasets" for training new DMs.
☆ The Power of Second Chance: Personalized Submodular Maximization with Two Candidates
Most of existing studies on submodular maximization focus on selecting a subset of items that maximizes a \emph{single} submodular function. However, in many real-world scenarios, we might have multiple user-specific functions, each of which models the utility of a particular type of user. In these settings, our goal would be to choose a set of items that performs well across all the user-specific functions. One way to tackle this problem is to select a single subset that maximizes the sum of all of the user-specific functions. Although this aggregate approach is efficient in the sense that it avoids computation of sets for individual functions, it really misses the power of personalization - for it does not allow to choose different sets for different functions. In this paper, we introduce the problem of personalized submodular maximization with two candidate solutions. For any two candidate solutions, the utility of each user-specific function is defined as the better of these two candidates. Our objective is, therefore, to select the best set of two candidates that maximize the sum of utilities of all the user-specific functions. We have designed effective algorithms for this problem. We also discuss how our approach generalizes to multiple candidate solutions, increasing flexibility and personalization in our solution.
☆ Risk-based Calibration for Probabilistic Classifiers
We introduce a general iterative procedure called risk-based calibration (RC) designed to minimize the empirical risk under the 0-1 loss (empirical error) for probabilistic classifiers. These classifiers are based on modeling probability distributions, including those constructed from the joint distribution (generative) and those based on the class conditional distribution (conditional). RC can be particularized to any probabilistic classifier provided a specific learning algorithm that computes the classifier's parameters in closed form using data statistics. RC reinforces the statistics aligned with the true class while penalizing those associated with other classes, guided by the 0-1 loss. The proposed method has been empirically tested on 30 datasets using na\"ive Bayes, quadratic discriminant analysis, and logistic regression classifiers. RC improves the empirical error of the original closed-form learning algorithms and, more notably, consistently outperforms the gradient descent approach with the three classifiers.
☆ Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift
Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.
comment: This preprint has not undergone any post-submission improvements or corrections
☆ A Physics-Informed Machine Learning Approach for Solving Distributed Order Fractional Differential Equations
This paper introduces a novel methodology for solving distributed-order fractional differential equations using a physics-informed machine learning framework. The core of this approach involves extending the support vector regression (SVR) algorithm to approximate the unknown solutions of the governing equations during the training phase. By embedding the distributed-order functional equation into the SVR framework, we incorporate physical laws directly into the learning process. To further enhance computational efficiency, Gegenbauer orthogonal polynomials are employed as the kernel function, capitalizing on their fractional differentiation properties to streamline the problem formulation. Finally, the resulting optimization problem of SVR is addressed either as a quadratic programming problem or as a positive definite system in its dual form. The effectiveness of the proposed approach is validated through a series of numerical experiments on Caputo-based distributed-order fractional differential equations, encompassing both ordinary and partial derivatives.
Survey of Data-driven Newsvendor: Unified Analysis and Spectrum of Achievable Regrets
In the Newsvendor problem, the goal is to guess the number that will be drawn from some distribution, with asymmetric consequences for guessing too high vs. too low. In the data-driven version, the distribution is unknown, and one must work with samples from the distribution. Data-driven Newsvendor has been studied under many variants: additive vs. multiplicative regret, high probability vs. expectation bounds, and different distribution classes. This paper studies all combinations of these variants, filling in many gaps in the literature and simplifying many proofs. In particular, we provide a unified analysis based on the notion of clustered distributions, which in conjunction with our new lower bounds, shows that the entire spectrum of regrets between $1/\sqrt{n}$ and $1/n$ can be possible.
☆ Maximum likelihood inference for high-dimensional problems with multiaffine variable relations
Maximum Likelihood Estimation of continuous variable models can be very challenging in high dimensions, due to potentially complex probability distributions. The existence of multiple interdependencies among variables can make it very difficult to establish convergence guarantees. This leads to a wide use of brute-force methods, such as grid searching and Monte-Carlo sampling and, when applicable, complex and problem-specific algorithms. In this paper, we consider inference problems where the variables are related by multiaffine expressions. We propose a novel Alternating and Iteratively-Reweighted Least Squares (AIRLS) algorithm, and prove its convergence for problems with Generalized Normal Distributions. We also provide an efficient method to compute the variance of the estimates obtained using AIRLS. Finally, we show how the method can be applied to graphical statistical models. We perform numerical experiments on several inference problems, showing significantly better performance than state-of-the-art approaches in terms of scalability, robustness to noise, and convergence speed due to an empirically observed super-linear convergence rate.
☆ Distributionally Robust Optimisation with Bayesian Ambiguity Sets
Decision making under uncertainty is challenging since the data-generating process (DGP) is often unknown. Bayesian inference proceeds by estimating the DGP through posterior beliefs about the model's parameters. However, minimising the expected risk under these posterior beliefs can lead to sub-optimal decisions due to model uncertainty or limited, noisy observations. To address this, we introduce Distributionally Robust Optimisation with Bayesian Ambiguity Sets (DRO-BAS) which hedges against uncertainty in the model by optimising the worst-case risk over a posterior-informed ambiguity set. We show that our method admits a closed-form dual representation for many exponential family members and showcase its improved out-of-sample robustness against existing Bayesian DRO methodology in the Newsvendor problem.
comment: 13 pages, 3 figures. Under review
☆ Sparsifying Parametric Models with L0 Regularization
This document contains an educational introduction to the problem of sparsifying parametric models with L0 regularization. We utilize this approach together with dictionary learning to learn sparse polynomial policies for deep reinforcement learning to control parametric partial differential equations. The code and a tutorial are provided here: https://github.com/nicob15/Sparsifying-Parametric-Models-with-L0.
LLM-based event abstraction and integration for IoT-sourced logs
The continuous flow of data collected by Internet of Things (IoT) devices, has revolutionised our ability to understand and interact with the world across various applications. However, this data must be prepared and transformed into event data before analysis can begin. In this paper, we shed light on the potential of leveraging Large Language Models (LLMs) in event abstraction and integration. Our approach aims to create event records from raw sensor readings and merge the logs from multiple IoT sources into a single event log suitable for further Process Mining applications. We demonstrate the capabilities of LLMs in event abstraction considering a case study for IoT application in elderly care and longitudinal health monitoring. The results, showing on average an accuracy of 90% in detecting high-level activities. These results highlight LLMs' promising potential in addressing event abstraction and integration challenges, effectively bridging the existing gap.
comment: 12 pages
☆ Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation
Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the "utility" of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at https://github.com/prerakmody/bayesuncertainty-error-correspondence
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:018
☆ Panopticon: a novel deep learning model to detect single transit events with no prior data filtering in PLATO light curves
To prepare for the analyses of the future PLATO light curves, we develop a deep learning model, Panopticon, to detect transits in high precision photometric light curves. Since PLATO's main objective is the detection of temperate Earth-size planets around solar-type stars, the code is designed to detect individual transit events. The filtering step, required by conventional detection methods, can affect the transit, which could be an issue for long and shallow transits. To protect transit shape and depth, the code is also designed to work on unfiltered light curves. We trained the model on a set of simulated PLATO light curves in which we injected, at pixel level, either planetary, eclipsing binary, or background eclipsing binary signals. We also include a variety of noises in our data, such as granulation, stellar spots or cosmic rays. The approach is able to recover 90% of our test population, including more than 25% of the Earth-analogs, even in the unfiltered light curves. The model also recovers the transits irrespective of the orbital period, and is able to retrieve transits on a unique event basis. These figures are obtained when accepting a false alarm rate of 1%. When keeping the false alarm rate low (<0.01%), it is still able to recover more than 85% of the transit signals. Any transit deeper than 180ppm is essentially guaranteed to be recovered. This method is able to recover transits on a unique event basis, and does so with a low false alarm rate. Thanks to light curves being one-dimensional, model training is fast, on the order of a few hours per model. This speed in training and inference, coupled to the recovery effectiveness and precision of the model make it an ideal tool to complement, or be used ahead of, classical approaches.
comment: Submitted to A&A
☆ Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks
Graph Neural Networks (GNNs) have become increasingly popular for effectively modeling data with graph structures. Recently, attention mechanisms have been integrated into GNNs to improve their ability to capture complex patterns. This paper presents the first comprehensive study revealing a critical, unexplored consequence of this integration: the emergence of Massive Activations (MAs) within attention layers. We introduce a novel method for detecting and analyzing MAs, focusing on edge features in different graph transformer architectures. Our study assesses various GNN models using benchmark datasets, including ZINC, TOX21, and PROTEINS. Key contributions include (1) establishing the direct link between attention mechanisms and MAs generation in GNNs, (2) developing a robust definition and detection method for MAs based on activation ratio distributions, (3) introducing the Explicit Bias Term (EBT) as a potential countermeasure and exploring it as an adversarial framework to assess models robustness based on the presence or absence of MAs. Our findings highlight the prevalence and impact of attention-induced MAs across different architectures, such as GraphTransformer, GraphiT, and SAN. The study reveals the complex interplay between attention mechanisms, model architecture, dataset characteristics, and MAs emergence, providing crucial insights for developing more robust and reliable graph models.
☆ Raw Speech Enhancement with Deep State Space Modeling
We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement in an end-to-end fashion. The network's performance is primarily evaluated on raw speech denoising, with additional assessments on tasks such as super-resolution and de-quantization. We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets. The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency. Even as a raw waveform processing model, the model maintains high fidelity to the clean signal with minimal audible artifacts. In addition, the model remains performant even when the noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech enhancement capabilities in low-resource environments.
comment: 7 pages, 2 figures
☆ Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time
Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, Artificial Intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using Large Language Models (LLMs) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical-patient communication using intelligent systems. Consequently, we leverage an LLM using the latest Natural Language Processing (NLP) techniques in a chatbot solution to provide interpretable Machine Learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing NLP-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80 % in all evaluation metrics, with a recall value for the mental deterioration class about 85 %. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.
☆ Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management
Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the heterogeneous workloads of different ML tasks and data modalities, training these models usually requires massive GPU resources and suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of large-scale MT MM models through data heterogeneity-aware model management optimization. The key idea is to decompose the model execution into stages and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. Based on this, we build a prototype system and evaluate it on various large MT MM models. Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.
☆ MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice ECCV
Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: ($i$) a new task termed \emph{space-time instance segmentation}, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and ($ii$) \emph{\dname}, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\url{https://github.com/tub-rip/MouseSIS}
comment: 18 pages, 5 figures, ECCV Workshops
☆ Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data
The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data yields significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. In this work, we study SSL for high dimensional sparse Gaussian classification. To construct an accurate classifier a key task is feature selection, detecting the few variables that separate the two classes. % For this SSL setting, we analyze information theoretic lower bounds for accurate feature selection as well as computational lower bounds, assuming the low-degree likelihood hardness conjecture. % Our key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where SSL is guaranteed to be advantageous for classification. Specifically, there is a regime where it is possible to construct in polynomial time an accurate SSL classifier. However, % any computationally efficient supervised or unsupervised learning schemes, that separately use only the labeled or unlabeled data would fail. Our work highlights the provable benefits of combining labeled and unlabeled data for {classification and} feature selection in high dimensions. We present simulations that complement our theoretical analysis.
☆ Towards training digitally-tied analog blocks via hybrid gradient computation
Power efficiency is plateauing in the standard digital electronics realm such that novel hardware, models, and algorithms are needed to reduce the costs of AI training. The combination of energy-based analog circuits and the Equilibrium Propagation (EP) algorithm constitutes one compelling alternative compute paradigm for gradient-based optimization of neural nets. Existing analog hardware accelerators, however, typically incorporate digital circuitry to sustain auxiliary non-weight-stationary operations, mitigate analog device imperfections, and leverage existing digital accelerators.This heterogeneous hardware approach calls for a new theoretical model building block. In this work, we introduce Feedforward-tied Energy-based Models (ff-EBMs), a hybrid model comprising feedforward and energy-based blocks accounting for digital and analog circuits. We derive a novel algorithm to compute gradients end-to-end in ff-EBMs by backpropagating and "eq-propagating" through feedforward and energy-based parts respectively, enabling EP to be applied to much more flexible and realistic architectures. We experimentally demonstrate the effectiveness of the proposed approach on ff-EBMs where Deep Hopfield Networks (DHNs) are used as energy-based blocks. We first show that a standard DHN can be arbitrarily split into any uniform size while maintaining performance. We then train ff-EBMs on ImageNet32 where we establish new SOTA performance in the EP literature (46 top-1 %). Our approach offers a principled, scalable, and incremental roadmap to gradually integrate self-trainable analog computational primitives into existing digital accelerators.
☆ Improving Robustness to Multiple Spurious Correlations by Multi-Objective Optimization
We study the problem of training an unbiased and accurate model given a dataset with multiple biases. This problem is challenging since the multiple biases cause multiple undesirable shortcuts during training, and even worse, mitigating one may exacerbate the other. We propose a novel training method to tackle this challenge. Our method first groups training data so that different groups induce different shortcuts, and then optimizes a linear combination of group-wise losses while adjusting their weights dynamically to alleviate conflicts between the groups in performance; this approach, rooted in the multi-objective optimization theory, encourages to achieve the minimax Pareto solution. We also present a new benchmark with multiple biases, dubbed MultiCelebA, for evaluating debiased training methods under realistic and challenging scenarios. Our method achieved the best on three datasets with multiple biases, and also showed superior performance on conventional single-bias datasets.
comment: International Conference on Machine Learning 2024
☆ Fourier Neural Operators for Learning Dynamics in Quantum Spin Systems
Fourier Neural Operators (FNOs) excel on tasks using functional data, such as those originating from partial differential equations. Such characteristics render them an effective approach for simulating the time evolution of quantum wavefunctions, which is a computationally challenging, yet coveted task for understanding quantum systems. In this manuscript, we use FNOs to model the evolution of random quantum spin systems, so chosen due to their representative quantum dynamics and minimal symmetry. We explore two distinct FNO architectures and examine their performance for learning and predicting time evolution using both random and low-energy input states. Additionally, we apply FNOs to a compact set of Hamiltonian observables ($\sim\text{poly}(n)$) instead of the entire $2^n$ quantum wavefunction, which greatly reduces the size of our inputs and outputs and, consequently, the requisite dimensions of the resulting FNOs. Moreover, this Hamiltonian observable-based method demonstrates that FNOs can effectively distill information from high-dimensional spaces into lower-dimensional spaces. The extrapolation of Hamiltonian observables to times later than those used in training is of particular interest, as this stands to fundamentally increase the simulatability of quantum systems past both the coherence times of contemporary quantum architectures and the circuit-depths of tractable tensor networks.
comment: 9 pages, 4 figures
☆ ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models
Reinforcement Learning (RL) is highly dependent on the meticulous design of the reward function. However, accurately assigning rewards to each state-action pair in Long-Term RL (LTRL) challenges is formidable. Consequently, RL agents are predominantly trained with expert guidance. Drawing on the principles of ordinal utility theory from economics, we propose a novel reward estimation algorithm: ELO-Rating based RL (ERRL). This approach is distinguished by two main features. Firstly, it leverages expert preferences over trajectories instead of cardinal rewards (utilities) to compute the ELO rating of each trajectory as its reward. Secondly, a new reward redistribution algorithm is introduced to mitigate training volatility in the absence of a fixed anchor reward. Our method demonstrates superior performance over several leading baselines in long-term scenarios (extending up to 5000 steps), where conventional RL algorithms falter. Furthermore, we conduct a thorough analysis of how expert preferences affect the outcomes.
☆ Bringing the RT-1-X Foundation Model to a SCARA robot
Traditional robotic systems require specific training data for each task, environment, and robot form. While recent advancements in machine learning have enabled models to generalize across new tasks and environments, the challenge of adapting these models to entirely new settings remains largely unexplored. This study addresses this by investigating the generalization capabilities of the RT-1-X robotic foundation model to a type of robot unseen during its training: a SCARA robot from UMI-RTX. Initial experiments reveal that RT-1-X does not generalize zero-shot to the unseen type of robot. However, fine-tuning of the RT-1-X model by demonstration allows the robot to learn a pickup task which was part of the foundation model (but learned for another type of robot). When the robot is presented with an object that is included in the foundation model but not in the fine-tuning dataset, it demonstrates that only the skill, but not the object-specific knowledge, has been transferred.
comment: 14 pages, submitted to the joint Artificial Intelligence & Machine Learning conference for Belgium, Netherlands & Luxembourg (BNAIC/BeNeLearn)
LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts EMNLP
With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (https://github.com/Reliable-Information-Lab-HEVS/dynamic_llm_detector_benchmark).
comment: 20 pages, 7 tables, 13 figures, under consideration for EMNLP
☆ Interpretable mixture of experts for time series prediction under recurrent and non-recurrent conditions
Non-recurrent conditions caused by incidents are different from recurrent conditions that follow periodic patterns. Existing traffic speed prediction studies are incident-agnostic and use one single model to learn all possible patterns from these drastically diverse conditions. This study proposes a novel Mixture of Experts (MoE) model to improve traffic speed prediction under two separate conditions, recurrent and non-recurrent (i.e., with and without incidents). The MoE leverages separate recurrent and non-recurrent expert models (Temporal Fusion Transformers) to capture the distinct patterns of each traffic condition. Additionally, we propose a training pipeline for non-recurrent models to remedy the limited data issues. To train our model, multi-source datasets, including traffic speed, incident reports, and weather data, are integrated and processed to be informative features. Evaluations on a real road network demonstrate that the MoE achieves lower errors compared to other benchmark algorithms. The model predictions are interpreted in terms of temporal dependencies and variable importance in each condition separately to shed light on the differences between recurrent and non-recurrent conditions.
☆ Tensor network square root Kalman filter for online Gaussian process regression
The state-of-the-art tensor network Kalman filter lifts the curse of dimensionality for high-dimensional recursive estimation problems. However, the required rounding operation can cause filter divergence due to the loss of positive definiteness of covariance matrices. We solve this issue by developing, for the first time, a tensor network square root Kalman filter, and apply it to high-dimensional online Gaussian process regression. In our experiments, we demonstrate that our method is equivalent to the conventional Kalman filter when choosing a full-rank tensor network. Furthermore, we apply our method to a real-life system identification problem where we estimate $4^{14}$ parameters on a standard laptop. The estimated model outperforms the state-of-the-art tensor network Kalman filter in terms of prediction accuracy and uncertainty quantification.
☆ In Search of Trees: Decision-Tree Policy Synthesis for Black-Box Systems via Search
Decision trees, owing to their interpretability, are attractive as control policies for (dynamical) systems. Unfortunately, constructing, or synthesising, such policies is a challenging task. Previous approaches do so by imitating a neural-network policy, approximating a tabular policy obtained via formal synthesis, employing reinforcement learning, or modelling the problem as a mixed-integer linear program. However, these works may require access to a hard-to-obtain accurate policy or a formal model of the environment (within reach of formal synthesis), and may not provide guarantees on the quality or size of the final tree policy. In contrast, we present an approach to synthesise optimal decision-tree policies given a black-box environment and specification, and a discretisation of the tree predicates, where optimality is defined with respect to the number of steps to achieve the goal. Our approach is a specialised search algorithm which systematically explores the (exponentially large) space of decision trees under the given discretisation. The key component is a novel pruning mechanism that significantly reduces the search space. Our approach represents a conceptually novel way of synthesising small decision-tree policies with optimality guarantees even for black-box environments with black-box specifications.
comment: 8 pages main text incl. references, 1 page appendix
☆ SpinMultiNet: Neural Network Potential Incorporating Spin Degrees of Freedom with Multi-Task Learning
Neural Network Potentials (NNPs) have attracted significant attention as a method for accelerating density functional theory (DFT) calculations. However, conventional NNP models typically do not incorporate spin degrees of freedom, limiting their applicability to systems where spin states critically influence material properties, such as transition metal oxides. This study introduces SpinMultiNet, a novel NNP model that integrates spin degrees of freedom through multi-task learning. SpinMultiNet achieves accurate predictions without relying on correct spin values obtained from DFT calculations. Instead, it utilizes initial spin estimates as input and leverages multi-task learning to optimize the spin latent representation while maintaining both $E(3)$ and time-reversal equivariance. Validation on a dataset of transition metal oxides demonstrates the high predictive accuracy of SpinMultiNet. The model successfully reproduces the energy ordering of stable spin configurations originating from superexchange interactions and accurately captures the rhombohedral distortion of the rocksalt structure. These results pave the way for new possibilities in materials simulations that consider spin degrees of freedom, promising future applications in large-scale simulations of various material systems, including magnetic materials.
comment: 12 pages, 4 figures
☆ Dual-TSST: A Dual-Branch Temporal-Spectral-Spatial Transformer Model for EEG Decoding
The decoding of electroencephalography (EEG) signals allows access to user intentions conveniently, which plays an important role in the fields of human-machine interaction. To effectively extract sufficient characteristics of the multichannel EEG, a novel decoding architecture network with a dual-branch temporal-spectral-spatial transformer (Dual-TSST) is proposed in this study. Specifically, by utilizing convolutional neural networks (CNNs) on different branches, the proposed processing network first extracts the temporal-spatial features of the original EEG and the temporal-spectral-spatial features of time-frequency domain data converted by wavelet transformation, respectively. These perceived features are then integrated by a feature fusion block, serving as the input of the transformer to capture the global long-range dependencies entailed in the non-stationary EEG, and being classified via the global average pooling and multi-layer perceptron blocks. To evaluate the efficacy of the proposed approach, the competitive experiments are conducted on three publicly available datasets of BCI IV 2a, BCI IV 2b, and SEED, with the head-to-head comparison of more than ten other state-of-the-art methods. As a result, our proposed Dual-TSST performs superiorly in various tasks, which achieves the promising EEG classification performance of average accuracy of 80.67% in BCI IV 2a, 88.64% in BCI IV 2b, and 96.65% in SEED, respectively. Extensive ablation experiments conducted between the Dual-TSST and comparative baseline model also reveal the enhanced decoding performance with each module of our proposed method. This study provides a new approach to high-performance EEG decoding, and has great potential for future CNN-Transformer based applications.
☆ DiffGrad for Physics-Informed Neural Networks
Physics-Informed Neural Networks (PINNs) are regarded as state-of-the-art tools for addressing highly nonlinear problems based on partial differential equations. Despite their broad range of applications, PINNs encounter several performance challenges, including issues related to efficiency, minimization of computational cost, and enhancement of accuracy. Burgers' equation, a fundamental equation in fluid dynamics that is extensively used in PINNs, provides flexible results with the Adam optimizer that does not account for past gradients. This paper introduces a novel strategy for solving Burgers' equation by incorporating DiffGrad with PINNs, a method that leverages the difference between current and immediately preceding gradients to enhance performance. A comprehensive computational analysis is conducted using optimizers such as Adam, Adamax, RMSprop, and DiffGrad to evaluate and compare their effectiveness. Our approach includes visualizing the solutions over space at various time intervals to demonstrate the accuracy of the network. The results show that DiffGrad not only improves the accuracy of the solution but also reduces training time compared to the other optimizers.
comment: 20 pages, 14 figures
☆ Preserving Empirical Probabilities in BERT for Small-sample Clinical Entity Recognition
Named Entity Recognition (NER) encounters the challenge of unbalanced labels, where certain entity types are overrepresented while others are underrepresented in real-world datasets. This imbalance can lead to biased models that perform poorly on minority entity classes, impeding accurate and equitable entity recognition. This paper explores the effects of unbalanced entity labels of the BERT-based pre-trained model. We analyze the different mechanisms of loss calculation and loss propagation for the task of token classification on randomized datasets. Then we propose ways to improve the token classification for the highly imbalanced task of clinical entity recognition.
comment: 8 pages, 8 figures
☆ Robust Q-Learning under Corrupted Rewards
Recently, there has been a surge of interest in analyzing the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of such algorithms in non-ideal environments, such as in the presence of corrupted rewards, is poorly understood. Motivated by this gap, we investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We start by proving that such an attack can cause the vanilla Q-learning algorithm to incur arbitrarily large errors. We then develop a novel robust synchronous Q-learning algorithm that uses historical reward data to construct robust empirical Bellman operators at each time step. Finally, we prove a finite-time convergence rate for our algorithm that matches known state-of-the-art bounds (in the absence of attacks) up to a small inevitable $O(\varepsilon)$ error term that scales with the adversarial corruption fraction $\varepsilon$. Notably, our results continue to hold even when the true reward distributions have infinite support, provided they admit bounded second moments.
comment: Accepted to the Decision and Control Conference (CDC) 2024
☆ State-space models are accurate and efficient neural operators for dynamical systems
Physics-informed machine learning (PIML) has emerged as a promising alternative to classical methods for predicting dynamical systems, offering faster and more generalizable solutions. However, existing models, including recurrent neural networks (RNNs), transformers, and neural operators, face challenges such as long-time integration, long-range dependencies, chaotic dynamics, and extrapolation, to name a few. To this end, this paper introduces state-space models implemented in Mamba for accurate and efficient dynamical system operator learning. Mamba addresses the limitations of existing architectures by dynamically capturing long-range dependencies and enhancing computational efficiency through reparameterization techniques. To extensively test Mamba and compare against another 11 baselines, we introduce several strict extrapolation testbeds that go beyond the standard interpolation benchmarks. We demonstrate Mamba's superior performance in both interpolation and challenging extrapolation tasks. Mamba consistently ranks among the top models while maintaining the lowest computational cost and exceptional extrapolation capabilities. Moreover, we demonstrate the good performance of Mamba for a real-world application in quantitative systems pharmacology for assessing the efficacy of drugs in tumor growth under limited data scenarios. Taken together, our findings highlight Mamba's potential as a powerful tool for advancing scientific machine learning in dynamical systems modeling. (The code will be available at https://github.com/zheyuanhu01/State_Space_Model_Neural_Operator upon acceptance.)
comment: 34 pages
☆ FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks ICSE 2025
We propose a method for formally certifying and quantifying individual fairness of deep neural networks (DNN). Individual fairness guarantees that any two individuals who are identical except for a legally protected attribute (e.g., gender or race) receive the same treatment. While there are existing techniques that provide such a guarantee, they tend to suffer from lack of scalability or accuracy as the size and input dimension of the DNN increase. Our method overcomes this limitation by applying abstraction to a symbolic interval based analysis of the DNN followed by iterative refinement guided by the fairness property. Furthermore, our method lifts the symbolic interval based analysis from conventional qualitative certification to quantitative certification, by computing the percentage of individuals whose classification outputs are provably fair, instead of merely deciding if the DNN is fair. We have implemented our method and evaluated it on deep neural networks trained on four popular fairness research datasets. The experimental results show that our method is not only more accurate than state-of-the-art techniques but also several orders-of-magnitude faster.
comment: To Appear In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025)
☆ Content Moderation by LLM: From Accuracy to Legitimacy
One trending application of LLM (large language model) is to use it for content moderation in online platforms. Most current studies on this application have focused on the metric of accuracy - the extent to which LLM makes correct decisions about content. This article argues that accuracy is insufficient and misleading, because it fails to grasp the distinction between easy cases and hard cases as well as the inevitable trade-offs in achieving higher accuracy. Closer examination reveals that content moderation is a constitutive part of platform governance, the key of which is to gain and enhance legitimacy. Instead of making moderation decisions correct, the chief goal of LLM is to make them legitimate. In this regard, this article proposes a paradigm shift from the single benchmark of accuracy towards a legitimacy-based framework of evaluating the performance of LLM moderators. The framework suggests that for easy cases, the key is to ensure accuracy, speed and transparency, while for hard cases, what matters is reasoned justification and user participation. Examined under this framework, LLM's real potential in moderation is not accuracy improvement. Rather, LLM can better contribute in four other aspects: to conduct screening of hard cases from easy cases, to provide quality explanations for moderation decisions, to assist human reviewers in getting more contextual information, and to facilitate user participation in a more interactive way. Using normative theories from law and social sciences to critically assess the new technological application, this article seeks to redefine LLM's role in content moderation and redirect relevant research in this field.
☆ Application Research On Real-Time Perception Of Device Performance Status
In order to accurately identify the performance status of mobile devices and finely adjust the user experience, a real-time performance perception evaluation method based on TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) combined with entropy weighting method and time series model construction was studied. After collecting the performance characteristics of various mobile devices, the device performance profile was fitted by using PCA (principal component analysis) dimensionality reduction and feature engineering methods such as descriptive time series analysis. The ability of performance features and profiles to describe the real-time performance status of devices was understood and studied by applying the TOPSIS method and multi-level weighting processing. A time series model was constructed for the feature set under objective weighting, and multiple sensitivity (real-time, short-term, long-term) performance status perception results were provided to obtain real-time performance evaluation data and long-term stable performance prediction data. Finally, by configuring dynamic AB experiments and overlaying fine-grained power reduction strategies, the usability of the method was verified, and the accuracy of device performance status identification and prediction was compared with the performance of the profile features including dimensionality reduction time series modeling, TOPSIS method and entropy weighting method, subjective weighting, HMA method. The results show that accurate real-time performance perception results can greatly enhance business value, and this research has application effectiveness and certain forward-looking significance.
☆ xLAM: A Family of Large Action Models to Empower AI Agent Systems
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4
comment: Technical report for the Salesforce xLAM model series
☆ Bi-capacity Choquet Integral for Sensor Fusion with Label Uncertainty
Sensor fusion combines data from multiple sensor sources to improve reliability, robustness, and accuracy of data interpretation. The Fuzzy Integral (FI), in particular, the Choquet integral (ChI), is often used as a powerful nonlinear aggregator for fusion across multiple sensors. However, existing supervised ChI learning algorithms typically require precise training labels for each input data point, which can be difficult or impossible to obtain. Additionally, prior work on ChI fusion is often based only on the normalized fuzzy measures, which bounds the fuzzy measure values between [0, 1]. This can be limiting in cases where the underlying scales of input data sources are bipolar (i.e., between [-1, 1]). To address these challenges, this paper proposes a novel Choquet integral-based fusion framework, named Bi-MIChI (pronounced "bi-mi-kee"), which uses bi-capacities to represent the interactions between pairs of subsets of the input sensor sources on a bi-polar scale. This allows for extended non-linear interactions between the sensor sources and can lead to interesting fusion results. Bi-MIChI also addresses label uncertainty through Multiple Instance Learning, where training labels are applied to "bags" (sets) of data instead of per-instance. Our proposed Bi-MIChI framework shows effective classification and detection performance on both synthetic and real-world experiments for sensor fusion with label uncertainty. We also provide detailed analyses on the behavior of the fuzzy measures to demonstrate our fusion process.
comment: 10 pages, 7 figures, 7 tables; Accepted to 2024 FUZZ-IEEE and presented at 2024 IEEE WCCI; Code available at https://github.com/hvak/Bi-MIChI
☆ Pricing American Options using Machine Learning Algorithms
This study investigates the application of machine learning algorithms, particularly in the context of pricing American options using Monte Carlo simulations. Traditional models, such as the Black-Scholes-Merton framework, often fail to adequately address the complexities of American options, which include the ability for early exercise and non-linear payoff structures. By leveraging Monte Carlo methods in conjunction Least Square Method machine learning was used. This research aims to improve the accuracy and efficiency of option pricing. The study evaluates several machine learning models, including neural networks and decision trees, highlighting their potential to outperform traditional approaches. The results from applying machine learning algorithm in LSM indicate that integrating machine learning with Monte Carlo simulations can enhance pricing accuracy and provide more robust predictions, offering significant insights into quantitative finance by merging classical financial theories with modern computational techniques. The dataset was split into features and the target variable representing bid prices, with an 80-20 train-validation split. LSTM and GRU models were constructed using TensorFlow's Keras API, each with four hidden layers of 200 neurons and an output layer for bid price prediction, optimized with the Adam optimizer and MSE loss function. The GRU model outperformed the LSTM model across all evaluated metrics, demonstrating lower mean absolute error, mean squared error, and root mean squared error, along with greater stability and efficiency in training.
☆ How noise affects memory in linear recurrent networks
The effects of noise on memory in a linear recurrent network are theoretically investigated. Memory is characterized by its ability to store previous inputs in its instantaneous state of network, which receives a correlated or uncorrelated noise. Two major properties are revealed: First, the memory reduced by noise is uniquely determined by the noise's power spectral density (PSD). Second, the memory will not decrease regardless of noise intensity if the PSD is in a certain class of distribution (including power law). The results are verified using the human brain signals, showing good agreement.
☆ Machine learning-based algorithms for at-home respiratory disease monitoring and respiratory assessment
Respiratory diseases impose a significant burden on global health, with current diagnostic and management practices primarily reliant on specialist clinical testing. This work aims to develop machine learning-based algorithms to facilitate at-home respiratory disease monitoring and assessment for patients undergoing continuous positive airway pressure (CPAP) therapy. Data were collected from 30 healthy adults, encompassing respiratory pressure, flow, and dynamic thoraco-abdominal circumferential measurements under three breathing conditions: normal, panting, and deep breathing. Various machine learning models, including the random forest classifier, logistic regression, and support vector machine (SVM), were trained to predict breathing types. The random forest classifier demonstrated the highest accuracy, particularly when incorporating breathing rate as a feature. These findings support the potential of AI-driven respiratory monitoring systems to transition respiratory assessments from clinical settings to home environments, enhancing accessibility and patient autonomy. Future work involves validating these models with larger, more diverse populations and exploring additional machine learning techniques.
comment: 10 pages, 2 figures
☆ InfraLib: Enabling Reinforcement Learning and Decision Making for Large Scale Infrastructure Management
Efficient management of infrastructure systems is crucial for economic stability, sustainability, and public safety. However, infrastructure management is challenging due to the vast scale of systems, stochastic deterioration of components, partial observability, and resource constraints. While data-driven approaches like reinforcement learning (RL) offer a promising avenue for optimizing management policies, their application to infrastructure has been limited by the lack of suitable simulation environments. We introduce InfraLib, a comprehensive framework for modeling and analyzing infrastructure management problems. InfraLib employs a hierarchical, stochastic approach to realistically model infrastructure systems and their deterioration. It supports practical functionality such as modeling component unavailability, cyclical budgets, and catastrophic failures. To facilitate research, InfraLib provides tools for expert data collection, simulation-driven analysis, and visualization. We demonstrate InfraLib's capabilities through case studies on a real-world road network and a synthetic benchmark with 100,000 components.
☆ A Scalable Matrix Visualization for Understanding Tree Ensemble Classifiers
The high performance of tree ensemble classifiers benefits from a large set of rules, which, in turn, makes the models hard to understand. To improve interpretability, existing methods extract a subset of rules for approximation using model reduction techniques. However, by focusing on the reduced rule set, these methods often lose fidelity and ignore anomalous rules that, despite their infrequency, play crucial roles in real-world applications. This paper introduces a scalable visual analysis method to explain tree ensemble classifiers that contain tens of thousands of rules. The key idea is to address the issue of losing fidelity by adaptively organizing the rules as a hierarchy rather than reducing them. To ensure the inclusion of anomalous rules, we develop an anomaly-biased model reduction method to prioritize these rules at each hierarchical level. Synergized with this hierarchical organization of rules, we develop a matrix-based hierarchical visualization to support exploration at different levels of detail. Our quantitative experiments and case studies demonstrate how our method fosters a deeper understanding of both common and anomalous rules, thereby enhancing interpretability without sacrificing comprehensiveness.
comment: 15 pages, 10 figures
☆ Standing on the shoulders of giants
Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.
comment: 15 pages, 8 figures, 3 tables, submitted for the BRACIS'24 conference
☆ Non-stationary and Sparsely-correlated Multi-output Gaussian Process with Spike-and-Slab Prior
Multi-output Gaussian process (MGP) is commonly used as a transfer learning method to leverage information among multiple outputs. A key advantage of MGP is providing uncertainty quantification for prediction, which is highly important for subsequent decision-making tasks. However, traditional MGP may not be sufficiently flexible to handle multivariate data with dynamic characteristics, particularly when dealing with complex temporal correlations. Additionally, since some outputs may lack correlation, transferring information among them may lead to negative transfer. To address these issues, this study proposes a non-stationary MGP model that can capture both the dynamic and sparse correlation among outputs. Specifically, the covariance functions of MGP are constructed using convolutions of time-varying kernel functions. Then a dynamic spike-and-slab prior is placed on correlation parameters to automatically decide which sources are informative to the target output in the training process. An expectation-maximization (EM) algorithm is proposed for efficient model fitting. Both numerical studies and a real case demonstrate its efficacy in capturing dynamic and sparse correlation structure and mitigating negative transfer for high-dimensional time-series data. Finally, a mountain-car reinforcement learning case highlights its potential application in decision making problems.
☆ Discovering Cyclists' Street Visual Preferences Through Multi-Source Big Data Using Deep Inverse Reinforcement Learning
Cycling has gained global popularity for its health benefits and positive urban impacts. To effectively promote cycling, early studies have extensively investigated the relationship between cycling behaviors and environmental factors, especially cyclists' preferences when making route decisions. However, these studies often struggle to comprehensively describe detailed cycling procedures at a large scale due to data limitations, and they tend to overlook the complex nature of cyclists' preferences. To address these issues, we propose a novel framework aimed to quantify and interpret cyclists' complicated street visual preferences from cycling records by leveraging maximum entropy deep inverse reinforcement learning (MEDIRL) and explainable artificial intelligence (XAI). Implemented in Bantian Sub-district, Shenzhen, we adapt MEDIRL model for efficient estimation of cycling reward function by integrating dockless-bike-sharing (DBS) trajectory and street view images (SVIs), which serves as a representation of cyclists' preferences for street visual environments during routing. In addition, we demonstrate the feasibility and reliability of MEDIRL in discovering cyclists' street visual preferences. Further analysis reveals the nonlinear and interactive effects of street visual elements on cyclists' preferences, offering a holistic perspective on streetscape design. Our proposed framework advances the understanding of individual cycling behaviors and provides actionable insights for urban planners to design bicycle-friendly streetscapes that prioritize cyclists' preferences.
comment: 38 pages, 16 figures
☆ Addressing the Gaps in Early Dementia Detection: A Path Towards Enhanced Diagnostic Models through Machine Learning
The rapid global aging trend has led to an increase in dementia cases, including Alzheimer's disease, underscoring the urgent need for early and accurate diagnostic methods. Traditional diagnostic techniques, such as cognitive tests, neuroimaging, and biomarker analysis, face significant limitations in sensitivity, accessibility, and cost, particularly in the early stages. This study explores the potential of machine learning (ML) as a transformative approach to enhance early dementia detection by leveraging ML models to analyze and integrate complex multimodal datasets, including cognitive assessments, neuroimaging, and genetic information. A comprehensive review of existing literature was conducted to evaluate various ML models, including supervised learning, deep learning, and advanced techniques such as ensemble learning and transformer models, assessing their accuracy, interpretability, and potential for clinical integration. The findings indicate that while ML models show significant promise in improving diagnostic precision and enabling earlier interventions, challenges remain in their generalizability, interpretability, and ethical deployment. This research concludes by outlining future directions aimed at enhancing the clinical utility of ML models in dementia detection, emphasizing interdisciplinary collaboration and ethically sound frameworks to improve early detection and intervention strategies for Alzheimer's disease and other forms of dementia.
☆ Causal Temporal Representation Learning with Nonstationary Sparse Transition
Causal Temporal Representation Learning (Ctrl) methods aim to identify the temporal causal dynamics of complex nonstationary temporal sequences. Despite the success of existing Ctrl methods, they require either directly observing the domain variables or assuming a Markov prior on them. Such requirements limit the application of these methods in real-world scenarios when we do not have such prior knowledge of the domain variables. To address this problem, this work adopts a sparse transition assumption, aligned with intuitive human understanding, and presents identifiability results from a theoretical perspective. In particular, we explore under what conditions on the significance of the variability of the transitions we can build a model to identify the distribution shifts. Based on the theoretical result, we introduce a novel framework, Causal Temporal Representation Learning with Nonstationary Sparse Transition (CtrlNS), designed to leverage the constraints on transition sparsity and conditional independence to reliably identify both distribution shifts and latent factors. Our experimental evaluations on synthetic and real-world datasets demonstrate significant improvements over existing baselines, highlighting the effectiveness of our approach.
☆ Towards Autonomous Cybersecurity: An Intelligent AutoML Framework for Autonomous Intrusion Detection CCS 2024
The rapid evolution of mobile networks from 5G to 6G has necessitated the development of autonomous network management systems, such as Zero-Touch Networks (ZTNs). However, the increased complexity and automation of these networks have also escalated cybersecurity risks. Existing Intrusion Detection Systems (IDSs) leveraging traditional Machine Learning (ML) techniques have shown effectiveness in mitigating these risks, but they often require extensive manual effort and expert knowledge. To address these challenges, this paper proposes an Automated Machine Learning (AutoML)-based autonomous IDS framework towards achieving autonomous cybersecurity for next-generation networks. To achieve autonomous intrusion detection, the proposed AutoML framework automates all critical procedures of the data analytics pipeline, including data pre-processing, feature engineering, model selection, hyperparameter tuning, and model ensemble. Specifically, it utilizes a Tabular Variational Auto-Encoder (TVAE) method for automated data balancing, tree-based ML models for automated feature selection and base model learning, Bayesian Optimization (BO) for hyperparameter optimization, and a novel Optimized Confidence-based Stacking Ensemble (OCSE) method for automated model ensemble. The proposed AutoML-based IDS was evaluated on two public benchmark network security datasets, CICIDS2017 and 5G-NIDD, and demonstrated improved performance compared to state-of-the-art cybersecurity methods. This research marks a significant step towards fully autonomous cybersecurity in next-generation networks, potentially revolutionizing network security applications.
comment: Accepted to the Workshop on Autonomous Cybersecurity, ACM CCS 2024; Code is available at Github link: https://github.com/Western-OC2-Lab/AutonomousCyber-AutoML-based-Autonomous-Intrusion-Detection-System
GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation
Online sellers and advertisers are recommended keyphrases for their listed products, which they bid on to enhance their sales. One popular paradigm that generates such recommendations is Extreme Multi-Label Classification (XMC), which involves tagging/mapping keyphrases to items. We outline the limitations of using traditional item-query based tagging or mapping techniques for keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles. Additionally, we demonstrate that relying on traditional metrics such as precision/recall can be misleading in practical applications, thereby necessitating a combination of metrics to evaluate performance in real-world scenarios. These metrics are designed to assess the relevance of keyphrases to items and the potential for buyer outreach. GraphEx outperforms production models at eBay, achieving the objectives mentioned above. It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.
☆ The AdEMAMix Optimizer: Better, Faster, Older
Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show -- quite surprisingly -- that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a $1.3$B parameter AdEMAMix LLM trained on $101$B tokens performs comparably to an AdamW model trained on $197$B tokens ($+95\%$). Moreover, our method significantly slows-down model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.
comment: 38 pages, 27 figures
♻ ☆ A Graph-based Adversarial Imitation Learning Framework for Reliable & Realtime Fleet Scheduling in Urban Air Mobility
The advent of Urban Air Mobility (UAM) presents the scope for a transformative shift in the domain of urban transportation. However, its widespread adoption and economic viability depends in part on the ability to optimally schedule the fleet of aircraft across vertiports in a UAM network, under uncertainties attributed to airspace congestion, changing weather conditions, and varying demands. This paper presents a comprehensive optimization formulation of the fleet scheduling problem, while also identifying the need for alternate solution approaches, since directly solving the resulting integer nonlinear programming problem is computationally prohibitive for daily fleet scheduling. Previous work has shown the effectiveness of using (graph) reinforcement learning (RL) approaches to train real-time executable policy models for fleet scheduling. However, such policies can often be brittle on out-of-distribution scenarios or edge cases. Moreover, training performance also deteriorates as the complexity (e.g., number of constraints) of the problem increases. To address these issues, this paper presents an imitation learning approach where the RL-based policy exploits expert demonstrations yielded by solving the exact optimization using a Genetic Algorithm. The policy model comprises Graph Neural Network (GNN) based encoders that embed the space of vertiports and aircraft, Transformer networks to encode demand, passenger fare, and transport cost profiles, and a Multi-head attention (MHA) based decoder. Expert demonstrations are used through the Generative Adversarial Imitation Learning (GAIL) algorithm. Interfaced with a UAM simulation environment involving 8 vertiports and 40 aircrafts, in terms of the daily profits earned reward, the new imitative approach achieves better mean performance and remarkable improvement in the case of unseen worst-case scenarios, compared to pure RL results.
comment: Presented at the AIAA Aviation Forum 2024
♻ ☆ Physics-Informed Machine Learning Towards A Real-Time Spacecraft Thermal Simulator
Modeling thermal states for complex space missions, such as the surface exploration of airless bodies, requires high computation, whether used in ground-based analysis for spacecraft design or during onboard reasoning for autonomous operations. For example, a finite-element thermal model with hundreds of elements can take significant time to simulate, which makes it unsuitable for onboard reasoning during time-sensitive scenarios such as descent and landing, proximity operations, or in-space assembly. Further, the lack of fast and accurate thermal modeling drives thermal designs to be more conservative and leads to spacecraft with larger mass and higher power budgets. The emerging paradigm of physics-informed machine learning (PIML) presents a class of hybrid modeling architectures that address this challenge by combining simplified physics models with machine learning (ML) models resulting in models which maintain both interpretability and robustness. Such techniques enable designs with reduced mass and power through onboard thermal-state estimation and control and may lead to improved onboard handling of off-nominal states, including unplanned down-time. The PIML model or hybrid model presented here consists of a neural network which predicts reduced nodalizations (distribution and size of coarse mesh) given on-orbit thermal load conditions, and subsequently a (relatively coarse) finite-difference model operates on this mesh to predict thermal states. We compare the computational performance and accuracy of the hybrid model to a data-driven neural net model, and a high-fidelity finite-difference model of a prototype Earth-orbiting small spacecraft. The PIML based active nodalization approach provides significantly better generalization than the neural net model and coarse mesh model, while reducing computing cost by up to 1.7x compared to the high-fidelity model.
comment: Presented at the AIAA Aviation 2024 Forum
♻ ☆ Deep Neural Implicit Representation of Accessibility for Multi-Axis Manufacturing SP
One of the main concerns in design and process planning for multi-axis additive and subtractive manufacturing is collision avoidance between moving objects (e.g., tool assemblies) and stationary objects (e.g., a part unified with fixtures). The collision measure for various pairs of relative rigid translations and rotations between the two pointsets can be conceptualized by a compactly supported scalar field over the 6D non-Euclidean configuration space. Explicit representation and computation of this field is costly in both time and space. If we fix $O(m)$ sparsely sampled rotations (e.g., tool orientations), computation of the collision measure field as a convolution of indicator functions of the 3D pointsets over a uniform grid (i.e., voxelized geometry) of resolution $O(n^3)$ via fast Fourier transforms (FFTs) scales as in $O(mn^3 \log n)$ in time and $O(mn^3)$ in space. In this paper, we develop an implicit representation of the collision measure field via deep neural networks (DNNs). We show that our approach is able to accurately interpolate the collision measure from a sparse sampling of rotations, and can represent the collision measure field with a small memory footprint. Moreover, we show that this representation can be efficiently updated through fine-tuning to more efficiently train the network on multi-resolution data, as well as accommodate incremental changes to the geometry (such as might occur in iterative processes such as topology optimization of the part subject to CNC tool accessibility constraints).
comment: Special Issue on symposium on Solid and Physical Modeling (SPM 2023)
♻ ☆ Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.
comment: new robustness experiments; new baselines; include Mistral, Mistral-Nemo and GPT-NeoX; link to code
♻ ☆ Evaluations of Machine Learning Privacy Defenses are Misleading CCS 2024
Empirical defenses for machine learning privacy forgo the provable guarantees of differential privacy in the hope of achieving higher utility while resisting realistic adversaries. We identify severe pitfalls in existing empirical privacy evaluations (based on membership inference attacks) that result in misleading conclusions. In particular, we show that prior evaluations fail to characterize the privacy leakage of the most vulnerable samples, use weak attacks, and avoid comparisons with practical differential privacy baselines. In 5 case studies of empirical privacy defenses, we find that prior evaluations underestimate privacy leakage by an order of magnitude. Under our stronger evaluation, none of the empirical defenses we study are competitive with a properly tuned, high-utility DP-SGD baseline (with vacuous provable guarantees).
comment: Accepted at ACM CCS 2024
♻ ☆ Towards Neural Network based Cognitive Models of Dynamic Decision-Making by Humans
Modeling human cognitive processes in dynamic decision-making tasks has been an endeavor in AI for a long time because such models can help make AI systems more intuitive, personalized, mitigate any human biases, and enhance training in simulation. Some initial work has attempted to utilize neural networks (and large language models) but often assumes one common model for all humans and aims to emulate human behavior in aggregate. However, the behavior of each human is distinct, heterogeneous, and relies on specific past experiences in certain tasks. For instance, consider two individuals responding to a phishing email: one who has previously encountered and identified similar threats may recognize it quickly, while another without such experience might fall for the scam. In this work, we build on Instance Based Learning (IBL) that posits that human decisions are based on similar situations encountered in the past. However, IBL relies on simple fixed form functions to capture the mapping from past situations to current decisions. To that end, we propose two new attention-based neural network models to have open form non-linear functions to model distinct and heterogeneous human decision-making in dynamic settings. We experiment with two distinct datasets gathered from human subject experiment data, one focusing on detection of phishing email by humans and another where humans act as attackers in a cybersecurity setting and decide on an attack option. We conducted extensive experiments with our two neural network models, IBL, and GPT3.5, and demonstrate that the neural network models outperform IBL significantly in representing human decision-making, while providing similar interpretability of human decisions as IBL. Overall, our work yields promising results for further use of neural networks in cognitive modeling of human decision making.
comment: Our code is available at https://github.com/shshnkreddy/NCM-HDM
♻ ☆ Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions
The responses generated by Large Language Models (LLMs) can include sensitive information from individuals and organizations, leading to potential privacy leakage. This work implements Influence Functions (IFs) to trace privacy leakage back to the training data, thereby mitigating privacy concerns of Language Models (LMs). However, we notice that current IFs struggle to accurately estimate the influence of tokens with large gradient norms, potentially overestimating their influence. When tracing the most influential samples, this leads to frequently tracing back to samples with large gradient norm tokens, overshadowing the actual most influential samples even if their influences are well estimated. To address this issue, we propose Heuristically Adjusted IF (HAIF), which reduces the weight of tokens with large gradient norms, thereby significantly improving the accuracy of tracing the most influential samples. To establish easily obtained groundtruth for tracing privacy leakage, we construct two datasets, PII-E and PII-CR, representing two distinct scenarios: one with identical text in the model outputs and pre-training data, and the other where models leverage their reasoning abilities to generate text divergent from pre-training data. HAIF significantly improves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E dataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs on real-world pretraining data CLUECorpus2020, demonstrating strong robustness regardless prompt and response lengths.
♻ ☆ Robust Clustering on High-Dimensional Data with Stochastic Quantization
This paper addresses the limitations of traditional vector quantization (clustering) algorithms, particularly K-Means and its variant K-Means++, and explores the Stochastic Quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning problems. Some traditional clustering algorithms suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as Mini-Batch K-Means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, the Stochastic Quantization algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data, comparing model accuracy across various ratios of labeled to unlabeled data. To address the challenge of high dimensionality, we trained Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both the Stochastic Quantization algorithm and traditional quantization algorithms. Furthermore, we enhance the algorithm's convergence speed by introducing modifications with an adaptive learning rate.
comment: 20 pages, 5 figures, to be published in the International Scientific Technical Journal "Problems of Control and Informatics"
♻ ☆ Rethinking Molecular Design: Integrating Latent Variable and Auto-Regressive Models for Goal Directed Generation
De novo molecule design has become a highly active research area, advanced significantly through the use of state-of-the-art generative models. Despite these advances, several fundamental questions remain unanswered as the field increasingly focuses on more complex generative models and sophisticated molecular representations as an answer to the challenges of drug design. In this paper, we return to the simplest representation of molecules, and investigate overlooked limitations of classical generative approaches, particularly Variational Autoencoders (VAEs) and auto-regressive models. We propose a hybrid model in the form of a novel regularizer that leverages the strengths of both to improve validity, conditional generation, and style transfer of molecular sequences. Additionally, we provide an in depth discussion of overlooked assumptions of these models' behaviour.
♻ ☆ Hierarchical Generative Adversarial Imitation Learning with Mid-level Input Generation for Autonomous Driving on Urban Environments
Deriving robust control policies for realistic urban navigation scenarios is not a trivial task. In an end-to-end approach, these policies must map high-dimensional images from the vehicle's cameras to low-level actions such as steering and throttle. While pure Reinforcement Learning (RL) approaches are based exclusively on engineered rewards, Generative Adversarial Imitation Learning (GAIL) agents learn from expert demonstrations while interacting with the environment, which favors GAIL on tasks for which a reward signal is difficult to derive, such as autonomous driving. However, training deep networks directly from raw images on RL tasks is known to be unstable and troublesome. To deal with that, this work proposes a hierarchical GAIL-based architecture (hGAIL) which decouples representation learning from the driving task to solve the autonomous navigation of a vehicle. The proposed architecture consists of two modules: a GAN (Generative Adversarial Net) which generates an abstract mid-level input representation, which is the Bird's-Eye View (BEV) from the surroundings of the vehicle; and the GAIL which learns to control the vehicle based on the BEV predictions from the GAN as input. hGAIL is able to learn both the policy and the mid-level representation simultaneously as the agent interacts with the environment. Our experiments made in the CARLA simulation environment have shown that GAIL exclusively from cameras (without BEV) fails to even learn the task, while hGAIL, after training exclusively on one city, was able to autonomously navigate successfully in 98% of the intersections of a new city not used in training phase. Videos and code available at: https://sites.google.com/view/hgail
♻ ☆ MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation NeurIPS 2023
Tactile sensing is critical to fine-grained, contact-rich manipulation tasks, such as insertion and assembly. Prior research has shown the possibility of learning tactile-guided policy from teleoperated demonstration data. However, to provide the demonstration, human users often rely on visual feedback to control the robot. This creates a gap between the sensing modality used for controlling the robot (visual) and the modality of interest (tactile). To bridge this gap, we introduce "MimicTouch", a novel framework for learning policies directly from demonstrations provided by human users with their hands. The key innovations are i) a human tactile data collection system which collects multi-modal tactile dataset for learning human's tactile-guided control strategy, ii) an imitation learning-based framework for learning human's tactile-guided control strategy through such data, and iii) an online residual RL framework to bridge the embodiment gap between the human hand and the robot gripper. Through comprehensive experiments, we highlight the efficacy of utilizing human's tactile-guided control strategy to resolve contact-rich manipulation tasks. The project website is at https://sites.google.com/view/MimicTouch.
comment: Accepted by CoRL 2024, Best Paper Award at NeurIPS 2023 Touch Processing Workshop
♻ ☆ CW-CNN & CW-AN: Convolutional Networks and Attention Networks for CW-Complexes
We present a novel framework for learning on CW-complex structured data points. Recent advances have discussed CW-complexes as ideal learning representations for problems in cheminformatics. However, there is a lack of available machine learning methods suitable for learning on CW-complexes. In this paper we develop notions of convolution and attention that are well defined for CW-complexes. These notions enable us to create the first Hodge informed neural network that can receive a CW-complex as input. We illustrate and interpret this framework in the context of supervised prediction.
♻ ☆ Implementation of The Future of Drug Discovery: QuantumBased Machine Learning Simulation (QMLS)
The Research & Development (R&D) phase of drug development is a lengthy and costly process. To revolutionize this process, we introduce our new concept QMLS to shorten the whole R&D phase to three to six months and decrease the cost to merely fifty to eighty thousand USD. For Hit Generation, Machine Learning Molecule Generation (MLMG) generates possible hits according to the molecular structure of the target protein while the Quantum Simulation (QS) filters molecules from the primary essay based on the reaction and binding effectiveness with the target protein. Then, For Lead Optimization, the resultant molecules generated and filtered from MLMG and QS are compared, and molecules that appear as a result of both processes will be made into dozens of molecular variations through Machine Learning Molecule Variation (MLMV), while others will only be made into a few variations. Lastly, all optimized molecules would undergo multiple rounds of QS filtering with a high standard for reaction effectiveness and safety, creating a few dozen pre-clinical-trail-ready drugs. This paper is based on our first paper, where we pitched the concept of machine learning combined with quantum simulations. In this paper we will go over the detailed design and framework of QMLS, including MLMG, MLMV, and QS.
comment: 13 pages, 6 figures
♻ ☆ Large-Batch, Iteration-Efficient Neural Bayesian Design Optimization
Bayesian optimization (BO) provides a powerful framework for optimizing black-box, expensive-to-evaluate functions. It is therefore an attractive tool for engineering design problems, typically involving multiple objectives. Thanks to the rapid advances in fabrication and measurement methods as well as parallel computing infrastructure, querying many design problems can be heavily parallelized. This class of problems challenges BO with an unprecedented setup where it has to deal with very large batches, shifting its focus from sample efficiency to iteration efficiency. We present a novel Bayesian optimization framework specifically tailored to address these limitations. Our key contribution is a highly scalable, sample-based acquisition function that performs a non-dominated sorting of not only the objectives but also their associated uncertainty. We show that our acquisition function in combination with different Bayesian neural network surrogates is effective in data-intensive environments with a minimal number of iterations. We demonstrate the superiority of our method by comparing it with state-of-the-art multi-objective optimizations. We perform our evaluation on two real-world problems -- airfoil design and 3D printing -- showcasing the applicability and efficiency of our approach. Our code is available at: https://github.com/an-on-ym-ous/lbn_mobo
♻ ☆ Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \url{https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications}.
♻ ☆ Reducing Spatial Discretization Error on Coarse CFD Simulations Using an OpenFOAM-Embedded Deep Learning Framework
We propose a method for reducing the spatial discretization error of coarse computational fluid dynamics (CFD) problems by enhancing the quality of low-resolution simulations using deep learning. We feed the model with fine-grid data after projecting it to the coarse-grid discretization. We substitute the default differencing scheme for the convection term by a feed-forward neural network that interpolates velocities from cell centers to face values to produce velocities that approximate the down-sampled fine-grid data well. The deep learning framework incorporates the open-source CFD code OpenFOAM, resulting in an end-to-end differentiable model. We automatically differentiate the CFD physics using a discrete adjoint code version. We present a fast communication method between TensorFlow (Python) and OpenFOAM (c++) that accelerates the training process. We applied the model to the flow past a square cylinder problem, reducing the error from 120% to 25% in the velocity for simulations inside the training distribution compared to the traditional solver using an x8 coarser mesh. For simulations outside the training distribution, the error reduction in the velocities was about 50%. The training is affordable in terms of time and data samples since the architecture exploits the local features of the physics.
♻ ☆ AI-Driven Intrusion Detection Systems (IDS) on the ROAD Dataset: A Comparative Analysis for Automotive Controller Area Network (CAN)
The integration of digital devices in modern vehicles has revolutionized automotive technology, enhancing safety and the overall driving experience. The Controller Area Network (CAN) bus is a central system for managing in-vehicle communication between the electronic control units (ECUs). However, the CAN protocol poses security challenges due to inherent vulnerabilities, lacking encryption and authentication, which, combined with an expanding attack surface, necessitates robust security measures. In response to this challenge, numerous Intrusion Detection Systems (IDS) have been developed and deployed. Nonetheless, an open, comprehensive, and realistic dataset to test the effectiveness of such IDSs remains absent in the existing literature. This paper addresses this gap by considering the latest ROAD dataset, containing stealthy and sophisticated injections. The methodology involves dataset labelling and the implementation of both state-of-the-art deep learning models and traditional machine learning models to show the discrepancy in performance between the datasets most commonly used in the literature and the ROAD dataset, a more realistic alternative.
♻ ☆ What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want to include these models in their software stack, however, face a dreadful challenge: debugging LLMs' inconsistent behavior across minor variations of the prompt. We therefore introduce two metrics for classification tasks, namely sensitivity and consistency, which are complementary to task performance. First, sensitivity measures changes of predictions across rephrasings of the prompt, and does not require access to ground truth labels. Instead, consistency measures how predictions vary across rephrasings for elements of the same class. We perform an empirical comparison of these metrics on text classification tasks, using them as guideline for understanding failure modes of the LLM. Our hope is that sensitivity and consistency will be helpful to guide prompt engineering and obtain LLMs that balance robustness with performance.
♻ ☆ Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods
We study stochastic Cubic Newton methods for solving general possibly non-convex minimization problems. We propose a new framework, which we call the helper framework, that provides a unified view of the stochastic and variance-reduced second-order algorithms equipped with global complexity guarantees. It can also be applied to learning with auxiliary information. Our helper framework offers the algorithm designer high flexibility for constructing and analyzing the stochastic Cubic Newton methods, allowing arbitrary size batches, and the use of noisy and possibly biased estimates of the gradients and Hessians, incorporating both the variance reduction and the lazy Hessian updates. We recover the best-known complexities for the stochastic and variance-reduced Cubic Newton, under weak assumptions on the noise. A direct consequence of our theory is the new lazy stochastic second-order method, which significantly improves the arithmetic complexity for large dimension problems. We also establish complexity bounds for the classes of gradient-dominated objectives, that include convex and strongly convex problems. For Auxiliary Learning, we show that using a helper (auxiliary function) can outperform training alone if a given similarity measure is small.
comment: Published in Transactions on Machine Learning Research
♻ ☆ Multifidelity Covariance Estimation via Regression on the Manifold of Symmetric Positive Definite Matrices
We introduce a multifidelity estimator of covariance matrices formulated as the solution to a regression problem on the manifold of symmetric positive definite matrices. The estimator is positive definite by construction, and the Mahalanobis distance minimized to obtain it possesses properties enabling practical computation. We show that our manifold regression multifidelity (MRMF) covariance estimator is a maximum likelihood estimator under a certain error model on manifold tangent space. More broadly, we show that our Riemannian regression framework encompasses existing multifidelity covariance estimators constructed from control variates. We demonstrate via numerical examples that the MRMF estimator can provide significant decreases, up to one order of magnitude, in squared estimation error relative to both single-fidelity and other multifidelity covariance estimators. Furthermore, preservation of positive definiteness ensures that our estimator is compatible with downstream tasks, such as data assimilation and metric learning, in which this property is essential.
comment: To appear in the SIAM Journal on Mathematics of Data Science (SIMODS)
♻ ☆ Ontology-driven Reinforcement Learning for Personalized Student Support
In the search for more effective education, there is a widespread effort to develop better approaches to personalize student education. Unassisted, educators often do not have time or resources to personally support every student in a given classroom. Motivated by this issue, and by recent advancements in artificial intelligence, this paper presents a general-purpose framework for personalized student support, applicable to any virtual educational system such as a serious game or an intelligent tutoring system. To fit any educational situation, we apply ontologies for their semantic organization, combining them with data collection considerations and multi-agent reinforcement learning. The result is a modular system that can be adapted to any virtual educational software to provide useful personalized assistance to students.
comment: 6 pages, 3 figures, in press for IEEE Systems, Man, and Cybernetics 2024 Conference
♻ ☆ On the design space between molecular mechanics and machine learning force fields
A force field as accurate as quantum mechanics (QM) and as fast as molecular mechanics (MM), with which one can simulate a biomolecular system efficiently enough and meaningfully enough to get quantitative insights, is among the most ardent dreams of biophysicists -- a dream, nevertheless, not to be fulfilled any time soon. Machine learning force fields (MLFFs) represent a meaningful endeavor towards this direction, where differentiable neural functions are parametrized to fit ab initio energies, and furthermore forces through automatic differentiation. We argue that, as of now, the utility of the MLFF models is no longer bottlenecked by accuracy but primarily by their speed (as well as stability and generalizability), as many recent variants, on limited chemical spaces, have long surpassed the chemical accuracy of $1$ kcal/mol -- the empirical threshold beyond which realistic chemical predictions are possible -- though still magnitudes slower than MM. Hoping to kindle explorations and designs of faster, albeit perhaps slightly less accurate MLFFs, in this review, we focus our attention on the design space (the speed-accuracy tradeoff) between MM and ML force fields. After a brief review of the building blocks of force fields of either kind, we discuss the desired properties and challenges now faced by the force field development community, survey the efforts to make MM force fields more accurate and ML force fields faster, envision what the next generation of MLFF might look like.
♻ ☆ Developing A Multi-Agent and Self-Adaptive Framework with Deep Reinforcement Learning for Dynamic Portfolio Risk Management
Deep or reinforcement learning (RL) approaches have been adapted as reactive agents to quickly learn and respond with new investment strategies for portfolio management under the highly turbulent financial market environments in recent years. In many cases, due to the very complex correlations among various financial sectors, and the fluctuating trends in different financial markets, a deep or reinforcement learning based agent can be biased in maximising the total returns of the newly formulated investment portfolio while neglecting its potential risks under the turmoil of various market conditions in the global or regional sectors. Accordingly, a multi-agent and self-adaptive framework namely the MASA is proposed in which a sophisticated multi-agent reinforcement learning (RL) approach is adopted through two cooperating and reactive agents to carefully and dynamically balance the trade-off between the overall portfolio returns and their potential risks. Besides, a very flexible and proactive agent as the market observer is integrated into the MASA framework to provide some additional information on the estimated market trends as valuable feedbacks for multi-agent RL approach to quickly adapt to the ever-changing market conditions. The obtained empirical results clearly reveal the potential strengths of our proposed MASA framework based on the multi-agent RL approach against many well-known RL-based approaches on the challenging data sets of the CSI 300, Dow Jones Industrial Average and S&P 500 indexes over the past 10 years. More importantly, our proposed MASA framework shed lights on many possible directions for future investigation.
comment: In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems
♻ ☆ Large Scale Training of Graph Neural Networks for Optimal Markov-Chain Partitioning Using the Kemeny Constant
Traditional clustering algorithms often struggle to capture the complex relationships within graphs and generalise to arbitrary clustering criteria. The emergence of graph neural networks (GNNs) as a powerful framework for learning representations of graph data provides new approaches to solving the problem. Previous work has shown GNNs to be capable of proposing partitionings using a variety of criteria, however, these approaches have not yet been extended to work on Markov chains or kinetic networks. These arise frequently in the study of molecular systems and are of particular interest to the biochemical modelling community. In this work, we propose several GNN-based architectures to tackle the graph partitioning problem for Markov Chains described as kinetic networks. This approach aims to minimize how much a proposed partitioning changes the Kemeny constant. We propose using an encoder-decoder architecture and show how simple GraphSAGE-based GNNs with linear layers can outperform much larger and more expressive attention-based models in this context. As a proof of concept, we first demonstrate the method's ability to cluster randomly connected graphs. We also use a linear chain architecture corresponding to a 1D free energy profile as our kinetic network. Subsequently, we demonstrate the effectiveness of our method through experiments on a data set derived from molecular dynamics. We compare the performance of our method to other partitioning techniques such as PCCA+. We explore the importance of feature and hyperparameter selection and propose a general strategy for large-scale parallel training of GNNs for discovering optimal graph partitionings.
♻ ☆ On the Impact of Data Heterogeneity in Federated Learning Environments with Application to Healthcare Networks
Federated Learning (FL) allows multiple privacy-sensitive applications to leverage their dataset for a global model construction without any disclosure of the information. One of those domains is healthcare, where groups of silos collaborate in order to generate a global predictor with improved accuracy and generalization. However, the inherent challenge lies in the high heterogeneity of medical data, necessitating sophisticated techniques for assessment and compensation. This paper presents a comprehensive exploration of the mathematical formalization and taxonomy of heterogeneity within FL environments, focusing on the intricacies of medical data. In particular, we address the evaluation and comparison of the most popular FL algorithms with respect to their ability to cope with quantity-based, feature and label distribution-based heterogeneity. The goal is to provide a quantitative evaluation of the impact of data heterogeneity in FL systems for healthcare networks as well as a guideline on FL algorithm selection. Our research extends beyond existing studies by benchmarking seven of the most common FL algorithms against the unique challenges posed by medical data use cases. The paper targets the prediction of the risk of stroke recurrence through a set of tabular clinical reports collected by different federated hospital silos: data heterogeneity frequently encountered in this scenario and its impact on FL performance are discussed.
♻ ☆ Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach
Soft Q-learning is a variation of Q-learning designed to solve entropy regularized Markov decision problems where an agent aims to maximize the entropy regularized value function. Despite its empirical success, there have been limited theoretical studies of soft Q-learning to date. This paper aims to offer a novel and unified finite-time, control-theoretic analysis of soft Q-learning algorithms. We focus on two types of soft Q-learning algorithms: one utilizing the log-sum-exp operator and the other employing the Boltzmann operator. By using dynamical switching system models, we derive novel finite-time error bounds for both soft Q-learning algorithms. We hope that our analysis will deepen the current understanding of soft Q-learning by establishing connections with switching system models and may even pave the way for new frameworks in the finite-time analysis of other reinforcement learning algorithms.
comment: 18 pages
♻ ☆ CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning
Federated learning (FL) has been proposed to enable distributed learning on Artificial Intelligence Internet of Things (AIoT) devices with guarantees of high-level data privacy. Since random initial models in FL can easily result in unregulated Stochastic Gradient Descent (SGD) processes, existing FL methods greatly suffer from both slow convergence and poor accuracy, especially in non-IID scenarios. To address this problem, we propose a novel method named CyclicFL, which can quickly derive effective initial models to guide the SGD processes, thus improving the overall FL training performance. We formally analyze the significance of data consistency between the pre-training and training stages of CyclicFL, showing the limited Lipschitzness of loss for the pre-trained models by CyclicFL. Moreover, we systematically prove that our method can achieve faster convergence speed under various convexity assumptions. Unlike traditional centralized pre-training methods that require public proxy data, CyclicFL pre-trains initial models on selected AIoT devices cyclically without exposing their local data. Therefore, they can be easily integrated into any security-critical FL methods. Comprehensive experimental results show that CyclicFL can not only improve the maximum classification accuracy by up to $14.11\%$ but also significantly accelerate the overall FL training process.
♻ ☆ Commute-Time-Optimised Graphs for GNNs
We explore graph rewiring methods that optimise commute time. Recent graph rewiring approaches facilitate long-range interactions in sparse graphs, making such rewirings commute-time-optimal on average. However, when an expert prior exists on which node pairs should or should not interact, a superior rewiring would favour short commute times between these privileged node pairs. We construct two synthetic datasets with known priors reflecting realistic settings, and use these to motivate two bespoke rewiring methods that incorporate the known prior. We investigate the regimes where our rewiring improves test performance on the synthetic datasets. Finally, we perform a case study on a real-world citation graph to investigate the practical implications of our work.
♻ ☆ Finite Sample Frequency Domain Identification
We study non-parametric frequency-domain system identification from a finite-sample perspective. We assume an open loop scenario where the excitation input is periodic and consider the Empirical Transfer Function Estimate (ETFE), where the goal is to estimate the frequency response at certain desired (evenly-spaced) frequencies, given input-output samples. We show that under sub-Gaussian colored noise (in time-domain) and stability assumptions, the ETFE estimates are concentrated around the true values. The error rate is of the order of $\mathcal{O}((d_{\mathrm{u}}+\sqrt{d_{\mathrm{u}}d_{\mathrm{y}}})\sqrt{M/N_{\mathrm{tot}}})$, where $N_{\mathrm{tot}}$ is the total number of samples, $M$ is the number of desired frequencies, and $d_{\mathrm{u}},\,d_{\mathrm{y}}$ are the dimensions of the input and output signals respectively. This rate remains valid for general irrational transfer functions and does not require a finite order state-space representation. By tuning $M$, we obtain a $N_{\mathrm{tot}}^{-1/3}$ finite-sample rate for learning the frequency response over all frequencies in the $ \mathcal{H}_{\infty}$ norm. Our result draws upon an extension of the Hanson-Wright inequality to semi-infinite matrices. We study the finite-sample behavior of ETFE in simulations.
comment: Version 2 changes: several typos were fixed and some proof steps were expanded
♻ ☆ TSFool: Crafting Highly-Imperceptible Adversarial Time Series through Multi-Objective Attack ECAI'24
Recent years have witnessed the success of recurrent neural network (RNN) models in time series classification (TSC). However, neural networks (NNs) are vulnerable to adversarial samples, which cause real-life adversarial attacks that undermine the robustness of AI models. To date, most existing attacks target at feed-forward NNs and image recognition tasks, but they cannot perform well on RNN-based TSC. This is due to the cyclical computation of RNN, which prevents direct model differentiation. In addition, the high visual sensitivity of time series to perturbations also poses challenges to local objective optimization of adversarial samples. In this paper, we propose an efficient method called TSFool to craft highly-imperceptible adversarial time series for RNN-based TSC. The core idea is a new global optimization objective known as "Camouflage Coefficient" that captures the imperceptibility of adversarial samples from the class distribution. Based on this, we reduce the adversarial attack problem to a multi-objective optimization problem that enhances the perturbation quality. Furthermore, to speed up the optimization process, we propose to use a representation model for RNN to capture deeply embedded vulnerable samples whose features deviate from the latent manifold. Experiments on 11 UCR and UEA datasets showcase that TSFool significantly outperforms six white-box and three black-box benchmark attacks in terms of effectiveness, efficiency and imperceptibility from various perspectives including standard measure, human study and real-world defense.
comment: 27th European Conference on Artificial Intelligence (ECAI'24)
♻ ☆ On Bits and Bandits: Quantifying the Regret-Information Trade-off
In interactive decision-making tasks, information can be acquired by direct interactions, through receiving indirect feedback, and from external knowledgeable sources. We examine the trade-off between the information an agent accumulates and the regret it suffers. We show that information from external sources, measured in bits, can be traded off for regret, measured in reward. We invoke information-theoretic methods for obtaining regret lower bounds, that also allow us to easily re-derive several known lower bounds. We then generalize a variety of interactive decision-making tasks with external information to a new setting. Using this setting, we introduce the first Bayesian regret lower bounds that depend on the information an agent accumulates. These lower bounds also prove the near-optimality of Thompson sampling for Bayesian problems. Finally, we demonstrate the utility of these bounds in improving the performance of a question-answering task with large language models, allowing us to obtain valuable insights.
♻ ☆ The Role of Transformer Models in Advancing Blockchain Technology: A Systematic Survey
As blockchain technology rapidly evolves, the demand for enhanced efficiency, security, and scalability grows.Transformer models, as powerful deep learning architectures,have shown unprecedented potential in addressing various blockchain challenges. However, a systematic review of Transformer applications in blockchain is lacking. This paper aims to fill this research gap by surveying over 200 relevant papers, comprehensively reviewing practical cases and research progress of Transformers in blockchain applications. Our survey covers key areas including anomaly detection, smart contract security analysis, cryptocurrency prediction and trend analysis, and code summary generation. To clearly articulate the advancements of Transformers across various blockchain domains, we adopt a domain-oriented classification system, organizing and introducing representative methods based on major challenges in current blockchain research. For each research domain,we first introduce its background and objectives, then review previous representative methods and analyze their limitations,and finally introduce the advancements brought by Transformer models. Furthermore, we explore the challenges of utilizing Transformer, such as data privacy, model complexity, and real-time processing requirements. Finally, this article proposes future research directions, emphasizing the importance of exploring the Transformer architecture in depth to adapt it to specific blockchain applications, and discusses its potential role in promoting the development of blockchain technology. This review aims to provide new perspectives and a research foundation for the integrated development of blockchain technology and machine learning, supporting further innovation and application expansion of blockchain technology.
♻ ☆ Cross-Validated Off-Policy Evaluation
In this paper, we study the problem of estimator selection and hyper-parameter tuning in off-policy evaluation. Although cross-validation is the most popular method for model selection in supervised learning, off-policy evaluation relies mostly on theory-based approaches, which provide only limited guidance to practitioners. We show how to use cross-validation for off-policy evaluation. This challenges a popular belief that cross-validation in off-policy evaluation is not feasible. We evaluate our method empirically and show that it addresses a variety of use cases.
comment: 13 pages, 7 figures
♻ ☆ A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions
Prediction models inform important clinical decisions, aiding in diagnosis, prognosis, and treatment planning. The predictive performance of these models is typically assessed through discrimination and calibration. However, changes in the distribution of the data impact model performance. In health-care, a typical change is a shift in case-mix: for example, for cardiovascular risk management, a general practitioner sees a different mix of patients than a specialist in a tertiary hospital. This work introduces a novel framework that differentiates the effects of case-mix shifts on discrimination and calibration based on the causal direction of the prediction task. When prediction is in the causal direction (often the case for prognosis predictions), calibration remains stable under case-mix shifts, while discrimination does not. Conversely, when predicting in the anti-causal direction (often with diagnosis predictions), discrimination remains stable, but calibration does not. A simulation study and empirical validation using cardiovascular disease prediction models demonstrate the implications of this framework. This framework provides critical insights for evaluating and deploying prediction models across different clinical settings, emphasizing the importance of understanding the causal structure of the prediction task.
♻ ☆ Explainable Hierarchical Urban Representation Learning for Commuting Flow Prediction
Commuting flow prediction is an essential task for municipal operations in the real world. Previous studies have revealed that it is feasible to estimate the commuting origin-destination (OD) demand within a city using multiple auxiliary data. However, most existing methods are not suitable to deal with a similar task at a large scale, namely within a prefecture or the whole nation, owing to the increased number of geographical units that need to be maintained. In addition, region representation learning is a universal approach for gaining urban knowledge for diverse metropolitan downstream tasks. Although many researchers have developed comprehensive frameworks to describe urban units from multi-source data, they have not clarified the relationship between the selected geographical elements. Furthermore, metropolitan areas naturally preserve ranked structures, like cities and their inclusive districts, which makes elucidating relations between cross-level urban units necessary. Therefore, we develop a heterogeneous graph-based model to generate meaningful region embeddings at multiple spatial resolutions for predicting different types of inter-level OD flows. To demonstrate the effectiveness of the proposed method, extensive experiments were conducted using real-world aggregated mobile phone datasets collected from Shizuoka Prefecture, Japan. The results indicate that our proposed model outperforms existing models in terms of a uniform urban structure. We extend the understanding of predicted results using reasonable explanations to enhance the credibility of the model.
♻ ☆ Trustworthy Human-AI Collaboration: Reinforcement Learning with Human Feedback and Physics Knowledge for Safe Autonomous Driving
In the field of autonomous driving, developing safe and trustworthy autonomous driving policies remains a significant challenge. Recently, Reinforcement Learning with Human Feedback (RLHF) has attracted substantial attention due to its potential to enhance training safety and sampling efficiency. Nevertheless, existing RLHF-enabled methods often falter when faced with imperfect human demonstrations, potentially leading to training oscillations or even worse performance than rule-based approaches. Inspired by the human learning process, we propose Physics-enhanced Reinforcement Learning with Human Feedback (PE-RLHF). This novel framework synergistically integrates human feedback (e.g., human intervention and demonstration) and physics knowledge (e.g., traffic flow model) into the training loop of reinforcement learning. The key advantage of PE-RLHF is its guarantee that the learned policy will perform at least as well as the given physics-based policy, even when human feedback quality deteriorates, thus ensuring trustworthy safety improvements. PE-RLHF introduces a Physics-enhanced Human-AI (PE-HAI) collaborative paradigm for dynamic action selection between human and physics-based actions, employs a reward-free approach with a proxy value function to capture human preferences, and incorporates a minimal intervention mechanism to reduce the cognitive load on human mentors. Extensive experiments across diverse driving scenarios demonstrate that PE-RLHF significantly outperforms traditional methods, achieving state-of-the-art (SOTA) performance in safety, efficiency, and generalizability, even with varying quality of human feedback. The philosophy behind PE-RLHF not only advances autonomous driving technology but can also offer valuable insights for other safety-critical domains. Demo video and code are available at: \https://zilin-huang.github.io/PE-RLHF-website/
comment: 33 pages, 20 figures
♻ ☆ FRAC-Q-Learning: A Reinforcement Learning with Boredom Avoidance Processes for Social Robots
The reinforcement learning algorithms have often been applied to social robots. However, most reinforcement learning algorithms were not optimized for the use of social robots, and consequently they may bore users. We proposed a new reinforcement learning method specialized for the social robot, the FRAC-Q-learning, that can avoid user boredom. The proposed algorithm consists of a forgetting process in addition to randomizing and categorizing processes. This study evaluated interest and boredom hardness scores of the FRAC-Q-learning by a comparison with the traditional Q-learning. The FRAC-Q-learning showed significantly higher trend of interest score, and indicated significantly harder to bore users compared to the traditional Q-learning. Therefore, the FRAC-Q-learning can contribute to develop a social robot that will not bore users. The proposed algorithm has a potential to apply for Web-based communication and educational systems. This paper presents the entire process, detailed implementation and a detailed evaluation method of the of the FRAC-Q-learning for the first time.
♻ ☆ Painful intelligence: What AI can tell us about human suffering
This book uses the modern theory of artificial intelligence (AI) to understand human suffering or mental pain. Both humans and sophisticated AI agents process information about the world in order to achieve goals and obtain rewards, which is why AI can be used as a model of the human brain and mind. This book intends to make the theory accessible to a relatively general audience, requiring only some relevant scientific background. The book starts with the assumption that suffering is mainly caused by frustration. Frustration means the failure of an agent (whether AI or human) to achieve a goal or a reward it wanted or expected. Frustration is inevitable because of the overwhelming complexity of the world, limited computational resources, and scarcity of good data. In particular, such limitations imply that an agent acting in the real world must cope with uncontrollability, unpredictability, and uncertainty, which all lead to frustration. Fundamental in such modelling is the idea of learning, or adaptation to the environment. While AI uses machine learning, humans and animals adapt by a combination of evolutionary mechanisms and ordinary learning. Even frustration is fundamentally an error signal that the system uses for learning. This book explores various aspects and limitations of learning algorithms and their implications regarding suffering. At the end of the book, the computational theory is used to derive various interventions or training methods that will reduce suffering in humans. The amount of frustration is expressed by a simple equation which indicates how it can be reduced. The ensuing interventions are very similar to those proposed by Buddhist and Stoic philosophy, and include mindfulness meditation. Therefore, this book can be interpreted as an exposition of a computational theory justifying why such philosophies and meditation reduce human suffering.
comment: Second Edition of this book with 258 pages
♻ ☆ Prediction of soil fertility parameters using USB-microscope imagery and portable X-ray fluorescence spectrometry
This study investigated the use of portable X-ray fluorescence (PXRF) spectrometry and soil image analysis for rapid soil fertility assessment, with a focus on key indicators such as available boron (B), organic carbon (OC), available manganese (Mn), available sulfur (S), and the sulfur availability index (SAI). A total of 1,133 soil samples from diverse agro-climatic zones in Eastern India were analyzed. The research integrated color and texture features from microscopic soil images, PXRF data, and auxiliary soil variables (AVs) using a Random Forest model. Results showed that combining image features (IFs) with AVs significantly improved prediction accuracy for available B (R2 = 0.80) and OC (R2 = 0.88). A data fusion approach, incorporating IFs, AVs, and PXRF data, further enhanced predictions for available Mn and SAI, with R2 values of 0.72 and 0.70, respectively. The study highlights the potential of integrating these technologies to offer rapid, cost-effective soil testing methods, paving the way for more advanced predictive models and a deeper understanding of soil fertility. Future work should explore the application of deep learning models on a larger dataset, incorporating soils from a wider range of agro-climatic zones under field conditions.
comment: Published in 'Soil Advances'
♻ ☆ AICAttack: Adversarial Image Captioning Attack with Attention-Based Optimization
Recent advances in deep learning research have shown remarkable achievements across many tasks in computer vision (CV) and natural language processing (NLP). At the intersection of CV and NLP is the problem of image captioning, where the related models' robustness against adversarial attacks has not been well studied. This paper presents a novel adversarial attack strategy, AICAttack (Attention-based Image Captioning Attack), designed to attack image captioning models through subtle perturbations on images. Operating within a black-box attack scenario, our algorithm requires no access to the target model's architecture, parameters, or gradient information. We introduce an attention-based candidate selection mechanism that identifies the optimal pixels to attack, followed by a customised differential evolution method to optimise the perturbations of pixels' RGB values. We demonstrate AICAttack's effectiveness through extensive experiments on benchmark datasets against multiple victim models. The experimental results demonstrate that our method outperforms current leading-edge techniques by achieving consistently higher attack success rates.
♻ ☆ LinFusion: 1 GPU, 1 Minute, 16K Image
Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.
comment: Work in Progress. Codes are available at https://github.com/Huage001/LinFusion
♻ ☆ A Survey for Foundation Models in Autonomous Driving
The advent of foundation models has revolutionized the fields of natural language processing and computer vision, paving the way for their application in autonomous driving (AD). This survey presents a comprehensive review of more than 40 research papers, demonstrating the role of foundation models in enhancing AD. Large language models contribute to planning and simulation in AD, particularly through their proficiency in reasoning, code generation and translation. In parallel, vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking, as well as creating realistic driving scenarios for simulation and testing. Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning, crucial for end-to-end AD. This survey not only provides a structured taxonomy, categorizing foundation models based on their modalities and functionalities within the AD domain but also delves into the methods employed in current research. It identifies the gaps between existing foundation models and cutting-edge AD approaches, thereby charting future research directions and proposing a roadmap for bridging these gaps.
♻ ☆ Transfer-based Adversarial Poisoning Attacks for Online (MIMO-)Deep Receviers
Recently, the design of wireless receivers using deep neural networks (DNNs), known as deep receivers, has attracted extensive attention for ensuring reliable communication in complex channel environments. To adapt quickly to dynamic channels, online learning has been adopted to update the weights of deep receivers with over-the-air data (e.g., pilots). However, the fragility of neural models and the openness of wireless channels expose these systems to malicious attacks. To this end, understanding these attack methods is essential for robust receiver design. In this paper, we propose a transfer-based adversarial poisoning attack method for online receivers.Without knowledge of the attack target, adversarial perturbations are injected to the pilots, poisoning the online deep receiver and impairing its ability to adapt to dynamic channels and nonlinear effects. In particular, our attack method targets Deep Soft Interference Cancellation (DeepSIC)[1] using online meta-learning. As a classical model-driven deep receiver, DeepSIC incorporates wireless domain knowledge into its architecture. This integration allows it to adapt efficiently to time-varying channels with only a small number of pilots, achieving optimal performance in a multi-input and multi-output (MIMO) scenario.The deep receiver in this scenario has a number of applications in the field of wireless communication, which motivates our study of the attack methods targeting it.Specifically, we demonstrate the effectiveness of our attack in simulations on synthetic linear, synthetic nonlinear, static, and COST 2100 channels. Simulation results indicate that the proposed poisoning attack significantly reduces the performance of online receivers in rapidly changing scenarios.
comment: 15 pages, 14 figures
♻ ☆ Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games NeurIPS 2023
In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorporation of the minimax value iteration. To our knowledge, our theoretical results present the first finite-sample analysis of such learning dynamics with last-iterate guarantees. In the matrix game setting, the results imply a sample complexity of $O(\epsilon^{-1})$ to find the Nash distribution and a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. In the stochastic game setting, the results also imply a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. To establish these results, the main challenge is to handle stochastic approximation algorithms with multiple sets of coupled and stochastic iterates that evolve on (possibly) different time scales. To overcome this challenge, we developed a coupled Lyapunov-based approach, which may be of independent interest to the broader community studying the convergence behavior of stochastic approximation algorithms.
comment: A preliminary version [arXiv:2303.03100] of this paper, with a subset of the results that are presented here, was presented at NeurIPS 2023
♻ ☆ Prediction of COPD Using Machine Learning, Clinical Summary Notes, and Vital Signs
Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung disease that causes obstructed airflow from the lungs. In the United States, more than 15.7 million Americans have been diagnosed with COPD, with 96% of individuals living with at least one other chronic health condition. It is the 4th leading cause of death in the country. Over 2.2 million patients are admitted to hospitals annually due to COPD exacerbations. Monitoring and predicting patient exacerbations on-time could save their life. This paper presents two different predictive models to predict COPD exacerbation using AI and natural language processing (NLP) approaches. These models use respiration summary notes, symptoms, and vital signs. To train and test these models, data records containing physiologic signals and vital signs time series were used. These records were captured from patient monitors and comprehensive clinical data obtained from hospital medical information systems for tens of thousands of Intensive Care Unit (ICU) patients. We achieved an area under the Receiver operating characteristic (ROC) curve of 0.82 in detection and prediction of COPD exacerbation.
comment: 11 pages, 5 figures
♻ ☆ Explanation Space: A New Perspective into Time Series Interpretability
Human understandable explanation of deep learning models is necessary for many critical and sensitive applications. Unlike image or tabular data where the importance of each input feature (for the classifier's decision) can be directly projected into the input, time series distinguishable features (e.g. dominant frequency) are often hard to manifest in time domain for a user to easily understand. Moreover, most explanation methods require a baseline value as an indication of the absence of any feature. However, the notion of lack of feature, which is often defined as black pixels for vision tasks or zero/mean values for tabular data, is not well-defined in time series. Despite the adoption of explainable AI methods (XAI) from tabular and vision domain into time series domain, these differences limit the application of these XAI methods in practice. In this paper, we propose a simple yet effective method that allows a model originally trained on time domain to be interpreted in other explanation spaces using existing methods. We suggest four explanation spaces that each can potentially alleviate these issues in certain types of time series. Our method can be readily adopted in existing platforms without any change to trained models or XAI methods. The code is available at https://github.com/shrezaei/TS-X-spaces.
♻ ☆ Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation
Large language models (LLMs) have achieved state-of-the-art performance in various language processing tasks, motivating their adoption in simultaneous translation. Current fine-tuning methods to adapt LLMs for simultaneous translation focus on prompting optimization strategies using either data augmentation or prompt structure modifications. However, these methods suffer from several issues, such as unnecessarily expanded training sets, computational inefficiency from dumping the key and value cache, increased prompt sizes, or restriction to a single decision policy. To eliminate these issues, in this work, we propose SimulMask, a new paradigm for fine-tuning LLMs for simultaneous translation. It utilizes a novel attention mask approach that models simultaneous translation during fine-tuning by masking attention for a desired decision policy. Applying the proposed SimulMask on a Falcon LLM for the IWSLT 2017 dataset, we have observed a significant translation quality improvement compared to state-of-the-art prompting optimization strategies on five language pairs while reducing the computational cost.
♻ ☆ Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network
Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences, including the development of cancer. Earlier identification of mutagenic compounds in the drug development process is therefore crucial for preventing the progression of unsafe candidates and reducing development costs. While computational techniques, especially machine learning models have become increasingly prevalent for this endpoint, they rely on a single modality. In this work, we introduce a novel stacked ensemble based mutagenicity prediction model which incorporate multiple modalities such as simplified molecular input line entry system (SMILES) and molecular graph. These modalities capture diverse information about molecules such as substructural, physicochemical, geometrical and topological. To derive substructural, geometrical and physicochemical information, we use SMILES, while topological information is extracted through a graph attention network (GAT) via molecular graph. Our model uses a stacked ensemble of machine learning classifiers to make predictions using these multiple features. We employ the explainable artificial intelligence (XAI) technique SHAP (Shapley Additive Explanations) to determine the significance of each classifier and the most relevant features in the prediction. We demonstrate that our method surpasses SOTA methods on two standard datasets across various metrics. Notably, we achieve an area under the curve of 95.21\% on the Hansen benchmark dataset, affirming the efficacy of our method in predicting mutagenicity. We believe that this research will captivate the interest of both clinicians and computational biologists engaged in translational research.
comment: Submitted to a journal
Information Retrieval 8
☆ WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis's utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.
RAG based Question-Answering for Contextual Response Prediction System CIKM'24
Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-answering framework for real-world applications using RAG entails several challenges: 1) data availability issues, 2) evaluating the quality of generated content, and 3) the costly nature of human evaluation. In this paper, we introduce an end-to-end framework that employs LLMs with RAG capabilities for industry use cases. Given a customer query, the proposed system retrieves relevant knowledge documents and leverages them, along with previous chat history, to generate response suggestions for customer service agents in the contact centers of a major retail company. Through comprehensive automated and human evaluations, we show that this solution outperforms the current BERT-based algorithms in accuracy and relevance. Our findings suggest that RAG-based LLMs can be an excellent support to human customer service representatives by lightening their workload.
comment: Accepted at the 1st Workshop on GenAI and RAG Systems for Enterprise, CIKM'24. 6 pages
☆ HGAMN: Heterogeneous Graph Attention Matching Network for Multilingual POI Retrieval at Baidu Maps KDD'21
The increasing interest in international travel has raised the demand of retrieving point of interests in multiple languages. This is even superior to find local venues such as restaurants and scenic spots in unfamiliar languages when traveling abroad. Multilingual POI retrieval, enabling users to find desired POIs in a demanded language using queries in numerous languages, has become an indispensable feature of today's global map applications such as Baidu Maps. This task is non-trivial because of two key challenges: (1) visiting sparsity and (2) multilingual query-POI matching. To this end, we propose a Heterogeneous Graph Attention Matching Network (HGAMN) to concurrently address both challenges. Specifically, we construct a heterogeneous graph that contains two types of nodes: POI node and query node using the search logs of Baidu Maps. To alleviate challenge \#1, we construct edges between different POI nodes to link the low-frequency POIs with the high-frequency ones, which enables the transfer of knowledge from the latter to the former. To mitigate challenge \#2, we construct edges between POI and query nodes based on the co-occurrences between queries and POIs, where queries in different languages and formulations can be aggregated for individual POIs. Moreover, we develop an attention-based network to jointly learn node representations of the heterogeneous graph and further design a cross-attention module to fuse the representations of both types of nodes for query-POI relevance scoring. Extensive experiments conducted on large-scale real-world datasets from Baidu Maps demonstrate the superiority and effectiveness of HGAMN. In addition, HGAMN has already been deployed in production at Baidu Maps, and it successfully keeps serving hundreds of millions of requests every day.
comment: Accepted by KDD'21
☆ MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search KDD'19
Baidu runs the largest commercial web search engine in China, serving hundreds of millions of online users every day in response to a great variety of queries. In order to build a high-efficiency sponsored search engine, we used to adopt a three-layer funnel-shaped structure to screen and sort hundreds of ads from billions of ad candidates subject to the requirement of low response latency and the restraints of computing resources. Given a user query, the top matching layer is responsible for providing semantically relevant ad candidates to the next layer, while the ranking layer at the bottom concerns more about business indicators (e.g., CPM, ROI, etc.) of those ads. The clear separation between the matching and ranking objectives results in a lower commercial return. The Mobius project has been established to address this serious issue. It is our first attempt to train the matching layer to consider CPM as an additional optimization objective besides the query-ad relevance, via directly predicting CTR (click-through rate) from billions of query-ad pairs. Specifically, this paper will elaborate on how we adopt active learning to overcome the insufficiency of click history at the matching layer when training our neural click networks offline, and how we use the SOTA ANN search technique for retrieving ads more efficiently (Here ``ANN'' stands for approximate nearest neighbor search). We contribute the solutions to Mobius-V1 as the first version of our next generation query-ad matching system.
comment: Accepted by KDD'19
☆ Federated Prototype-based Contrastive Learning for Privacy-Preserving Cross-domain Recommendation
Cross-domain recommendation (CDR) aims to improve recommendation accuracy in sparse domains by transferring knowledge from data-rich domains. However, existing CDR methods often assume the availability of user-item interaction data across domains, overlooking user privacy concerns. Furthermore, these methods suffer from performance degradation in scenarios with sparse overlapping users, as they typically depend on a large number of fully shared users for effective knowledge transfer. To address these challenges, we propose a Federated Prototype-based Contrastive Learning (CL) method for Privacy-Preserving CDR, named FedPCL-CDR. This approach utilizes non-overlapping user information and prototypes to improve multi-domain performance while protecting user privacy. FedPCL-CDR comprises two modules: local domain (client) learning and global server aggregation. In the local domain, FedPCL-CDR clusters all user data to learn representative prototypes, effectively utilizing non-overlapping user information and addressing the sparse overlapping user issue. It then facilitates knowledge transfer by employing both local and global prototypes returned from the server in a CL manner. Simultaneously, the global server aggregates representative prototypes from local domains to learn both local and global prototypes. The combination of prototypes and federated learning (FL) ensures that sensitive user data remains decentralized, with only prototypes being shared across domains, thereby protecting user privacy. Extensive experiments on four CDR tasks using two real-world datasets demonstrate that FedPCL-CDR outperforms the state-of-the-art baselines.
☆ iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models
Most available data is unstructured, making it challenging to access valuable information. Automatically building Knowledge Graphs (KGs) is crucial for structuring data and making it accessible, allowing users to search for information effectively. KGs also facilitate insights, inference, and reasoning. Traditional NLP methods, such as named entity recognition and relation extraction, are key in information retrieval but face limitations, including the use of predefined entity types and the need for supervised learning. Current research leverages large language models' capabilities, such as zero- or few-shot learning. However, unresolved and semantically duplicated entities and relations still pose challenges, leading to inconsistent graphs and requiring extensive post-processing. Additionally, most approaches are topic-dependent. In this paper, we propose iText2KG, a method for incremental, topic-independent KG construction without post-processing. This plug-and-play, zero-shot method is applicable across a wide range of KG construction scenarios and comprises four modules: Document Distiller, Incremental Entity Extractor, Incremental Relation Extractor, and Graph Integrator and Visualization. Our method demonstrates superior performance compared to baseline methods across three scenarios: converting scientific papers to graphs, websites to graphs, and CVs to graphs.
comment: Accepted at The International Web Information Systems Engineering conference (the WISE conference) 2024
GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation
Online sellers and advertisers are recommended keyphrases for their listed products, which they bid on to enhance their sales. One popular paradigm that generates such recommendations is Extreme Multi-Label Classification (XMC), which involves tagging/mapping keyphrases to items. We outline the limitations of using traditional item-query based tagging or mapping techniques for keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles. Additionally, we demonstrate that relying on traditional metrics such as precision/recall can be misleading in practical applications, thereby necessitating a combination of metrics to evaluate performance in real-world scenarios. These metrics are designed to assess the relevance of keyphrases to items and the potential for buyer outreach. GraphEx outperforms production models at eBay, achieving the objectives mentioned above. It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.
♻ ☆ Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?
The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.
comment: https://github.com/yixuantt/PoolingAndAttn
Computation and Language 80
☆ RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)
Effective collaboration of dual-arm robots and their tool use capabilities are increasingly important areas in the advancement of robotics. These skills play a significant role in expanding robots' ability to operate in diverse real-world environments. However, progress is impeded by the scarcity of specialized training data. This paper introduces RoboTwin, a novel benchmark dataset combining real-world teleoperated data with synthetic data from digital twins, designed for dual-arm robotic scenarios. Using the COBOT Magic platform, we have collected diverse data on tool usage and human-robot interaction. We present a innovative approach to creating digital twins using AI-generated content, transforming 2D images into detailed 3D models. Furthermore, we utilize large language models to generate expert-level training data and task-specific pose sequences oriented toward functionality. Our key contributions are: 1) the RoboTwin benchmark dataset, 2) an efficient real-to-simulation pipeline, and 3) the use of language models for automatic expert-level data generation. These advancements are designed to address the shortage of robotic training data, potentially accelerating the development of more capable and versatile robotic systems for a wide range of real-world applications. The project page is available at https://robotwin-benchmark.github.io/early-version/
comment: Project page: https://robotwin-benchmark.github.io/early-version/
☆ Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling
Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs' original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20$\times$ speedup. In addition, our investigation challenges previous claims that MDMs can surpass ARMs in generative perplexity. We identify, for the first time, an underlying numerical issue, even with the 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, leading to unfair assessments of MDMs' generation results in the previous literature.
comment: 40 pages
☆ LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.
☆ LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
comment: 19 pages, 7 figures, 6 tables
☆ Configurable Foundation Models: Building LLMs from a Modular Perspective
Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.
☆ Historical German Text Normalization Using Type- and Token-Based Language Modeling
Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.
comment: 27 pages, 3 figures
☆ R2GQA: Retriever-Reader-Generator Question Answering System to Support Students Understanding Legal Regulations in Higher Education
In this article, we propose the R2GQA system, a Retriever-Reader-Generator Question Answering system, consisting of three main components: Document Retriever, Machine Reader, and Answer Generator. The Retriever module employs advanced information retrieval techniques to extract the context of articles from a dataset of legal regulation documents. The Machine Reader module utilizes state-of-the-art natural language understanding algorithms to comprehend the retrieved documents and extract answers. Finally, the Generator module synthesizes the extracted answers into concise and informative responses to questions of students regarding legal regulations. Furthermore, we built the ViRHE4QA dataset in the domain of university training regulations, comprising 9,758 question-answer pairs with a rigorous construction process. This is the first Vietnamese dataset in the higher regulations domain with various types of answers, both extractive and abstractive. In addition, the R2GQA system is the first system to offer abstractive answers in Vietnamese. This paper discusses the design and implementation of each module within the R2GQA system on the ViRHE4QA dataset, highlighting their functionalities and interactions. Furthermore, we present experimental results demonstrating the effectiveness and utility of the proposed system in supporting the comprehension of students of legal regulations in higher education settings. In general, the R2GQA system and the ViRHE4QA dataset promise to contribute significantly to related research and help students navigate complex legal documents and regulations, empowering them to make informed decisions and adhere to institutional policies effectively. Our dataset is available for research purposes.
Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models
This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
☆ CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models
Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.
☆ MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
☆ Towards a Unified View of Preference Learning for Large Language Models: A Survey
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences.
comment: Initial Commit, 21 pages
☆ A Comparative Study of Pre-training and Self-training
Pre-training and self-training are two approaches to semi-supervised learning. The comparison between pre-training and self-training has been explored. However, the previous works led to confusing findings: self-training outperforms pre-training experienced on some tasks in computer vision, and contrarily, pre-training outperforms self-training experienced on some tasks in natural language processing, under certain conditions of incomparable settings. We propose, comparatively and exhaustively, an ensemble method to empirical study all feasible training paradigms combining pre-training, self-training, and fine-tuning within consistent foundational settings comparable to data augmentation. We conduct experiments on six datasets, four data augmentation, and imbalanced data for sentiment analysis and natural language inference tasks. Our findings confirm that the pre-training and fine-tuning paradigm yields the best overall performances. Moreover, self-training offers no additional benefits when combined with semi-supervised pre-training.
comment: 19 pages, 2 figures, 9 tables
☆ Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?
The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.
comment: https://github.com/yixuantt/PoolingAndAttn
Pre-training data selection for biomedical domain adaptation using journal impact metrics
Domain adaptation is a widely used method in natural language processing (NLP) to improve the performance of a language model within a specific domain. This method is particularly common in the biomedical domain, which sees regular publication of numerous scientific articles. PubMed, a significant corpus of text, is frequently used in the biomedical domain. The primary objective of this study is to explore whether refining a pre-training dataset using specific quality metrics for scientific papers can enhance the performance of the resulting model. To accomplish this, we employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set, we then evaluate the resulting models on biomedical language understanding tasks from the BLURB benchmark. Our results show that pruning using journal impact metrics is not efficient. But we also show that pre-training using fewer abstracts (but with the same number of training steps) does not necessarily decrease the resulting model's performance.
☆ Alignment-Aware Model Extraction Attacks on Large Language Models
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing research attention lately. Existing attack methods on LLMs inherit the extraction strategies from those designed for deep neural networks (DNNs) yet neglect the inconsistency of training tasks between MEA and LLMs' alignments. As such, they result in poor attack performances. To tackle this issue, we present Locality Reinforced Distillation (LoRD), a novel model extraction attack algorithm specifically for LLMs. In particular, we design a policy-gradient-style training task, which utilizes victim models' responses as a signal to guide the crafting of preference for the local model. Theoretical analysis has shown that i) LoRD's convergence procedure in MEAs is consistent with the alignments of LLMs, and ii) LoRD can reduce query complexity while mitigating watermark protection through exploration-based stealing. Extensive experiments on domain-specific extractions demonstrate the superiority of our method by examining the extraction of various state-of-the-art commercial LLMs.
comment: Source code: https://github.com/liangzid/alignmentExtraction
☆ A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations
Machine translation in low-resource language pairs faces significant challenges due to the scarcity of parallel corpora and linguistic resources. This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy, impeding the performance of machine translation models. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Our methodology leverages a multilingual SBERT model to filter out problematic translations in the training data. Specifically, we employ an IndicSBERT similarity model to assess the semantic equivalence between original and translated sentences, allowing us to retain linguistically correct translations while discarding instances with substantial deviations. The results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT. This illustrates how cross-lingual sentence representations can reduce errors in machine translation scenarios with limited resources. By integrating multilingual sentence BERT models into the translation pipeline, this research contributes to advancing machine translation techniques in low-resource environments. The proposed method not only addresses the challenges in English-Marathi language pairs but also provides a valuable framework for enhancing translation quality in other low-resource language translation tasks.
comment: Accepted at I2CT 2024
☆ Detecting Calls to Action in Multimodal Content: Analysis of the 2021 German Federal Election Campaign on Instagram
This study investigates the automated classification of Calls to Action (CTAs) within the 2021 German Instagram election campaign to advance the understanding of mobilization in social media contexts. We analyzed over 2,208 Instagram stories and 712 posts using fine-tuned BERT models and OpenAI's GPT-4 models. The fine-tuned BERT model incorporating synthetic training data achieved a macro F1 score of 0.93, demonstrating a robust classification performance. Our analysis revealed that 49.58% of Instagram posts and 10.64% of stories contained CTAs, highlighting significant differences in mobilization strategies between these content types. Additionally, we found that FDP and the Greens had the highest prevalence of CTAs in posts, whereas CDU and CSU led in story CTAs.
comment: Accepted Archival Paper for the CPSS Workshop at KONVENS 2024. Camera Ready Submission
☆ Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.
☆ Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus
This article investigates how translation memories (TM) can be created by translators or other language professionals in order to compile domain-specific parallel corpora , which can then be used in different scenarios, such as machine translation training and fine-tuning, TM leveraging, and/or large language model fine-tuning. The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators in favor of data quality and control by the translators. This semi-automatic methodology is then used to build a cardiology-based Turkish -> English corpus from bilingual abstracts of Turkish cardiology journals. The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences. Using this methodology, translators can build their custom TMs in a reasonable time and use them in their bilingual data requiring tasks.
☆ OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation
This paper presents the experiments and results for the CheckThat! Lab at CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE). The primary objective of this task was to generate adversarial examples in five problem domains in order to evaluate the robustness of widely used text classification methods (fine-tuned BERT, BiLSTM, and RoBERTa) when applied to credibility assessment issues. This study explores the application of ensemble learning to enhance adversarial attacks on natural language processing (NLP) models. We systematically tested and refined several adversarial attack methods, including BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across various misinformation tasks. By developing modified versions of BERT-Attack and hybrid methods, we achieved significant improvements in attack effectiveness. Our results demonstrate the potential of modification and combining multiple methods to create more sophisticated and effective adversarial attack strategies, contributing to the development of more robust and secure systems.
comment: CLEF 2024 - Conference and Labs of the Evaluation Forum
☆ A Survey on Emergent Language
The field of emergent language represents a novel area of research within the domain of artificial intelligence, particularly within the context of multi-agent reinforcement learning. Although the concept of studying language emergence is not new, early approaches were primarily concerned with explaining human language formation, with little consideration given to its potential utility for artificial agents. In contrast, studies based on reinforcement learning aim to develop communicative capabilities in agents that are comparable to or even superior to human language. Thus, they extend beyond the learned statistical representations that are common in natural language processing research. This gives rise to a number of fundamental questions, from the prerequisites for language emergence to the criteria for measuring its success. This paper addresses these questions by providing a comprehensive review of 181 scientific publications on emergent language in artificial intelligence. Its objective is to serve as a reference for researchers interested in or proficient in the field. Consequently, the main contributions are the definition and overview of the prevailing terminology, the analysis of existing evaluation methods and metrics, and the description of the identified research gaps.
☆ PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation
The ability of large language models (LLMs) to interpret visual representations of data is crucial for advancing their application in data analysis and decision-making processes. This paper presents a novel synthetic dataset designed to evaluate the proficiency of LLMs in interpreting various forms of data visualizations, including plots like time series, histograms, violins, boxplots, and clusters. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models like ChatGPT or Gemini, assessing their understanding and interpretative accuracy. To ensure data integrity, our benchmark dataset is generated automatically, making it entirely new and free from prior exposure to the models being tested. This strategy allows us to evaluate the models' ability to truly interpret and understand the data, eliminating possibility of pre-learned responses, and allowing for an unbiased evaluation of the models' capabilities. We also introduce quantitative metrics to assess the performance of the models, providing a robust and comprehensive evaluation tool. Benchmarking several state-of-the-art LLMs with this dataset reveals varying degrees of success, highlighting specific strengths and weaknesses in interpreting diverse types of visual data. The results provide valuable insights into the current capabilities of LLMs and identify key areas for improvement. This work establishes a foundational benchmark for future research and development aimed at enhancing the visual interpretative abilities of language models. In the future, improved LLMs with robust visual interpretation skills can significantly aid in automated data analysis, scientific research, educational tools, and business intelligence applications.
☆ An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.
comment: Accepted in the IEEE Soken Language Technology Workshop 2024
☆ More is More: Addition Bias in Large Language Models
In this paper, we investigate the presence of additive bias in Large Language Models (LLMs), drawing a parallel to the cognitive bias observed in humans where individuals tend to favor additive over subtractive changes. Using a series of controlled experiments, we tested various LLMs, including GPT-3.5 Turbo, Claude 3.5 Sonnet, Mistral, Math$\Sigma$tral, and Llama 3.1, on tasks designed to measure their propensity for additive versus subtractive modifications. Our findings demonstrate a significant preference for additive changes across all tested models. For example, in a palindrome creation task, Llama 3.1 favored adding letters 97.85% of the time over removing them. Similarly, in a Lego tower balancing task, GPT-3.5 Turbo chose to add a brick 76.38% of the time rather than remove one. In a text summarization task, Mistral 7B produced longer summaries in 59.40% to 75.10% of cases when asked to improve its own or others' writing. These results indicate that, similar to humans, LLMs exhibit a marked additive bias, which might have implications when LLMs are used on a large scale. Addittive bias might increase resource use and environmental impact, leading to higher economic costs due to overconsumption and waste. This bias should be considered in the development and application of LLMs to ensure balanced and efficient problem-solving approaches.
comment: 25 pages, 8 figures
☆ Language is Scary when Over-Analyzed: Unpacking Implied Misogynistic Reasoning with Argumentation Theory-Driven Prompts
We propose misogyny detection as an Argumentative Reasoning task and we investigate the capacity of large language models (LLMs) to understand the implicit reasoning used to convey misogyny in both Italian and English. The central aim is to generate the missing reasoning link between a message and the implied meanings encoding the misogyny. Our study uses argumentation theory as a foundation to form a collection of prompts in both zero-shot and few-shot settings. These prompts integrate different techniques, including chain-of-thought reasoning and augmented knowledge. Our findings show that LLMs fall short on reasoning capabilities about misogynistic comments and that they mostly rely on their implicit knowledge derived from internalized common stereotypes about women to generate implied assumptions, rather than on inductive reasoning.
☆ Word and Phrase Features in Graph Convolutional Network for Automatic Question Classification
Effective question classification is crucial for AI-driven educational tools, enabling adaptive learning systems to categorize questions by skill area, difficulty level, and competence. This classification not only supports educational diagnostics and analytics but also enhances complex tasks like information retrieval and question answering by associating questions with relevant categories. Traditional methods, often based on word embeddings and conventional classifiers, struggle to capture the nuanced relationships in natural language, leading to suboptimal performance. To address this, we propose a novel approach leveraging graph convolutional networks (GCNs), named Phrase Question-Graph Convolutional Network (PQ-GCN) to better model the inherent structure of questions. By representing questions as graphs -- where nodes signify words or phrases and edges denote syntactic or semantic relationships -- our method allows GCNs to learn from the interconnected nature of language more effectively. Additionally, we explore the incorporation of phrase-based features to enhance classification accuracy, especially in low-resource settings. Our findings demonstrate that GCNs, augmented with these features, offer a promising solution for more accurate and context-aware question classification, bridging the gap between graph neural network research and practical educational applications.
☆ A Comparative Study on Large Language Models for Log Parsing
Background: Log messages provide valuable information about the status of software systems. This information is provided in an unstructured fashion and automated approaches are applied to extract relevant parameters. To ease this process, log parsing can be applied, which transforms log messages into structured log templates. Recent advances in language models have led to several studies that apply ChatGPT to the task of log parsing with promising results. However, the performance of other state-of-the-art large language models (LLMs) on the log parsing task remains unclear. Aims: In this study, we investigate the current capability of state-of-the-art LLMs to perform log parsing. Method: We select six recent LLMs, including both paid proprietary (GPT-3.5, Claude 2.1) and four free-to-use open models, and compare their performance on system logs obtained from a selection of mature open-source projects. We design two different prompting approaches and apply the LLMs on 1, 354 log templates across 16 different projects. We evaluate their effectiveness, in the number of correctly identified templates, and the syntactic similarity between the generated templates and the ground truth. Results: We found that free-to-use models are able to compete with paid models, with CodeLlama extracting 10% more log templates correctly than GPT-3.5. Moreover, we provide qualitative insights into the usability of language models (e.g., how easy it is to use their responses). Conclusions: Our results reveal that some of the smaller, free-to-use LLMs can considerably assist log parsing compared to their paid proprietary competitors, especially code-specialized models.
comment: Accepted for publication in the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '24)
☆ DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels
With the rapid advancement of Large Language Models (LLMs), long-context information understanding and processing have become a hot topic in academia and industry. However, benchmarks for evaluating the ability of LLMs to handle long-context information do not seem to have kept pace with the development of LLMs. Despite the emergence of various long-context evaluation benchmarks, the types of capability assessed are still limited, without new capability dimensions. In this paper, we introduce DetectiveQA, a narrative reasoning benchmark featured with an average context length of over 100K tokens. DetectiveQA focuses on evaluating the long-context reasoning ability of LLMs, which not only requires a full understanding of context but also requires extracting important evidences from the context and reasoning according to extracted evidences to answer the given questions. This is a new dimension of capability evaluation, which is more in line with the current intelligence level of LLMs. We use detective novels as data sources, which naturally have various reasoning elements. Finally, we manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions. We evaluate many long-context LLMs on DetectiveQA, including commercial and open-sourced models, and the results indicate that existing long-context LLMs still require significant advancements to effectively process true long-context dependency questions.
☆ What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations EMNLP 2024
This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts. We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer, and their unintended consequences on performance metrics. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially inflated performance metrics for Indic languages. We conclude by proposing a shift towards developing normalization routines that leverage native linguistic expertise, ensuring more robust and accurate evaluations of multilingual ASR models.
comment: Sumbitted to EMNLP 2024
Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning
Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.
☆ Abstractive Text Summarization: State of the Art, Challenges, and Improvements
Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
comment: 9 Tables, 7 Figures
☆ Determination of language families using deep learning
We use a c-GAN (convolutional generative adversarial) neural network to analyze transliterated text fragments of extant, dead comprehensible, and one dead non-deciphered (Cypro-Minoan) language to establish linguistic affinities. The paper is agnostic with respect to translation and/or deciphering. However, there is hope that the proposed approach can be useful for decipherment with more sophisticated neural network techniques.
comment: First draft. Comments are welcome
Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.
comment: 10 pages, 1 figure
☆ STAB: Speech Tokenizer Assessment Benchmark
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text, thus enabling the use of speech as an input to the widely successful large language models (LLMs). Currently, while several speech tokenizers have been proposed, there is ambiguity regarding the properties that are desired from a tokenizer for specific downstream tasks and its overall generalizability. Evaluating the performance of tokenizers across different downstream tasks is a computationally intensive effort that poses challenges for scalability. To circumvent this requirement, we present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively and shed light on their inherent characteristics. This framework provides a deeper understanding of the underlying mechanisms of speech tokenization, thereby offering a valuable resource for expediting the advancement of future tokenizer models and enabling comparative analysis using a standardized benchmark. We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
comment: 5 pages
☆ How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical Review
The recent advances in large language models (LLMs) have significantly expanded their applications across various fields such as language generation, summarization, and complex question answering. However, their application to privacy compliance and technical privacy reviews remains under-explored, raising critical concerns about their ability to adhere to global privacy standards and protect sensitive user data. This paper seeks to address this gap by providing a comprehensive case study evaluating LLMs' performance in privacy-related tasks such as privacy information extraction (PIE), legal and regulatory key point detection (KPD), and question answering (QA) with respect to privacy policies and data protection regulations. We introduce a Privacy Technical Review (PTR) framework, highlighting its role in mitigating privacy risks during the software development life-cycle. Through an empirical assessment, we investigate the capacity of several prominent LLMs, including BERT, GPT-3.5, GPT-4, and custom models, in executing privacy compliance checks and technical privacy reviews. Our experiments benchmark the models across multiple dimensions, focusing on their precision, recall, and F1-scores in extracting privacy-sensitive information and detecting key regulatory compliance points. While LLMs show promise in automating privacy reviews and identifying regulatory discrepancies, significant gaps persist in their ability to fully comply with evolving legal standards. We provide actionable recommendations for enhancing LLMs' capabilities in privacy compliance, emphasizing the need for robust model improvements and better integration with legal and regulatory requirements. This study underscores the growing importance of developing privacy-aware LLMs that can both support businesses in compliance efforts and safeguard user privacy rights.
comment: 8 pages, 4 figures
☆ Do Large Language Models Possess Sensitive to Sentiment?
Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models' outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.
comment: 10 pages, 2 figures
☆ Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering
The retrieval augmented generation (RAG) framework addresses an ambiguity in user queries in QA systems by retrieving passages that cover all plausible interpretations and generating comprehensive responses based on the passages. However, our preliminary studies reveal that a single retrieval process often suffers from low quality results, as the retrieved passages frequently fail to capture all plausible interpretations. Although the iterative RAG approach has been proposed to address this problem, it comes at the cost of significantly reduced efficiency. To address these issues, we propose the diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved passages to encompass diverse interpretations. Subsequently, DIVA verifies the quality of the passages and adapts the most suitable approach tailored to their quality. This approach improves the QA systems accuracy and robustness by handling low quality retrieval issue in ambiguous questions, while enhancing efficiency.
☆ NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
☆ Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)
This paper explores a novel approach to adversarial attacks on large language models (LLM): the Single-Turn Crescendo Attack (STCA). The STCA builds upon the multi-turn crescendo attack established by Mark Russinovich, Ahmed Salem, Ronen Eldan. Traditional multi-turn adversarial strategies gradually escalate the context to elicit harmful or controversial responses from LLMs. However, this paper introduces a more efficient method where the escalation is condensed into a single interaction. By carefully crafting the prompt to simulate an extended dialogue, the attack bypasses typical content moderation systems, leading to the generation of responses that would normally be filtered out. I demonstrate this technique through a few case studies. The results highlight vulnerabilities in current LLMs and underscore the need for more robust safeguards. This work contributes to the broader discourse on responsible AI (RAI) safety and adversarial testing, providing insights and practical examples for researchers and developers. This method is unexplored in the literature, making it a novel contribution to the field.
☆ Probing self-attention in self-supervised speech models for cross-linguistic differences
Speech models have gained traction thanks to increase in accuracy from novel transformer architectures. While this impressive increase in performance across automatic speech recognition (ASR) benchmarks is noteworthy, there is still much that is unknown about the use of attention mechanisms for speech-related tasks. For example, while it is assumed that these models are learning language-independent (i.e., universal) speech representations, there has not yet been an in-depth exploration of what it would mean for the models to be language-independent. In the current paper, we explore this question within the realm of self-attention mechanisms of one small self-supervised speech transformer model (TERA). We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language. We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining. We also present a head ablation study which shows that models across languages primarily rely on diagonal heads to classify phonemes.
comment: 10 pages, 18 figures
☆ Quantification of stylistic differences in human- and ASR-produced transcripts of African American English
Common measures of accuracy used to assess the performance of automatic speech recognition (ASR) systems, as well as human transcribers, conflate multiple sources of error. Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation when differences exist between training and test datasets. The problem is compounded for speech from underrepresented varieties, where the speech to orthography mapping is not as standardized. We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English (AAE) speech. Focusing on verbatim features and AAE morphosyntactic features, we investigate the interactions of these categories with how well transcripts can be compared via word error rate (WER). The results, and overall analysis, help clarify how ASR outputs are a function of the decisions made by the training data's human transcribers.
comment: Published in Interspeech 2024 Proceedings, 5 pages excluding references, 5 figures
☆ Oddballness: universal anomaly detection with language models
We present a new method to detect anomalies in texts (in general: in sequences of any data), using language models, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a language model, but instead of focusing on low-likelihood tokens, it considers a new metric introduced in this paper: oddballness. Oddballness measures how ``strange'' a given token is according to the language model. We demonstrate in grammatical error detection tasks (a specific case of text anomaly detection) that oddballness is better than just considering low-likelihood events, if a totally unsupervised setup is assumed.
☆ CLUE: Concept-Level Uncertainty Estimation for Large Language Models
Large Language Models (LLMs) have demonstrated remarkable proficiency in various natural language generation (NLG) tasks. Previous studies suggest that LLMs' generation process involves uncertainty. However, existing approaches to uncertainty estimation mainly focus on sequence-level uncertainty, overlooking individual pieces of information within sequences. These methods fall short in separately assessing the uncertainty of each component in a sequence. In response, we propose a novel framework for Concept-Level Uncertainty Estimation (CLUE) for LLMs. We leverage LLMs to convert output sequences into concept-level representations, breaking down sequences into individual concepts and measuring the uncertainty of each concept separately. We conduct experiments to demonstrate that CLUE can provide more interpretable uncertainty estimation results compared with sentence-level uncertainty, and could be a useful tool for various tasks such as hallucination detection and story generation.
☆ Hallucination Detection in LLMs: Fast and Memory-Efficient Finetuned Models
Uncertainty estimation is a necessary component when implementing AI in high-risk settings, such as autonomous cars, medicine, or insurances. Large Language Models (LLMs) have seen a surge in popularity in recent years, but they are subject to hallucinations, which may cause serious harm in high-risk settings. Despite their success, LLMs are expensive to train and run: they need a large amount of computations and memory, preventing the use of ensembling methods in practice. In this work, we present a novel method that allows for fast and memory-friendly training of LLM ensembles. We show that the resulting ensembles can detect hallucinations and are a viable approach in practice as only one GPU is needed for training and inference.
comment: 5 pages, 3 figures
♻ ☆ LADDER: Language Driven Slice Discovery and Error Rectification
Error slice discovery associates structured patterns with model errors. Existing methods discover error slices by clustering the error-prone samples with similar patterns or assigning discrete attributes to each sample for post-hoc analysis. While these methods aim for interpretability and easier mitigation through reweighting or rebalancing, they may not capture the full complexity of error patterns due to incomplete or missing attributes. Contrary to the existing approach, this paper utilizes the reasoning capabilities of the Large Language Model (LLM) to analyze complex error patterns and generate testable hypotheses. This paper proposes LADDER: Language Driven slice Discovery and Error Rectification. It first projects the model's representation into a language-aligned feature space (eg CLIP) to preserve semantics in the original model feature space. This ensures the accurate retrieval of sentences that highlight the model's errors. Next, the LLM utilizes the sentences and generates hypotheses to discover error slices. Finally, we mitigate the error by fine-tuning the classification head by creating a group-balanced dataset using the hypotheses. Our entire method does not require any attribute annotation, either explicitly or through external tagging models. We validate our method with \textbf{five} image classification datasets. The code is available (https://github.com/batmanlab/Ladder).
♻ ☆ The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem
Large language models (LLMs) are useful tools with the capacity for performing specific types of knowledge work at an effective scale. However, LLM deployments in high-risk and safety-critical domains pose unique challenges, notably the issue of ``hallucination,'' where LLMs can generate fabricated information. This is particularly concerning in settings such as drug safety, where inaccuracies could lead to patient harm. To mitigate these risks, we have developed and demonstrated a proof of concept suite of guardrails specifically designed to mitigate certain types of hallucinations and errors for drug safety, and potentially applicable to other medical safety-critical contexts. These guardrails include mechanisms to detect anomalous documents to prevent the ingestion of inappropriate data, identify incorrect drug names or adverse event terms, and convey uncertainty in generated content. We integrated these guardrails with an LLM fine-tuned for a text-to-text task, which involves converting both structured and unstructured data within adverse event reports into natural language. This method was applied to translate individual case safety reports, demonstrating effective application in a pharmacovigilance processing task. Our guardrail framework offers a set of tools with broad applicability across various domains, ensuring LLMs can be safely used in high-risk situations by eliminating the occurrence of key errors, including the generation of incorrect pharmacovigilance-related terms, thus adhering to stringent regulatory and quality standards in medical safety-critical environments.
comment: 27 pages, 6 figures, 4 tables and supplementary material provided
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory. Our code is released at https://github.com/zhiyuanhubj/LongRecipe.
comment: Work in Progress
♻ ☆ Revisiting Character-level Adversarial Attacks for Language Models ICML 2024
Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.
comment: Accepted in ICML 2024
♻ ☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
♻ ☆ AI-generated text boundary detection with RoFT
Due to the rapid development of large language models, people increasingly often encounter texts that may start as written by a human but continue as machine-generated. Detecting the boundary between human-written and machine-generated parts of such texts is a challenging problem that has not received much attention in literature. We attempt to bridge this gap and examine several ways to adapt state of the art artificial text detection classifiers to the boundary detection setting. We push all detectors to their limits, using the Real or Fake text benchmark that contains short texts on several topics and includes generations of various language models. We use this diversity to deeply examine the robustness of all detectors in cross-domain and cross-model settings to provide baselines and insights for future research. In particular, we find that perplexity-based approaches to boundary detection tend to be more robust to peculiarities of domain-specific data than supervised fine-tuning of the RoBERTa model; we also find which features of the text confuse boundary detection algorithms and negatively influence their performance in cross-domain settings.
comment: Our official repository: https://github.com/SilverSolver/ai_boundary_detection
♻ ☆ Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation
Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these tools to improve them further. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content. In the present work, we have identified a fundamental limitation related to the image generation ability of LLMs, and termed it The NO Syndrome. This negation blindness refers to LLMs inability to correctly comprehend NO related natural language prompts to generate the desired images. Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found to be suffering from this syndrome. To demonstrate the generalization of this limitation, we carried out simulation experiments and conducted entropy-based and benchmark statistical analysis tests on various LLMs in multiple languages, including English, Hindi, and French. We conclude that the NO syndrome is a significant flaw in current LLMs that needs to be addressed. A related finding of this study showed a consistent discrepancy between image and textual responses as a result of this NO syndrome. We posit that the introduction of a negation context-aware reinforcement learning based feedback loop between the LLMs textual response and generated image could help ensure the generated text is based on both the LLMs correct contextual understanding of the negation query and the generated visual output.
comment: 15 pages, 7 figures
♻ ☆ Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors' simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult.
♻ ☆ A Causal Explainable Guardrails for Large Language Models
Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for steering LLMs toward desired attributes often assume unbiased representations and rely solely on steering prompts. However, the representations learned from pre-training can introduce semantic biases that influence the steering process, leading to suboptimal results. We propose LLMGuardrail, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. LLMGuardrail systematically identifies and blocks the confounding effects of biases, enabling the extraction of unbiased steering representations. Additionally, it includes an explainable component that provides insights into the alignment between the generated output and the desired direction. Experiments demonstrate LLMGuardrail's effectiveness in steering LLMs toward desired attributes while mitigating biases. Our work contributes to the development of safe and reliable LLMs that align with desired attributes.
comment: 16 pages
♻ ☆ Parallel Speculative Decoding with Adaptive Draft Length
Speculative decoding (SD), where an extra draft model is employed to provide multiple \textit{draft} tokens first and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is \textit{guessing} tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model, and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely \textbf{P}arallel sp\textbf{E}culative decoding with \textbf{A}daptive d\textbf{R}aft \textbf{L}ength (PEARL). Specifically, PEARL proposes \textit{pre-verify} to verify the first draft token in advance during the drafting phase, and \textit{post-verify} to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Moreover, we theoretically demonstrate that the mean accepted tokens of PEARL is more than existing \textit{draft-then-verify} works. Experiments on various text generation benchmarks demonstrate the effectiveness of our \name, leading to a superior speedup performance up to \textbf{3.79$\times$} and \textbf{1.52$\times$}, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
♻ ☆ HIRO: Hierarchical Information Retrieval Optimization
Retrieval-Augmented Generation (RAG) has revolutionized natural language processing by dynamically integrating external knowledge into Large Language Models (LLMs), addressing their limitation of static training datasets. Recent implementations of RAG leverage hierarchical data structures, which organize documents at various levels of summarization and information density. This complexity, however, can cause LLMs to "choke" on information overload, necessitating more sophisticated querying mechanisms. In this context, we introduce Hierarchical Information Retrieval Optimization (HIRO), a novel querying approach that employs a Depth-First Search (DFS)-based recursive similarity score calculation and branch pruning. This method uniquely minimizes the context delivered to the LLM without informational loss, effectively managing the challenge of excessive data. HIRO's refined approach is validated by a 10.85% improvement in performance on the NarrativeQA dataset.
♻ ☆ What Formal Languages Can Transformers Express? A Survey
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.
comment: One minor correction in {\S}5.1
♻ ☆ Large Language Models for Information Retrieval: A Survey
As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions, such as search agents, within this expanding field.
comment: updated to version 3
♻ ☆ Towards a Universal Method for Meaningful Signal Detection
It is known that human speech and certain animal vocalizations can convey meaningful content because we can decipher the content that a given utterance does convey. This paper explores an alternative approach to determining whether a signal is meaningful, one that analyzes only the signal itself and is independent of what the conveyed meaning might be. We devise a method that takes a waveform as input and outputs a score indicating its degree of `meaningfulness`. We cluster contiguous portions of the input to minimize the total description length, and then take the length of the code of the assigned cluster labels as meaningfulness score. We evaluate our method empirically, against several baselines, and show that it is the only one to give a high score to human speech in various languages and with various speakers, a moderate score to animal vocalizations from birds and orcas, and a low score to ambient noise from various sources.
♻ ☆ Open Implementation and Study of BEST-RQ for Speech Processing ICASSP 2024
Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
comment: Accepted in IEEE ICASSP 2024 workshop on Self-supervision in Audio, Speech and Beyond (SASB 2024)
♻ ☆ Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: https://github.com/Workday/cpc.
♻ ☆ CADGE: Context-Aware Dialogue Generation Enhanced with Graph-Structured Knowledge Aggregation
Commonsense knowledge is crucial to many natural language processing tasks. Existing works usually incorporate graph knowledge with conventional graph neural networks (GNNs), leading to the text and graph knowledge encoding processes being separated in a serial pipeline. We argue that these separate representation learning stages may be suboptimal for neural networks to learn the overall context contained in both types of input knowledge. In this paper, we propose a novel context-aware graph-attention model (Context-aware GAT), which can effectively incorporate global features of relevant knowledge graphs based on a context-enhanced knowledge aggregation process. Specifically, our framework leverages a novel representation learning approach to process heterogeneous features - combining flattened graph knowledge with text. To the best of our knowledge, this is the first attempt at hierarchically applying graph knowledge aggregation on a connected subgraph in addition to contextual information to support commonsense dialogue generation. This framework shows superior performance compared to conventional GNN-based language frameworks. Both automatic and human evaluation demonstrates that our proposed model has significant performance uplifts over state-of-the-art baselines.
comment: Accepted by INLG 2024
♻ ☆ Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
comment: Journal Paper, 14 pages
♻ ☆ A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?
Modern Artificial Intelligence applications show great potential for language-related tasks that rely on next-word prediction. The current generation of Large Language Models (LLMs) have been linked to claims about human-like linguistic performance and their applications are hailed both as a step towards artificial general intelligence and as a major advance in understanding the cognitive, and even neural basis of human language. To assess these claims, first we analyze the contribution of LLMs as theoretically informative representations of a target cognitive system vs. atheoretical mechanistic tools. Second, we evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing, which requires grounding in previous expectations and past world experience. We hypothesize that since models lack grounded cognition, they cannot take advantage of these features and instead solely rely on fixed associations between represented words and word vectors. To assess this, we designed and ran a novel 'leet task' (l33t t4sk), which requires decoding sentences in which letters are systematically replaced by numbers. The results suggest that humans excel in this task whereas models struggle, confirming our hypothesis. We interpret the results by identifying the key abilities that are still missing from the current state of development of these models, which require solutions that go beyond increased system scaling.
♻ ☆ Exploring Interpretability of Independent Components of Word Embeddings with Automated Word Intruder Test LREC
Independent Component Analysis (ICA) is an algorithm originally developed for finding separate sources in a mixed signal, such as a recording of multiple people in the same room speaking at the same time. Unlike Principal Component Analysis (PCA), ICA permits the representation of a word as an unstructured set of features, without any particular feature being deemed more significant than the others. In this paper, we used ICA to analyze word embeddings. We have found that ICA can be used to find semantic features of the words, and these features can easily be combined to search for words that satisfy the combination. We show that most of the independent components represent such features. To quantify the interpretability of the components, we use the word intruder test, performed both by humans and by large language models. We propose to use the automated version of the word intruder test as a fast and inexpensive way of quantifying vector interpretability without the need for human effort.
comment: Presented at LREC-COLING 2024, cite this version please: https://aclanthology.org/2024.lrec-main.605/
♻ ☆ Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models
Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology. Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs (versions), including GPT (3.5, 4, 4o, 4omini), Claude (3, 3.5), Gemini (1.0), Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces (web and API), computing environments (cloud and local), and model precisions (with and without quantization). Finally, we assessed accuracy using a semiautomated pipeline. Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet (74.0%) achieved the highest accuracy, outperforming the top open-source models: Llama3.1-405b (64%), Llama3.1-70b (58.3%), and Mixtral-8x7b (54.3%). Among the quantized open-source models, the 6-bit quantized Phi3-14b (48.7%) performed best. The scores of the quantized models were comparable to those of the full-precision models Llama2-7b, Llama2--13b, and Gemma2-9b. Notably, VLM performance on image-containing questions did not improve when the images were provided and worsened when LLM-generated captions were provided. In contrast, a 10% increase in accuracy was observed when images were accompanied by human-crafted image descriptions. Conclusion: In conclusion, while LLMs exhibit robust zero-shot performance in medical reasoning, the integration of visual data remains a challenge for VLMs. Effective deployment involves carefully determining optimal model configurations, encouraging users to consider either the high performance of proprietary models or the flexible adaptability of open-source models.
comment: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File Pages: 35, Data Transparency Statement: Code is available at: https://github.com/Sdamirsa/LLM-VLM-in-Gastroenterology . Study data from American College of Gastroenterology (ACG) are restricted and available upon request with ACG permission. Correction: updated abstract considering Llama3.1 results
♻ ☆ Towards Measuring and Modeling "Culture" in LLMs: A Survey
We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of culture, and organize them across two dimensions of demographic and semantic proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of ``culture,'' such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness of probing techniques and situated studies on the impact of cultural mis- and under-representation in LLM-based applications.
♻ ☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever EMNLP
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
comment: 8 pages, references at pp7,8; EMNLP workshop submission
♻ ☆ An Empirical Study on Information Extraction using Large Language Models
Human-like large language models (LLMs), especially the most powerful and popular ones in OpenAI's GPT family, have proven to be very helpful for many natural language processing (NLP) related tasks. Therefore, various attempts have been made to apply LLMs to information extraction (IE), which is a fundamental NLP task that involves extracting information from unstructured plain text. To demonstrate the latest representative progress in LLMs' information extraction ability, we assess the information extraction ability of GPT-4 (the latest version of GPT at the time of writing this paper) from four perspectives: Performance, Evaluation Criteria, Robustness, and Error Types. Our results suggest a visible performance gap between GPT-4 and state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the LLMs' human-like characteristics, we propose and analyze the effects of a series of simple prompt-based methods, which can be generalized to other LLMs and NLP tasks. Rich experiments show our methods' effectiveness and some of their remaining issues in improving GPT-4's information extraction ability.
comment: This article has an original arxiv version entitled "Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors", whose url link is arXiv/2305.14450
♻ ☆ Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs
Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.
comment: 5 figures, 2 tables
♻ ☆ Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies
Recent advancements in natural language processing, particularly with large language models (LLMs) like GPT-4, have significantly enhanced dialogue systems, enabling them to generate more natural and fluent conversations. Despite these improvements, challenges persist, such as managing continuous dialogues, memory retention, and minimizing hallucinations. The AIWolfDial2024 addresses these challenges by employing the Werewolf Game, an incomplete information game, to test the capabilities of LLMs in complex interactive environments. This paper introduces a LLM-based Werewolf Game AI, where each role is supported by situation analysis to aid response generation. Additionally, for the werewolf role, various persuasion strategies, including logical appeal, credibility appeal, and emotional appeal, are employed to effectively persuade other players to align with its actions.
comment: Accepted to the AIWolfDial2024 workshop at INLG 2024
♻ ☆ SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
♻ ☆ LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.
♻ ☆ Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.
♻ ☆ RT-Surv: Improving Mortality Prediction After Radiotherapy with Large Language Model Structuring of Large-Scale Unstructured Electronic Health Records
Accurate patient selection is critical in radiotherapy (RT) to prevent ineffective treatments. Traditional survival prediction models, relying on structured data, often lack precision. This study explores the potential of large language models (LLMs) to structure unstructured electronic health record (EHR) data, thereby improving survival prediction accuracy through comprehensive clinical information integration. Data from 34,276 patients treated with RT at Yonsei Cancer Center between 2013 and 2023 were analyzed, encompassing both structured and unstructured data. An open-source LLM was used to structure the unstructured EHR data via single-shot learning, with its performance compared against a domain-specific medical LLM and a smaller variant. Survival prediction models were developed using statistical, machine learning, and deep learning approaches, incorporating both structured and LLM-structured data. Clinical experts evaluated the accuracy of the LLM-structured data. The open-source LLM achieved 87.5% accuracy in structuring unstructured EHR data without additional training, significantly outperforming the domain-specific medical LLM, which reached only 35.8% accuracy. Larger LLMs were more effective, particularly in extracting clinically relevant features like general condition and disease extent, which closely correlated with patient survival. Incorporating LLM-structured clinical features into survival prediction models significantly improved accuracy, with the C-index of deep learning models increasing from 0.737 to 0.820. These models also became more interpretable by emphasizing clinically significant factors. This study shows that general-domain LLMs, even without specific medical training, can effectively structure large-scale unstructured EHR data, substantially enhancing the accuracy and interpretability of clinical predictive models.
comment: 23 pages, 2 tables, 4 figures
♻ ☆ Learning to Ask: When LLMs Meet Unclear Instruction
Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
♻ ☆ Predicting Drug-Gene Relations via Analogy Tasks with Word Embeddings
Natural language processing (NLP) is utilized in a wide range of fields, where words in text are typically transformed into feature vectors called embeddings. BioConceptVec is a specific example of embeddings tailored for biology, trained on approximately 30 million PubMed abstracts using models such as skip-gram. Generally, word embeddings are known to solve analogy tasks through simple vector arithmetic. For instance, $\mathrm{\textit{king}} - \mathrm{\textit{man}} + \mathrm{\textit{woman}}$ predicts $\mathrm{\textit{queen}}$. In this study, we demonstrate that BioConceptVec embeddings, along with our own embeddings trained on PubMed abstracts, contain information about drug-gene relations and can predict target genes from a given drug through analogy computations. We also show that categorizing drugs and genes using biological pathways improves performance. Furthermore, we illustrate that vectors derived from known relations in the past can predict unknown future relations in datasets divided by year. Despite the simplicity of implementing analogy tasks as vector additions, our approach demonstrated performance comparable to that of large language models such as GPT-4 in predicting drug-gene relations.
♻ ☆ Optimizing Byte-level Representation for End-to-end ASR
We propose a novel approach to optimizing a byte-level representation for end-to-end automatic speech recognition (ASR). Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported languages is large. The compactness and universality of byte-level representation allow the ASR models to use smaller output vocabularies and therefore, provide more flexibility. UTF-8 is a commonly used byte-level representation for multilingual ASR, but it is not designed to optimize machine learning tasks directly. By using auto-encoder and vector quantization, we show that we can optimize a byte-level representation for ASR and achieve better accuracy. Our proposed framework can incorporate information from different modalities, and provides an error correction mechanism. In an English/Mandarin dictation task, we show that a bilingual ASR model built with this approach can outperform UTF-8 representation by 5% relative in error rate.
comment: 5 pages, 1 figure, IEEE SLT 2024
♻ ☆ Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
Harmful fine-tuning issue \citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses \citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \url{https://github.com/git-disl/Booster}.
♻ ☆ Language-Guided World Models: A Model-Based Approach to AI Control ACL 2024
This paper introduces the concept of Language-Guided World Models (LWMs) -- probabilistic models that can simulate environments by reading texts. Agents equipped with these models provide humans with more extensive and efficient control, allowing them to simultaneously alter agent behaviors in multiple tasks via natural verbal communication. In this work, we take initial steps in developing robust LWMs that can generalize to compositionally novel language descriptions. We design a challenging world modeling benchmark based on the game of MESSENGER (Hanjie et al., 2021), featuring evaluation settings that require varying degrees of compositional generalization. Our experiments reveal the lack of generalizability of the state-of-the-art Transformer model, as it offers marginal improvements in simulation quality over a no-text baseline. We devise a more robust model by fusing the Transformer with the EMMA attention mechanism (Hanjie et al., 2021). Our model substantially outperforms the Transformer and approaches the performance of a model with an oracle semantic parsing and grounding capability. To demonstrate the practicality of this model in improving AI safety and transparency, we simulate a scenario in which the model enables an agent to present plans to a human before execution, and to revise plans based on their language feedback.
comment: SpLU-RoboNLP workshop at ACL 2024
Computer Vision and Pattern Recognition 136
☆ HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality.
☆ UC-NeRF: Uncertainty-aware Conditional Neural Radiance Fields from Endoscopic Sparse Views
Visualizing surgical scenes is crucial for revealing internal anatomical structures during minimally invasive procedures. Novel View Synthesis is a vital technique that offers geometry and appearance reconstruction, enhancing understanding, planning, and decision-making in surgical scenes. Despite the impressive achievements of Neural Radiance Field (NeRF), its direct application to surgical scenes produces unsatisfying results due to two challenges: endoscopic sparse views and significant photometric inconsistencies. In this paper, we propose uncertainty-aware conditional NeRF for novel view synthesis to tackle the severe shape-radiance ambiguity from sparse surgical views. The core of UC-NeRF is to incorporate the multi-view uncertainty estimation to condition the neural radiance field for modeling the severe photometric inconsistencies adaptively. Specifically, our UC-NeRF first builds a consistency learner in the form of multi-view stereo network, to establish the geometric correspondence from sparse views and generate uncertainty estimation and feature priors. In neural rendering, we design a base-adaptive NeRF network to exploit the uncertainty estimation for explicitly handling the photometric inconsistencies. Furthermore, an uncertainty-guided geometry distillation is employed to enhance geometry learning. Experiments on the SCARED and Hamlyn datasets demonstrate our superior performance in rendering appearance and geometry, consistently outperforming the current state-of-the-art approaches. Our code will be released at \url{https://github.com/wrld/UC-NeRF}.
☆ Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving
Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset. The project page can be found at: \url{https://4dvlab.github.io/project_page/idkb.html}
☆ SITAR: Semi-supervised Image Transformer for Action Recognition ICPR 2024
Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep $3$D transformers for this can introduce significant computational complexity. In this paper, our objective is to address video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos along with a collection of unlabeled videos in a compute efficient manner. Specifically, we rearrange multiple frames from the input videos in row-column form to construct super images. Subsequently, we capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images. Our proposed approach employs two pathways to generate representations for temporally augmented super images originating from the same video. Specifically, we utilize a 2D image-transformer to generate representations and apply a contrastive loss function to minimize the similarity between representations from different videos while maximizing the representations of identical videos. Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition across various benchmark datasets, all while significantly reducing computational costs.
comment: Accepted at ICPR 2024
☆ LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
comment: 19 pages, 7 figures, 6 tables
☆ CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS Differently
The rapidly evolving field of digital oncopathology faces significant challenges, including the need to address diverse and complex clinical questions, often involving rare conditions, with limited availability of labeled data. These limitations hinder the development of robust AI-driven tools in the biomedical space, where accuracy in probabilistic determinations is of utmost importance. To address this, digital pathology foundation models have begun to emerge, typically developed with the size and diversity of the pre-training dataset and model parameters in mind. Here, we present CanvOI, a ViT-g/10-based foundation model designed to enhance the capabilities of digital pathology by addressing these challenges through a different approach. Considering the unique nature of oncologic histopathological images and the requirements from the embeddings to provide meaningful representations for Multiple Instance Learning (MIL) downstream models, we chose to modify the input image characteristics. By introducing larger tile sizes (380 x 380 pixels) and smaller patch sizes (10 x 10 pixels), we were able to optimize the model's performance, pushing computational resources in a new direction and achieving state-of-the-art performance on cancer-related benchmarks. CanvOI demonstrated a 1.5-7.4% improvement in averaged AUC compared to other leading foundation models built for digital pathology. Moreover, our results demonstrate that CanvOI significantly outperformed the other models, with the performance gap widening substantially when trained on just 10% of the initial cohort. This work highlights an alternative approach that, if integrated with traditional development approaches, has the potential to advance Oncology Intelligence (OI), overcome some of the current barriers and ultimately improve the clinical outcome of cancer patients.
comment: 12 pages, 5 figures
☆ Multi-stream deep learning framework to predict mild cognitive impairment with Rey Complex Figure Test
Drawing tests like the Rey Complex Figure Test (RCFT) are widely used to assess cognitive functions such as visuospatial skills and memory, making them valuable tools for detecting mild cognitive impairment (MCI). Despite their utility, existing predictive models based on these tests often suffer from limitations like small sample sizes and lack of external validation, which undermine their reliability. We developed a multi-stream deep learning framework that integrates two distinct processing streams: a multi-head self-attention based spatial stream using raw RCFT images and a scoring stream employing a previously developed automated scoring system. Our model was trained on data from 1,740 subjects in the Korean cohort and validated on an external hospital dataset of 222 subjects from Korea. The proposed multi-stream model demonstrated superior performance over baseline models (AUC = 0.872, Accuracy = 0.781) in external validation. The integration of both spatial and scoring streams enables the model to capture intricate visual details from the raw images while also incorporating structured scoring data, which together enhance its ability to detect subtle cognitive impairments. This dual approach not only improves predictive accuracy but also increases the robustness of the model, making it more reliable in diverse clinical settings. Our model has practical implications for clinical settings, where it could serve as a cost-effective tool for early MCI screening.
comment: 20 pages, 3 figures, 2 tables
☆ Benchmarking Spurious Bias in Few-Shot Image Classifiers ECCV 2024
Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at https://github.com/gtzheng/FewSTAB.
comment: Accepted to ECCV 2024
☆ The Impact of Balancing Real and Synthetic Data on Accuracy and Fairness in Face Recognition ECCV 2024
Over the recent years, the advancements in deep face recognition have fueled an increasing demand for large and diverse datasets. Nevertheless, the authentic data acquired to create those datasets is typically sourced from the web, which, in many cases, can lead to significant privacy issues due to the lack of explicit user consent. Furthermore, obtaining a demographically balanced, large dataset is even more difficult because of the natural imbalance in the distribution of images from different demographic groups. In this paper, we investigate the impact of demographically balanced authentic and synthetic data, both individually and in combination, on the accuracy and fairness of face recognition models. Initially, several generative methods were used to balance the demographic representations of the corresponding synthetic datasets. Then a state-of-the-art face encoder was trained and evaluated using (combinations of) synthetic and authentic images. Our findings emphasized two main points: (i) the increased effectiveness of training data generated by diffusion-based models in enhancing accuracy, whether used alone or combined with subsets of authentic data, and (ii) the minimal impact of incorporating balanced data from pre-trained generative methods on fairness (in nearly all tested scenarios using combined datasets, fairness scores remained either unchanged or worsened, even when compared to unbalanced authentic datasets). Source code and data are available at \url{https://cutt.ly/AeQy1K5G} for reproducibility.
comment: Accepted at Synthetic Data for Computer Vision Workshop - Side Event at ECCV 2024
☆ Hybrid-Segmentor: A Hybrid Approach to Automated Fine-Grained Crack Segmentation in Civil Infrastructure
Detecting and segmenting cracks in infrastructure, such as roads and buildings, is crucial for safety and cost-effective maintenance. In spite of the potential of deep learning, there are challenges in achieving precise results and handling diverse crack types. With the proposed dataset and model, we aim to enhance crack detection and infrastructure maintenance. We introduce Hybrid-Segmentor, an encoder-decoder based approach that is capable of extracting both fine-grained local and global crack features. This allows the model to improve its generalization capabilities in distinguish various type of shapes, surfaces and sizes of cracks. To keep the computational performances low for practical purposes, while maintaining the high the generalization capabilities of the model, we incorporate a self-attention model at the encoder level, while reducing the complexity of the decoder component. The proposed model outperforms existing benchmark models across 5 quantitative metrics (accuracy 0.971, precision 0.804, recall 0.744, F1-score 0.770, and IoU score 0.630), achieving state-of-the-art status.
comment: 25 pages, 6 figures
☆ Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models
Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: https://human-vdm.github.io/Human-VDM/
comment: 14 Pages, 8 figures, Project page: https://human-vdm.github.io/Human-VDM/
☆ MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher), updated via Exponential Moving Average (EMA), along with the original stereo model (student), where teacher predictions serve as pseudo supervisory signals to effectively distill knowledge into the student model. State-of-the-arts performance is achieved with the proposed method on several stereo matching such as ETH3D and KITTI 2015. Additionally, to demonstrate that our model effectively leverages locality inductive bias, we provide the attention distance measurement.
☆ iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation
Transfer learning based on full fine-tuning (FFT) of the pre-trained encoder and task-specific decoder becomes increasingly complex as deep models grow exponentially. Parameter efficient fine-tuning (PEFT) approaches using adapters consisting of small learnable layers have emerged as an alternative to FFT, achieving comparable performance while maintaining high training efficiency. However, the inflexibility of the adapter with respect to input instances limits its capability of learning task-specific information in diverse downstream tasks. In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances. To secure flexible learning ability on input instances in various downstream tasks, we introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation. To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks. Experimental results demonstrate that by tuning just 1.6% to 2.8% of the Transformer backbone parameters, iConFormer achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation. Also, the proposed method consistently outperforms recent PEFT methods for all the tasks mentioned above.
☆ ExpLLM: Towards Chain of Thought for Facial Expression Recognition
Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.
comment: project page: https://starhiking.github.io/ExpLLM_Page/
☆ Automatic facial axes standardization of 3D fetal ultrasound images
Craniofacial anomalies indicate early developmental disturbances and are usually linked to many genetic syndromes. Early diagnosis is critical, yet ultrasound (US) examinations often fail to identify these features. This study presents an AI-driven tool to assist clinicians in standardizing fetal facial axes/planes in 3D US, reducing sonographer workload and facilitating the facial evaluation. Our network, structured into three blocks-feature extractor, rotation and translation regression, and spatial transformer-processes three orthogonal 2D slices to estimate the necessary transformations for standardizing the facial planes in the 3D US. These transformations are applied to the original 3D US using a differentiable module (the spatial transformer block), yielding a standardized 3D US and the corresponding 2D facial standard planes. The dataset used consists of 1180 fetal facial 3D US images acquired between weeks 20 and 35 of gestation. Results show that our network considerably reduces inter-observer rotation variability in the test set, with a mean geodesic angle difference of 14.12$^{\circ}$ $\pm$ 18.27$^{\circ}$ and an Euclidean angle error of 7.45$^{\circ}$ $\pm$ 14.88$^{\circ}$. These findings demonstrate the network's ability to effectively standardize facial axes, crucial for consistent fetal facial assessments. In conclusion, the proposed network demonstrates potential for improving the consistency and accuracy of fetal facial assessments in clinical settings, facilitating early evaluation of craniofacial anomalies.
☆ Deep Learning Meets Satellite Images -- An Evaluation on Handcrafted and Learning-based Features for Multi-date Satellite Stereo Images ECCV2024
A critical step in the digital surface models(DSM) generation is feature matching. Off-track (or multi-date) satellite stereo images, in particular, can challenge the performance of feature matching due to spectral distortions between images, long baseline, and wide intersection angles. Feature matching methods have evolved over the years from handcrafted methods (e.g., SIFT) to learning-based methods (e.g., SuperPoint and SuperGlue). In this paper, we compare the performance of different features, also known as feature extraction and matching methods, applied to satellite imagery. A wide range of stereo pairs(~500) covering two separate study sites are used. SIFT, as a widely used classic feature extraction and matching algorithm, is compared with seven deep-learning matching methods: SuperGlue, LightGlue, LoFTR, ASpanFormer, DKM, GIM-LightGlue, and GIM-DKM. Results demonstrate that traditional matching methods are still competitive in this age of deep learning, although for particular scenarios learning-based methods are very promising.
comment: ECCV2024 Workshop - TradiCV
☆ MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
☆ UnLearning from Experience to Avoid Spurious Correlations
While deep neural networks can achieve state-of-the-art performance in many tasks, these models are more fragile than they appear. They are prone to learning spurious correlations in their training data, leading to surprising failure cases. In this paper, we propose a new approach that addresses the issue of spurious correlations: UnLearning from Experience (ULE). Our method is based on using two classification models trained in parallel: student and teacher models. Both models receive the same batches of training data. The student model is trained with no constraints and pursues the spurious correlations in the data. The teacher model is trained to solve the same classification problem while avoiding the mistakes of the student model. As training is done in parallel, the better the student model learns the spurious correlations, the more robust the teacher model becomes. The teacher model uses the gradient of the student's output with respect to its input to unlearn mistakes made by the student. We show that our method is effective on the Waterbirds, CelebA, Spawrious and UrbanCars datasets.
comment: 10 pages
☆ Validation of musculoskeletal segmentation model with uncertainty estimation for bone and muscle assessment in hip-to-knee clinical CT images
Deep learning-based image segmentation has allowed for the fully automated, accurate, and rapid analysis of musculoskeletal (MSK) structures from medical images. However, current approaches were either applied only to 2D cross-sectional images, addressed few structures, or were validated on small datasets, which limit the application in large-scale databases. This study aimed to validate an improved deep learning model for volumetric MSK segmentation of the hip and thigh with uncertainty estimation from clinical computed tomography (CT) images. Databases of CT images from multiple manufacturers/scanners, disease status, and patient positioning were used. The segmentation accuracy, and accuracy in estimating the structures volume and density, i.e., mean HU, were evaluated. An approach for segmentation failure detection based on predictive uncertainty was also investigated. The model has shown an overall improvement with respect to all segmentation accuracy and structure volume/density evaluation metrics. The predictive uncertainty yielded large areas under the receiver operating characteristic (AUROC) curves (AUROCs>=.95) in detecting inaccurate and failed segmentations. The high segmentation and muscle volume/density estimation accuracy, along with the high accuracy in failure detection based on the predictive uncertainty, exhibited the model's reliability for analyzing individual MSK structures in large-scale CT databases.
comment: 29 pages, 7+10supp figures, 8 tables
☆ CLDA: Collaborative Learning for Enhanced Unsupervised Domain Adaptation
Unsupervised Domain Adaptation (UDA) endeavors to bridge the gap between a model trained on a labeled source domain and its deployment in an unlabeled target domain. However, current high-performance models demand significant resources, resulting in prohibitive deployment costs and highlighting the need for small yet effective models. For UDA of lightweight models, Knowledge Distillation (KD) in a Teacher-Student framework can be a common approach, but we find that domain shift in UDA leads to a significant increase in non-salient parameters in the teacher model, degrading model's generalization ability and transferring misleading information to the student model. Interestingly, we observed that this phenomenon occurs considerably less in the student model. Driven by this insight, we introduce Collaborative Learning, a method that updates the teacher's non-salient parameters using the student model and at the same time enhance the student's performance using the updated teacher model. Experiments across various tasks and datasets show consistent performance improvements for both student and teacher models. For example, in semantic segmentation, CLDA achieves an improvement of +0.7% mIoU for teacher and +1.4% mIoU for student compared to the baseline model in the GTA to Cityscapes. In the Synthia to Cityscapes, it achieves an improvement of +0.8% mIoU for teacher and +2.0% mIoU for student.
☆ Rethinking HTG Evaluation: Bridging Generation and Recognition
The evaluation of generative models for natural image tasks has been extensively studied. Similar protocols and metrics are used in cases with unique particularities, such as Handwriting Generation, even if they might not be completely appropriate. In this work, we introduce three measures tailored for HTG evaluation, $ \text{HTG}_{\text{HTR}} $, $ \text{HTG}_{\text{style}} $, and $ \text{HTG}_{\text{OOV}} $, and argue that they are more expedient to evaluate the quality of generated handwritten images. The metrics rely on the recognition error/accuracy of Handwriting Text Recognition and Writer Identification models and emphasize writing style, textual content, and diversity as the main aspects that adhere to the content of handwritten images. We conduct comprehensive experiments on the IAM handwriting database, showcasing that widely used metrics such as FID fail to properly quantify the diversity and the practical utility of generated handwriting samples. Our findings show that our metrics are richer in information and underscore the necessity of standardized evaluation protocols in HTG. The proposed metrics provide a more robust and informative protocol for assessing HTG quality, contributing to improved performance in HTR. Code for the evaluation protocol is available at: https://github.com/koninik/HTG_evaluation.
☆ Improved Single Camera BEV Perception Using Multi-Camera Training SC 2024
Bird's Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.
comment: This Paper has been accepted to the 27th IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)
☆ Multi-Head Attention Residual Unfolded Network for Model-Based Pansharpening
The objective of pansharpening and hypersharpening is to accurately combine a high-resolution panchromatic (PAN) image with a low-resolution multispectral (MS) or hyperspectral (HS) image, respectively. Unfolding fusion methods integrate the powerful representation capabilities of deep learning with the robustness of model-based approaches. These techniques involve unrolling the steps of the optimization scheme derived from the minimization of an energy into a deep learning framework, resulting in efficient and highly interpretable architectures. In this paper, we propose a model-based deep unfolded method for satellite image fusion. Our approach is based on a variational formulation that incorporates the classic observation model for MS/HS data, a high-frequency injection constraint based on the PAN image, and an arbitrary convex prior. For the unfolding stage, we introduce upsampling and downsampling layers that use geometric information encoded in the PAN image through residual networks. The backbone of our method is a multi-head attention residual network (MARNet), which replaces the proximity operator in the optimization scheme and combines multiple head attentions with residual learning to exploit image self-similarities via nonlocal operators defined in terms of patches. Additionally, we incorporate a post-processing module based on the MARNet architecture to further enhance the quality of the fused images. Experimental results on PRISMA, Quickbird, and WorldView2 datasets demonstrate the superior performance of our method and its ability to generalize across different sensor configurations and varying spatial and spectral resolutions. The source code will be available at https://github.com/TAMI-UIB/MARNet.
☆ Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection
The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a pseudo-word guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.
☆ PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation
While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.
comment: 7+5 pages, 15 figures
☆ Skip-and-Play: Depth-Driven Pose-Preserved Image Generation for Any Objects
The emergence of diffusion models has enabled the generation of diverse high-quality images solely from text, prompting subsequent efforts to enhance the controllability of these models. Despite the improvement in controllability, pose control remains limited to specific objects (e.g., humans) or poses (e.g., frontal view) due to the fact that pose is generally controlled via camera parameters (e.g., rotation angle) or keypoints (e.g., eyes, nose). Specifically, camera parameters-conditional pose control models generate unrealistic images depending on the object, owing to the small size of 3D datasets for training. Also, keypoint-based approaches encounter challenges in acquiring reliable keypoints for various objects (e.g., church) or poses (e.g., back view). To address these limitations, we propose depth-based pose control, as depth maps are easily obtainable from a single depth estimation model regardless of objects and poses, unlike camera parameters and keypoints. However, depth-based pose control confronts issues of shape dependency, as depth maps influence not only the pose but also the shape of the generated images. To tackle this issue, we propose Skip-and-Play (SnP), designed via analysis of the impact of three components of depth-conditional ControlNet on the pose and the shape of the generated images. To be specific, based on the analysis, we selectively skip parts of the components to mitigate shape dependency on the depth map while preserving the pose. Through various experiments, we demonstrate the superiority of SnP over baselines and showcase the ability of SnP to generate images of diverse objects and poses. Remarkably, SnP exhibits the ability to generate images even when the objects in the condition (e.g., a horse) and the prompt (e.g., a hedgehog) differ from each other.
☆ Creating a Microstructure Latent Space with Rich Material Information for Multiphase Alloy Design
The intricate microstructure serves as the cornerstone for the composition/processing-structure-property (CPSP) connection in multiphase alloys. Traditional alloy design methods often overlook microstructural details, which diminishes the reliability and effectiveness of the outcomes. This study introduces an improved alloy design algorithm that integrates authentic microstructural information to establish precise CPSP relationships. The approach utilizes a deep-learning framework based on a variational autoencoder to map real microstructural data to a latent space, enabling the prediction of composition, processing steps, and material properties from the latent space vector. By integrating this deep learning model with a specific sampling strategy in the latent space, a novel, microstructure-centered algorithm for multiphase alloy design is developed. This algorithm is demonstrated through the design of a unified dual-phase steel, and the results are assessed at three performance levels. Moreover, an exploration into the latent vector space of the model highlights its seamless interpolation ability and its rich material information content. Notably, the current configuration of the latent space is particularly advantageous for alloy design, offering an exhaustive representation of microstructure, composition, processing, and property variations essential for multiphase alloys.
☆ Learning-Based Error Detection System for Advanced Vehicle Instrument Cluster Rendering
The automotive industry is currently expanding digital display options with every new model that comes onto the market. This entails not just an expansion in dimensions, resolution, and customization choices, but also the capability to employ novel display effects like overlays while assembling the content of the display cluster. Unfortunately, this raises the need for appropriate monitoring systems that can detect rendering errors and apply appropriate countermeasures when required. Classical solutions such as Cyclic Redundancy Checks (CRC) will soon be no longer viable as any sort of alpha blending, warping of scaling of content can cause unwanted CRC violations. Therefore, we propose a novel monitoring approach to verify correctness of displayed content using telltales (e.g. warning signs) as example. It uses a learning-based approach to separate "good" telltales, i.e. those that a human driver will understand correctly, and "corrupted" telltales, i.e. those that will not be visible or perceived correctly. As a result, it possesses inherent resilience against individual pixel errors and implicitly supports changing backgrounds, overlay or scaling effects. This is underlined by our experimental study where all "corrupted" test patterns were correctly classified, while no false alarms were triggered.
comment: 9 pages
☆ MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos
Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer's egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: https://irmvlab.github.io/madiff.github.io.
☆ Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.
☆ AdvSecureNet: A Python Toolkit for Adversarial Machine Learning
Machine learning models are vulnerable to adversarial attacks. Several tools have been developed to research these vulnerabilities, but they often lack comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch based toolkit for adversarial machine learning that is the first to natively support multi-GPU setups for attacks, defenses, and evaluation. It is the first toolkit that supports both CLI and API interfaces and external YAML configuration files to enhance versatility and reproducibility. The toolkit includes multiple attacks, defenses and evaluation metrics. Rigiorous software engineering practices are followed to ensure high code quality and maintainability. The project is available as an open-source project on GitHub at https://github.com/melihcatal/advsecurenet and installable via PyPI.
☆ GoT-CQA: Graph-of-Thought Guided Compositional Reasoning for Chart Question Answering
Chart Question Answering (CQA) aims at answering questions based on the visual chart content, which plays an important role in chart sumarization, business data analysis, and data report generation. CQA is a challenging multi-modal task because of the strong context dependence and complex reasoning requirement. The former refers to answering this question strictly based on the analysis of the visual content or internal data of the given chart, while the latter emphasizes the various logical and numerical reasoning involved in answer prediction process. In this paper, we pay more attention on the complex reasoning in CQA task, and propose a novel Graph-of-Thought (GoT) guided compositional reasoning model called GoT-CQA to overcome this problem. At first, we transform the chart-oriented question into a directed acyclic GoT composed of multiple operator nodes, including localization, numerical and logical operator. It intuitively reflects the human brain's solution process to this question. After that, we design an efficient auto-compositional reasoning framework guided by the GoT, to excute the multi-step reasoning operations in various types of questions. Comprehensive experiments on ChartQA and PlotQA-D datasets show that GoT-CQA achieves outstanding performance, especially in complex human-written and reasoning questions, comparing with the latest popular baselines.
☆ A Medical Multimodal Large Language Model for Pediatric Pneumonia
Pediatric pneumonia is the leading cause of death among children under five years worldwide, imposing a substantial burden on affected families. Currently, there are three significant hurdles in diagnosing and treating pediatric pneumonia. Firstly, pediatric pneumonia shares similar symptoms with other respiratory diseases, making rapid and accurate differential diagnosis challenging. Secondly, primary hospitals often lack sufficient medical resources and experienced doctors. Lastly, providing personalized diagnostic reports and treatment recommendations is labor-intensive and time-consuming. To tackle these challenges, we proposed a Medical Multimodal Large Language Model for Pediatric Pneumonia (P2Med-MLLM). It was capable of handling diverse clinical tasks, such as generating free-text radiology reports and medical records within a unified framework. Specifically, P2Med-MLLM can process both pure text and image-text data, trained on an extensive and large-scale dataset (P2Med-MD), including real clinical information from 163,999 outpatient and 8,684 inpatient cases. This dataset comprised 2D chest X-ray images, 3D chest CT images, corresponding radiology reports, and outpatient and inpatient records. We designed a three-stage training strategy to enable P2Med-MLLM to comprehend medical knowledge and follow instructions for various clinical tasks. To rigorously evaluate P2Med-MLLM's performance, we developed P2Med-MBench, a benchmark consisting of 642 meticulously verified samples by pediatric pulmonology specialists, covering six clinical decision-support tasks and a balanced variety of diseases. The automated scoring results demonstrated the superiority of P2Med-MLLM. This work plays a crucial role in assisting primary care doctors with prompt disease diagnosis and treatment planning, reducing severe symptom mortality rates, and optimizing the allocation of medical resources.
comment: 18 pages, 10 figures
☆ A Fashion Item Recommendation Model in Hyperbolic Space CVPR 2024
In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users' purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.
comment: This work was presented at the CVFAD Workshop at CVPR 2024
☆ SurgTrack: CAD-Free 3D Tracking of Real-world Surgical Instruments
Vision-based surgical navigation has received increasing attention due to its non-invasive, cost-effective, and flexible advantages. In particular, a critical element of the vision-based navigation system is tracking surgical instruments. Compared with 2D instrument tracking methods, 3D instrument tracking has broader value in clinical practice, but is also more challenging due to weak texture, occlusion, and lack of Computer-Aided Design (CAD) models for 3D registration. To solve these challenges, we propose the SurgTrack, a two-stage 3D instrument tracking method for CAD-free and robust real-world applications. In the first registration stage, we incorporate an Instrument Signed Distance Field (SDF) modeling the 3D representation of instruments, achieving CAD-freed 3D registration. Due to this, we can obtain the location and orientation of instruments in the 3D space by matching the video stream with the registered SDF model. In the second tracking stage, we devise a posture graph optimization module, leveraging the historical tracking results of the posture memory pool to optimize the tracking results and improve the occlusion robustness. Furthermore, we collect the Instrument3D dataset to comprehensively evaluate the 3D tracking of surgical instruments. The extensive experiments validate the superiority and scalability of our SurgTrack, by outperforming the state-of-the-arts with a remarkable improvement. The code and dataset are available at https://github.com/wenwucode/SurgTrack.
☆ BMI Prediction from Handwritten English Characters Using a Convolutional Neural Network
A person's Body Mass Index, or BMI, is the most widely used parameter for assessing their health. BMI is a crucial predictor of potential diseases that may arise at higher body fat levels because it is correlated with body fat. Conversely, a community's or an individual's nutritional status can be determined using the BMI. Although deep learning models are used in several studies to estimate BMI from face photos and other data, no previous research established a clear connection between deep learning techniques for handwriting analysis and BMI prediction. This article addresses this research gap with a deep learning approach to estimating BMI from handwritten characters by developing a convolutional neural network (CNN). A dataset containing samples from 48 people in lowercase English scripts is successfully captured for the BMI prediction task. The proposed CNN-based approach reports a commendable accuracy of 99.92%. Performance comparison with other popular CNN architectures reveals that AlexNet and InceptionV3 achieve the second and third-best performance, with the accuracy of 99.69% and 99.53%, respectively.
☆ Object Gaussian for Monocular 6D Pose Estimation from Sparse Views
Monocular object pose estimation, as a pivotal task in computer vision and robotics, heavily depends on accurate 2D-3D correspondences, which often demand costly CAD models that may not be readily available. Object 3D reconstruction methods offer an alternative, among which recent advancements in 3D Gaussian Splatting (3DGS) afford a compelling potential. Yet its performance still suffers and tends to overfit with fewer input views. Embracing this challenge, we introduce SGPose, a novel framework for sparse view object pose estimation using Gaussian-based methods. Given as few as ten views, SGPose generates a geometric-aware representation by starting with a random cuboid initialization, eschewing reliance on Structure-from-Motion (SfM) pipeline-derived geometry as required by traditional 3DGS methods. SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success. Experiments on typical benchmarks, especially on the Occlusion LM-O dataset, demonstrate that SGPose outperforms existing methods even under sparse view constraints, under-scoring its potential in real-world applications.
☆ Solving Video Inverse Problems Using Image Diffusion Models
Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: https://solving-video-inverse.github.io/main/
comment: 22 pages, 16 figures
☆ Evaluation Study on SAM 2 for Class-agnostic Instance-level Segmentation
Segment Anything Model (SAM) has demonstrated powerful zero-shot segmentation performance in natural scenes. The recently released Segment Anything Model 2 (SAM2) has further heightened researchers' expectations towards image segmentation capabilities. To evaluate the performance of SAM2 on class-agnostic instance-level segmentation tasks, we adopt different prompt strategies for SAM2 to cope with instance-level tasks for three relevant scenarios: Salient Instance Segmentation (SIS), Camouflaged Instance Segmentation (CIS), and Shadow Instance Detection (SID). In addition, to further explore the effectiveness of SAM2 in segmenting granular object structures, we also conduct detailed tests on the high-resolution Dichotomous Image Segmentation (DIS) benchmark to assess the fine-grained segmentation capability. Qualitative and quantitative experimental results indicate that the performance of SAM2 varies significantly across different scenarios. Besides, SAM2 is not particularly sensitive to segmenting high-resolution fine details. We hope this technique report can drive the emergence of SAM2-based adapters, aiming to enhance the performance ceiling of large vision models on class-agnostic instance segmentation tasks.
☆ How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental Representations
Facial expression perception in humans inherently relies on prior knowledge and contextual cues, contributing to efficient and flexible processing. For instance, multi-modal emotional context (such as voice color, affective text, body pose, etc.) can prompt people to perceive emotional expressions in objectively neutral faces. Drawing inspiration from this, we introduce a novel approach for facial expression classification that goes beyond simple classification tasks. Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context. With this, our model offers visual insights into its internal decision-making process. We achieve this by learning two independent representations of content and context using a VAE-GAN architecture. Subsequently, we propose a novel attention mechanism for context-dependent feature adaptation. The adapted representation is used for classification and to generate a context-augmented expression. We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations. We achieve State-of-the-Art classification accuracies of 81.01% on the RAVDESS dataset and 79.34% on the MEAD dataset. We make our code publicly available.
comment: GCPR 2024
☆ Interacting Multiple Model-based Joint Homography Matrix and Multiple Object State Estimation
A novel MOT algorithm, IMM Joint Homography State Estimation (IMM-JHSE), is proposed. By jointly modelling the camera projection matrix as part of track state vectors, IMM-JHSE removes the explicit influence of camera motion compensation techniques on predicted track position states, which was prevalent in previous approaches. Expanding upon this, static and dynamic camera motion models are combined through the use of an IMM filter. A simple bounding box motion model is used to predict bounding box positions to incorporate image plane information. In addition to applying an IMM to camera motion, a non-standard IMM approach is applied where bounding-box-based BIoU scores are mixed with ground-plane-based Mahalanobis distances in an IMM-like fashion to perform association only. Finally, IMM-JHSE makes use of dynamic process and measurement noise estimation techniques. IMM-JHSE improves upon related techniques on the DanceTrack and KITTI-car datasets, increasing HOTA by 2.64 and 2.11, respectively, while offering competitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets.
comment: Preprint submitted to Information Fusion
☆ Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation
Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
comment: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
☆ Real-Time Dynamic Scale-Aware Fusion Detection Network: Take Road Damage Detection as an example
Unmanned Aerial Vehicle (UAV)-based Road Damage Detection (RDD) is important for daily maintenance and safety in cities, especially in terms of significantly reducing labor costs. However, current UAV-based RDD research is still faces many challenges. For example, the damage with irregular size and direction, the masking of damage by the background, and the difficulty of distinguishing damage from the background significantly affect the ability of UAV to detect road damage in daily inspection. To solve these problems and improve the performance of UAV in real-time road damage detection, we design and propose three corresponding modules: a feature extraction module that flexibly adapts to shape and background; a module that fuses multiscale perception and adapts to shape and background ; an efficient downsampling module. Based on these modules, we designed a multi-scale, adaptive road damage detection model with the ability to automatically remove background interference, called Dynamic Scale-Aware Fusion Detection Model (RT-DSAFDet). Experimental results on the UAV-PDD2023 public dataset show that our model RT-DSAFDet achieves a mAP50 of 54.2%, which is 11.1% higher than that of YOLOv10-m, an efficient variant of the latest real-time object detection model YOLOv10, while the amount of parameters is reduced to 1.8M and FLOPs to 4.6G, with a decreased by 88% and 93%, respectively. Furthermore, on the large generalized object detection public dataset MS COCO2017 also shows the superiority of our model with mAP50-95 is the same as YOLOv9-t, but with 0.5% higher mAP50, 10% less parameters volume, and 40% less FLOPs.
☆ UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching
Unlike other vision tasks where Transformer-based approaches are becoming increasingly common, stereo depth estimation is still dominated by convolution-based approaches. This is mainly due to the limited availability of real-world ground truth for stereo matching, which is a limiting factor in improving the performance of Transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning used for pre-training with stereo matching framework based on supervised learning. To be specific, we explore the effectiveness of reconstructing features of masked portions in an input image and at the same time predicting corresponding points in another image from the perspective of locality inductive bias, which is crucial in training models with limited training data. Moreover, to address these challenging tasks of reconstruction-and-prediction, we present a new strategy to vary a masking ratio when training the stereo model with stereo-tailored losses. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets. Lastly, to investigate the advantages of the proposed approach, we provide a frequency analysis of feature maps and the analysis of locality inductive bias based on attention maps.
☆ StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models ECCV2024
Despite the burst of innovative methods for controlling the diffusion process, effectively controlling image styles in text-to-image generation remains a challenging task. Many adapter-based methods impose image representation conditions on the denoising process to accomplish image control. However these conditions are not aligned with the word embedding space, leading to interference between image and text control conditions and the potential loss of semantic information from the text prompt. Addressing this issue involves two key challenges. Firstly, how to inject the style representation without compromising the effectiveness of text representation in control. Secondly, how to obtain the accurate style representation from a single reference image. To tackle these challenges, we introduce StyleTokenizer, a zero-shot style control image generation method that aligns style representation with text representation using a style tokenizer. This alignment effectively minimizes the impact on the effectiveness of text prompts. Furthermore, we collect a well-labeled style dataset named Style30k to train a style feature extractor capable of accurately representing style while excluding other content information. Experimental results demonstrate that our method fully grasps the style characteristics of the reference image, generating appealing images that are consistent with both the target image style and text prompt. The code and dataset are available at https://github.com/alipay/style-tokenizer.
comment: Accepted by ECCV2024
☆ Sample what you cant compress
For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a principled interpretation. Concurrently, in generative settings diffusion has demonstrated a remarkable ability to create crisp, high quality results and has solid theoretical underpinnings (from variational inference to direct study as the Fisher Divergence). Our work combines autoencoder representation learning with diffusion and is, to our knowledge, the first to demonstrate the efficacy of jointly learning a continuous encoder and decoder under a diffusion-based loss. We demonstrate that this approach yields better reconstruction quality as compared to GAN-based autoencoders while being easier to tune. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss. Since our decoder is stochastic, it can generate details not encoded in the otherwise deterministic latent representation; we therefore name our approach "Sample what you can't compress", or SWYCC for short.
☆ SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction
Masked Image Modeling (MIM) techniques have redefined the landscape of computer vision, enabling pre-trained models to achieve exceptional performance across a broad spectrum of tasks. Despite their success, the full potential of MIM-based methods in dense prediction tasks, particularly in depth estimation, remains untapped. Existing MIM approaches primarily rely on single-image inputs, which makes it challenging to capture the crucial structured information, leading to suboptimal performance in tasks requiring fine-grained feature representation. To address these limitations, we propose SG-MIM, a novel Structured knowledge Guided Masked Image Modeling framework designed to enhance dense prediction tasks by utilizing structured knowledge alongside images. SG-MIM employs a lightweight relational guidance framework, allowing it to guide structured knowledge individually at the feature level rather than naively combining at the pixel level within the same architecture, as is common in traditional multi-modal pre-training methods. This approach enables the model to efficiently capture essential information while minimizing discrepancies between pre-training and downstream tasks. Furthermore, SG-MIM employs a selective masking strategy to incorporate structured knowledge, maximizing the synergy between general representation learning and structured knowledge-specific learning. Our method requires no additional annotations, making it a versatile and efficient solution for a wide range of applications. Our evaluations on the KITTI, NYU-v2, and ADE20k datasets demonstrate SG-MIM's superiority in monocular depth estimation and semantic segmentation.
☆ TLD: A Vehicle Tail Light signal Dataset and Benchmark
Understanding other drivers' intentions is crucial for safe driving. The role of taillights in conveying these intentions is underemphasized in current autonomous driving systems. Accurately identifying taillight signals is essential for predicting vehicle behavior and preventing collisions. Open-source taillight datasets are scarce, often small and inconsistently annotated. To address this gap, we introduce a new large-scale taillight dataset called TLD. Sourced globally, our dataset covers diverse traffic scenarios. To our knowledge, TLD is the first dataset to separately annotate brake lights and turn signals in real driving scenarios. We collected 17.78 hours of driving videos from the internet. This dataset consists of 152k labeled image frames sampled at a rate of 2 Hz, along with 1.5 million unlabeled frames interspersed throughout. Additionally, we have developed a two-stage vehicle light detection model consisting of two primary modules: a vehicle detector and a taillight classifier. Initially, YOLOv10 and DeepSORT captured consecutive vehicle images over time. Subsequently, the two classifiers work simultaneously to determine the states of the brake lights and turn signals. A post-processing procedure is then used to eliminate noise caused by misidentifications and provide the taillight states of the vehicle within a given time frame. Our method shows exceptional performance on our dataset, establishing a benchmark for vehicle taillight detection. The dataset is available at https://huggingface.co/datasets/ChaiJohn/TLD/tree/main
☆ A Learnable Color Correction Matrix for RAW Reconstruction BMVC2024
Autonomous driving algorithms usually employ sRGB images as model input due to their compatibility with the human visual system. However, visually pleasing sRGB images are possibly sub-optimal for downstream tasks when compared to RAW images. The availability of RAW images is constrained by the difficulties in collecting real-world driving data and the associated challenges of annotation. To address this limitation and support research in RAW-domain driving perception, we design a novel and ultra-lightweight RAW reconstruction method. The proposed model introduces a learnable color correction matrix (CCM), which uses only a single convolutional layer to approximate the complex inverse image signal processor (ISP). Experimental results demonstrate that simulated RAW (simRAW) images generated by our method provide performance improvements equivalent to those produced by more complex inverse ISP methods when pretraining RAW-domain object detectors, which highlights the effectiveness and practicality of our approach.
comment: Accepted by BMVC2024
☆ Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation
Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.
comment: 14 pages, 12 figures, 8 tables
☆ Reliable Deep Diffusion Tensor Estimation: Rethinking the Power of Data-Driven Optimization Routine
Diffusion tensor imaging (DTI) holds significant importance in clinical diagnosis and neuroscience research. However, conventional model-based fitting methods often suffer from sensitivity to noise, leading to decreased accuracy in estimating DTI parameters. While traditional data-driven deep learning methods have shown potential in terms of accuracy and efficiency, their limited generalization to out-of-training-distribution data impedes their broader application due to the diverse scan protocols used across centers, scanners, and studies. This work aims to tackle these challenges and promote the use of DTI by introducing a data-driven optimization-based method termed DoDTI. DoDTI combines the weighted linear least squares fitting algorithm and regularization by denoising technique. The former fits DW images from diverse acquisition settings into diffusion tensor field, while the latter applies a deep learning-based denoiser to regularize the diffusion tensor field instead of the DW images, which is free from the limitation of fixed-channel assignment of the network. The optimization object is solved using the alternating direction method of multipliers and then unrolled to construct a deep neural network, leveraging a data-driven strategy to learn network parameters. Extensive validation experiments are conducted utilizing both internally simulated datasets and externally obtained in-vivo datasets. The results, encompassing both qualitative and quantitative analyses, showcase that the proposed method attains state-of-the-art performance in DTI parameter estimation. Notably, it demonstrates superior generalization, accuracy, and efficiency, rendering it highly reliable for widespread application in the field.
☆ TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT
While Multi-Object Tracking (MOT) has made substantial advancements, it is limited by heavy reliance on prior knowledge and limited to predefined categories. In contrast, Generic Multiple Object Tracking (GMOT), tracking multiple objects with similar appearance, requires less prior information about the targets but faces challenges with variants like viewpoint, lighting, occlusion, and resolution. Our contributions commence with the introduction of the \textbf{\text{Refer-GMOT dataset}} a collection of videos, each accompanied by fine-grained textual descriptions of their attributes. Subsequently, we introduce a novel text prompt-based open-vocabulary GMOT framework, called \textbf{\text{TP-GMOT}}, which can track never-seen object categories with zero training examples. Within \text{TP-GMOT} framework, we introduce two novel components: (i) {\textbf{\text{TP-OD}}, an object detection by a textual prompt}, for accurately detecting unseen objects with specific characteristics. (ii) Motion-Appearance Cost SORT \textbf{\text{MAC-SORT}}, a novel object association approach that adeptly integrates motion and appearance-based matching strategies to tackle the complex task of tracking multiple generic objects with high similarity. Our contributions are benchmarked on the \text{Refer-GMOT} dataset for GMOT task. Additionally, to assess the generalizability of the proposed \text{TP-GMOT} framework and the effectiveness of \text{MAC-SORT} tracker, we conduct ablation studies on the DanceTrack and MOT20 datasets for the MOT task. Our dataset, code, and models will be publicly available at: https://fsoft-aic.github.io/TP-GMOT
☆ Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization IROS 2024
Indoor robots rely on depth to perform tasks like navigation or obstacle detection, and single-image depth estimation is widely used to assist perception. Most indoor single-image depth prediction focuses less on model generalizability to unseen datasets, concerned with in-the-wild robustness for system deployment. This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied meta-learning of image classification associated with explicit class labels, no explicit task boundaries exist for continuous depth values tied to highly varying indoor environments regarding object arrangement and scene composition. We propose fine-grained task that treats each RGB-D mini-batch as a task in our meta-learning formulation. We first show that our method on limited data induces a much better prior (max 27.8% in RMSE). Then, finetuning on meta-learned initialization consistently outperforms baselines without the meta approach. Aiming at generalization, we propose zero-shot cross-dataset protocols and validate higher generalizability induced by our meta-initialization, as a simple and useful plugin to many existing depth estimation methods. The work at the intersection of depth and meta-learning potentially drives both research to step closer to practical robotic and machine perception usage.
comment: IROS 2024. The version supersedes 2305.07269. arXiv admin note: text overlap with arXiv:2305.07269
☆ TASAR: Transferable Attack on Skeletal Action Recognition
Skeletal sequences, as well-structured representations of human behaviors, are crucial in Human Activity Recognition (HAR). The transferability of adversarial skeletal sequences enables attacks in real-world HAR scenarios, such as autonomous driving, intelligent surveillance, and human-computer interactions. However, existing Skeleton-based HAR (S-HAR) attacks exhibit weak adversarial transferability and, therefore, cannot be considered true transfer-based S-HAR attacks. More importantly, the reason for this failure remains unclear. In this paper, we study this phenomenon through the lens of loss surface, and find that its sharpness contributes to the poor transferability in S-HAR. Inspired by this observation, we assume and empirically validate that smoothening the rugged loss landscape could potentially improve adversarial transferability in S-HAR. To this end, we propose the first Transfer-based Attack on Skeletal Action Recognition, TASAR. TASAR explores the smoothed model posterior without re-training the pre-trained surrogates, which is achieved by a new post-train Dual Bayesian optimization strategy. Furthermore, unlike previous transfer-based attacks that treat each frame independently and overlook temporal coherence within sequences, TASAR incorporates motion dynamics into the Bayesian attack gradient, effectively disrupting the spatial-temporal coherence of S-HARs. To exhaustively evaluate the effectiveness of existing methods and our method, we build the first large-scale robust S-HAR benchmark, comprising 7 S-HAR models, 10 attack methods, 3 S-HAR datasets and 2 defense models. Extensive results demonstrate the superiority of TASAR. Our benchmark enables easy comparisons for future studies, with the code available in the supplementary material.
comment: arXiv admin note: text overlap with arXiv:2407.08572
☆ Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes
High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). These problems are exacerbated on low-performance graphics hardware, e.g. on mobile devices. We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed layer order from outermost to innermost. We model mesh layers as SDF shells with optimal spacing learned during training. After baking, we fit UV textures to the corresponding meshes. We show that our method can represent challenging fuzzy objects while achieving higher frame rates than volume-based and splatting-based methods on low-end and mobile devices.
☆ FrameCorr: Adaptive, Autoencoder-based Neural Compression for Video Reconstruction in Resource and Timing Constrained Network Settings
Despite the growing adoption of video processing via Internet of Things (IoT) devices due to their cost-effectiveness, transmitting captured data to nearby servers poses challenges due to varying timing constraints and scarcity of network bandwidth. Existing video compression methods face difficulties in recovering compressed data when incomplete data is provided. Here, we introduce \emph{\project}, a deep-learning based solution that utilizes previously received data to predict the missing segments of a frame, enabling the reconstruction of a frame from partially received data.
☆ Detecting Korean Food Using Image using Hierarchical Model
A solution was made available for Korean Food lovers who have dietary restrictions to identify the Korean food before consuming. Just by uploading a clear photo of the dish, people can get to know what they are eating. Image processing techniques together with machine learning helped to come up with this solution.
☆ Non-target Divergence Hypothesis: Toward Understanding Domain Gaps in Cross-Modal Knowledge Distillation
Compared to single-modal knowledge distillation, cross-modal knowledge distillation faces more severe challenges due to domain gaps between modalities. Although various methods have proposed various solutions to overcome these challenges, there is still limited research on how domain gaps affect cross-modal knowledge distillation. This paper provides an in-depth analysis and evaluation of this issue. We first introduce the Non-Target Divergence Hypothesis (NTDH) to reveal the impact of domain gaps on cross-modal knowledge distillation. Our key finding is that domain gaps between modalities lead to distribution differences in non-target classes, and the smaller these differences, the better the performance of cross-modal knowledge distillation. Subsequently, based on Vapnik-Chervonenkis (VC) theory, we derive the upper and lower bounds of the approximation error for cross-modal knowledge distillation, thereby theoretically validating the NTDH. Finally, experiments on five cross-modal datasets further confirm the validity, generalisability, and applicability of the NTDH.
☆ Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis
We consider the problem of independently, in a disentangled fashion, controlling the outputs of text-to-image diffusion models with color and style attributes of a user-supplied reference image. We present the first training-free, test-time-only method to disentangle and condition text-to-image models on color and style attributes from reference image. To realize this, we propose two key innovations. Our first contribution is to transform the latent codes at inference time using feature transformations that make the covariance matrix of current generation follow that of the reference image, helping meaningfully transfer color. Next, we observe that there exists a natural disentanglement between color and style in the LAB image space, which we exploit to transform the self-attention feature maps of the image being generated with respect to those of the reference computed from its L channel. Both these operations happen purely at test time and can be done independently or merged. This results in a flexible method where color and style information can come from the same reference image or two different sources, and a new generation can seamlessly fuse them in either scenario.
comment: 16 pages, 17 figures
☆ Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering
Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.
comment: 39 pages, 9 figures
☆ MOSMOS: Multi-organ segmentation facilitated by medical report supervision
Owing to a large amount of multi-modal data in modern medical systems, such as medical images and reports, Medical Vision-Language Pre-training (Med-VLP) has demonstrated incredible achievements in coarse-grained downstream tasks (i.e., medical classification, retrieval, and visual question answering). However, the problem of transferring knowledge learned from Med-VLP to fine-grained multi-organ segmentation tasks has barely been investigated. Multi-organ segmentation is challenging mainly due to the lack of large-scale fully annotated datasets and the wide variation in the shape and size of the same organ between individuals with different diseases. In this paper, we propose a novel pre-training & fine-tuning framework for Multi-Organ Segmentation by harnessing Medical repOrt Supervision (MOSMOS). Specifically, we first introduce global contrastive learning to maximally align the medical image-report pairs in the pre-training stage. To remedy the granularity discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags. More importantly, our pre-trained models can be transferred to any segmentation model by introducing the pixel-tag attention maps. Different network settings, i.e., 2D U-Net and 3D UNETR, are utilized to validate the generalization. We have extensively evaluated our approach using different diseases and modalities on BTCV, AMOS, MMWHS, and BRATS datasets. Experimental results in various settings demonstrate the effectiveness of our framework. This framework can serve as the foundation to facilitate future research on automatic annotation tasks under the supervision of medical reports.
comment: 14 pages, 7 figures
☆ Local map Construction Methods with SD map: A Novel Survey
In recent years, significant academic advancements have been made in the field of autonomous vehicles, with Local maps emerging as a crucial component of autonomous driving technology. Local maps not only provide intricate details of road networks but also serve as fundamental inputs for critical tasks such as vehicle localization, navigation, and decision-making. Given the characteristics of SD map (Standard Definition Map), which include low cost, ease of acquisition, and high versatility, perception methods that integrate SD map as prior information have demonstrated significant potential in the field of Local map perception. The purpose of this paper is to provide researchers with a comprehensive overview and summary of the latest advancements in the integration of SD map as prior information for Local map perception methods. This review begins by introducing the task definition and general pipeline of local map perception methods that incorporate SD maps as prior information, along with relevant public datasets. And then it focuses on the representation and encoding methods of multi-source information, as well as the methods for fusing multi-source information. In response to this burgeoning trend, this article presents a comprehensive and meticulous overview of the diverse research efforts in this particular field. Finally, the article addresses pertinent issues and future challenges with the aim of guiding researchers in understanding the current trends and methodologies prevalent in the field.
comment: 14 pages, 11 figures
☆ Hadamard Row-Wise Generation Algorithm
In this paper, we introduce an efficient algorithm for generating specific Hadamard rows, addressing the memory demands of pre-computing the entire matrix. Leveraging Sylvester's recursive construction, our method generates the required $i$-th row on demand, significantly reducing computational resources. The algorithm uses the Kronecker product to construct the desired row from the binary representation of the index, without creating the full matrix. This approach is particularly useful for single-pixel imaging systems that need only one row at a time.
☆ Neural Dynamics Model of Visual Decision-Making: Learning from Human Experts
Uncovering the fundamental neural correlates of biological intelligence, developing mathematical models, and conducting computational simulations are critical for advancing new paradigms in artificial intelligence (AI). In this study, we implemented a comprehensive visual decision-making model that spans from visual input to behavioral output, using a neural dynamics modeling approach. Drawing inspiration from the key components of the dorsal visual pathway in primates, our model not only aligns closely with human behavior but also reflects neural activities in primates, and achieving accuracy comparable to convolutional neural networks (CNNs). Moreover, magnetic resonance imaging (MRI) identified key neuroimaging features such as structural connections and functional connectivity that are associated with performance in perceptual decision-making tasks. A neuroimaging-informed fine-tuning approach was introduced and applied to the model, leading to performance improvements that paralleled the behavioral variations observed among subjects. Compared to classical deep learning models, our model more accurately replicates the behavioral performance of biological intelligence, relying on the structural characteristics of biological neural networks rather than extensive training data, and demonstrating enhanced resilience to perturbation.
☆ Multi-modal Situated Reasoning in 3D Scenes
Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.
comment: Project page: https://msr3d.github.io/
☆ Unified Framework with Consistency across Modalities for Human Activity Recognition BMVC 2024
Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: https://github.com/tranxuantuyen/COMPUTER.
comment: Accepted to BMVC 2024
☆ GGS: Generalizable Gaussian Splatting for Lane Switching in Autonomous Driving
We propose GGS, a Generalizable Gaussian Splatting method for Autonomous Driving which can achieve realistic rendering under large viewpoint changes. Previous generalizable 3D gaussian splatting methods are limited to rendering novel views that are very close to the original pair of images, which cannot handle large differences in viewpoint. Especially in autonomous driving scenarios, images are typically collected from a single lane. The limited training perspective makes rendering images of a different lane very challenging. To further improve the rendering capability of GGS under large viewpoint changes, we introduces a novel virtual lane generation module into GSS method to enables high-quality lane switching even without a multi-lane dataset. Besides, we design a diffusion loss to supervise the generation of virtual lane image to further address the problem of lack of data in the virtual lanes. Finally, we also propose a depth refinement module to optimize depth estimation in the GSS model. Extensive validation of our method, compared to existing approaches, demonstrates state-of-the-art performance.
☆ Coral Model Generation from Single Images for Virtual Reality Applications
With the rapid development of VR technology, the demand for high-quality 3D models is increasing. Traditional methods struggle with efficiency and quality in large-scale customization. This paper introduces a deep-learning framework that generates high-precision 3D coral models from a single image. Using the Coral dataset, the framework extracts geometric and texture features, performs 3D reconstruction, and optimizes design and material blending. Advanced optimization and polygon count control ensure shape accuracy, detail retention, and flexible output for various complexities, catering to high-quality rendering and real-time interaction needs.The project incorporates Explainable AI (XAI) to transform AI-generated models into interactive "artworks," best viewed in VR and XR. This enhances model interpretability and human-machine collaboration. Real-time feedback in VR interactions displays information like coral species and habitat, enriching user experience. The generated models surpass traditional methods in detail, visual quality, and efficiency. This research offers an intelligent approach to 3D content creation for VR, lowering production barriers, and promoting widespread VR applications. Additionally, integrating XAI provides new insights into AI-generated visual content and advances research in 3D vision interpretability.
comment: In Proceedings of Explainable AI for the Arts Workshop 2024 (XAIxArts 2024) arXiv:2406.14485
Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing
Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at https://github.com/ChicyChen/LOCO-Edit.
☆ Unfolding Videos Dynamics via Taylor Expansion
Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.
☆ Pluralistic Salient Object Detection
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. Unlike conventional SOD methods that produce a single segmentation mask for salient objects, this new setting recognizes the inherent complexity of real-world images, comprising multiple objects, and the ambiguity in defining salient objects due to different user intentions. To study this task, we present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics. DUTS-MM builds upon the DUTS dataset but enriches the ground-truth mask annotations from three aspects which 1) improves the mask quality especially for boundary and fine-grained structures; 2) alleviates the annotation inconsistency issue; and 3) provides multiple ground-truth masks for images with saliency ambiguity. DUTS-MQ consists of approximately 100K image-mask pairs with human-annotated preference scores, enabling the learning of real human preferences in measuring mask quality. Building upon these two datasets, we propose a simple yet effective pluralistic SOD baseline based on a Mixture-of-Experts (MOE) design. Equipped with two prediction heads, it simultaneously predicts multiple masks using different query prompts and predicts human preference scores for each mask candidate. Extensive experiments and analyses underscore the significance of our proposed datasets and affirm the effectiveness of our PSOD framework.
☆ Developing, Analyzing, and Evaluating Self-Drive Algorithms Using Drive-by-Wire Electric Vehicles
Reliable lane-following algorithms are essential for safe and effective autonomous driving. This project was primarily focused on developing and evaluating different lane-following programs to find the most reliable algorithm for a Vehicle to Everything (V2X) project. The algorithms were first tested on a simulator and then with real vehicles equipped with a drive-by-wire system using ROS (Robot Operating System). Their performance was assessed through reliability, comfort, speed, and adaptability metrics. The results show that the two most reliable approaches detect both lane lines and use unsupervised learning to separate them. These approaches proved to be robust in various driving scenarios, making them suitable candidates for integration into the V2X project.
comment: Supported by the National Science Foundation under Grants No. 2150292 and 2150096
☆ MSTT-199: MRI Dataset for Musculoskeletal Soft Tissue Tumor Segmentation
Accurate musculoskeletal soft tissue tumor segmentation is vital for assessing tumor size, location, diagnosis, and response to treatment, thereby influencing patient outcomes. However, segmentation of these tumors requires clinical expertise, and an automated segmentation model would save valuable time for both clinician and patient. Training an automatic model requires a large dataset of annotated images. In this work, we describe the collection of an MR imaging dataset of 199 musculoskeletal soft tissue tumors from 199 patients. We trained segmentation models on this dataset and then benchmarked them on a publicly available dataset. Our model achieved the state-of-the-art dice score of 0.79 out of the box without any fine tuning, which shows the diversity and utility of our curated dataset. We analyzed the model predictions and found that its performance suffered on fibrous and vascular tumors due to their diverse anatomical location, size, and intensity heterogeneity. The code and models are available in the following github repository, https://github.com/Reasat/mstt
comment: Dataset will be made publicly available after the acceptance of the paper
☆ Spatial Diffusion for Cell Layout Generation MICCAI 2024
Generative models, such as GANs and diffusion models, have been used to augment training sets and boost performances in different tasks. We focus on generative models for cell detection instead, i.e., locating and classifying cells in given pathology images. One important information that has been largely overlooked is the spatial patterns of the cells. In this paper, we propose a spatial-pattern-guided generative model for cell layout generation. Specifically, a novel diffusion model guided by spatial features and generates realistic cell layouts has been proposed. We explore different density models as spatial features for the diffusion model. In downstream tasks, we show that the generated cell layouts can be used to guide the generation of high-quality pathology images. Augmenting with these images can significantly boost the performance of SOTA cell detection methods. The code is available at https://github.com/superlc1995/Diffusion-cell.
comment: 12 pages, 4 figures, accepted by MICCAI 2024
☆ Coupling AI and Citizen Science in Creation of Enhanced Training Dataset for Medical Image Segmentation
Recent advancements in medical imaging and artificial intelligence (AI) have greatly enhanced diagnostic capabilities, but the development of effective deep learning (DL) models is still constrained by the lack of high-quality annotated datasets. The traditional manual annotation process by medical experts is time- and resource-intensive, limiting the scalability of these datasets. In this work, we introduce a robust and versatile framework that combines AI and crowdsourcing to improve both the quality and quantity of medical image datasets across different modalities. Our approach utilises a user-friendly online platform that enables a diverse group of crowd annotators to label medical images efficiently. By integrating the MedSAM segmentation AI with this platform, we accelerate the annotation process while maintaining expert-level quality through an algorithm that merges crowd-labelled images. Additionally, we employ pix2pixGAN, a generative AI model, to expand the training dataset with synthetic images that capture realistic morphological features. These methods are combined into a cohesive framework designed to produce an enhanced dataset, which can serve as a universal pre-processing pipeline to boost the training of any medical deep learning segmentation model. Our results demonstrate that this framework significantly improves model performance, especially when training data is limited.
☆ MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation ECCV 2024
Skin cancer segmentation poses a significant challenge in medical image analysis. Numerous existing solutions, predominantly CNN-based, face issues related to a lack of global contextual understanding. Alternatively, some approaches resort to large-scale Transformer models to bridge the global contextual gaps, but at the expense of model size and computational complexity. Finally many Transformer based approaches rely primarily on CNN based decoders overlooking the benefits of Transformer based decoding models. Recognizing these limitations, we address the need efficient lightweight solutions by introducing MobileUNETR, which aims to overcome the performance constraints associated with both CNNs and Transformers while minimizing model size, presenting a promising stride towards efficient image segmentation. MobileUNETR has 3 main features. 1) MobileUNETR comprises of a lightweight hybrid CNN-Transformer encoder to help balance local and global contextual feature extraction in an efficient manner; 2) A novel hybrid decoder that simultaneously utilizes low-level and global features at different resolutions within the decoding stage for accurate mask generation; 3) surpassing large and complex architectures, MobileUNETR achieves superior performance with 3 million parameters and a computational complexity of 1.3 GFLOP resulting in 10x and 23x reduction in parameters and FLOPS, respectively. Extensive experiments have been conducted to validate the effectiveness of our proposed method on four publicly available skin lesion segmentation datasets, including ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. The code will be publicly available at: https://github.com/OSUPCVLab/MobileUNETR.git
comment: Accepted at ECCV 2024 - BioImage Computing Workshop (Oral)
☆ Incorporating dense metric depth into neural 3D representations for view synthesis and relighting
Synthesizing accurate geometry and photo-realistic appearance of small scenes is an active area of research with compelling use cases in gaming, virtual reality, robotic-manipulation, autonomous driving, convenient product capture, and consumer-level photography. When applying scene geometry and appearance estimation techniques to robotics, we found that the narrow cone of possible viewpoints due to the limited range of robot motion and scene clutter caused current estimation techniques to produce poor quality estimates or even fail. On the other hand, in robotic applications, dense metric depth can often be measured directly using stereo and illumination can be controlled. Depth can provide a good initial estimate of the object geometry to improve reconstruction, while multi-illumination images can facilitate relighting. In this work we demonstrate a method to incorporate dense metric depth into the training of neural 3D representations and address an artifact observed while jointly refining geometry and appearance by disambiguating between texture and geometry edges. We also discuss a multi-flash stereo camera system developed to capture the necessary data for our pipeline and show results on relighting and view synthesis with a few training views.
comment: Project webpage: https://stereomfc.github.io
☆ Can Your Generative Model Detect Out-of-Distribution Covariate Shift? ECCV 2024
Detecting Out-of-Distribution~(OOD) sensory data and covariate distribution shift aims to identify new test examples with different high-level image statistics to the captured, normal and In-Distribution (ID) set. Existing OOD detection literature largely focuses on semantic shift with little-to-no consensus over covariate shift. Generative models capture the ID data in an unsupervised manner, enabling them to effectively identify samples that deviate significantly from this learned distribution, irrespective of the downstream task. In this work, we elucidate the ability of generative models to detect and quantify domain-specific covariate shift through extensive analyses that involves a variety of models. To this end, we conjecture that it is sufficient to detect most occurring sensory faults (anomalies and deviations in global signals statistics) by solely modeling high-frequency signal-dependent and independent details. We propose a novel method, CovariateFlow, for OOD detection, specifically tailored to covariate heteroscedastic high-frequency image-components using conditional Normalizing Flows (cNFs). Our results on CIFAR10 vs. CIFAR10-C and ImageNet200 vs. ImageNet200-C demonstrate the effectiveness of the method by accurately detecting OOD covariate shift. This work contributes to enhancing the fidelity of imaging systems and aiding machine learning models in OOD detection in the presence of covariate shift.
comment: ECCV 2024
☆ MDNF: Multi-Diffusion-Nets for Neural Fields on Meshes
We propose a novel framework for representing neural fields on triangle meshes that is multi-resolution across both spatial and frequency domains. Inspired by the Neural Fourier Filter Bank (NFFB), our architecture decomposes the spatial and frequency domains by associating finer spatial resolution levels with higher frequency bands, while coarser resolutions are mapped to lower frequencies. To achieve geometry-aware spatial decomposition we leverage multiple DiffusionNet components, each associated with a different spatial resolution level. Subsequently, we apply a Fourier feature mapping to encourage finer resolution levels to be associated with higher frequencies. The final signal is composed in a wavelet-inspired manner using a sine-activated MLP, aggregating higher-frequency signals on top of lower-frequency ones. Our architecture attains high accuracy in learning complex neural fields and is robust to discontinuities, exponential scale variations of the target field, and mesh modification. We demonstrate the effectiveness of our approach through its application to diverse neural fields, such as synthetic RGB functions, UV texture coordinates, and vertex normals, illustrating different challenges. To validate our method, we compare its performance against two alternatives, showcasing the advantages of our multi-resolution architecture.
☆ A General Albedo Recovery Approach for Aerial Photogrammetric Images through Inverse Rendering SP
Modeling outdoor scenes for the synthetic 3D environment requires the recovery of reflectance/albedo information from raw images, which is an ill-posed problem due to the complicated unmodeled physics in this process (e.g., indirect lighting, volume scattering, specular reflection). The problem remains unsolved in a practical context. The recovered albedo can facilitate model relighting and shading, which can further enhance the realism of rendered models and the applications of digital twins. Typically, photogrammetric 3D models simply take the source images as texture materials, which inherently embed unwanted lighting artifacts (at the time of capture) into the texture. Therefore, these polluted textures are suboptimal for a synthetic environment to enable realistic rendering. In addition, these embedded environmental lightings further bring challenges to photo-consistencies across different images that cause image-matching uncertainties. This paper presents a general image formation model for albedo recovery from typical aerial photogrammetric images under natural illuminations and derives the inverse model to resolve the albedo information through inverse rendering intrinsic image decomposition. Our approach builds on the fact that both the sun illumination and scene geometry are estimable in aerial photogrammetry, thus they can provide direct inputs for this ill-posed problem. This physics-based approach does not require additional input other than data acquired through the typical drone-based photogrammetric collection and was shown to favorably outperform existing approaches. We also demonstrate that the recovered albedo image can in turn improve typical image processing tasks in photogrammetry such as feature and dense matching, edge, and line extraction.
comment: ISPRS Journal of Photogrammetry and Remote Sensing
☆ No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.
☆ Boundless: Generating Photorealistic Synthetic Data for Object Detection in Urban Streetscapes
We introduce Boundless, a photo-realistic synthetic data generation system for enabling highly accurate object detection in dense urban streetscapes. Boundless can replace massive real-world data collection and manual ground-truth object annotation (labeling) with an automated and configurable process. Boundless is based on the Unreal Engine 5 (UE5) City Sample project with improvements enabling accurate collection of 3D bounding boxes across different lighting and scene variability conditions. We evaluate the performance of object detection models trained on the dataset generated by Boundless when used for inference on a real-world dataset acquired from medium-altitude cameras. We compare the performance of the Boundless-trained model against the CARLA-trained model and observe an improvement of 7.8 mAP. The results we achieved support the premise that synthetic data generation is a credible methodology for training/fine-tuning scalable object detection models for urban scenes.
☆ Design and Evaluation of Camera-Centric Mobile Crowdsourcing Applications
The data that underlies automated methods in computer vision and machine learning, such as image retrieval and fine-grained recognition, often comes from crowdsourcing. In contexts that rely on the intrinsic motivation of users, we seek to understand how the application design affects a user's willingness to contribute and the quantity and quality of the data they capture. In this project, we designed three versions of a camera-based mobile crowdsourcing application, which varied in the amount of labeling effort requested of the user and conducted a user study to evaluate the trade-off between the level of user-contributed information requested and the quantity and quality of labeled images collected. The results suggest that higher levels of user labeling do not lead to reduced contribution. Users collected and annotated the most images using the application version with the highest requested level of labeling with no decrease in user satisfaction. In preliminary experiments, the additional labeled data supported increased performance on an image retrieval task.
☆ Vec2Face: Scaling Face Dataset Generation with Loosely Constrained Vectors
This paper studies how to synthesize face images of non-existent persons, to create a dataset that allows effective training of face recognition (FR) models. Two important goals are (1) the ability to generate a large number of distinct identities (inter-class separation) with (2) a wide variation in appearance of each identity (intra-class variation). However, existing works 1) are typically limited in how many well-separated identities can be generated and 2) either neglect or use a separate editing model for attribute augmentation. We propose Vec2Face, a holistic model that uses only a sampled vector as input and can flexibly generate and control face images and their attributes. Composed of a feature masked autoencoder and a decoder, Vec2Face is supervised by face image reconstruction and can be conveniently used in inference. Using vectors with low similarity among themselves as inputs, Vec2Face generates well-separated identities. Randomly perturbing an input identity vector within a small range allows Vec2Face to generate faces of the same identity with robust variation in face attributes. It is also possible to generate images with designated attributes by adjusting vector values with a gradient descent method. Vec2Face has efficiently synthesized as many as 300K identities with 15 million total images, whereas 60K is the largest number of identities created in the previous works. FR models trained with the generated HSFace datasets, from 10k to 300k identities, achieve state-of-the-art accuracy, from 92% to 93.52%, on five real-world test sets. For the first time, our model created using a synthetic training set achieves higher accuracy than the model created using a same-scale training set of real face images (on the CALFW test set).
♻ ☆ LADDER: Language Driven Slice Discovery and Error Rectification
Error slice discovery associates structured patterns with model errors. Existing methods discover error slices by clustering the error-prone samples with similar patterns or assigning discrete attributes to each sample for post-hoc analysis. While these methods aim for interpretability and easier mitigation through reweighting or rebalancing, they may not capture the full complexity of error patterns due to incomplete or missing attributes. Contrary to the existing approach, this paper utilizes the reasoning capabilities of the Large Language Model (LLM) to analyze complex error patterns and generate testable hypotheses. This paper proposes LADDER: Language Driven slice Discovery and Error Rectification. It first projects the model's representation into a language-aligned feature space (eg CLIP) to preserve semantics in the original model feature space. This ensures the accurate retrieval of sentences that highlight the model's errors. Next, the LLM utilizes the sentences and generates hypotheses to discover error slices. Finally, we mitigate the error by fine-tuning the classification head by creating a group-balanced dataset using the hypotheses. Our entire method does not require any attribute annotation, either explicitly or through external tagging models. We validate our method with \textbf{five} image classification datasets. The code is available (https://github.com/batmanlab/Ladder).
♻ ☆ Quantifying uncertainty in lung cancer segmentation with foundation models applied to mixed-domain datasets
Medical image foundation models have shown the ability to segment organs and tumors with minimal fine-tuning. These models are typically evaluated on task-specific in-distribution (ID) datasets. However, reliable performance on ID dataset does not guarantee robust generalization on out-of-distribution (OOD) datasets. Importantly, once deployed for clinical use, it is impractical to have `ground truth' delineations to assess ongoing performance drifts, especially when images fall into OOD category due to different imaging protocols. Hence, we introduced a comprehensive set of computationally fast metrics to evaluate the performance of multiple foundation models (Swin UNETR, SimMIM, iBOT, SMIT) trained with self-supervised learning (SSL). SSL pretraining was selected as this approach is applicable for large, diverse, and unlabeled image sets. All models were fine-tuned on identical datasets for lung tumor segmentation from computed tomography (CT) scans. SimMIM, iBOT, and SMIT used identical architecture, pretraining, and fine-tuning datasets to assess performance variations with the choice of pretext tasks used in SSL. Evaluation was performed on two public lung cancer datasets (LRAD: n = 140, 5Rater: n = 21) with different image acquisitions and tumor stage compared to training data (n = 317 public resource with stage III-IV lung cancers) and a public non-cancer dataset containing volumetric CT scans of patients with pulmonary embolism (n = 120). All models produced similarly accurate tumor segmentation on the lung cancer testing datasets. SMIT produced a highest F1-score (LRAD: 0.60, 5Rater: 0.64) and lowest entropy (LRAD: 0.06, 5Rater: 0.12), indicating higher tumor detection rate and confident segmentations. In the OOD dataset, SMIT misdetected least number of tumors, indicated by median volume occupancy of 5.67 cc compared to second best method SimMIM of 9.97 cc.
♻ ☆ Multi-task Learning Approach for Intracranial Hemorrhage Prognosis MICCAI 2024
Prognosis after intracranial hemorrhage (ICH) is influenced by a complex interplay between imaging and tabular data. Rapid and reliable prognosis are crucial for effective patient stratification and informed treatment decision-making. In this study, we aim to enhance image-based prognosis by learning a robust feature representation shared between prognosis and the clinical and demographic variables most highly correlated with it. Our approach mimics clinical decision-making by reinforcing the model to learn valuable prognostic data embedded in the image. We propose a 3D multi-task image model to predict prognosis, Glasgow Coma Scale and age, improving accuracy and interpretability. Our method outperforms current state-of-the-art baseline image models, and demonstrates superior performance in ICH prognosis compared to four board-certified neuroradiologists using only CT scans as input. We further validate our model with interpretability saliency maps. Code is available at https://github.com/MiriamCobo/MultitaskLearning_ICH_Prognosis.git.
comment: 16 pages. Accepted at Machine Learning in Medical Imaging Workshop @ MICCAI 2024 (MLMI2024). This is the submitted manuscript with added link to github repo, funding acknowledgements and authors' names and affiliations. No further post submission improvements or corrections were integrated. Final version not published yet
♻ ☆ SDE-based Multiplicative Noise Removal
Multiplicative noise, also known as speckle or pepper noise, commonly affects images produced by synthetic aperture radar (SAR), lasers, or optical lenses. Unlike additive noise, which typically arises from thermal processes or external factors, multiplicative noise is inherent to the system, originating from the fluctuation in diffuse reflections. These fluctuations result in multiple copies of the same signal with varying magnitudes being combined. Consequently, despeckling, or removing multiplicative noise, necessitates different techniques compared to those used for additive noise removal. In this paper, we propose a novel approach using Stochastic Differential Equations based diffusion models to address multiplicative noise. We demonstrate that multiplicative noise can be effectively modeled as a Geometric Brownian Motion process in the logarithmic domain. Utilizing the Fokker-Planck equation, we derive the corresponding reverse process for image denoising. To validate our method, we conduct extensive experiments on two different datasets, comparing our approach to both classical signal processing techniques and contemporary CNN-based noise removal models. Our results indicate that the proposed method significantly outperforms existing methods on perception-based metrics such as FID and LPIPS, while maintaining competitive performance on traditional metrics like PSNR and SSIM.
comment: 9 pages, 4 figures
♻ ☆ Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes ECCV 2024
It is challenging to reconstruct 3D point clouds in unseen classes from single 2D images. Instead of object-centered coordinate system, current methods generalized global priors learned in seen classes to reconstruct 3D shapes from unseen classes in viewer-centered coordinate system. However, the reconstruction accuracy and interpretability are still eager to get improved. To resolve this issue, we introduce to learn local pattern modularization for reconstructing 3D shapes in unseen classes, which achieves both good generalization ability and high reconstruction accuracy. Our insight is to learn a local prior which is class-agnostic and easy to generalize in object-centered coordinate system. Specifically, the local prior is learned via a process of learning and customizing local pattern modularization in seen classes. During this process, we first learn a set of patterns in local regions, which is the basis in the object-centered coordinate system to represent an arbitrary region on shapes across different classes. Then, we modularize each region on an initially reconstructed shape using the learned local patterns. Based on that, we customize the local pattern modularization using the input image by refining the reconstruction with more details. Our method enables to reconstruct high fidelity point clouds from unseen classes in object-centered coordinate system without requiring a large number of patterns or any additional information, such as segmentation supervision or camera poses. Our experimental results under widely used benchmarks show that our method achieves the state-of-the-art reconstruction accuracy for shapes from unseen classes. The code is available at https://github.com/chenchao15/Unseen.
comment: 14pages, 11figures, accepted by ECCV 2024
♻ ☆ CONDA: Condensed Deep Association Learning for Co-Salient Object Detection
Inter-image association modeling is crucial for co-salient object detection. Despite satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. Because most of them focus on image feature optimization under the guidance of heuristically calculated raw inter-image associations. They directly rely on raw associations which are not reliable in complex scenarios, and their image feature optimization approach is not explicit for inter-image association modeling. To alleviate these limitations, this paper proposes a deep association learning strategy that deploys deep networks on raw associations to explicitly transform them into deep association features. Specifically, we first create hyperassociations to collect dense pixel-pair-wise raw associations and then deploys deep aggregation networks on them. We design a progressive association generation module for this purpose with additional enhancement of the hyperassociation calculation. More importantly, we propose a correspondence-induced association condensation module that introduces a pretext task, i.e. semantic correspondence estimation, to condense the hyperassociations for computational burden reduction and noise elimination. We also design an object-aware cycle consistency loss for high-quality correspondence estimations. Experimental results in three benchmark datasets demonstrate the remarkable effectiveness of our proposed method with various training settings.
comment: There is an error. In Sec 4.1, the number of images in some dataset is incorrect and needs to be revised
♻ ☆ Open Gaze: Open Source eye tracker for smartphone devices using Deep Learning
Eye tracking has been a pivotal tool in diverse fields such as vision research, language analysis, and usability assessment. The majority of prior investigations, however, have concentrated on expansive desktop displays employing specialized, costly eye tracking hardware that lacks scalability. Remarkably little insight exists into ocular movement patterns on smartphones, despite their widespread adoption and significant usage. In this manuscript, we present an open-source implementation of a smartphone-based gaze tracker that emulates the methodology proposed by a GooglePaper (whose source code remains proprietary). Our focus is on attaining accuracy comparable to that attained through the GooglePaper's methodology, without the necessity for supplementary hardware. Through the integration of machine learning techniques, we unveil an accurate eye tracking solution that is native to smartphones. Our approach demonstrates precision akin to the state-of-the-art mobile eye trackers, which are characterized by a cost that is two orders of magnitude higher. Leveraging the vast MIT GazeCapture dataset, which is available through registration on the dataset's website, we successfully replicate crucial findings from previous studies concerning ocular motion behavior in oculomotor tasks and saliency analyses during natural image observation. Furthermore, we emphasize the applicability of smartphone-based gaze tracking in discerning reading comprehension challenges. Our findings exhibit the inherent potential to amplify eye movement research by significant proportions, accommodating participation from thousands of subjects with explicit consent. This scalability not only fosters advancements in vision research, but also extends its benefits to domains such as accessibility enhancement and healthcare applications.
comment: This paper results are incorrectly reported. The paper is not authentic and conclusions are not correct
♻ ☆ Q-Seg: Quantum Annealing-Based Unsupervised Image Segmentation
We present Q-Seg, a novel unsupervised image segmentation method based on quantum annealing, tailored for existing quantum hardware. We formulate the pixel-wise segmentation problem, which assimilates spectral and spatial information of the image, as a graph-cut optimization task. Our method efficiently leverages the interconnected qubit topology of the D-Wave Advantage device, offering superior scalability over existing quantum approaches and outperforming several tested state-of-the-art classical methods. Empirical evaluations on synthetic datasets have shown that Q-Seg has better runtime performance than the state-of-the-art classical optimizer Gurobi. The method has also been tested on earth observation image segmentation, a critical area with noisy and unreliable annotations. In the era of noisy intermediate-scale quantum, Q-Seg emerges as a reliable contender for real-world applications in comparison to advanced techniques like Segment Anything. Consequently, Q-Seg offers a promising solution using available quantum hardware, especially in situations constrained by limited labeled data and the need for efficient computational runtime.
comment: 12 pages, 9 figures, 1 table
♻ ☆ Enhancing the vision-language foundation model with key semantic knowledge-emphasized report refinement
Recently, vision-language representation learning has made remarkable advancements in building up medical foundation models, holding immense potential for transforming the landscape of clinical research and medical care. The underlying hypothesis is that the rich knowledge embedded in radiology reports can effectively assist and guide the learning process, reducing the need for additional labels. However, these reports tend to be complex and sometimes even consist of redundant descriptions that make the representation learning too challenging to capture the key semantic information. This paper develops a novel iterative vision-language representation learning framework by proposing a key semantic knowledge-emphasized report refinement method. Particularly, raw radiology reports are refined to highlight the key information according to a constructed clinical dictionary and two model-optimized knowledge-enhancement metrics. The iterative framework is designed to progressively learn, starting from gaining a general understanding of the patient's condition based on raw reports and gradually refines and extracts critical information essential to the fine-grained analysis tasks. The effectiveness of the proposed framework is validated on various downstream medical image analysis tasks, including disease classification, region-of-interest segmentation, and phrase grounding. Our framework surpasses seven state-of-the-art methods in both fine-tuning and zero-shot settings, demonstrating its encouraging potential for different clinical applications.
♻ ☆ Pre-processing and Compression: Understanding Hidden Representation Refinement Across Imaging Domains via Intrinsic Dimension
In recent years, there has been interest in how geometric properties such as intrinsic dimension (ID) of a neural network's hidden representations change through its layers, and how such properties are predictive of important model behavior such as generalization ability. However, evidence has begun to emerge that such behavior can change significantly depending on the domain of the network's training data, such as natural versus medical images. Here, we further this inquiry by exploring how the ID of a network's learned representations changes through its layers, in essence, characterizing how the network successively refines the information content of input data to be used for predictions. Analyzing eleven natural and medical image datasets across six network architectures, we find that how ID changes through the network differs noticeably between natural and medical image models. Specifically, medical image models peak in representation ID earlier in the network, implying a difference in the image features and their abstractness that are typically used for downstream tasks in these domains. Additionally, we discover a strong correlation of this peak representation ID with the ID of the data in its input space, implying that the intrinsic information content of a model's learned representations is guided by that of the data it was trained on. Overall, our findings emphasize notable discrepancies in network behavior between natural and non-natural imaging domains regarding hidden representation information content, and provide further insights into how a network's learned features are shaped by its training data.
♻ ☆ CHOTA: A Higher Order Accuracy Metric for Cell Tracking
The evaluation of cell tracking results steers the development of tracking methods, significantly impacting biomedical research. This is quantitatively achieved by means of evaluation metrics. Unfortunately, current metrics favor local correctness and weakly reward global coherence, impeding high-level biological analysis. To also foster global coherence, we propose the CHOTA metric (Cell-specific Higher Order Tracking Accuracy) which unifies the evaluation of all relevant aspects of cell tracking: cell detections and local associations, global coherence, and lineage tracking. We achieve this by introducing a new definition of the term 'trajectory' that includes the entire cell lineage and by including this into the well-established HOTA metric from general multiple object tracking. Furthermore, we provide a detailed survey of contemporary cell tracking metrics to compare our novel CHOTA metric and to show its advantages. All metrics are extensively evaluated on state-of-the-art real-data cell tracking results and synthetic results that simulate specific tracking errors. We show that CHOTA is sensitive to all tracking errors and gives a good indication of the biologically relevant capability of a method to reconstruct the full lineage of cells. It introduces a robust and comprehensive alternative to the currently used metrics in cell tracking. Python code is available at https://github.com/CellTrackingChallenge/py-ctcmetrics .
comment: Accepted at BIC Workshop at European Conference on Computer Vision 2024, 14 pages, 4 figures, 2 tables
♻ ☆ Large Scale Unsupervised Brain MRI Image Registration Solution for Learn2Reg 2024 MICCAI
In this paper, we summarize the methods and experimental results we proposed for Task 2 in the learn2reg 2024 Challenge. This task focuses on unsupervised registration of anatomical structures in brain MRI images between different patients. The difficulty lies in: (1) without segmentation labels, and (2) a large amount of data. To address these challenges, we built an efficient backbone network and explored several schemes to further enhance registration accuracy. Under the guidance of the NCC loss function and smoothness regularization loss function, we obtained a smooth and reasonable deformation field. According to the leaderboard, our method achieved a Dice coefficient of 77.34%, which is 1.4% higher than the TransMorph. Overall, we won second place on the leaderboard for Task 2.
comment: MICCAI Learn2Reg 2024 Challenge & WBIR 2024 Workshop on Biomedical Imaging Registration
♻ ☆ Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation
In this paper, we address the challenge of compressing generative adversarial networks (GANs) for deployment in resource-constrained environments by proposing two novel methodologies: Distribution Matching for Efficient compression (DiME) and Network Interactive Compression via Knowledge Exchange and Learning (NICKEL). DiME employs foundation models as embedding kernels for efficient distribution matching, leveraging maximum mean discrepancy to facilitate effective knowledge distillation. Simultaneously, NICKEL employs an interactive compression method that enhances the communication between the student generator and discriminator, achieving a balanced and stable compression process. Our comprehensive evaluation on the StyleGAN2 architecture with the FFHQ dataset shows the effectiveness of our approach, with NICKEL & DiME achieving FID scores of 10.45 and 15.93 at compression rates of 95.73% and 98.92%, respectively. Remarkably, our methods sustain generative quality even at an extreme compression rate of 99.69%, surpassing the previous state-of-the-art performance by a large margin. These findings not only demonstrate our methodologies' capacity to significantly lower GANs' computational demands but also pave the way for deploying high-quality GAN models in settings with limited resources. Our code will be released soon.
♻ ☆ When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective
Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%. The source code is available at https://github.com/IBM/VP-LLR.
♻ ☆ MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation
This paper presents MMA-MRNNet, a novel deep learning architecture for dynamic multi-output Facial Expression Intensity Estimation (FEIE) from video data. Traditional approaches to this task often rely on complex 3-D CNNs, which require extensive pre-training and assume that facial expressions are uniformly distributed across all frames of a video. These methods struggle to handle videos of varying lengths, often resorting to ad-hoc strategies that either discard valuable information or introduce bias. MMA-MRNNet addresses these challenges through a two-stage process. First, the Multiple Models of Affect (MMA) extractor component is a Multi-Task Learning CNN that concurrently estimates valence-arousal, recognizes basic facial expressions, and detects action units in each frame. These representations are then processed by a Masked RNN component, which captures temporal dependencies and dynamically updates weights according to the true length of the input video, ensuring that only the most relevant features are used for the final prediction. The proposed unimodal non-ensemble learning MMA-MRNNet was evaluated on the Hume-Reaction dataset and demonstrated significantly superior performance, surpassing state-of-the-art methods by a wide margin, regardless of whether they were unimodal, multimodal, or ensemble approaches. Finally, we demonstrated the effectiveness of the MMA component of our proposed method across multiple in-the-wild datasets, where it consistently outperformed all state-of-the-art methods across various metrics.
♻ ☆ In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data
Studying and analyzing cropland is a difficult task due to its dynamic and heterogeneous growth behavior. Usually, diverse data sources can be collected for its estimation. Although deep learning models have proven to excel in the crop classification task, they face substantial challenges when dealing with multiple inputs, named Multi-View Learning (MVL). The methods used in the MVL scenario can be structured based on the encoder architecture, the fusion strategy, and the optimization technique. The literature has primarily focused on using specific encoder architectures for local regions, lacking a deeper exploration of other components in the MVL methodology. In contrast, we investigate the simultaneous selection of the fusion strategy and encoder architecture, assessing global-scale cropland and crop-type classifications. We use a range of five fusion strategies (Input, Feature, Decision, Ensemble, Hybrid) and five temporal encoders (LSTM, GRU, TempCNN, TAE, L-TAE) as possible configurations in the MVL method. We use the CropHarvest dataset for validation, which provides optical, radar, weather time series, and topographic information as input data. We found that in scenarios with a limited number of labeled samples, a unique configuration is insufficient for all the cases. Instead, a specialized combination should be meticulously sought, including an encoder and fusion strategy. To streamline this search process, we suggest identifying the optimal encoder architecture tailored for a particular fusion strategy, and then determining the most suitable fusion strategy for the classification task. We provide a methodological framework for researchers exploring crop classification through an MVL methodology.
comment: submitted to journal
♻ ☆ Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation ACL
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
comment: Accepted at the MACLEAN workshop in the ECML/PKDD 2024
♻ ☆ Scalable Glacier Mapping using Deep Learning and Open Earth Observation Data Matches the Accuracy of Manual Delineation
Accurate global glacier mapping is critical for understanding climate change impacts. Despite its importance, automated glacier mapping at a global scale remains largely unexplored. Here we address this gap and propose Glacier-VisionTransformer-U-Net (GlaViTU), a convolutional-transformer deep learning model, and five strategies for multitemporal global-scale glacier mapping using open satellite imagery. Assessing the spatial, temporal and cross-sensor generalisation shows that our best strategy achieves intersection over union >0.85 on previously unobserved images in most cases, which drops to >0.75 for debris-rich areas such as High-Mountain Asia and increases to >0.90 for regions dominated by clean ice. A comparative validation against human expert uncertainties in terms of area and distance deviations underscores GlaViTU performance, approaching or matching expert-level delineation. Adding synthetic aperture radar data, namely, backscatter and interferometric coherence, increases the accuracy in all regions where available. The calibrated confidence for glacier extents is reported making the predictions more reliable and interpretable. We also release a benchmark dataset that covers 9% of glaciers worldwide. Our results support efforts towards automated multitemporal and global glacier mapping.
comment: after major revision, expanded validation
♻ ☆ CSGO: Content-Style Composition in Text-to-Image Generation
The diffusion model has shown exceptional capabilities in controlled image generation, which has further fueled interest in image style transfer. Existing works mainly focus on training free-based methods (e.g., image inversion) due to the scarcity of specific data. In this study, we present a data construction pipeline for content-style-stylized image triplets that generates and automatically cleanses stylized data triplets. Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research. Equipped with IMAGStyle, we propose CSGO, a style transfer model based on end-to-end training, which explicitly decouples content and style features employing independent feature injection. The unified CSGO implements image-driven style transfer, text-driven stylized synthesis, and text editing-driven stylized synthesis. Extensive experiments demonstrate the effectiveness of our approach in enhancing style control capabilities in image generation. Additional visualization and access to the source code can be located on the project page: \url{https://csgo-gen.github.io/}.
♻ ☆ Object-Size-Driven Design of Convolutional Neural Networks: Virtual Axle Detection based on Raw Data
As infrastructure ages, the need for efficient monitoring methods becomes increasingly critical. Bridge Weigh-In-Motion (BWIM) systems are crucial for cost-efficient load and thus residual service life determination of road and railway infrastructure. However, conventional BWIM systems require additional sensors for axle detection, which have to be installed in potentially inaccessible locations or in locations that interfere with bridge operation. This study addresses this challenge by replacing dedicated axle detectors with a novel approach to real-time detection of train axles using sensors arbitrarily placed on bridges. The proposed Virtual Axle Detector with Enhanced Receptive Field (VADER) has been validated on a single-track railway bridge, demonstrating that it achieves to detect 99.9% of axles with a spatial error of 3.69cm using only acceleration measurements. Using raw data as input outperforms the state-of-the-art spectrogram-based method in both speed and memory usage by 99%, making real-time application feasible for the first time. Additionally, we introduce the Maximum Receptive Field (MRF) rule, a novel approach to optimise hyperparameters of Convolutional Neural Networks (CNNs) based on the size of objects, which in this case relates to the fundamental frequency of a bridge. The MRF rule effectively narrows the hyperparameter search space, potentially replacing the need for extensive hyperparameter tuning. Since the MRF rule is theoretically applicable to all unstructured data, it could have implications for a wide range of deep learning problems from earthquake prediction to object recognition.
♻ ☆ Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data
The increasing availability of image-text pairs has largely fueled the rapid advancement in vision-language foundation models. However, the vast scale of these datasets inevitably introduces significant variability in data quality, which can adversely affect the model performance. This highlights the critical role of data filtering, not only to enhance training efficiency but also to improve overall data quality. Existing methods typically rely on metrics such as CLIP Score and BLIP Score, which are derived from pre-trained models. However, these models are often trained on uncurated, noisy datasets, which can perpetuate errors and misalignments in the filtered dataset. We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets into a compact and high-quality form. To systemically capture human preferences on image-text alignments, we collect a diverse image-text dataset where each image is associated with multiple captions from various sources, and establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Additionally, we train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter image-text pairs. Extensive experiments demonstrate that we can maintain, sometimes even improve, model performance while compressing the image-text datasets up to ~90%. An impressive example is that, by aggressively reducing the total training sample from 130M to only 15.5M, our BLIP-B/16 models consistently show an average improvement of 2.9% on retrieval tasks and 11.5% on captioning tasks compared to full-size-dataset counterparts.
♻ ☆ Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition CVPR 2024
Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly
comment: The paper will appear in the CVPR 2024 workshops proceedings
♻ ☆ UHD-IQA Benchmark Database: Pushing the Boundaries of Blind Photo Quality Assessment
We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of 3840 pixels. Contrary to existing No-Reference (NR) IQA datasets, ours focuses on highly aesthetic photos of high technical quality, filling a gap in the literature. The images, carefully curated to exclude synthetic content, are sufficiently diverse to train general NR-IQA models. Importantly, the dataset is annotated with perceptual quality ratings obtained through a crowdsourcing study. Ten expert raters, comprising photographers and graphics artists, assessed each image at least twice in multiple sessions spanning several days, resulting in 20 highly reliable ratings per image. Annotators were rigorously selected based on several metrics, including self-consistency, to ensure their reliability. The dataset includes rich metadata with user and machine-generated tags from over 5,000 categories and popularity indicators such as favorites, likes, downloads, and views. With its unique characteristics, such as its focus on high-quality images, reliable crowdsourced annotations, and high annotation resolution, our dataset opens up new opportunities for advancing perceptual image quality assessment research and developing practical NR-IQA models that apply to modern photos. Our dataset is available at https://database.mmsp-kn.de/uhd-iqa-benchmark-database.html
♻ ☆ Model-agnostic explainable artificial intelligence for object detection in image data
In recent years, deep neural networks have been widely used for building high-performance Artificial Intelligence (AI) systems for computer vision applications. Object detection is a fundamental task in computer vision, which has been greatly progressed through developing large and intricate AI models. However, the lack of transparency is a big challenge that may not allow the widespread adoption of these models. Explainable artificial intelligence is a field of research where methods are developed to help users understand the behavior, decision logics, and vulnerabilities of AI systems. Previously, few explanation methods were developed for object detection based on random masking. However, random masks may raise some issues regarding the actual importance of pixels within an image. In this paper, we design and implement a black-box explanation method named Black-box Object Detection Explanation by Masking (BODEM) through adopting a hierarchical random masking approach for object detection systems. We propose a hierarchical random masking framework in which coarse-grained masks are used in lower levels to find salient regions within an image, and fine-grained mask are used to refine the salient regions in higher levels. Experimentations on various object detection datasets and models showed that BODEM can effectively explain the behavior of object detectors. Moreover, our method outperformed Detector Randomized Input Sampling for Explanation (D-RISE) and Local Interpretable Model-agnostic Explanations (LIME) with respect to different quantitative measures of explanation effectiveness. The experimental results demonstrate that BODEM can be an effective method for explaining and validating object detection systems in black-box testing scenarios.
♻ ☆ Map-Free Visual Relocalization Enhanced by Instance Knowledge and Depth Knowledge
Map-free relocalization technology is crucial for applications in autonomous navigation and augmented reality, but relying on pre-built maps is often impractical. It faces significant challenges due to limitations in matching methods and the inherent lack of scale in monocular images. These issues lead to substantial rotational and metric errors and even localization failures in real-world scenarios. Large matching errors significantly impact the overall relocalization process, affecting both rotational and translational accuracy. Due to the inherent limitations of the camera itself, recovering the metric scale from a single image is crucial, as this significantly impacts the translation error. To address these challenges, we propose a map-free relocalization method enhanced by instance knowledge and depth knowledge. By leveraging instance-based matching information to improve global matching results, our method significantly reduces the possibility of mismatching across different objects. The robustness of instance knowledge across the scene helps the feature point matching model focus on relevant regions and enhance matching accuracy. Additionally, we use estimated metric depth from a single image to reduce metric errors and improve scale recovery accuracy. By integrating methods dedicated to mitigating large translational and rotational errors, our approach demonstrates superior performance in map-free relocalization techniques.
comment: 17 pages,6 figures
♻ ☆ CT-AGRG: Automated Abnormality-Guided Report Generation from 3D Chest CT Volumes
The rapid increase of computed tomography (CT) scans and their time-consuming manual analysis have created an urgent need for robust automated analysis techniques in clinical settings. These aim to assist radiologists and help them managing their growing workload. Existing methods typically generate entire reports directly from 3D CT images, without explicitly focusing on observed abnormalities. This unguided approach often results in repetitive content or incomplete reports, failing to prioritize anomaly-specific descriptions. We propose a new anomaly-guided report generation model, which first predicts abnormalities and then generates targeted descriptions for each. Evaluation on a public dataset demonstrates significant improvements in report quality and clinical relevance. We extend our work by conducting an ablation study to demonstrate its effectiveness.
comment: 15 pages, 9 figures, submitted to ISBI 2025
♻ ☆ Path-SAM2: Transfer SAM2 for digital pathology semantic segmentation
The semantic segmentation task in pathology plays an indispensable role in assisting physicians in determining the condition of tissue lesions. With the proposal of Segment Anything Model (SAM), more and more foundation models have seen rapid development in the field of image segmentation. Recently, SAM2 has garnered widespread attention in both natural image and medical image segmentation. Compared to SAM, it has significantly improved in terms of segmentation accuracy and generalization performance. We compared the foundational models based on SAM and found that their performance in semantic segmentation of pathological images was hardly satisfactory. In this paper, we propose Path-SAM2, which for the first time adapts the SAM2 model to cater to the task of pathological semantic segmentation. We integrate the largest pretrained vision encoder for histopathology (UNI) with the original SAM2 encoder, adding more pathology-based prior knowledge. Additionally, we introduce a learnable Kolmogorov-Arnold Networks (KAN) classification module to replace the manual prompt process. In three adenoma pathological datasets, Path-SAM2 has achieved state-of-the-art performance.This study demonstrates the great potential of adapting SAM2 to pathology image segmentation tasks. We plan to release the code and model weights for this paper at: https://github.com/simzhangbest/SAM2PATH
comment: 5 pages , 5 figures
♻ ☆ Bayesian Evidential Learning for Few-Shot Classification
Few-Shot Classification(FSC) aims to generalize from base classes to novel classes given very limited labeled samples, which is an important step on the path toward human-like machine learning. State-of-the-art solutions involve learning to find a good metric and representation space to compute the distance between samples. Despite the promising accuracy performance, how to model uncertainty for metric-based FSC methods effectively is still a challenge. To model uncertainty, We place a distribution over class probability based on the theory of evidence. As a result, uncertainty modeling and metric learning can be decoupled. To reduce the uncertainty of classification, we propose a Bayesian evidence fusion theorem. Given observed samples, the network learns to get posterior distribution parameters given the prior parameters produced by the pre-trained network. Detailed gradient analysis shows that our method provides a smooth optimization target and can capture the uncertainty. The proposed method is agnostic to metric learning strategies and can be implemented as a plug-and-play module. We integrate our method into several newest FSC methods and demonstrate the improved accuracy and uncertainty quantification on standard FSC benchmarks.
comment: 15 pages
♻ ☆ AI-Assisted Cervical Cancer Screening
Visual Inspection with Acetic Acid (VIA) remains the most feasible cervical cancer screening test in resource-constrained settings of low- and middle-income countries (LMICs), which are often performed screening camps or primary/community health centers by nurses instead of the preferred but unavailable expert Gynecologist. To address the highly subjective nature of the test, various handheld devices integrating cameras or smartphones have been recently explored to capture cervical images during VIA and aid decision-making via telemedicine or AI models. Most studies proposing AI models retrospectively use a relatively small number of already collected images from specific devices, digital cameras, or smartphones; the challenges and protocol for quality image acquisition during VIA in resource-constrained camp settings, challenges in getting gold standard, data imbalance, etc. are often overlooked. We present a novel approach and describe the end-to-end design process to build a robust smartphone-based AI-assisted system that does not require buying a separate integrated device: the proposed protocol for quality image acquisition in resource-constrained settings, dataset collected from 1,430 women during VIA performed by nurses in screening camps, preprocessing pipeline, and training and evaluation of a deep-learning-based classification model aimed to identify (pre)cancerous lesions. Our work shows that the readily available smartphones and a suitable protocol can capture the cervix images with the required details for the VIA test well; the deep-learning-based classification model provides promising results to assist nurses in VIA screening; and provides a direction for large-scale data collection and validation in resource-constrained settings.
♻ ☆ Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images
We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality. Result videos are also available on our project page: https://haruolabs.github.io/style-n2n/
comment: 16 pages, 9 figures
♻ ☆ Rethinking Barely-Supervised Volumetric Medical Image Segmentation from an Unsupervised Domain Adaptation Perspective
This paper investigates an extremely challenging problem: barely-supervised volumetric medical image segmentation (BSS). A BSS training dataset consists of two parts: 1) a barely-annotated labeled set, where each labeled image contains only a single-slice annotation, and 2) an unlabeled set comprising numerous unlabeled volumetric images. State-of-the-art BSS methods employ a registration-based paradigm, which uses inter-slice image registration to propagate single-slice annotations into volumetric pseudo labels, constructing a completely annotated labeled set, to which a semi-supervised segmentation scheme can be applied. However, the paradigm has a critical limitation: the pseudo-labels generated by image registration are unreliable and noisy. Motivated by this, we propose a new perspective: instead of solving BSS within a semi-supervised learning scheme, this work formulates BSS as an unsupervised domain adaptation problem. To this end, we propose a novel BSS framework, \textbf{B}arely-supervised learning \textbf{via} unsupervised domain \textbf{A}daptation (BvA), as an alternative to the dominant registration paradigm. Specifically, we first design a novel noise-free labeled data construction algorithm (NFC) for slice-to-volume labeled data synthesis. Then, we introduce a frequency and spatial Mix-Up strategy (FSX) to mitigate the domain shifts. Extensive experiments demonstrate that our method provides a promising alternative for BSS. Remarkably, the proposed method, trained on the left atrial segmentation dataset with \textbf{only one} barely-labeled image, achieves a Dice score of 81.20%, outperforming the state-of-the-art by 61.71%. The code is available at https://github.com/Senyh/BvA.
♻ ☆ Zero-shot 3D Segmentation of Abdominal Organs in CT Scans Using Segment Anything Model 2: Adapting Video Tracking Capabilities for 3D Medical Imaging
Purpose: To evaluate the zero-shot performance of Segment Anything Model 2 (SAM 2) in 3D segmentation of abdominal organs in CT scans, and to investigate the effects of prompt settings on segmentation results. Materials and Methods: Using a subset of the TotalSegmentator CT dataset (n = 123) from eight institutions, we assessed SAM 2's ability to segment eight abdominal organs. Segmentation was initiated from three different z-coordinate levels (caudal, mid, and cranial levels) of each organ. Performance was measured using the Dice similarity coefficient (DSC). We also analyzed the impact of "negative prompts," which explicitly exclude certain regions from the segmentation process, on accuracy. Additionally, we analyzed organ volumes to contextualize the segmentation performance. Results: As a zero-shot approach, larger organs with clear boundaries demonstrated high segmentation performance, with mean(median) DSCs as follows: liver 0.821(0.898), left kidney 0.870(0.921), right kidney 0.862(0.935), and spleen 0.891(0.932). Smaller organs showed lower performance: gallbladder 0.531(0.590), pancreas 0.361(0.359), and adrenal glands, right 0.203(0.109), left 0.308(0.231). The initial slice for segmentation and the use of negative prompts significantly influenced the results. By removing negative prompts from the input, the DSCs significantly decreased for six organs. Moderate positive correlations were observed between volume sizes and DSCs. Conclusion: SAM 2 demonstrated promising zero-shot performance in segmenting certain abdominal organs in CT scans, particularly larger organs with clear boundaries. Performance was significantly influenced by input negative prompts and initial slice selection, highlighting the importance of optimizing these factors for effective segmentation.
comment: 20 pages, 7 figures (including 2 supplemental figure), 4 tables
♻ ☆ RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality Assessment for Enhanced Video Content ECCV 2024
With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artifacts, and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimized through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.
comment: This paper has been accepted by the ECCV 2024 AIM Advances in Image Manipulation workshop
♻ ☆ Group-aware Parameter-efficient Updating for Content-Adaptive Neural Video Compression ACM MM 2024
Content-adaptive compression is crucial for enhancing the adaptability of the pre-trained neural codec for various contents. Although these methods have been very practical in neural image compression (NIC), their application in neural video compression (NVC) is still limited due to two main aspects: 1), video compression relies heavily on temporal redundancy, therefore updating just one or a few frames can lead to significant errors accumulating over time; 2), NVC frameworks are generally more complex, with many large components that are not easy to update quickly during encoding. To address the previously mentioned challenges, we have developed a content-adaptive NVC technique called Group-aware Parameter-Efficient Updating (GPU). Initially, to minimize error accumulation, we adopt a group-aware approach for updating encoder parameters. This involves adopting a patch-based Group of Pictures (GoP) training strategy to segment a video into patch-based GoPs, which will be updated to facilitate a globally optimized domain-transferable solution. Subsequently, we introduce a parameter-efficient delta-tuning strategy, which is achieved by integrating several light-weight adapters into each coding component of the encoding process by both serial and parallel configuration. Such architecture-agnostic modules stimulate the components with large parameters, thereby reducing both the update cost and the encoding time. We incorporate our GPU into the latest NVC framework and conduct comprehensive experiments, whose results showcase outstanding video compression efficiency across four video benchmarks and adaptability of one medical image benchmark.
comment: Accepted by ACM MM 2024, Melbourne, Australia
♻ ☆ CrossDF: Improving Cross-Domain Deepfake Detection with Deep Information Decomposition
Deepfake technology poses a significant threat to security and social trust. Although existing detection methods have shown high performance in identifying forgeries within datasets that use the same deepfake techniques for both training and testing, they suffer from sharp performance degradation when faced with cross-dataset scenarios where unseen deepfake techniques are tested. To address this challenge, we propose a Deep Information Decomposition (DID) framework to enhance the performance of Cross-dataset Deepfake Detection (CrossDF). Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over specific visual artifacts. Specifically, it adaptively decomposes facial features into deepfake-related and irrelevant information, only using the intrinsic deepfake-related information for real/fake discrimination. Moreover, it optimizes these two kinds of information to be independent with a de-correlation learning module, thereby enhancing the model's robustness against various irrelevant information changes and generalization ability to unseen forgery methods. Our extensive experimental evaluation and comparison with existing state-of-the-art detection methods validate the effectiveness and superiority of the DID framework on cross-dataset deepfake detection.
♻ ☆ Towards Extreme Image Compression with Latent Feature Guidance and Diffusion Prior
Image compression at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. In this work, we propose a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates. In the first stage, we treat the latent representation of images in the diffusion space as guidance, employing a VAE-based compression approach to compress images and initially decode the compressed information into content variables. The second stage leverages pre-trained stable diffusion to reconstruct images under the guidance of content variables. Specifically, we introduce a small control module to inject content information while keeping the stable diffusion model fixed to maintain its generative capability. Furthermore, we design a space alignment loss to force the content variables to align with the diffusion space and provide the necessary constraints for optimization. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in terms of visual performance at extremely low bitrates. The source code and trained models are available at https://github.com/huai-chang/DiffEIC.
comment: Accepted by IEEE TCSVT
♻ ☆ Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration
Multimodal learning leverages complementary information derived from different modalities, thereby enhancing performance in medical image segmentation. However, prevailing multimodal learning methods heavily rely on extensive well-annotated data from various modalities to achieve accurate segmentation performance. This dependence often poses a challenge in clinical settings due to limited availability of such data. Moreover, the inherent anatomical misalignment between different imaging modalities further complicates the endeavor to enhance segmentation performance. To address this problem, we propose a novel semi-supervised multimodal segmentation framework that is robust to scarce labeled data and misaligned modalities. Our framework employs a novel cross modality collaboration strategy to distill modality-independent knowledge, which is inherently associated with each modality, and integrates this information into a unified fusion layer for feature amalgamation. With a channel-wise semantic consistency loss, our framework ensures alignment of modality-independent information from a feature-wise perspective across modalities, thereby fortifying it against misalignments in multimodal scenarios. Furthermore, our framework effectively integrates contrastive consistent learning to regulate anatomical structures, facilitating anatomical-wise prediction alignment on unlabeled data in semi-supervised segmentation tasks. Our method achieves competitive performance compared to other multimodal methods across three tasks: cardiac, abdominal multi-organ, and thyroid-associated orbitopathy segmentations. It also demonstrates outstanding robustness in scenarios involving scarce labeled data and misaligned modalities.
♻ ☆ Weakly Supervised Intracranial Hemorrhage Segmentation with YOLO and an Uncertainty Rectified Segment Anything Model
Intracranial hemorrhage (ICH) is a life-threatening condition that requires rapid and accurate diagnosis to improve treatment outcomes and patient survival rates. Recent advancements in supervised deep learning have greatly improved the analysis of medical images, but often rely on extensive datasets with high-quality annotations, which are costly, time-consuming, and require medical expertise to prepare. To mitigate the need for large amounts of expert-prepared segmentation data, we have developed a novel weakly supervised ICH segmentation method that utilizes the YOLO object detection model and an uncertainty-rectified Segment Anything Model (SAM). In addition, we have proposed a novel point prompt generator for this model to further improve segmentation results with YOLO-predicted bounding box prompts. Our approach achieved a high accuracy of 0.933 and an AUC of 0.796 in ICH detection, along with a mean Dice score of 0.629 for ICH segmentation, outperforming existing weakly supervised and popular supervised (UNet and Swin-UNETR) approaches. Overall, the proposed method provides a robust and accurate alternative to the more commonly used supervised techniques for ICH quantification without requiring refined segmentation ground truths during model training.
comment: Manuscript was accepted at SWITCH2024. 10 pages, 2 figures
♻ ☆ Learned Image Transmission with Hierarchical Variational Autoencoder
In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Additionally, we introduce a rate attention module to guide the JSCC encoder in optimizing its encoding strategy based on prior information. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise.
♻ ☆ A Novel Approach to Classify Power Quality Signals Using Vision Transformers
With the rapid integration of electronically interfaced renewable energy resources and loads into smart grids, there is increasing interest in power quality disturbances (PQD) classification to enhance the security and efficiency of these grids. This paper introduces a new approach to PQD classification based on the Vision Transformer (ViT) model. When a PQD occurs, the proposed approach first converts the power quality signal into an image and then utilizes a pre-trained ViT to accurately determine the class of the PQD. Unlike most previous works, which were limited to a few disturbance classes or small datasets, the proposed method is trained and tested on a large dataset with 17 disturbance classes. Our experimental results show that the proposed ViT-based approach achieves PQD classification precision and recall of 98.28% and 97.98%, respectively, outperforming recently proposed techniques applied to the same dataset.
comment: IECON 2024-50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, U.S.A, 2024, pp. 1-6
♻ ☆ Hand1000: Generating Realistic Hands from Text with Only 1,000 Images
Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing, and colors.
comment: Project page https://haozhuo-zhang.github.io/Hand1000-project-page/
♻ ☆ MV-VTON: Multi-View Virtual Try-On with Diffusion Models
The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets are publicly released at https://github.com/hywang2002/MV-VTON .
comment: Project url: https://hywang2002.github.io/MV-VTON/
♻ ☆ Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning ECCV 2024
Federated Class Continual Learning (FCCL) merges the challenges of distributed client learning with the need for seamless adaptation to new classes without forgetting old ones. The key challenge in FCCL is catastrophic forgetting, an issue that has been explored to some extent in Continual Learning (CL). However, due to privacy preservation requirements, some conventional methods, such as experience replay, are not directly applicable to FCCL. Existing FCCL methods mitigate forgetting by generating historical data through federated training of GANs or data-free knowledge distillation. However, these approaches often suffer from unstable training of generators or low-quality generated data, limiting their guidance for the model. To address this challenge, we propose a novel method of data replay based on diffusion models. Instead of training a diffusion model, we employ a pre-trained conditional diffusion model to reverse-engineer each class, searching the corresponding input conditions for each class within the model's input space, significantly reducing computational resources and time consumption while ensuring effective generation. Furthermore, we enhance the classifier's domain generalization ability on generated and real data through contrastive learning, indirectly improving the representational capability of generated data for real data. Comprehensive experiments demonstrate that our method significantly outperforms existing baselines. Code is available at https://github.com/jinglin-liang/DDDR.
comment: Accepted by ECCV 2024 Oral
♻ ☆ ORMNet: Object-centric Relationship Modeling for Egocentric Hand-object Segmentation
Egocentric hand-object segmentation (EgoHOS) is a promising new task aiming at segmenting hands and interacting objects in egocentric images. Although EgoHOS has the potential to enable various applications, current methods struggle to achieve both high performance and end-to-end optimization simultaneously. Moreover, existing approaches fail to fully leverage hand cues to assist the interacting-object segmentation and overlook the coupled relationships between diverse interacting-object categories, resulting in performance deficiencies. To address these limitations, this paper proposes a novel Object-centric Relationship Modeling Network (ORMNet) to fulfill end-to-end and effective EgoHOS by modeling relationships between hands and objects as well as objects and objects. Specifically, a Hand-Object Relation (HOR) module is introduced to capture the correlation between hands and objects, which uses hand features to guide the network to extract more distinguishing interacting-object features. Besides, we find the coupling relations between diverse interacting-object categories and design the Object Relation Decoupling (ORD) strategy to disentangle them, emphasizing learning of the interaction between hands and objects and reducing the confusion of interacting-object classification. In-domain experiments show that ORMNet has notably exceptional segmentation performance compared with state-of-the-art methods, while out-of-domain experiments further exhibit its robust generalization capability. The project is available at https://github.com/yuggiehk/ORMNet/
♻ ☆ MADE-for-ASD: A Multi-Atlas Deep Ensemble Network for Diagnosing Autism Spectrum Disorder
In response to the global need for efficient early diagnosis of Autism Spectrum Disorder (ASD), this paper bridges the gap between traditional, time-consuming diagnostic methods and potential automated solutions. We propose a multi-atlas deep ensemble network, MADE-for-ASD, that integrates multiple atlases of the brain's functional magnetic resonance imaging (fMRI) data through a weighted deep ensemble network. Our approach integrates demographic information into the prediction workflow, which enhances ASD diagnosis performance and offers a more holistic perspective on patient profiling. We experiment with the well-known publicly available ABIDE (Autism Brain Imaging Data Exchange) I dataset, consisting of resting state fMRI data from 17 different laboratories around the globe. Our proposed system achieves 75.20% accuracy on the entire dataset and 96.40% on a specific subset $-$ both surpassing reported ASD diagnosis accuracy in ABIDE I fMRI studies. Specifically, our model improves by 4.4 percentage points over prior works on the same amount of data. The model exhibits a sensitivity of 82.90% and a specificity of 69.70% on the entire dataset, and 91.00% and 99.50%, respectively, on the specific subset. We leverage the F-score to pinpoint the top 10 ROI in ASD diagnosis, such as precuneus and anterior cingulate/ventromedial. The proposed system can potentially pave the way for more cost-effective, efficient and scalable strategies in ASD diagnosis. Codes and evaluations are publicly available at https://github.com/hasan-rakibul/MADE-for-ASD.
comment: Xuehan Liu and Md Rakibul Hasan contributed equally to this work
♻ ☆ MCDubber: Multimodal Context-Aware Expressive Video Dubbing SC2024
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.
comment: Accepted by NCMMSC2024
♻ ☆ Asynchronous Blob Tracker for Event Cameras
Event-based cameras are popular for tracking fast-moving objects due to their high temporal resolution, low latency, and high dynamic range. In this paper, we propose a novel algorithm for tracking event blobs using raw events asynchronously in real time. We introduce the concept of an event blob as a spatio-temporal likelihood of event occurrence where the conditional spatial likelihood is blob-like. Many real-world objects such as car headlights or any quickly moving foreground objects generate event blob data. The proposed algorithm uses a nearest neighbour classifier with a dynamic threshold criteria for data association coupled with an extended Kalman filter to track the event blob state. Our algorithm achieves highly accurate blob tracking, velocity estimation, and shape estimation even under challenging lighting conditions and high-speed motions (> 11000 pixels/s). The microsecond time resolution achieved means that the filter output can be used to derive secondary information such as time-to-contact or range estimation, that will enable applications to real-world problems such as collision avoidance in autonomous driving.
comment: 18 pages, 16 figures. The manuscript was accepted on August 7, 2024, by IEEE Transactions on Robotics
♻ ☆ From Lab to Field: Real-World Evaluation of an AI-Driven Smart Video Solution to Enhance Community Safety
This article adopts and evaluates an AI-enabled Smart Video Solution (SVS) designed to enhance safety in the real world. The system integrates with existing infrastructure camera networks, leveraging recent advancements in AI for easy adoption. Prioritizing privacy and ethical standards, pose based data is used for downstream AI tasks such as anomaly detection. Cloud-based infrastructure and mobile app are deployed, enabling real-time alerts within communities. The SVS employs innovative data representation and visualization techniques, such as the Occupancy Indicator, Statistical Anomaly Detection, Bird's Eye View, and Heatmaps, to understand pedestrian behaviors and enhance public safety. Evaluation of the SVS demonstrates its capacity to convert complex computer vision outputs into actionable insights for stakeholders, community partners, law enforcement, urban planners, and social scientists. This article presents a comprehensive real-world deployment and evaluation of the SVS, implemented in a community college environment across 16 cameras. The system integrates AI-driven visual processing, supported by statistical analysis, database management, cloud communication, and user notifications. Additionally, the article evaluates the end-to-end latency from the moment an AI algorithm detects anomalous behavior in real-time at the camera level to the time stakeholders receive a notification. The results demonstrate the system's robustness, effectively managing 16 CCTV cameras with a consistent throughput of 16.5 frames per second (FPS) over a 21-hour period and an average end-to-end latency of 26.76 seconds between anomaly detection and alert issuance.
♻ ☆ Depth-guided NeRF Training via Earth Mover's Distance ECCV 2024
Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-ground truth. While these depth priors are assumed to be perfect once filtered for noise, in practice, their accuracy is more challenging to capture. This work proposes a novel approach to uncertainty in depth priors for NeRF supervision. Instead of using custom-trained depth or uncertainty priors, we use off-the-shelf pretrained diffusion models to predict depth and capture uncertainty during the denoising process. Because we know that depth priors are prone to errors, we propose to supervise the ray termination distance distribution with Earth Mover's Distance instead of enforcing the rendered depth to replicate the depth prior exactly through L2-loss. Our depth-guided NeRF outperforms all baselines on standard depth metrics by a large margin while maintaining performance on photometric measures.
comment: Accepted to ECCV 2024
♻ ☆ VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training IROS 2024
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. However, most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects -- not necessarily relevant to navigation and potentially misleading. Alternative approaches train specialized navigation models from scratch, requiring significant computation. On the other hand, self-supervised learning has revolutionized computer vision and natural language processing, but its application to robotic navigation remains underexplored due to the difficulty of defining effective self-supervision signals. Motivated by these observations, in this work, we propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP). Instead of detecting salient objects that are beneficial for tasks such as classification or detection, VANP learns to focus only on specific visual regions that are relevant to the navigation task. To achieve this, VANP uses a history of visual observations, future actions, and a goal image for self-supervision, and embeds them using two small Transformer Encoders. Then, VANP maximizes the information between the embeddings by using a mutual information maximization objective function. We demonstrate that most VANP-extracted features match with human navigation intuition. VANP achieves comparable performance as models learned end-to-end with half the training time and models trained on a large-scale, fully supervised dataset, i.e., ImageNet, with only 0.08% data.
comment: Extended version of the paper accepted at IROS 2024. Code: https://github.com/mhnazeri/VANP
♻ ☆ Contrastive Learning with Consistent Representations
Contrastive learning demonstrates great promise for representation learning. Data augmentations play a critical role in contrastive learning by providing informative views of the data without necessitating explicit labels. Nonetheless, the efficacy of current methodologies heavily hinges on the quality of employed data augmentation (DA) functions, often chosen manually from a limited set of options. While exploiting diverse data augmentations is appealing, the complexities inherent in both DAs and representation learning can lead to performance deterioration. Addressing this challenge and facilitating the systematic incorporation of diverse data augmentations, this paper proposes Contrastive Learning with Consistent Representations CoCor. At the heart of CoCor is a novel consistency metric termed DA consistency. This metric governs the mapping of augmented input data to the representation space, ensuring that these instances are positioned optimally in a manner consistent with the applied intensity of the DA. Moreover, we propose to learn the optimal mapping locations as a function of DA, all while preserving a desired monotonic property relative to DA intensity. Experimental results demonstrate that CoCor notably enhances the generalizability and transferability of learned representations in comparison to baseline methods.
comment: Accepted by TMLR
♻ ☆ ConGeo: Robust Cross-view Geo-localization across Ground View Variations ECCV2024
Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view. In real-world scenarios, the task requires accommodating diverse ground images captured by users with varying orientations and reduced field of views (FoVs). However, existing learning pipelines are orientation-specific or FoV-specific, demanding separate model training for different ground view variations. Such models heavily depend on the North-aligned spatial correspondence and predefined FoVs in the training data, compromising their robustness across different settings. To tackle this challenge, we propose ConGeo, a single- and cross-view Contrastive method for Geo-localization: it enhances robustness and consistency in feature representations to improve a model's invariance to orientation and its resilience to FoV variations, by enforcing proximity between ground view variations of the same location. As a generic learning objective for cross-view geo-localization, when integrated into state-of-the-art pipelines, ConGeo significantly boosts the performance of three base models on four geo-localization benchmarks for diverse ground view variations and outperforms competing methods that train separate models for each ground view variation.
comment: ECCV2024. Project page at https://eceo-epfl.github.io/ConGeo/
Artificial Intelligence 126
☆ RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)
Effective collaboration of dual-arm robots and their tool use capabilities are increasingly important areas in the advancement of robotics. These skills play a significant role in expanding robots' ability to operate in diverse real-world environments. However, progress is impeded by the scarcity of specialized training data. This paper introduces RoboTwin, a novel benchmark dataset combining real-world teleoperated data with synthetic data from digital twins, designed for dual-arm robotic scenarios. Using the COBOT Magic platform, we have collected diverse data on tool usage and human-robot interaction. We present a innovative approach to creating digital twins using AI-generated content, transforming 2D images into detailed 3D models. Furthermore, we utilize large language models to generate expert-level training data and task-specific pose sequences oriented toward functionality. Our key contributions are: 1) the RoboTwin benchmark dataset, 2) an efficient real-to-simulation pipeline, and 3) the use of language models for automatic expert-level data generation. These advancements are designed to address the shortage of robotic training data, potentially accelerating the development of more capable and versatile robotic systems for a wide range of real-world applications. The project page is available at https://robotwin-benchmark.github.io/early-version/
comment: Project page: https://robotwin-benchmark.github.io/early-version/
☆ UC-NeRF: Uncertainty-aware Conditional Neural Radiance Fields from Endoscopic Sparse Views
Visualizing surgical scenes is crucial for revealing internal anatomical structures during minimally invasive procedures. Novel View Synthesis is a vital technique that offers geometry and appearance reconstruction, enhancing understanding, planning, and decision-making in surgical scenes. Despite the impressive achievements of Neural Radiance Field (NeRF), its direct application to surgical scenes produces unsatisfying results due to two challenges: endoscopic sparse views and significant photometric inconsistencies. In this paper, we propose uncertainty-aware conditional NeRF for novel view synthesis to tackle the severe shape-radiance ambiguity from sparse surgical views. The core of UC-NeRF is to incorporate the multi-view uncertainty estimation to condition the neural radiance field for modeling the severe photometric inconsistencies adaptively. Specifically, our UC-NeRF first builds a consistency learner in the form of multi-view stereo network, to establish the geometric correspondence from sparse views and generate uncertainty estimation and feature priors. In neural rendering, we design a base-adaptive NeRF network to exploit the uncertainty estimation for explicitly handling the photometric inconsistencies. Furthermore, an uncertainty-guided geometry distillation is employed to enhance geometry learning. Experiments on the SCARED and Hamlyn datasets demonstrate our superior performance in rendering appearance and geometry, consistently outperforming the current state-of-the-art approaches. Our code will be released at \url{https://github.com/wrld/UC-NeRF}.
☆ Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling
Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs' original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20$\times$ speedup. In addition, our investigation challenges previous claims that MDMs can surpass ARMs in generative perplexity. We identify, for the first time, an underlying numerical issue, even with the 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, leading to unfair assessments of MDMs' generation results in the previous literature.
comment: 40 pages
☆ LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
comment: 19 pages, 7 figures, 6 tables
☆ Multi-stream deep learning framework to predict mild cognitive impairment with Rey Complex Figure Test
Drawing tests like the Rey Complex Figure Test (RCFT) are widely used to assess cognitive functions such as visuospatial skills and memory, making them valuable tools for detecting mild cognitive impairment (MCI). Despite their utility, existing predictive models based on these tests often suffer from limitations like small sample sizes and lack of external validation, which undermine their reliability. We developed a multi-stream deep learning framework that integrates two distinct processing streams: a multi-head self-attention based spatial stream using raw RCFT images and a scoring stream employing a previously developed automated scoring system. Our model was trained on data from 1,740 subjects in the Korean cohort and validated on an external hospital dataset of 222 subjects from Korea. The proposed multi-stream model demonstrated superior performance over baseline models (AUC = 0.872, Accuracy = 0.781) in external validation. The integration of both spatial and scoring streams enables the model to capture intricate visual details from the raw images while also incorporating structured scoring data, which together enhance its ability to detect subtle cognitive impairments. This dual approach not only improves predictive accuracy but also increases the robustness of the model, making it more reliable in diverse clinical settings. Our model has practical implications for clinical settings, where it could serve as a cost-effective tool for early MCI screening.
comment: 20 pages, 3 figures, 2 tables
☆ Configurable Foundation Models: Building LLMs from a Modular Perspective
Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.
☆ Hybrid Imitation-Learning Motion Planner for Urban Driving
With the release of open source datasets such as nuPlan and Argoverse, the research around learning-based planners has spread a lot in the last years. Existing systems have shown excellent capabilities in imitating the human driver behaviour, but they struggle to guarantee safe closed-loop driving. Conversely, optimization-based planners offer greater security in short-term planning scenarios. To confront this challenge, in this paper we propose a novel hybrid motion planner that integrates both learning-based and optimization-based techniques. Initially, a multilayer perceptron (MLP) generates a human-like trajectory, which is then refined by an optimization-based component. This component not only minimizes tracking errors but also computes a trajectory that is both kinematically feasible and collision-free with obstacles and road boundaries. Our model effectively balances safety and human-likeness, mitigating the trade-off inherent in these objectives. We validate our approach through simulation experiments and further demonstrate its efficacy by deploying it in real-world self-driving vehicles.
☆ Bioinformatics Retrieval Augmentation Data (BRAD) Digital Assistant
We present a prototype for a Bioinformatics Retrieval Augmentation Data (BRAD) digital assistant. BRAD integrates a suite of tools to handle a wide range of bioinformatics tasks, from code execution to online search. We demonstrate BRAD's capabilities through (1) improved question-and-answering with retrieval augmented generation (RAG), (2) BRAD's ability to run and write complex software pipelines, and (3) BRAD's ability to organize and distribute tasks across individual and teams of agents. We use BRAD for automation of bioinformatics workflows, performing tasks ranging from gene enrichment and searching the archive to automatic code generation and running biomarker identification pipelines. BRAD is a step toward the ultimate goal to develop a digital twin of laboratories driven by self-contained loops for hypothesis generation and testing of digital biology experiments.
☆ Oops, I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning
The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://github.com/RafLaf/FSL-benchmark-again
☆ R2GQA: Retriever-Reader-Generator Question Answering System to Support Students Understanding Legal Regulations in Higher Education
In this article, we propose the R2GQA system, a Retriever-Reader-Generator Question Answering system, consisting of three main components: Document Retriever, Machine Reader, and Answer Generator. The Retriever module employs advanced information retrieval techniques to extract the context of articles from a dataset of legal regulation documents. The Machine Reader module utilizes state-of-the-art natural language understanding algorithms to comprehend the retrieved documents and extract answers. Finally, the Generator module synthesizes the extracted answers into concise and informative responses to questions of students regarding legal regulations. Furthermore, we built the ViRHE4QA dataset in the domain of university training regulations, comprising 9,758 question-answer pairs with a rigorous construction process. This is the first Vietnamese dataset in the higher regulations domain with various types of answers, both extractive and abstractive. In addition, the R2GQA system is the first system to offer abstractive answers in Vietnamese. This paper discusses the design and implementation of each module within the R2GQA system on the ViRHE4QA dataset, highlighting their functionalities and interactions. Furthermore, we present experimental results demonstrating the effectiveness and utility of the proposed system in supporting the comprehension of students of legal regulations in higher education settings. In general, the R2GQA system and the ViRHE4QA dataset promise to contribute significantly to related research and help students navigate complex legal documents and regulations, empowering them to make informed decisions and adhere to institutional policies effectively. Our dataset is available for research purposes.
Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models
This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
☆ A hybrid FEM-PINN method for time-dependent partial differential equations
In this work, we present a hybrid numerical method for solving evolution partial differential equations (PDEs) by merging the time finite element method with deep neural networks. In contrast to the conventional deep learning-based formulation where the neural network is defined on a spatiotemporal domain, our methodology utilizes finite element basis functions in the time direction where the space-dependent coefficients are defined as the output of a neural network. We then apply the Galerkin or collocation projection in the time direction to obtain a system of PDEs for the space-dependent coefficients which is approximated in the framework of PINN. The advantages of such a hybrid formulation are twofold: statistical errors are avoided for the integral in the time direction, and the neural network's output can be regarded as a set of reduced spatial basis functions. To further alleviate the difficulties from high dimensionality and low regularity, we have developed an adaptive sampling strategy that refines the training set. More specifically, we use an explicit density model to approximate the distribution induced by the PDE residual and then augment the training set with new time-dependent random samples given by the learned density model. The effectiveness and efficiency of our proposed method have been demonstrated through a series of numerical experiments.
comment: 25pages
☆ Towards Edge-Based Data Lake Architecture for Intelligent Transportation System
The rapid urbanization growth has underscored the need for innovative solutions to enhance transportation efficiency and safety. Intelligent Transportation Systems (ITS) have emerged as a promising solution in this context. However, analyzing and processing the massive and intricate data generated by ITS presents significant challenges for traditional data processing systems. This work proposes an Edge-based Data Lake Architecture to integrate and analyze the complex data from ITS efficiently. The architecture offers scalability, fault tolerance, and performance, improving decision-making and enhancing innovative services for a more intelligent transportation ecosystem. We demonstrate the effectiveness of the architecture through an analysis of three different use cases: (i) Vehicular Sensor Network, (ii) Mobile Network, and (iii) Driver Identification applications.
☆ Governing dual-use technologies: Case studies of international security agreements and lessons for AI governance
International AI governance agreements and institutions may play an important role in reducing global security risks from advanced AI. To inform the design of such agreements and institutions, we conducted case studies of historical and contemporary international security agreements. We focused specifically on those arrangements around dual-use technologies, examining agreements in nuclear security, chemical weapons, biosecurity, and export controls. For each agreement, we examined four key areas: (a) purpose, (b) core powers, (c) governance structure, and (d) instances of non-compliance. From these case studies, we extracted lessons for the design of international AI agreements and governance institutions. We discuss the importance of robust verification methods, strategies for balancing power between nations, mechanisms for adapting to rapid technological change, approaches to managing trade-offs between transparency and security, incentives for participation, and effective enforcement mechanisms.
☆ An incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting
This paper introduces a novel incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting (MCS) problems, enabling decision makers to progressively provide assignment example preference information. Specifically, we first construct a max-margin optimization-based model to model potentially non-monotonic preferences and inconsistent assignment example preference information in each iteration of the incremental preference elicitation process. Using the optimal objective function value of the max-margin optimization-based model, we devise information amount measurement methods and question selection strategies to pinpoint the most informative alternative in each iteration within the framework of uncertainty sampling in active learning. Once the termination criterion is satisfied, the sorting result for non-reference alternatives can be determined through the use of two optimization models, i.e., the max-margin optimization-based model and the complexity controlling optimization model. Subsequently, two incremental preference elicitation-based algorithms are developed to learn potentially non-monotonic preferences, considering different termination criteria. Ultimately, we apply the proposed approach to a credit rating problem to elucidate the detailed implementation steps, and perform computational experiments on both artificial and real-world data sets to compare the proposed question selection strategies with several benchmark strategies.
comment: 37 pages, 22 figures
☆ Tractable Offline Learning of Regular Decision Processes
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs). In RDPs, the unknown dependency of future observations and rewards from the past interactions can be captured by some hidden finite-state automaton. For this reason, many RDP algorithms first reconstruct this unknown dependency using automata learning techniques. In this paper, we show that it is possible to overcome two strong limitations of previous offline RL algorithms for RDPs, notably RegORL. This can be accomplished via the introduction of two original techniques: the development of a new pseudometric based on formal languages, which removes a problematic dependency on $L_\infty^\mathsf{p}$-distinguishability parameters, and the adoption of Count-Min-Sketch (CMS), instead of naive counting. The former reduces the number of samples required in environments that are characterized by a low complexity in language-theoretic terms. The latter alleviates the memory requirements for long planning horizons. We derive the PAC sample complexity bounds associated to each of these techniques, and we validate the approach experimentally.
comment: To appear in EWRL 2024
☆ Creating a Gen-AI based Track and Trace Assistant MVP (SuperTracy) for PostNL
The developments in the field of generative AI has brought a lot of opportunities for companies, for instance to improve efficiency in customer service and automating tasks. PostNL, the biggest parcel and E-commerce corporation of the Netherlands wants to use generative AI to enhance the communication around track and trace of parcels. During the internship a Minimal Viable Product (MVP) is created to showcase the value of using generative AI technologies, to enhance parcel tracking, analyzing the parcel's journey and being able to communicate about it in an easy to understand manner. The primary goal was to develop an in-house LLM-based system, reducing dependency on external platforms and establishing the feasibility of a dedicated generative AI team within the company. This multi-agent LLM based system aimed to construct parcel journey stories and identify logistical disruptions with heightened efficiency and accuracy. The research involved deploying a sophisticated AI-driven communication system, employing Retrieval-Augmented Generation (RAG) for enhanced response precision, and optimizing large language models (LLMs) tailored to domain specific tasks. The MVP successfully implemented a multi-agent open-source LLM system, called SuperTracy. SuperTracy is capable of autonomously managing a broad spectrum of user inquiries and improving internal knowledge handling. Results and evaluation demonstrated technological innovation and feasibility, notably in communication about the track and trace of a parcel, which exceeded initial expectations. These advancements highlight the potential of AI-driven solutions in logistics, suggesting many opportunities for further refinement and broader implementation within PostNL operational framework.
☆ Incorporating Like-Minded Peers to Overcome Friend Data Sparsity in Session-Based Social Recommendations
Session-based Social Recommendation (SSR) leverages social relationships within online networks to enhance the performance of Session-based Recommendation (SR). However, existing SSR algorithms often encounter the challenge of ``friend data sparsity''. Moreover, significant discrepancies can exist between the purchase preferences of social network friends and those of the target user, reducing the influence of friends relative to the target user's own preferences. To address these challenges, this paper introduces the concept of ``Like-minded Peers'' (LMP), representing users whose preferences align with the target user's current session based on their historical sessions. This is the first work, to our knowledge, that uses LMP to enhance the modeling of social influence in SSR. This approach not only alleviates the problem of friend data sparsity but also effectively incorporates users with similar preferences to the target user. We propose a novel model named Transformer Encoder with Graph Attention Aggregator Recommendation (TEGAARec), which includes the TEGAA module and the GAT-based social aggregation module. The TEGAA module captures and merges both long-term and short-term interests for target users and LMP users. Concurrently, the GAT-based social aggregation module is designed to aggregate the target users' dynamic interests and social influence in a weighted manner. Extensive experiments on four real-world datasets demonstrate the efficacy and superiority of our proposed model and ablation studies are done to illustrate the contributions of each component in TEGAARec.
comment: None
☆ Decision Transformer for Enhancing Neural Local Search on the Job Shop Scheduling Problem
The job shop scheduling problem (JSSP) and its solution algorithms have been of enduring interest in both academia and industry for decades. In recent years, machine learning (ML) is playing an increasingly important role in advancing existing and building new heuristic solutions for the JSSP, aiming to find better solutions in shorter computation times. In this paper we build on top of a state-of-the-art deep reinforcement learning (DRL) agent, called Neural Local Search (NLS), which can efficiently and effectively control a large local neighborhood search on the JSSP. In particular, we develop a method for training the decision transformer (DT) algorithm on search trajectories taken by a trained NLS agent to further improve upon the learned decision-making sequences. Our experiments show that the DT successfully learns local search strategies that are different and, in many cases, more effective than those of the NLS agent itself. In terms of the tradeoff between solution quality and acceptable computational time needed for the search, the DT is particularly superior in application scenarios where longer computational times are acceptable. In this case, it makes up for the longer inference times required per search step, which are caused by the larger neural network architecture, through better quality decisions per step. Thereby, the DT achieves state-of-the-art results for solving the JSSP with ML-enhanced search.
comment: currently under review for IEEE Transactions on Cybernetics
☆ The Role of Artificial Intelligence and Machine Learning in Software Testing
Artificial Intelligence (AI) and Machine Learning (ML) have significantly impacted various industries, including software development. Software testing, a crucial part of the software development lifecycle (SDLC), ensures the quality and reliability of software products. Traditionally, software testing has been a labor-intensive process requiring significant manual effort. However, the advent of AI and ML has transformed this landscape by introducing automation and intelligent decision-making capabilities. AI and ML technologies enhance the efficiency and effectiveness of software testing by automating complex tasks such as test case generation, test execution, and result analysis. These technologies reduce the time required for testing and improve the accuracy of defect detection, ultimately leading to higher quality software. AI can predict potential areas of failure by analyzing historical data and identifying patterns, which allows for more targeted and efficient testing. This paper explores the role of AI and ML in software testing by reviewing existing literature, analyzing current tools and techniques, and presenting case studies that demonstrate the practical benefits of these technologies. The literature review provides a comprehensive overview of the advancements in AI and ML applications in software testing, highlighting key methodologies and findings from various studies. The analysis of current tools showcases the capabilities of popular AI-driven testing tools such as Eggplant AI, Test.ai, Selenium, Appvance, Applitools Eyes, Katalon Studio, and Tricentis Tosca, each offering unique features and advantages. Case studies included in this paper illustrate real-world applications of AI and ML in software testing, showing significant improvements in testing efficiency, accuracy, and overall software quality.
LLM-Assisted Visual Analytics: Opportunities and Challenges
We explore the integration of large language models (LLMs) into visual analytics (VA) systems to transform their capabilities through intuitive natural language interactions. We survey current research directions in this emerging field, examining how LLMs are integrated into data management, language interaction, visualisation generation, and language generation processes. We highlight the new possibilities that LLMs bring to VA, especially how they can change VA processes beyond the usual use cases. We especially highlight building new visualisation-language models, allowing access of a breadth of domain knowledge, multimodal interaction, and opportunities with guidance. Finally, we carefully consider the prominent challenges of using current LLMs in VA tasks. Our discussions in this paper aim to guide future researchers working on LLM-assisted VA systems and help them navigate common obstacles when developing these systems.
comment: Accepted at EG UK Computer Graphics & Visual Computing 2024
☆ Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.
☆ RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models
Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, models trained on domain-specific data often yield better results within their respective domains. While prior work in information retrieval has tackled this through multi-task training, the topic of combining multiple domain-specific expert retrievers remains unexplored, despite its popularity in language model generation. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts along with a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of experts without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average) commonly used in language modeling. Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. To our knowledge, RouterRetriever is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models with effective routing over a single, general-purpose embedding model in retrieval tasks.
☆ Neural Networks with LSTM and GRU in Modeling Active Fires in the Amazon
This study presents a comprehensive methodology for modeling and forecasting the historical time series of fire spots detected by the AQUA_M-T satellite in the Amazon, Brazil. The approach utilizes a mixed Recurrent Neural Network (RNN) model, combining Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures to predict monthly accumulations of daily detected fire spots. A summary of the data revealed a consistent seasonality over time, with annual maximum and minimum fire spot values tending to repeat at the same periods each year. The primary objective is to verify whether the forecasts capture this inherent seasonality through rigorous statistical analysis. The methodology involved careful data preparation, model configuration, and training using cross-validation with two seeds, ensuring that the data generalizes well to the test and validation sets, and confirming the convergence of the model parameters. The results indicate that the mixed LSTM and GRU model offers improved accuracy in forecasting 12 months ahead, demonstrating its effectiveness in capturing complex temporal patterns and modeling the observed time series. This research significantly contributes to the application of deep learning techniques in environmental monitoring, specifically in fire spot forecasting. In addition to improving forecast accuracy, the proposed approach highlights the potential for adaptation to other time series forecasting challenges, opening new avenues for research and development in machine learning and natural phenomenon prediction. Keywords: Time Series Forecasting, Recurrent Neural Networks, Deep Learning.
comment: 16 pages, in Portuguese language, 24 figures
☆ Independence Constrained Disentangled Representation Learning from Epistemological Perspective
Disentangled Representation Learning aims to improve the explainability of deep learning methods by training a data encoder that identifies semantically meaningful latent variables in the data generation process. Nevertheless, there is no consensus regarding a universally accepted definition for the objective of disentangled representation learning. In particular, there is a considerable amount of discourse regarding whether should the latent variables be mutually independent or not. In this paper, we first investigate these arguments on the interrelationships between latent variables by establishing a conceptual bridge between Epistemology and Disentangled Representation Learning. Then, inspired by these interdisciplinary concepts, we introduce a two-level latent space framework to provide a general solution to the prior arguments on this issue. Finally, we propose a novel method for disentangled representation learning by employing an integration of mutual information constraint and independence constraint within the Generative Adversarial Network (GAN) framework. Experimental results demonstrate that our proposed method consistently outperforms baseline approaches in both quantitative and qualitative evaluations. The method exhibits strong performance across multiple commonly used metrics and demonstrates a great capability in disentangling various semantic factors, leading to an improved quality of controllable generation, which consequently benefits the explainability of the algorithm.
☆ Causality-Aware Transformer Networks for Robotic Navigation
Recent advances in machine learning algorithms have garnered growing interest in developing versatile Embodied AI systems. However, current research in this domain reveals opportunities for improvement. First, the direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling, potentially limiting its performance in Embodied AI tasks. Second, the reliance on task-specific configurations, such as pre-trained modules and dataset-specific logic, compromises the generalizability of these methods. We address these constraints by initially exploring the unique differences between Embodied AI tasks and other sequential data tasks through the lens of Causality, presenting a causal framework to elucidate the inadequacies of conventional sequential methods for Embodied AI. By leveraging this causal perspective, we propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module to enhance the models's Environmental Understanding capability. Meanwhile, our method is devoid of task-specific inductive biases and can be trained in an End-to-End manner, which enhances the method's generalizability across various contexts. Empirical evaluations demonstrate that our methodology consistently surpasses benchmark performances across a spectrum of settings, tasks and simulation environments. Extensive ablation studies reveal that the performance gains can be attributed to the Causal Understanding Module, which demonstrates effectiveness and efficiency in both Reinforcement Learning and Supervised Learning settings.
☆ PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation
While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.
comment: 7+5 pages, 15 figures
☆ OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation
This paper presents the experiments and results for the CheckThat! Lab at CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE). The primary objective of this task was to generate adversarial examples in five problem domains in order to evaluate the robustness of widely used text classification methods (fine-tuned BERT, BiLSTM, and RoBERTa) when applied to credibility assessment issues. This study explores the application of ensemble learning to enhance adversarial attacks on natural language processing (NLP) models. We systematically tested and refined several adversarial attack methods, including BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across various misinformation tasks. By developing modified versions of BERT-Attack and hybrid methods, we achieved significant improvements in attack effectiveness. Our results demonstrate the potential of modification and combining multiple methods to create more sophisticated and effective adversarial attack strategies, contributing to the development of more robust and secure systems.
comment: CLEF 2024 - Conference and Labs of the Evaluation Forum
☆ Evaluating Environments Using Exploratory Agents
Exploration is a key part of many video games. We investigate the using an exploratory agent to provide feedback on the design of procedurally generated game levels, 5 engaging levels and 5 unengaging levels. We expand upon a framework introduced in previous research which models motivations for exploration and introduce a fitness function for evaluating an environment's potential for exploration. Our study showed that our exploratory agent can clearly distinguish between engaging and unengaging levels. The findings suggest that our agent has the potential to serve as an effective tool for assessing procedurally generated levels, in terms of exploration. This work contributes to the growing field of AI-driven game design by offering new insights into how game environments can be evaluated and optimised for player exploration.
comment: 9 Pages, 9 figures, 2 tables, work in progress
☆ AdvSecureNet: A Python Toolkit for Adversarial Machine Learning
Machine learning models are vulnerable to adversarial attacks. Several tools have been developed to research these vulnerabilities, but they often lack comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch based toolkit for adversarial machine learning that is the first to natively support multi-GPU setups for attacks, defenses, and evaluation. It is the first toolkit that supports both CLI and API interfaces and external YAML configuration files to enhance versatility and reproducibility. The toolkit includes multiple attacks, defenses and evaluation metrics. Rigiorous software engineering practices are followed to ensure high code quality and maintainability. The project is available as an open-source project on GitHub at https://github.com/melihcatal/advsecurenet and installable via PyPI.
☆ SurgTrack: CAD-Free 3D Tracking of Real-world Surgical Instruments
Vision-based surgical navigation has received increasing attention due to its non-invasive, cost-effective, and flexible advantages. In particular, a critical element of the vision-based navigation system is tracking surgical instruments. Compared with 2D instrument tracking methods, 3D instrument tracking has broader value in clinical practice, but is also more challenging due to weak texture, occlusion, and lack of Computer-Aided Design (CAD) models for 3D registration. To solve these challenges, we propose the SurgTrack, a two-stage 3D instrument tracking method for CAD-free and robust real-world applications. In the first registration stage, we incorporate an Instrument Signed Distance Field (SDF) modeling the 3D representation of instruments, achieving CAD-freed 3D registration. Due to this, we can obtain the location and orientation of instruments in the 3D space by matching the video stream with the registered SDF model. In the second tracking stage, we devise a posture graph optimization module, leveraging the historical tracking results of the posture memory pool to optimize the tracking results and improve the occlusion robustness. Furthermore, we collect the Instrument3D dataset to comprehensively evaluate the 3D tracking of surgical instruments. The extensive experiments validate the superiority and scalability of our SurgTrack, by outperforming the state-of-the-arts with a remarkable improvement. The code and dataset are available at https://github.com/wenwucode/SurgTrack.
☆ AlignGroup: Learning and Aligning Group Consensus with Member Preferences for Group Recommendation CIKM 2024
Group activities are important behaviors in human society, providing personalized recommendations for groups is referred to as the group recommendation task. Existing methods can usually be categorized into two strategies to infer group preferences: 1) determining group preferences by aggregating members' personalized preferences, and 2) inferring group consensus by capturing group members' coherent decisions after common compromises. However, the former would suffer from the lack of group-level considerations, and the latter overlooks the fine-grained preferences of individual users. To this end, we propose a novel group recommendation method AlignGroup, which focuses on both group consensus and individual preferences of group members to infer the group decision-making. Specifically, AlignGroup explores group consensus through a well-designed hypergraph neural network that efficiently learns intra- and inter-group relationships. Moreover, AlignGroup innovatively utilizes a self-supervised alignment task to capture fine-grained group decision-making by aligning the group consensus with members' common preferences. Extensive experiments on two real-world datasets validate that our AlignGroup outperforms the state-of-the-art on both the group recommendation task and the user recommendation task, as well as outperforms the efficiency of most baselines.
comment: 10 pages, accepted by CIKM 2024
☆ Solving Video Inverse Problems Using Image Diffusion Models
Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: https://solving-video-inverse.github.io/main/
comment: 22 pages, 16 figures
☆ Advancing Cyber Incident Timeline Analysis Through Rule Based AI and Large Language Models
Timeline Analysis (TA) is a key part of Timeline Forensics (TF) in Digital Forensics (DF), focusing primarily on examining and analysing temporal digital artefacts such as timestamps, derived from event logs, file metadata, and other related data to correlate events resulting from cyber incidents and reconstruct their chronological timeline. Traditional tools often struggle to efficiently process the vast volume and variety of data acquired during DF investigations and Incident Response (IR) processes. This paper presents a novel framework, GenDFIR, that combines Rule-Based Artificial Intelligence (R-BAI) algorithms with Large Language Models (LLMs) to advance and automate the TA process. Our approach consists of two main stages (1) We use R-BAI to identify and select anomalous digital artefacts based on predefined rules. (2) The selected artefacts are then converted into embeddings for processing by an LLM with the help of a Retrieval-Augmented Generation (RAG) agent. The LLM consequently leverages its capabilities to perform automated TA on the artefacts and predict potential incident scenarios. To validate our framework, we evaluate GenDFIR performance, efficiency, and reliability using various metrics across synthetic cyber incident simulation scenarios. This paper presents a proof of concept, where the findings demonstrate the significant potential of integrating R-BAI and LLMs for TA. This novel approach highlights the power of Generative AI (GenAI), specifically LLMs, and opens new avenues for advanced threat detection and incident reconstruction, representing a significant step forward in the field.
comment: 25 pages
☆ More is More: Addition Bias in Large Language Models
In this paper, we investigate the presence of additive bias in Large Language Models (LLMs), drawing a parallel to the cognitive bias observed in humans where individuals tend to favor additive over subtractive changes. Using a series of controlled experiments, we tested various LLMs, including GPT-3.5 Turbo, Claude 3.5 Sonnet, Mistral, Math$\Sigma$tral, and Llama 3.1, on tasks designed to measure their propensity for additive versus subtractive modifications. Our findings demonstrate a significant preference for additive changes across all tested models. For example, in a palindrome creation task, Llama 3.1 favored adding letters 97.85% of the time over removing them. Similarly, in a Lego tower balancing task, GPT-3.5 Turbo chose to add a brick 76.38% of the time rather than remove one. In a text summarization task, Mistral 7B produced longer summaries in 59.40% to 75.10% of cases when asked to improve its own or others' writing. These results indicate that, similar to humans, LLMs exhibit a marked additive bias, which might have implications when LLMs are used on a large scale. Addittive bias might increase resource use and environmental impact, leading to higher economic costs due to overconsumption and waste. This bias should be considered in the development and application of LLMs to ensure balanced and efficient problem-solving approaches.
comment: 25 pages, 8 figures
☆ Vision-Language Navigation with Continual Learning
Vision-language navigation (VLN) is a critical domain within embedded intelligence, requiring agents to navigate 3D environments based on natural language instructions. Traditional VLN research has focused on improving environmental understanding and decision accuracy. However, these approaches often exhibit a significant performance gap when agents are deployed in novel environments, mainly due to the limited diversity of training data. Expanding datasets to cover a broader range of environments is impractical and costly. We propose the Vision-Language Navigation with Continual Learning (VLNCL) paradigm to address this challenge. In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge. VLNCL enables agents to maintain an environmental memory and extract relevant knowledge, allowing rapid adaptation to new environments while preserving existing information. We introduce a novel dual-loop scenario replay method (Dual-SR) inspired by brain memory replay mechanisms integrated with VLN agents. This method facilitates consolidating past experiences and enhances generalization across new tasks. By utilizing a multi-scenario memory buffer, the agent efficiently organizes and replays task memories, thereby bolstering its ability to adapt quickly to new environments and mitigating catastrophic forgetting. Our work pioneers continual learning in VLN agents, introducing a novel experimental setup and evaluation metrics. We demonstrate the effectiveness of our approach through extensive evaluations and establish a benchmark for the VLNCL paradigm. Comparative experiments with existing continual learning and VLN methods show significant improvements, achieving state-of-the-art performance in continual learning ability and highlighting the potential of our approach in enabling rapid adaptation while preserving prior knowledge.
☆ Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation
Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
comment: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
☆ A Sequential Decision-Making Model for Perimeter Identification
Perimeter identification involves ascertaining the boundaries of a designated area or zone, requiring traffic flow monitoring, control, or optimization. Various methodologies and technologies exist for accurately defining these perimeters; however, they often necessitate specialized equipment, precise mapping, or comprehensive data for effective problem delineation. In this study, we propose a sequential decision-making framework for perimeter search, designed to operate efficiently in real-time and require only publicly accessible information. We conceptualize the perimeter search as a game between a playing agent and an artificial environment, where the agent's objective is to identify the optimal perimeter by sequentially improving the current perimeter. We detail the model for the game and discuss its adaptability in determining the definition of an optimal perimeter. Ultimately, we showcase the model's efficacy through a real-world scenario, highlighting the identification of corresponding optimal perimeters.
☆ Understanding eGFR Trajectories and Kidney Function Decline via Large Multimodal Models
The estimated Glomerular Filtration Rate (eGFR) is an essential indicator of kidney function in clinical practice. Although traditional equations and Machine Learning (ML) models using clinical and laboratory data can estimate eGFR, accurately predicting future eGFR levels remains a significant challenge for nephrologists and ML researchers. Recent advances demonstrate that Large Language Models (LLMs) and Large Multimodal Models (LMMs) can serve as robust foundation models for diverse applications. This study investigates the potential of LMMs to predict future eGFR levels with a dataset consisting of laboratory and clinical values from 50 patients. By integrating various prompting techniques and ensembles of LMMs, our findings suggest that these models, when combined with precise prompts and visual representations of eGFR trajectories, offer predictive performance comparable to existing ML models. This research extends the application of foundation models and suggests avenues for future studies to harness these models in addressing complex medical forecasting challenges.
comment: This preprint version includes corrections of typographical errors related to numerical values in Table 2, which were present in the version published at the BDH workshop in MIPR 2024. These corrections do not affect the overall conclusions of the study
☆ Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments
Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI, demanding agents to navigate freely in unbounded 3D spaces solely guided by natural language instructions. This task introduces distinct challenges in multimodal comprehension, spatial reasoning, and decision-making. To address these challenges, we introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks. Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes. Firstly, it constructs a cognitive map, integrating temporal, spatial, and semantic elements, thereby facilitating the development of spatial memory within LLMs. Secondly, Cog-GA employs a predictive mechanism for waypoints, strategically optimizing the exploration trajectory to maximize navigational efficiency. Each waypoint is accompanied by a dual-channel scene description, categorizing environmental cues into 'what' and 'where' streams as the brain. This segregation enhances the agent's attentional focus, enabling it to discern pertinent spatial information for navigation. A reflective mechanism complements these strategies by capturing feedback from prior navigation experiences, facilitating continual learning and adaptive replanning. Extensive evaluations conducted on VLN-CE benchmarks validate Cog-GA's state-of-the-art performance and ability to simulate human-like navigation behaviors. This research significantly contributes to the development of strategic and interpretable VLN-CE agents.
☆ Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal
Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks' datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks.
☆ CoAst: Validation-Free Contribution Assessment for Federated Learning based on Cross-Round Valuation
In the federated learning (FL) process, since the data held by each participant is different, it is necessary to figure out which participant has a higher contribution to the model performance. Effective contribution assessment can help motivate data owners to participate in the FL training. Research works in this field can be divided into two directions based on whether a validation dataset is required. Validation-based methods need to use representative validation data to measure the model accuracy, which is difficult to obtain in practical FL scenarios. Existing validation-free methods assess the contribution based on the parameters and gradients of local models and the global model in a single training round, which is easily compromised by the stochasticity of model training. In this work, we propose CoAst, a practical method to assess the FL participants' contribution without access to any validation data. The core idea of CoAst involves two aspects: one is to only count the most important part of model parameters through a weights quantization, and the other is a cross-round valuation based on the similarity between the current local parameters and the global parameter updates in several subsequent communication rounds. Extensive experiments show that CoAst has comparable assessment reliability to existing validation-based methods and outperforms existing validation-free methods.
☆ NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention
In the study of auditory attention, it has been revealed that there exists a robust correlation between attended speech and elicited neural responses, measurable through electroencephalography (EEG). Therefore, it is possible to use the attention information available within EEG signals to guide the extraction of the target speaker in a cocktail party computationally. In this paper, we present a neuro-guided speaker extraction model, i.e. NeuroSpex, using the EEG response of the listener as the sole auxiliary reference cue to extract attended speech from monaural speech mixtures. We propose a novel EEG signal encoder that captures the attention information. Additionally, we propose a cross-attention (CA) mechanism to enhance the speech feature representations, generating a speaker extraction mask. Experimental results on a publicly available dataset demonstrate that our proposed model outperforms two baseline models across various evaluation metrics.
☆ Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization IROS 2024
Indoor robots rely on depth to perform tasks like navigation or obstacle detection, and single-image depth estimation is widely used to assist perception. Most indoor single-image depth prediction focuses less on model generalizability to unseen datasets, concerned with in-the-wild robustness for system deployment. This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied meta-learning of image classification associated with explicit class labels, no explicit task boundaries exist for continuous depth values tied to highly varying indoor environments regarding object arrangement and scene composition. We propose fine-grained task that treats each RGB-D mini-batch as a task in our meta-learning formulation. We first show that our method on limited data induces a much better prior (max 27.8% in RMSE). Then, finetuning on meta-learned initialization consistently outperforms baselines without the meta approach. Aiming at generalization, we propose zero-shot cross-dataset protocols and validate higher generalizability induced by our meta-initialization, as a simple and useful plugin to many existing depth estimation methods. The work at the intersection of depth and meta-learning potentially drives both research to step closer to practical robotic and machine perception usage.
comment: IROS 2024. The version supersedes 2305.07269. arXiv admin note: text overlap with arXiv:2305.07269
☆ Adversarial Attacks on Machine Learning-Aided Visualizations
Research in ML4VIS investigates how to use machine learning (ML) techniques to generate visualizations, and the field is rapidly growing with high societal impact. However, as with any computational pipeline that employs ML processes, ML4VIS approaches are susceptible to a range of ML-specific adversarial attacks. These attacks can manipulate visualization generations, causing analysts to be tricked and their judgments to be impaired. Due to a lack of synthesis from both visualization and ML perspectives, this security aspect is largely overlooked by the current ML4VIS literature. To bridge this gap, we investigate the potential vulnerabilities of ML-aided visualizations from adversarial attacks using a holistic lens of both visualization and ML perspectives. We first identify the attack surface (i.e., attack entry points) that is unique in ML-aided visualizations. We then exemplify five different adversarial attacks. These examples highlight the range of possible attacks when considering the attack surface and multiple different adversary capabilities. Our results show that adversaries can induce various attacks, such as creating arbitrary and deceptive visualizations, by systematically identifying input attributes that are influential in ML inferences. Based on our observations of the attack surface characteristics and the attack examples, we underline the importance of comprehensive studies of security issues and defense mechanisms as a call of urgency for the ML4VIS community.
comment: This is the author's version of the article that has been accepted by the Journal of Visualization
☆ TASAR: Transferable Attack on Skeletal Action Recognition
Skeletal sequences, as well-structured representations of human behaviors, are crucial in Human Activity Recognition (HAR). The transferability of adversarial skeletal sequences enables attacks in real-world HAR scenarios, such as autonomous driving, intelligent surveillance, and human-computer interactions. However, existing Skeleton-based HAR (S-HAR) attacks exhibit weak adversarial transferability and, therefore, cannot be considered true transfer-based S-HAR attacks. More importantly, the reason for this failure remains unclear. In this paper, we study this phenomenon through the lens of loss surface, and find that its sharpness contributes to the poor transferability in S-HAR. Inspired by this observation, we assume and empirically validate that smoothening the rugged loss landscape could potentially improve adversarial transferability in S-HAR. To this end, we propose the first Transfer-based Attack on Skeletal Action Recognition, TASAR. TASAR explores the smoothed model posterior without re-training the pre-trained surrogates, which is achieved by a new post-train Dual Bayesian optimization strategy. Furthermore, unlike previous transfer-based attacks that treat each frame independently and overlook temporal coherence within sequences, TASAR incorporates motion dynamics into the Bayesian attack gradient, effectively disrupting the spatial-temporal coherence of S-HARs. To exhaustively evaluate the effectiveness of existing methods and our method, we build the first large-scale robust S-HAR benchmark, comprising 7 S-HAR models, 10 attack methods, 3 S-HAR datasets and 2 defense models. Extensive results demonstrate the superiority of TASAR. Our benchmark enables easy comparisons for future studies, with the code available in the supplementary material.
comment: arXiv admin note: text overlap with arXiv:2407.08572
☆ Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP
Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and 0.16 compared to the state-of-the-art (SOTA) baseline. Our DDSP vocoder is 4.9x faster than the baseline on CPU during inference, and can generate speech of comparable quality with only 0.4M parameters, in contrast to the 9M parameters required by the SOTA.
comment: accepted for Spoken Language Technology Workshop 2024
☆ What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations EMNLP 2024
This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts. We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer, and their unintended consequences on performance metrics. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially inflated performance metrics for Indic languages. We conclude by proposing a shift towards developing normalization routines that leverage native linguistic expertise, ensuring more robust and accurate evaluations of multilingual ASR models.
comment: Sumbitted to EMNLP 2024
☆ Detecting Korean Food Using Image using Hierarchical Model
A solution was made available for Korean Food lovers who have dietary restrictions to identify the Korean food before consuming. Just by uploading a clear photo of the dish, people can get to know what they are eating. Image processing techniques together with machine learning helped to come up with this solution.
Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning
Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.
☆ Accelerating Large Language Model Training with Hybrid GPU-based Compression
Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a na\"ive compression scheme across all collectives and observed a 22.5\% increase in TFLOPS per GPU and a 23.6\% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (arXiv:2301.02654), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3\% increase in TFLOPS per GPU and a 12.7\% increase in samples per second while reaching baseline loss convergence.
☆ Abstractive Text Summarization: State of the Art, Challenges, and Improvements
Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
comment: 9 Tables, 7 Figures
☆ Learning Privacy-Preserving Student Networks via Discriminative-Generative Distillation
While deep models have proved successful in learning rich knowledge from massive well-annotated data, they may pose a privacy leakage risk in practical deployment. It is necessary to find an effective trade-off between high utility and strong privacy. In this work, we propose a discriminative-generative distillation approach to learn privacy-preserving deep models. Our key idea is taking models as bridge to distill knowledge from private data and then transfer it to learn a student network via two streams. First, discriminative stream trains a baseline classifier on private data and an ensemble of teachers on multiple disjoint private subsets, respectively. Then, generative stream takes the classifier as a fixed discriminator and trains a generator in a data-free manner. After that, the generator is used to generate massive synthetic data which are further applied to train a variational autoencoder (VAE). Among these synthetic data, a few of them are fed into the teacher ensemble to query labels via differentially private aggregation, while most of them are embedded to the trained VAE for reconstructing synthetic data. Finally, a semi-supervised student learning is performed to simultaneously handle two tasks: knowledge transfer from the teachers with distillation on few privately labeled synthetic data, and knowledge enhancement with tangent-normal adversarial regularization on many triples of reconstructed synthetic data. In this way, our approach can control query cost over private data and mitigate accuracy degradation in a unified manner, leading to a privacy-preserving student model. Extensive experiments and analysis clearly show the effectiveness of the proposed approach.
comment: This paper is accepted by IEEE Transactions on Image Processing (TIP)
☆ Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Translation
This paper derives 'scaling laws' -- empirical relationships between the amount of training compute used for a Large Language Model (LLM) and its performance -- for economic outcomes. In a preregistered experiment, 300 professional translators completed 1800 tasks with access to one of thirteen LLMs with differing model training compute sizes (or a control). Our results show that model scaling substantially raises productivity: for every 10x increase in model compute, translators completed tasks 12.3% quicker, received 0.18 s.d. higher grades, and earned 16.1% more per minute (including bonus payments). Further, the gains from model scaling are much higher for lower-skilled workers who gain a 4x larger improvement in task completion speed. These results imply further frontier model scaling -- which is currently estimated at 4x increase per year -- may have significant economic implications.
☆ Neural Dynamics Model of Visual Decision-Making: Learning from Human Experts
Uncovering the fundamental neural correlates of biological intelligence, developing mathematical models, and conducting computational simulations are critical for advancing new paradigms in artificial intelligence (AI). In this study, we implemented a comprehensive visual decision-making model that spans from visual input to behavioral output, using a neural dynamics modeling approach. Drawing inspiration from the key components of the dorsal visual pathway in primates, our model not only aligns closely with human behavior but also reflects neural activities in primates, and achieving accuracy comparable to convolutional neural networks (CNNs). Moreover, magnetic resonance imaging (MRI) identified key neuroimaging features such as structural connections and functional connectivity that are associated with performance in perceptual decision-making tasks. A neuroimaging-informed fine-tuning approach was introduced and applied to the model, leading to performance improvements that paralleled the behavioral variations observed among subjects. Compared to classical deep learning models, our model more accurately replicates the behavioral performance of biological intelligence, relying on the structural characteristics of biological neural networks rather than extensive training data, and demonstrating enhanced resilience to perturbation.
☆ Multi-modal Situated Reasoning in 3D Scenes
Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.
comment: Project page: https://msr3d.github.io/
Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.
comment: 10 pages, 1 figure
☆ Coral Model Generation from Single Images for Virtual Reality Applications
With the rapid development of VR technology, the demand for high-quality 3D models is increasing. Traditional methods struggle with efficiency and quality in large-scale customization. This paper introduces a deep-learning framework that generates high-precision 3D coral models from a single image. Using the Coral dataset, the framework extracts geometric and texture features, performs 3D reconstruction, and optimizes design and material blending. Advanced optimization and polygon count control ensure shape accuracy, detail retention, and flexible output for various complexities, catering to high-quality rendering and real-time interaction needs.The project incorporates Explainable AI (XAI) to transform AI-generated models into interactive "artworks," best viewed in VR and XR. This enhances model interpretability and human-machine collaboration. Real-time feedback in VR interactions displays information like coral species and habitat, enriching user experience. The generated models surpass traditional methods in detail, visual quality, and efficiency. This research offers an intelligent approach to 3D content creation for VR, lowering production barriers, and promoting widespread VR applications. Additionally, integrating XAI provides new insights into AI-generated visual content and advances research in 3D vision interpretability.
comment: In Proceedings of Explainable AI for the Arts Workshop 2024 (XAIxArts 2024) arXiv:2406.14485
☆ Do Large Language Models Possess Sensitive to Sentiment?
Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models' outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.
comment: 10 pages, 2 figures
☆ NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
☆ Backdoor defense, learnability and obfuscation
We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the "trigger", while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker's strategy must work for a randomly-chosen trigger. Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of Hanneke et al. (2022) to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation.
comment: 29 pages
☆ MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation ECCV 2024
Skin cancer segmentation poses a significant challenge in medical image analysis. Numerous existing solutions, predominantly CNN-based, face issues related to a lack of global contextual understanding. Alternatively, some approaches resort to large-scale Transformer models to bridge the global contextual gaps, but at the expense of model size and computational complexity. Finally many Transformer based approaches rely primarily on CNN based decoders overlooking the benefits of Transformer based decoding models. Recognizing these limitations, we address the need efficient lightweight solutions by introducing MobileUNETR, which aims to overcome the performance constraints associated with both CNNs and Transformers while minimizing model size, presenting a promising stride towards efficient image segmentation. MobileUNETR has 3 main features. 1) MobileUNETR comprises of a lightweight hybrid CNN-Transformer encoder to help balance local and global contextual feature extraction in an efficient manner; 2) A novel hybrid decoder that simultaneously utilizes low-level and global features at different resolutions within the decoding stage for accurate mask generation; 3) surpassing large and complex architectures, MobileUNETR achieves superior performance with 3 million parameters and a computational complexity of 1.3 GFLOP resulting in 10x and 23x reduction in parameters and FLOPS, respectively. Extensive experiments have been conducted to validate the effectiveness of our proposed method on four publicly available skin lesion segmentation datasets, including ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. The code will be publicly available at: https://github.com/OSUPCVLab/MobileUNETR.git
comment: Accepted at ECCV 2024 - BioImage Computing Workshop (Oral)
☆ Better Verified Explanations with Applications to Incorrectness and Out-of-Distribution Detection
Building on VeriX (Verified eXplainability, arXiv:2212.01051), a system for producing optimal verified explanations for machine learning model outputs, we present VeriX+, which significantly improves both the size and the generation time of verified explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time -- the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain (Junker 2004) algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of 38% on the GTSRB dataset and a time reduction of 90% on MNIST. We also explore applications of our verified explanations and show that explanation size is a useful proxy for both incorrectness detection and out-of-distribution detection.
☆ Can Your Generative Model Detect Out-of-Distribution Covariate Shift? ECCV 2024
Detecting Out-of-Distribution~(OOD) sensory data and covariate distribution shift aims to identify new test examples with different high-level image statistics to the captured, normal and In-Distribution (ID) set. Existing OOD detection literature largely focuses on semantic shift with little-to-no consensus over covariate shift. Generative models capture the ID data in an unsupervised manner, enabling them to effectively identify samples that deviate significantly from this learned distribution, irrespective of the downstream task. In this work, we elucidate the ability of generative models to detect and quantify domain-specific covariate shift through extensive analyses that involves a variety of models. To this end, we conjecture that it is sufficient to detect most occurring sensory faults (anomalies and deviations in global signals statistics) by solely modeling high-frequency signal-dependent and independent details. We propose a novel method, CovariateFlow, for OOD detection, specifically tailored to covariate heteroscedastic high-frequency image-components using conditional Normalizing Flows (cNFs). Our results on CIFAR10 vs. CIFAR10-C and ImageNet200 vs. ImageNet200-C demonstrate the effectiveness of the method by accurately detecting OOD covariate shift. This work contributes to enhancing the fidelity of imaging systems and aiding machine learning models in OOD detection in the presence of covariate shift.
comment: ECCV 2024
Large Language Model-Based Agents for Software Engineering: A Survey
The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 106 papers and categorize them from two perspectives, i.e., the SE and agent perspectives. In addition, we discuss open challenges and future directions in this critical domain. The repository of this survey is at https://github.com/FudanSELab/Agent4SE-Paper-List.
☆ Hallucination Detection in LLMs: Fast and Memory-Efficient Finetuned Models
Uncertainty estimation is a necessary component when implementing AI in high-risk settings, such as autonomous cars, medicine, or insurances. Large Language Models (LLMs) have seen a surge in popularity in recent years, but they are subject to hallucinations, which may cause serious harm in high-risk settings. Despite their success, LLMs are expensive to train and run: they need a large amount of computations and memory, preventing the use of ensembling methods in practice. In this work, we present a novel method that allows for fast and memory-friendly training of LLM ensembles. We show that the resulting ensembles can detect hallucinations and are a viable approach in practice as only one GPU is needed for training and inference.
comment: 5 pages, 3 figures
♻ ☆ Decentralized Health Intelligence Network (DHIN)
Decentralized Health Intelligence Network (DHIN) extends the Decentralized Intelligence Network (DIN) framework to address challenges in healthcare data sovereignty and AI utilization. Building upon DIN's core principles, DHIN introduces healthcare-specific components to tackle data fragmentation across providers and institutions, establishing a sovereign architecture for healthcare provision. It facilitates effective AI utilization by overcoming barriers to accessing diverse health data sources. This comprehensive framework leverages: 1) self-sovereign identity architecture coupled with a personal health record (PHR), extending DIN's personal data stores concept to ensure health data sovereignty; 2) a scalable federated learning (FL) protocol implemented on a public blockchain for decentralized AI training in healthcare, tailored for medical data; and 3) a scalable, trustless rewards mechanism adapted from DIN to incentivize participation in healthcare AI development. DHIN operates on a public blockchain with an immutable record, ensuring that no entity can control access to health data or determine financial benefits. It supports effective AI training while allowing patients to maintain control over their health data, benefit financially, and contribute to a decentralized ecosystem. Unique to DHIN, patients receive rewards in digital wallets as an incentive to opt into the FL protocol, with a long-term roadmap to fund decentralized insurance solutions. This approach introduces a novel, self-financed healthcare model that adapts to individual needs, complements existing systems, and redefines universal coverage, showcasing how DIN principles can transform healthcare data management and AI utilization while empowering patients.
comment: 13 pages, 6 figures
♻ ☆ Enhancing Graph Neural Networks with Limited Labeled Data by Actively Distilling Knowledge from Large Language Models
Graphs are pervasive in the real-world, such as social network analysis, bioinformatics, and knowledge graphs. Graph neural networks (GNNs) have great ability in node classification, a fundamental task on graphs. Unfortunately, conventional GNNs still face challenges in scenarios with few labeled nodes, despite the prevalence of few-shot node classification tasks in real-world applications. To address this challenge, various approaches have been proposed, including graph meta-learning, transfer learning, and methods based on Large Language Models (LLMs). However, traditional meta-learning and transfer learning methods often require prior knowledge from base classes or fail to exploit the potential advantages of unlabeled nodes. Meanwhile, LLM-based methods may overlook the zero-shot capabilities of LLMs and rely heavily on the quality of generated contexts. In this paper, we propose a novel approach that integrates LLMs and GNNs, leveraging the zero-shot inference and reasoning capabilities of LLMs and employing a Graph-LLM-based active learning paradigm to enhance GNNs' performance. Extensive experiments demonstrate the effectiveness of our model in improving node classification accuracy with considerably limited labeled data, surpassing state-of-the-art baselines by significant margins.
comment: 10 pages, 3 Figures
♻ ☆ The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem
Large language models (LLMs) are useful tools with the capacity for performing specific types of knowledge work at an effective scale. However, LLM deployments in high-risk and safety-critical domains pose unique challenges, notably the issue of ``hallucination,'' where LLMs can generate fabricated information. This is particularly concerning in settings such as drug safety, where inaccuracies could lead to patient harm. To mitigate these risks, we have developed and demonstrated a proof of concept suite of guardrails specifically designed to mitigate certain types of hallucinations and errors for drug safety, and potentially applicable to other medical safety-critical contexts. These guardrails include mechanisms to detect anomalous documents to prevent the ingestion of inappropriate data, identify incorrect drug names or adverse event terms, and convey uncertainty in generated content. We integrated these guardrails with an LLM fine-tuned for a text-to-text task, which involves converting both structured and unstructured data within adverse event reports into natural language. This method was applied to translate individual case safety reports, demonstrating effective application in a pharmacovigilance processing task. Our guardrail framework offers a set of tools with broad applicability across various domains, ensuring LLMs can be safely used in high-risk situations by eliminating the occurrence of key errors, including the generation of incorrect pharmacovigilance-related terms, thus adhering to stringent regulatory and quality standards in medical safety-critical environments.
comment: 27 pages, 6 figures, 4 tables and supplementary material provided
♻ ☆ Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL
Objective: To enhance the efficiency and accuracy of information retrieval from pharmacovigilance (PV) databases by employing Large Language Models (LLMs) to convert natural language queries (NLQs) into Structured Query Language (SQL) queries, leveraging a business context document. Materials and Methods: We utilized OpenAI's GPT-4 model within a retrieval-augmented generation (RAG) framework, enriched with a business context document, to transform NLQs into syntactically precise SQL queries. Each NLQ was presented to the LLM randomly and independently to prevent memorization. The study was conducted in three phases, varying query complexity, and assessing the LLM's performance both with and without the business context document. Results: Our approach significantly improved NLQ-to-SQL accuracy, increasing from 8.3\% with the database schema alone to 78.3\% with the business context document. This enhancement was consistent across low, medium, and high complexity queries, indicating the critical role of contextual knowledge in query generation. Discussion: The integration of a business context document markedly improved the LLM's ability to generate accurate and contextually relevant SQL queries. Performance achieved a maximum of 85\% when high complexity queries are excluded, suggesting promise for routine deployment. Conclusion: This study presents a novel approach to employing LLMs for safety data retrieval and analysis, demonstrating significant advancements in query generation accuracy. The methodology offers a framework applicable to various data-intensive domains, enhancing the accessibility and efficiency of information retrieval for non-technical users.
comment: 15 pages, 3 tables, 5 figures
♻ ☆ The Design of an LLM-powered Unstructured Analytics System
LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.
comment: 6 pages, 3 figures, fixed typos
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques
We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We utilize imitation learning to approximate the reference policy, ensuring stability and effectiveness in training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
♻ ☆ Revisiting Character-level Adversarial Attacks for Language Models ICML 2024
Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.
comment: Accepted in ICML 2024
♻ ☆ The Future of Open Human Feedback
Human feedback on conversations with language language models (LLMs) is central to how these systems learn about the world, improve their capabilities, and are steered toward desirable and safe behaviors. However, this feedback is mostly collected by frontier AI labs and kept behind closed doors. In this work, we bring together interdisciplinary experts to assess the opportunities and challenges to realizing an open ecosystem of human feedback for AI. We first look for successful practices in peer production, open source, and citizen science communities. We then characterize the main challenges for open human feedback. For each, we survey current approaches and offer recommendations. We end by envisioning the components needed to underpin a sustainable and open human feedback ecosystem. In the center of this ecosystem are mutually beneficial feedback loops, between users and specialized models, incentivizing a diverse stakeholders community of model trainers and feedback providers to support a general open feedback pool.
♻ ☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
♻ ☆ Open Gaze: Open Source eye tracker for smartphone devices using Deep Learning
Eye tracking has been a pivotal tool in diverse fields such as vision research, language analysis, and usability assessment. The majority of prior investigations, however, have concentrated on expansive desktop displays employing specialized, costly eye tracking hardware that lacks scalability. Remarkably little insight exists into ocular movement patterns on smartphones, despite their widespread adoption and significant usage. In this manuscript, we present an open-source implementation of a smartphone-based gaze tracker that emulates the methodology proposed by a GooglePaper (whose source code remains proprietary). Our focus is on attaining accuracy comparable to that attained through the GooglePaper's methodology, without the necessity for supplementary hardware. Through the integration of machine learning techniques, we unveil an accurate eye tracking solution that is native to smartphones. Our approach demonstrates precision akin to the state-of-the-art mobile eye trackers, which are characterized by a cost that is two orders of magnitude higher. Leveraging the vast MIT GazeCapture dataset, which is available through registration on the dataset's website, we successfully replicate crucial findings from previous studies concerning ocular motion behavior in oculomotor tasks and saliency analyses during natural image observation. Furthermore, we emphasize the applicability of smartphone-based gaze tracking in discerning reading comprehension challenges. Our findings exhibit the inherent potential to amplify eye movement research by significant proportions, accommodating participation from thousands of subjects with explicit consent. This scalability not only fosters advancements in vision research, but also extends its benefits to domains such as accessibility enhancement and healthcare applications.
comment: This paper results are incorrectly reported. The paper is not authentic and conclusions are not correct
♻ ☆ Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation
Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these tools to improve them further. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content. In the present work, we have identified a fundamental limitation related to the image generation ability of LLMs, and termed it The NO Syndrome. This negation blindness refers to LLMs inability to correctly comprehend NO related natural language prompts to generate the desired images. Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found to be suffering from this syndrome. To demonstrate the generalization of this limitation, we carried out simulation experiments and conducted entropy-based and benchmark statistical analysis tests on various LLMs in multiple languages, including English, Hindi, and French. We conclude that the NO syndrome is a significant flaw in current LLMs that needs to be addressed. A related finding of this study showed a consistent discrepancy between image and textual responses as a result of this NO syndrome. We posit that the introduction of a negation context-aware reinforcement learning based feedback loop between the LLMs textual response and generated image could help ensure the generated text is based on both the LLMs correct contextual understanding of the negation query and the generated visual output.
comment: 15 pages, 7 figures
♻ ☆ Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS 2024
In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions that rely on keywords. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection engines that rely on textual features and keywords are bypassed, an occurrence our observations show happens frequently.
comment: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)
♻ ☆ Multi-Modal Experience Inspired AI Creation
AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences. More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics. The code and data are available at: \url{https://github.com/Aman-4-Real/MMTG}.
comment: Accepted by ACM Multimedia 2022
♻ ☆ Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors' simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult.
Multimodal Recommender Systems: A Survey
The recommender system (RS) has been an integral toolkit of online services. They are equipped with various deep learning techniques to model user preference based on identifier and attribute information. With the emergence of multimedia services, such as short videos, news and etc., understanding these contents while recommending becomes critical. Besides, multimodal features are also helpful in alleviating the problem of data sparsity in RS. Thus, Multimodal Recommender System (MRS) has attracted much attention from both academia and industry recently. In this paper, we will give a comprehensive survey of the MRS models, mainly from technical views. First, we conclude the general procedures and major challenges for MRS. Then, we introduce the existing MRS models according to four categories, i.e., Modality Encoder, Feature Interaction, Feature Enhancement and Model Optimization. Besides, to make it convenient for those who want to research this field, we also summarize the dataset and code resources. Finally, we discuss some promising future directions of MRS and conclude this paper. To access more details of the surveyed papers, such as implementation code, we open source a repository.
comment: accepted by CSUR
♻ ☆ Comprehensive Review and Empirical Evaluation of Causal Discovery Algorithms for Numerical Data
Causal analysis has become an essential component in understanding the underlying causes of phenomena across various fields. Despite its significance, existing literature on causal discovery algorithms is fragmented, with inconsistent methodologies, i.e., there is no universal classification standard for existing methods, and a lack of comprehensive evaluations, i.e., data characteristics are often ignored to be jointly analyzed when benchmarking algorithms. This study addresses these gaps by conducting an exhaustive review and empirical evaluation for causal discovery methods on numerical data, aiming to provide a clearer and more structured understanding of the field. Our research begins with a comprehensive literature review spanning over two decades, analyzing over 200 academic articles and identifying more than 40 representative algorithms. This extensive analysis leads to the development of a structured taxonomy tailored to the complexities of causal discovery, categorizing methods into six main types. To address the lack of comprehensive evaluations, our study conducts an extensive empirical assessment of 29 causal discovery algorithms on multiple synthetic and real-world datasets. We categorize synthetic datasets based on size, linearity, and noise distribution, employing five evaluation metrics, and summarize the top-3 algorithm recommendations, providing guidelines for users in various data scenarios. Our results highlight a significant impact of dataset characteristics on algorithm performance. Moreover, a metadata extraction strategy with an accuracy exceeding 80% is developed to assist users in algorithm selection on unknown datasets. Based on these insights, we offer professional and practical guidelines to help users choose the most suitable causal discovery methods for their specific dataset.
♻ ☆ HIRO: Hierarchical Information Retrieval Optimization
Retrieval-Augmented Generation (RAG) has revolutionized natural language processing by dynamically integrating external knowledge into Large Language Models (LLMs), addressing their limitation of static training datasets. Recent implementations of RAG leverage hierarchical data structures, which organize documents at various levels of summarization and information density. This complexity, however, can cause LLMs to "choke" on information overload, necessitating more sophisticated querying mechanisms. In this context, we introduce Hierarchical Information Retrieval Optimization (HIRO), a novel querying approach that employs a Depth-First Search (DFS)-based recursive similarity score calculation and branch pruning. This method uniquely minimizes the context delivered to the LLM without informational loss, effectively managing the challenge of excessive data. HIRO's refined approach is validated by a 10.85% improvement in performance on the NarrativeQA dataset.
♻ ☆ DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data
Classification tasks present challenges due to class imbalances and evolving data distributions. Addressing these issues requires a robust method to handle imbalances while effectively detecting out-of-distribution (OOD) samples not encountered during training. This study introduces a novel OOD detection algorithm designed for tabular datasets, titled Deep Neural Network-based Gaussian Descriptor for Imbalanced Tabular Data (DNN-GDITD). The DNN-GDITD algorithm can be placed on top of any DNN to facilitate better classification of imbalanced data and OOD detection using spherical decision boundaries. Using a combination of Push, Score-based, and focal losses, DNN-GDITD assigns confidence scores to test data points, categorizing them as known classes or as an OOD sample. Extensive experimentation on tabular datasets demonstrates the effectiveness of DNN-GDITD compared to three OOD algorithms. Evaluation encompasses imbalanced and balanced scenarios on diverse tabular datasets, including a synthetic financial dispute dataset and publicly available tabular datasets like Gas Sensor, Drive Diagnosis, and MNIST, showcasing DNN-GDITD's versatility.
comment: 17 pages
♻ ☆ Graph Retrieval Augmented Trustworthiness Reasoning
Trustworthiness reasoning is crucial in multiplayer games with incomplete information, enabling agents to identify potential allies and adversaries, thereby enhancing reasoning and decision-making processes. Traditional approaches relying on pre-trained models necessitate extensive domain-specific data and considerable reward feedback, with their lack of real-time adaptability hindering their effectiveness in dynamic environments. In this paper, we introduce the Graph Retrieval Augmented Reasoning (GRATR) framework, leveraging the Retrieval-Augmented Generation (RAG) technique to bolster trustworthiness reasoning in agents. GRATR constructs a dynamic trustworthiness graph, updating it in real-time with evidential information, and retrieves relevant trust data to augment the reasoning capabilities of Large Language Models (LLMs). We validate our approach through experiments on the multiplayer game "Werewolf," comparing GRATR against baseline LLM and LLM enhanced with Native RAG and Rerank RAG. Our results demonstrate that GRATR surpasses the baseline methods by over 30\% in winning rate, with superior reasoning performance. Moreover, GRATR effectively mitigates LLM hallucinations, such as identity and objective amnesia, and crucially, it renders the reasoning process more transparent and traceable through the use of the trustworthiness graph.
♻ ☆ Decision-Focused Learning: Foundations, State of the Art, Benchmark and Future Opportunities
Decision-focused learning (DFL) is an emerging paradigm that integrates machine learning (ML) and constrained optimization to enhance decision quality by training ML models in an end-to-end system. This approach shows significant potential to revolutionize combinatorial decision-making in real-world applications that operate under uncertainty, where estimating unknown parameters within decision models is a major challenge. This paper presents a comprehensive review of DFL, providing an in-depth analysis of both gradient-based and gradient-free techniques used to combine ML and constrained optimization. It evaluates the strengths and limitations of these techniques and includes an extensive empirical evaluation of eleven methods across seven problems. The survey also offers insights into recent advancements and future research directions in DFL. Code and benchmark: https://github.com/PredOpt/predopt-benchmarks
comment: Experimental Survey and Benchmarking
♻ ☆ Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?
Real-world autonomous driving systems must make safe decisions in the face of rare and diverse traffic scenarios. Current state-of-the-art planners are mostly evaluated on real-world datasets like nuScenes (open-loop) or nuPlan (closed-loop). In particular, nuPlan seems to be an expressive evaluation method since it is based on real-world data and closed-loop, yet it mostly covers basic driving scenarios. This makes it difficult to judge a planner's capabilities to generalize to rarely-seen situations. Therefore, we propose a novel closed-loop benchmark interPlan containing several edge cases and challenging driving scenarios. We assess existing state-of-the-art planners on our benchmark and show that neither rule-based nor learning-based planners can safely navigate the interPlan scenarios. A recently evolving direction is the usage of foundation models like large language models (LLM) to handle generalization. We evaluate an LLM-only planner and introduce a novel hybrid planner that combines an LLM-based behavior planner with a rule-based motion planner that achieves state-of-the-art performance on our benchmark.
♻ ☆ In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data
Studying and analyzing cropland is a difficult task due to its dynamic and heterogeneous growth behavior. Usually, diverse data sources can be collected for its estimation. Although deep learning models have proven to excel in the crop classification task, they face substantial challenges when dealing with multiple inputs, named Multi-View Learning (MVL). The methods used in the MVL scenario can be structured based on the encoder architecture, the fusion strategy, and the optimization technique. The literature has primarily focused on using specific encoder architectures for local regions, lacking a deeper exploration of other components in the MVL methodology. In contrast, we investigate the simultaneous selection of the fusion strategy and encoder architecture, assessing global-scale cropland and crop-type classifications. We use a range of five fusion strategies (Input, Feature, Decision, Ensemble, Hybrid) and five temporal encoders (LSTM, GRU, TempCNN, TAE, L-TAE) as possible configurations in the MVL method. We use the CropHarvest dataset for validation, which provides optical, radar, weather time series, and topographic information as input data. We found that in scenarios with a limited number of labeled samples, a unique configuration is insufficient for all the cases. Instead, a specialized combination should be meticulously sought, including an encoder and fusion strategy. To streamline this search process, we suggest identifying the optimal encoder architecture tailored for a particular fusion strategy, and then determining the most suitable fusion strategy for the classification task. We provide a methodological framework for researchers exploring crop classification through an MVL methodology.
comment: submitted to journal
♻ ☆ Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation ACL
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
comment: Accepted at the MACLEAN workshop in the ECML/PKDD 2024
♻ ☆ Smart E-commerce Recommendations with Semantic AI
In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.
comment: 8 pages
♻ ☆ A Hybrid Framework for Spatial Interpolation: Merging Data-driven with Domain Knowledge
Estimating spatially distributed information through the interpolation of scattered observation datasets often overlooks the critical role of domain knowledge in understanding spatial dependencies. Additionally, the features of these data sets are typically limited to the spatial coordinates of the scattered observation locations. In this paper, we propose a hybrid framework that integrates data-driven spatial dependency feature extraction with rule-assisted spatial dependency function mapping to augment domain knowledge. We demonstrate the superior performance of our framework in two comparative application scenarios, highlighting its ability to capture more localized spatial features in the reconstructed distribution fields. Furthermore, we underscore its potential to enhance nonlinear estimation capabilities through the application of transformed fuzzy rules and to quantify the inherent uncertainties associated with the observation data sets. Our framework introduces an innovative approach to spatial information estimation by synergistically combining observational data with rule-assisted domain knowledge.
comment: 21 pages, 13 figures; typos corrected, references updated
♻ ☆ Moderate Adaptive Linear Units (MoLU)
We propose a new high-performance activation function, Moderate Adaptive Linear Units (MoLU), for the deep neural network. The MoLU is a simple, beautiful and powerful activation function that can be a good main activation function among hundreds of activation functions. Because the MoLU is made up of the elementary functions, not only it is a diffeomorphism (i.e. analytic over whole domains), but also it reduces the training time.
comment: 4 pages, 5 figures
♻ ☆ Simultaneous Training of First- and Second-Order Optimizers in Population-Based Reinforcement Learning
The tuning of hyperparameters in reinforcement learning (RL) is critical, as these parameters significantly impact an agent's performance and learning efficiency. Dynamic adjustment of hyperparameters during the training process can significantly enhance both the performance and stability of learning. Population-based training (PBT) provides a method to achieve this by continuously tuning hyperparameters throughout the training. This ongoing adjustment enables models to adapt to different learning stages, resulting in faster convergence and overall improved performance. In this paper, we propose an enhancement to PBT by simultaneously utilizing both first- and second-order optimizers within a single population. We conducted a series of experiments using the TD3 algorithm across various MuJoCo environments. Our results, for the first time, empirically demonstrate the potential of incorporating second-order optimizers within PBT-based RL. Specifically, the combination of the K-FAC optimizer with Adam led to up to a 10% improvement in overall performance compared to PBT using only Adam. Additionally, in environments where Adam occasionally fails, such as the Swimmer environment, the mixed population with K-FAC exhibited more reliable learning outcomes, offering a significant advantage in training stability without a substantial increase in computational time.
comment: 8 pages, 5 figures
♻ ☆ Genesis: Towards the Automation of Systems Biology Research
The cutting edge of applying AI to science is the closed-loop automation of scientific research: robot scientists. We have previously developed two robot scientists: `Adam' (for yeast functional biology), and `Eve' (for early-stage drug design)). We are now developing a next generation robot scientist Genesis. With Genesis we aim to demonstrate that an area of science can be investigated using robot scientists unambiguously faster, and at lower cost, than with human scientists. Here we report progress on the Genesis project. Genesis is designed to automatically improve system biology models with thousands of interacting causal components. When complete Genesis will be able to initiate and execute in parallel one thousand hypothesis-led closed-loop cycles of experiment per-day. Here we describe the core Genesis hardware: the one thousand computer-controlled $\mu$-bioreactors. For the integrated Mass Spectrometry platform we have developed AutonoMS, a system to automatically run, process, and analyse high-throughput experiments. We have also developed Genesis-DB, a database system designed to enable software agents access to large quantities of structured domain information. We have developed RIMBO (Revisions for Improvements of Models in Biology Ontology) to describe the planned hundreds of thousands of changes to the models. We have demonstrated the utility of this infrastructure by developed two relational learning bioinformatic projects. Finally, we describe LGEM+ a relational learning system for the automated abductive improvement of genome-scale metabolic models.
♻ ☆ Model-agnostic explainable artificial intelligence for object detection in image data
In recent years, deep neural networks have been widely used for building high-performance Artificial Intelligence (AI) systems for computer vision applications. Object detection is a fundamental task in computer vision, which has been greatly progressed through developing large and intricate AI models. However, the lack of transparency is a big challenge that may not allow the widespread adoption of these models. Explainable artificial intelligence is a field of research where methods are developed to help users understand the behavior, decision logics, and vulnerabilities of AI systems. Previously, few explanation methods were developed for object detection based on random masking. However, random masks may raise some issues regarding the actual importance of pixels within an image. In this paper, we design and implement a black-box explanation method named Black-box Object Detection Explanation by Masking (BODEM) through adopting a hierarchical random masking approach for object detection systems. We propose a hierarchical random masking framework in which coarse-grained masks are used in lower levels to find salient regions within an image, and fine-grained mask are used to refine the salient regions in higher levels. Experimentations on various object detection datasets and models showed that BODEM can effectively explain the behavior of object detectors. Moreover, our method outperformed Detector Randomized Input Sampling for Explanation (D-RISE) and Local Interpretable Model-agnostic Explanations (LIME) with respect to different quantitative measures of explanation effectiveness. The experimental results demonstrate that BODEM can be an effective method for explaining and validating object detection systems in black-box testing scenarios.
♻ ☆ Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures
Despite significant progress in static protein structure collection and prediction, the dynamic behavior of proteins, one of their most vital characteristics, has been largely overlooked in prior research. This oversight can be attributed to the limited availability, diversity, and heterogeneity of dynamic protein datasets. To address this gap, we propose to enhance existing prestigious static 3D protein structural databases, such as the Protein Data Bank (PDB), by integrating dynamic data and additional physical properties. Specifically, we introduce a large-scale dataset, Dynamic PDB, encompassing approximately 12.6K proteins, each subjected to all-atom molecular dynamics (MD) simulations lasting 1 microsecond to capture conformational changes. Furthermore, we provide a comprehensive suite of physical properties, including atomic velocities and forces, potential and kinetic energies of proteins, and the temperature of the simulation environment, recorded at 1 picosecond intervals throughout the simulations. For benchmarking purposes, we evaluate state-of-the-art methods on the proposed dataset for the task of trajectory prediction. To demonstrate the value of integrating richer physical properties in the study of protein dynamics and related model design, we base our approach on the SE(3) diffusion model and incorporate these physical properties into the trajectory prediction process. Preliminary results indicate that this straightforward extension of the SE(3) model yields improved accuracy, as measured by MAE and RMSD, when the proposed physical properties are taken into consideration. https://fudan-generative-vision.github.io/dynamicPDB/ .
♻ ☆ Map-Free Visual Relocalization Enhanced by Instance Knowledge and Depth Knowledge
Map-free relocalization technology is crucial for applications in autonomous navigation and augmented reality, but relying on pre-built maps is often impractical. It faces significant challenges due to limitations in matching methods and the inherent lack of scale in monocular images. These issues lead to substantial rotational and metric errors and even localization failures in real-world scenarios. Large matching errors significantly impact the overall relocalization process, affecting both rotational and translational accuracy. Due to the inherent limitations of the camera itself, recovering the metric scale from a single image is crucial, as this significantly impacts the translation error. To address these challenges, we propose a map-free relocalization method enhanced by instance knowledge and depth knowledge. By leveraging instance-based matching information to improve global matching results, our method significantly reduces the possibility of mismatching across different objects. The robustness of instance knowledge across the scene helps the feature point matching model focus on relevant regions and enhance matching accuracy. Additionally, we use estimated metric depth from a single image to reduce metric errors and improve scale recovery accuracy. By integrating methods dedicated to mitigating large translational and rotational errors, our approach demonstrates superior performance in map-free relocalization techniques.
comment: 17 pages,6 figures
♻ ☆ Sentinel: An Aggregation Function to Secure Decentralized Federated Learning
Decentralized Federated Learning (DFL) emerges as an innovative paradigm to train collaborative models, addressing the single point of failure limitation. However, the security and trustworthiness of FL and DFL are compromised by poisoning attacks, negatively impacting its performance. Existing defense mechanisms have been designed for centralized FL and they do not adequately exploit the particularities of DFL. Thus, this work introduces Sentinel, a defense strategy to counteract poisoning attacks in DFL. Sentinel leverages the accessibility of local data and defines a three-step aggregation protocol consisting of similarity filtering, bootstrap validation, and normalization to safeguard against malicious model updates. Sentinel has been evaluated with diverse datasets and data distributions. Besides, various poisoning attack types and threat levels have been verified. The results improve the state-of-the-art performance against both untargeted and targeted poisoning attacks when data follows an IID (Independent and Identically Distributed) configuration. Besides, under non-IID configuration, it is analyzed how performance degrades both for Sentinel and other state-of-the-art robust aggregation methods.
♻ ☆ Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models
Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology. Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs (versions), including GPT (3.5, 4, 4o, 4omini), Claude (3, 3.5), Gemini (1.0), Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces (web and API), computing environments (cloud and local), and model precisions (with and without quantization). Finally, we assessed accuracy using a semiautomated pipeline. Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet (74.0%) achieved the highest accuracy, outperforming the top open-source models: Llama3.1-405b (64%), Llama3.1-70b (58.3%), and Mixtral-8x7b (54.3%). Among the quantized open-source models, the 6-bit quantized Phi3-14b (48.7%) performed best. The scores of the quantized models were comparable to those of the full-precision models Llama2-7b, Llama2--13b, and Gemma2-9b. Notably, VLM performance on image-containing questions did not improve when the images were provided and worsened when LLM-generated captions were provided. In contrast, a 10% increase in accuracy was observed when images were accompanied by human-crafted image descriptions. Conclusion: In conclusion, while LLMs exhibit robust zero-shot performance in medical reasoning, the integration of visual data remains a challenge for VLMs. Effective deployment involves carefully determining optimal model configurations, encouraging users to consider either the high performance of proprietary models or the flexible adaptability of open-source models.
comment: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File Pages: 35, Data Transparency Statement: Code is available at: https://github.com/Sdamirsa/LLM-VLM-in-Gastroenterology . Study data from American College of Gastroenterology (ACG) are restricted and available upon request with ACG permission. Correction: updated abstract considering Llama3.1 results
♻ ☆ A Systematic Review on Sleep Stage Classification and Sleep Disorder Detection Using Artificial Intelligence
Sleep is vital for people's physical and mental health, and sound sleep can help them focus on daily activities. Therefore, a sleep study that includes sleep patterns and sleep disorders is crucial to enhancing our knowledge about individuals' health status. This study aims to provide a comprehensive, systematic review of the recent literature to analyze the different approaches and their outcomes in sleep studies, which includes works on "sleep stages classification" and "sleep disorder detection" using AI. In this review, 183 articles were initially selected from different journals, among which 80 records were enlisted for explicit review, ranging from 2016 to 2023. Brain waves were the most commonly employed body parameters for sleep staging and disorder studies (almost 29% of the research used brain activity signals exclusively, and 77% combined with the other signals). The convolutional neural network (CNN), the most widely used of the 34 distinct artificial intelligence models, comprised 27%. The other models included the long short-term memory (LSTM), support vector machine (SVM), random forest (RF), and recurrent neural network (RNN), which consisted of 11%, 6%, 6%, and 5% sequentially. For performance metrics, accuracy was widely used for a maximum of 83.75% of the cases, the F1 score of 45%, Kappa of 36.25%, Sensitivity of 31.25%, and Specificity of 30% of cases, along with the other metrics. This article would help physicians and researchers get the gist of AI's contribution to sleep studies and the feasibility of their intended work.
comment: 39 pages, 11 Figures, 8 Tables
♻ ☆ Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
Text-to-image (T2I) models can be maliciously used to generate harmful content such as sexually explicit, unfaithful, and misleading or Not-Safe-for-Work (NSFW) images. Previous attacks largely depend on the availability of the diffusion model or involve a lengthy optimization process. In this work, we investigate a more practical and universal attack that does not require the presence of a target model and demonstrate that the high-dimensional text embedding space inherently contains NSFW concepts that can be exploited to generate harmful images. We present the Jailbreaking Prompt Attack (JPA). JPA first searches for the target malicious concepts in the text embedding space using a group of antonyms generated by ChatGPT. Subsequently, a prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space. We further introduce a soft assignment with gradient masking technique that allows us to perform gradient ascent in the discrete vocabulary space. We perform extensive experiments with open-sourced T2I models, e.g. stable-diffusion-v1-4 and closed-sourced online services, e.g. DALLE2, Midjourney with black-box safety checkers. Results show that (1) JPA bypasses both text and image safety checkers (2) while preserving high semantic alignment with the target prompt. (3) JPA demonstrates a much faster speed than previous methods and can be executed in a fully automated manner. These merits render it a valuable tool for robustness evaluation in future text-to-image generation research.
♻ ☆ Evaluating Named Entity Recognition Using Few-Shot Prompting with Large Language Models
This paper evaluates Few-Shot Prompting with Large Language Models for Named Entity Recognition (NER). Traditional NER systems rely on extensive labeled datasets, which are costly and time-consuming to obtain. Few-Shot Prompting or in-context learning enables models to recognize entities with minimal examples. We assess state-of-the-art models like GPT-4 in NER tasks, comparing their few-shot performance to fully supervised benchmarks. Results show that while there is a performance gap, large models excel in adapting to new entity types and domains with very limited data. We also explore the effects of prompt engineering, guided output format and context length on performance. This study underscores Few-Shot Learning's potential to reduce the need for large labeled datasets, enhancing NER scalability and accessibility.
comment: Github repo: https://github.com/GEODE-project/ner-llm
♻ ☆ MarsCode Agent: AI-native Automated Bug Fixing
Recent advances in large language models (LLMs) have shown significant potential to automate various software development tasks, including code completion, test generation, and bug fixing. However, the application of LLMs for automated bug fixing remains challenging due to the complexity and diversity of real-world software systems. In this paper, we introduce MarsCode Agent, a novel framework that leverages LLMs to automatically identify and repair bugs in software code. MarsCode Agent combines the power of LLMs with advanced code analysis techniques to accurately localize faults and generate patches. Our approach follows a systematic process of planning, bug reproduction, fault localization, candidate patch generation, and validation to ensure high-quality bug fixes. We evaluated MarsCode Agent on SWE-bench, a comprehensive benchmark of real-world software projects, and our results show that MarsCode Agent achieves a high success rate in bug fixing compared to most of the existing automated approaches.
comment: Yizhou Liu and Pengfei Gao contributed equally and the order is determined by rolling the dice. Chao Peng is the corresponding author
♻ ☆ Large Language Models for Explainable Decisions in Dynamic Digital Twins
Dynamic data-driven Digital Twins (DDTs) can enable informed decision-making and provide an optimisation platform for the underlying system. By leveraging principles of Dynamic Data-Driven Applications Systems (DDDAS), DDTs can formulate computational modalities for feedback loops, model updates and decision-making, including autonomous ones. However, understanding autonomous decision-making often requires technical and domain-specific knowledge. This paper explores using large language models (LLMs) to provide an explainability platform for DDTs, generating natural language explanations of the system's decision-making by leveraging domain-specific knowledge bases. A case study from smart agriculture is presented.
comment: 9 pages, 3 figures, accepted by DDDAS2024 -- the 5th International Conference on Dynamic Data Driven Applications Systems
♻ ☆ Towards Measuring and Modeling "Culture" in LLMs: A Survey
We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of culture, and organize them across two dimensions of demographic and semantic proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of ``culture,'' such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness of probing techniques and situated studies on the impact of cultural mis- and under-representation in LLM-based applications.
♻ ☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever EMNLP
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
comment: 8 pages, references at pp7,8; EMNLP workshop submission
♻ ☆ Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration
Multimodal learning leverages complementary information derived from different modalities, thereby enhancing performance in medical image segmentation. However, prevailing multimodal learning methods heavily rely on extensive well-annotated data from various modalities to achieve accurate segmentation performance. This dependence often poses a challenge in clinical settings due to limited availability of such data. Moreover, the inherent anatomical misalignment between different imaging modalities further complicates the endeavor to enhance segmentation performance. To address this problem, we propose a novel semi-supervised multimodal segmentation framework that is robust to scarce labeled data and misaligned modalities. Our framework employs a novel cross modality collaboration strategy to distill modality-independent knowledge, which is inherently associated with each modality, and integrates this information into a unified fusion layer for feature amalgamation. With a channel-wise semantic consistency loss, our framework ensures alignment of modality-independent information from a feature-wise perspective across modalities, thereby fortifying it against misalignments in multimodal scenarios. Furthermore, our framework effectively integrates contrastive consistent learning to regulate anatomical structures, facilitating anatomical-wise prediction alignment on unlabeled data in semi-supervised segmentation tasks. Our method achieves competitive performance compared to other multimodal methods across three tasks: cardiac, abdominal multi-organ, and thyroid-associated orbitopathy segmentations. It also demonstrates outstanding robustness in scenarios involving scarce labeled data and misaligned modalities.
♻ ☆ Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs
Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.
comment: 5 figures, 2 tables
♻ ☆ CCPL: Cross-modal Contrastive Protein Learning ICPR 2024
Effective protein representation learning is crucial for predicting protein functions. Traditional methods often pretrain protein language models on large, unlabeled amino acid sequences, followed by finetuning on labeled data. While effective, these methods underutilize the potential of protein structures, which are vital for function determination. Common structural representation techniques rely heavily on annotated data, limiting their generalizability. Moreover, structural pretraining methods, similar to natural language pretraining, can distort actual protein structures. In this work, we introduce a novel unsupervised protein structure representation pretraining method, cross-modal contrastive protein learning (CCPL). CCPL leverages a robust protein language model and uses unsupervised contrastive alignment to enhance structure learning, incorporating self-supervised structural constraints to maintain intrinsic structural information. We evaluated our model across various benchmarks, demonstrating the framework's superiority.
comment: Accepted to ICPR 2024
♻ ☆ Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies
Recent advancements in natural language processing, particularly with large language models (LLMs) like GPT-4, have significantly enhanced dialogue systems, enabling them to generate more natural and fluent conversations. Despite these improvements, challenges persist, such as managing continuous dialogues, memory retention, and minimizing hallucinations. The AIWolfDial2024 addresses these challenges by employing the Werewolf Game, an incomplete information game, to test the capabilities of LLMs in complex interactive environments. This paper introduces a LLM-based Werewolf Game AI, where each role is supported by situation analysis to aid response generation. Additionally, for the werewolf role, various persuasion strategies, including logical appeal, credibility appeal, and emotional appeal, are employed to effectively persuade other players to align with its actions.
comment: Accepted to the AIWolfDial2024 workshop at INLG 2024
♻ ☆ SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
♻ ☆ MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting
Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-language models (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
♻ ☆ Thresholded Lexicographic Ordered Multiobjective Reinforcement Learning ECAI 2024
Lexicographic multi-objective problems, which impose a lexicographic importance order over the objectives, arise in many real-life scenarios. Existing Reinforcement Learning work directly addressing lexicographic tasks has been scarce. The few proposed approaches were all noted to be heuristics without theoretical guarantees as the Bellman equation is not applicable to them. Additionally, the practical applicability of these prior approaches also suffers from various issues such as not being able to reach the goal state. While some of these issues have been known before, in this work we investigate further shortcomings, and propose fixes for improving practical performance in many cases. We also present a policy optimization approach using our Lexicographic Projection Optimization (LPO) algorithm that has the potential to address these theoretical and practical concerns. Finally, we demonstrate our proposed algorithms on benchmark problems.
comment: Full version of ECAI 2024 paper
♻ ☆ Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.
♻ ☆ Extending Structural Causal Models for Autonomous Embodied Systems
In this work we aim to bridge the divide between autonomous embodied systems and causal reasoning. Autonomous embodied systems have come to increasingly interact with humans, and in many cases may pose risks to the physical or mental well-being of those they interact with. Meanwhile causal models, despite their inherent transparency and ability to offer contrastive explanations, have found limited usage within such systems. As such, we first identify the challenges that have limited the integration of structural causal models within autonomous embodied systems. We then introduce a number of theoretical extensions to the structural causal model formalism in order to tackle these challenges. This augments these models to possess greater levels of modularisation and encapsulation, as well presenting a constant space temporal causal model representation. While not an extension itself, we also prove through the extensions we have introduced that dynamically mutable sets can be captured within structural causal models while maintaining a form of causal stationarity. Finally we introduce two case study architectures demonstrating the application of these extensions along with a discussion of where these extensions could be utilised in future work.
comment: 33 Pages = 7 Pages (Main Content) + 2 Pages (References) + 24 Pages (Appendix), 17 Figures = 2 Figures (Main Content) + 15 (Appendix), Oxford Robotics Institute Preprint; Significantly updated based upon feedback
♻ ☆ From Lab to Field: Real-World Evaluation of an AI-Driven Smart Video Solution to Enhance Community Safety
This article adopts and evaluates an AI-enabled Smart Video Solution (SVS) designed to enhance safety in the real world. The system integrates with existing infrastructure camera networks, leveraging recent advancements in AI for easy adoption. Prioritizing privacy and ethical standards, pose based data is used for downstream AI tasks such as anomaly detection. Cloud-based infrastructure and mobile app are deployed, enabling real-time alerts within communities. The SVS employs innovative data representation and visualization techniques, such as the Occupancy Indicator, Statistical Anomaly Detection, Bird's Eye View, and Heatmaps, to understand pedestrian behaviors and enhance public safety. Evaluation of the SVS demonstrates its capacity to convert complex computer vision outputs into actionable insights for stakeholders, community partners, law enforcement, urban planners, and social scientists. This article presents a comprehensive real-world deployment and evaluation of the SVS, implemented in a community college environment across 16 cameras. The system integrates AI-driven visual processing, supported by statistical analysis, database management, cloud communication, and user notifications. Additionally, the article evaluates the end-to-end latency from the moment an AI algorithm detects anomalous behavior in real-time at the camera level to the time stakeholders receive a notification. The results demonstrate the system's robustness, effectively managing 16 CCTV cameras with a consistent throughput of 16.5 frames per second (FPS) over a 21-hour period and an average end-to-end latency of 26.76 seconds between anomaly detection and alert issuance.
♻ ☆ RT-Surv: Improving Mortality Prediction After Radiotherapy with Large Language Model Structuring of Large-Scale Unstructured Electronic Health Records
Accurate patient selection is critical in radiotherapy (RT) to prevent ineffective treatments. Traditional survival prediction models, relying on structured data, often lack precision. This study explores the potential of large language models (LLMs) to structure unstructured electronic health record (EHR) data, thereby improving survival prediction accuracy through comprehensive clinical information integration. Data from 34,276 patients treated with RT at Yonsei Cancer Center between 2013 and 2023 were analyzed, encompassing both structured and unstructured data. An open-source LLM was used to structure the unstructured EHR data via single-shot learning, with its performance compared against a domain-specific medical LLM and a smaller variant. Survival prediction models were developed using statistical, machine learning, and deep learning approaches, incorporating both structured and LLM-structured data. Clinical experts evaluated the accuracy of the LLM-structured data. The open-source LLM achieved 87.5% accuracy in structuring unstructured EHR data without additional training, significantly outperforming the domain-specific medical LLM, which reached only 35.8% accuracy. Larger LLMs were more effective, particularly in extracting clinically relevant features like general condition and disease extent, which closely correlated with patient survival. Incorporating LLM-structured clinical features into survival prediction models significantly improved accuracy, with the C-index of deep learning models increasing from 0.737 to 0.820. These models also became more interpretable by emphasizing clinically significant factors. This study shows that general-domain LLMs, even without specific medical training, can effectively structure large-scale unstructured EHR data, substantially enhancing the accuracy and interpretability of clinical predictive models.
comment: 23 pages, 2 tables, 4 figures
♻ ☆ SALLM: Security Assessment of Generated Code
With the growing popularity of Large Language Models (LLMs) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to the insecure code generation. First, existing datasets used to evaluate LLMs do not adequately represent genuine software engineering tasks sensitive to security. Instead, they are often based on competitive programming challenges or classroom-type coding tasks. In real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. Second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. Therefore, in this paper, we described SALLM, a framework to benchmark LLMs' abilities to generate secure code systematically. This framework has three major components: a novel dataset of security-centric Python prompts, configurable assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
comment: Accepted at the 6th International Workshop on Automated and verifiable Software sYstem DEvelopment (ASYDE) with ASE Conference 2024
♻ ☆ Depth-guided NeRF Training via Earth Mover's Distance ECCV 2024
Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-ground truth. While these depth priors are assumed to be perfect once filtered for noise, in practice, their accuracy is more challenging to capture. This work proposes a novel approach to uncertainty in depth priors for NeRF supervision. Instead of using custom-trained depth or uncertainty priors, we use off-the-shelf pretrained diffusion models to predict depth and capture uncertainty during the denoising process. Because we know that depth priors are prone to errors, we propose to supervise the ray termination distance distribution with Earth Mover's Distance instead of enforcing the rendered depth to replicate the depth prior exactly through L2-loss. Our depth-guided NeRF outperforms all baselines on standard depth metrics by a large margin while maintaining performance on photometric measures.
comment: Accepted to ECCV 2024
♻ ☆ OceanNet: A principled neural operator-based digital twin for regional oceans
While data-driven approaches demonstrate great potential in atmospheric modeling and weather forecasting, ocean modeling poses distinct challenges due to complex bathymetry, land, vertical structure, and flow non-linearity. This study introduces OceanNet, a principled neural operator-based digital twin for ocean circulation. OceanNet uses a Fourier neural operator and predictor-evaluate-corrector integration scheme to mitigate autoregressive error growth and enhance stability over extended time scales. A spectral regularizer counteracts spectral bias at smaller scales. OceanNet is applied to the northwest Atlantic Ocean western boundary current (the Gulf Stream), focusing on the task of seasonal prediction for Loop Current eddies and the Gulf Stream meander. Trained using historical sea surface height (SSH) data, OceanNet demonstrates competitive forecast skill by outperforming SSH predictions by an uncoupled, state-of-the-art dynamical ocean model forecast, reducing computation by 500,000 times. These accomplishments demonstrate the potential of physics-inspired deep neural operators as cost-effective alternatives to high-resolution numerical ocean models.
comment: Supplementary information can be found in: https://drive.google.com/file/d/1NoxJLa967naJT787a5-IfZ7f_MmRuZMP/view?usp=sharing
♻ ☆ Learning to Ask: When LLMs Meet Unclear Instruction
Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
♻ ☆ Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming
Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem producing a definition given a formal specification expressed as an F* type. We provide a program fragment checker that queries F* to check the correctness of candidate solutions. We also report on an extended version of our dataset containing a total of 940K lines of programs and proofs, with a total of 54k top-level F* definitions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
comment: 47th International Conference on Software Engineering
♻ ☆ Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
Harmful fine-tuning issue \citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses \citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \url{https://github.com/git-disl/Booster}.
♻ ☆ Language-Guided World Models: A Model-Based Approach to AI Control ACL 2024
This paper introduces the concept of Language-Guided World Models (LWMs) -- probabilistic models that can simulate environments by reading texts. Agents equipped with these models provide humans with more extensive and efficient control, allowing them to simultaneously alter agent behaviors in multiple tasks via natural verbal communication. In this work, we take initial steps in developing robust LWMs that can generalize to compositionally novel language descriptions. We design a challenging world modeling benchmark based on the game of MESSENGER (Hanjie et al., 2021), featuring evaluation settings that require varying degrees of compositional generalization. Our experiments reveal the lack of generalizability of the state-of-the-art Transformer model, as it offers marginal improvements in simulation quality over a no-text baseline. We devise a more robust model by fusing the Transformer with the EMMA attention mechanism (Hanjie et al., 2021). Our model substantially outperforms the Transformer and approaches the performance of a model with an oracle semantic parsing and grounding capability. To demonstrate the practicality of this model in improving AI safety and transparency, we simulate a scenario in which the model enables an agent to present plans to a human before execution, and to revise plans based on their language feedback.
comment: SpLU-RoboNLP workshop at ACL 2024
♻ ☆ GenAI-powered Multi-Agent Paradigm for Smart Urban Mobility: Opportunities and Challenges for Integrating Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) with Intelligent Transportation Systems
Leveraging recent advances in generative AI, multi-agent systems are increasingly being developed to enhance the functionality and efficiency of smart city applications. This paper explores the transformative potential of large language models (LLMs) and emerging Retrieval-Augmented Generation (RAG) technologies in Intelligent Transportation Systems (ITS), paving the way for innovative solutions to address critical challenges in urban mobility. We begin by providing a comprehensive overview of the current state-of-the-art in mobility data, ITS, and Connected Vehicles (CV) applications. Building on this review, we discuss the rationale behind RAG and examine the opportunities for integrating these Generative AI (GenAI) technologies into the smart mobility sector. We propose a conceptual framework aimed at developing multi-agent systems capable of intelligently and conversationally delivering smart mobility services to urban commuters, transportation operators, and decision-makers. Our approach seeks to foster an autonomous and intelligent approach that (a) promotes science-based advisory to reduce traffic congestion, accidents, and carbon emissions at multiple scales, (b) facilitates public education and engagement in participatory mobility management, and (c) automates specialized transportation management tasks and the development of critical ITS platforms, such as data analytics and interpretation, knowledge representation, and traffic simulations. By integrating LLM and RAG, our approach seeks to overcome the limitations of traditional rule-based multi-agent systems, which rely on fixed knowledge bases and limited reasoning capabilities. This integration paves the way for a more scalable, intuitive, and automated multi-agent paradigm, driving advancements in ITS and urban mobility.
Machine Learning 149
☆ Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling
Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs' original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20$\times$ speedup. In addition, our investigation challenges previous claims that MDMs can surpass ARMs in generative perplexity. We identify, for the first time, an underlying numerical issue, even with the 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, leading to unfair assessments of MDMs' generation results in the previous literature.
comment: 40 pages
☆ Topological Methods in Machine Learning: A Tutorial for Practitioners
Topological Machine Learning (TML) is an emerging field that leverages techniques from algebraic topology to analyze complex data structures in ways that traditional machine learning methods may not capture. This tutorial provides a comprehensive introduction to two key TML techniques, persistent homology and the Mapper algorithm, with an emphasis on practical applications. Persistent homology captures multi-scale topological features such as clusters, loops, and voids, while the Mapper algorithm creates an interpretable graph summarizing high-dimensional data. To enhance accessibility, we adopt a data-centric approach, enabling readers to gain hands-on experience applying these techniques to relevant tasks. We provide step-by-step explanations, implementations, hands-on examples, and case studies to demonstrate how these tools can be applied to real-world problems. The goal is to equip researchers and practitioners with the knowledge and resources to incorporate TML into their work, revealing insights often hidden from conventional machine learning methods. The tutorial code is available at https://github.com/cakcora/TopologyForML
comment: 54 pages, 35 figures
☆ Regional data-driven weather modeling with a global stretched-grid
A data-driven model (DDM) suitable for regional weather forecasting applications is presented. The model extends the Artificial Intelligence Forecasting System by introducing a stretched-grid architecture that dedicates higher resolution over a regional area of interest and maintains a lower resolution elsewhere on the globe. The model is based on graph neural networks, which naturally affords arbitrary multi-resolution grid configurations. The model is applied to short-range weather prediction for the Nordics, producing forecasts at 2.5 km spatial and 6 h temporal resolution. The model is pre-trained on 43 years of global ERA5 data at 31 km resolution and is further refined using 3.3 years of 2.5 km resolution operational analyses from the MetCoOp Ensemble Prediction System (MEPS). The performance of the model is evaluated using surface observations from measurement stations across Norway and is compared to short-range weather forecasts from MEPS. The DDM outperforms both the control run and the ensemble mean of MEPS for 2 m temperature. The model also produces competitive precipitation and wind speed forecasts, but is shown to underestimate extreme events.
☆ Benchmarking Spurious Bias in Few-Shot Image Classifiers ECCV 2024
Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at https://github.com/gtzheng/FewSTAB.
comment: Accepted to ECCV 2024
☆ Configurable Foundation Models: Building LLMs from a Modular Perspective
Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.
☆ Hybrid Imitation-Learning Motion Planner for Urban Driving
With the release of open source datasets such as nuPlan and Argoverse, the research around learning-based planners has spread a lot in the last years. Existing systems have shown excellent capabilities in imitating the human driver behaviour, but they struggle to guarantee safe closed-loop driving. Conversely, optimization-based planners offer greater security in short-term planning scenarios. To confront this challenge, in this paper we propose a novel hybrid motion planner that integrates both learning-based and optimization-based techniques. Initially, a multilayer perceptron (MLP) generates a human-like trajectory, which is then refined by an optimization-based component. This component not only minimizes tracking errors but also computes a trajectory that is both kinematically feasible and collision-free with obstacles and road boundaries. Our model effectively balances safety and human-likeness, mitigating the trade-off inherent in these objectives. We validate our approach through simulation experiments and further demonstrate its efficacy by deploying it in real-world self-driving vehicles.
☆ Look Into the LITE in Deep Learning for Time Series Classification
Deep learning models have been shown to be a powerful solution for Time Series Classification (TSC). State-of-the-art architectures, while producing promising results on the UCR and the UEA archives , present a high number of trainable parameters. This can lead to long training with high CO2 emission, power consumption and possible increase in the number of FLoating-point Operation Per Second (FLOPS). In this paper, we present a new architecture for TSC, the Light Inception with boosTing tEchnique (LITE) with only 2.34% of the number of parameters of the state-of-the-art InceptionTime model, while preserving performance. This architecture, with only 9, 814 trainable parameters due to the usage of DepthWise Separable Convolutions (DWSC), is boosted by three techniques: multiplexing, custom filters, and dilated convolution. The LITE architecture, trained on the UCR, is 2.78 times faster than InceptionTime and consumes 2.79 times less CO2 and power. To evaluate the performance of the proposed architecture on multivariate time series data, we adapt LITE to handle multivariate time series, we call this version LITEMV. To bring theory into application, we also conducted experiments using LITEMV on multivariate time series representing human rehabilitation movements, showing that LITEMV not only is the most efficient model but also the best performing for this application on the Kimore dataset, a skeleton based human rehabilitation exercises dataset. Moreover, to address the interpretability of LITEMV, we present a study using Class Activation Maps to understand the classification decision taken by the model during evaluation.
☆ Building a Scalable, Effective, and Steerable Search and Ranking Platform
Modern e-commerce platforms offer vast product selections, making it difficult for customers to find items that they like and that are relevant to their current session intent. This is why it is key for e-commerce platforms to have near real-time scalable and adaptable personalized ranking and search systems. While numerous methods exist in the scientific literature for building such systems, many are unsuitable for large-scale industrial use due to complexity and performance limitations. Consequently, industrial ranking systems often resort to computationally efficient yet simplistic retrieval or candidate generation approaches, which overlook near real-time and heterogeneous customer signals, which results in a less personalized and relevant experience. Moreover, related customer experiences are served by completely different systems, which increases complexity, maintenance, and inconsistent experiences. In this paper, we present a personalized, adaptable near real-time ranking platform that is reusable across various use cases, such as browsing and search, and that is able to cater to millions of items and customers under heavy load (thousands of requests per second). We employ transformer-based models through different ranking layers which can learn complex behavior patterns directly from customer action sequences while being able to incorporate temporal (e.g. in-session) and contextual information. We validate our system through a series of comprehensive offline and online real-world experiments at a large online e-commerce platform, and we demonstrate its superiority when compared to existing systems, both in terms of customer experience as well as in net revenue. Finally, we share the lessons learned from building a comprehensive, modern ranking platform for use in a large-scale e-commerce environment.
☆ Oops, I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning
The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://github.com/RafLaf/FSL-benchmark-again
☆ SNNAX -- Spiking Neural Networks in JAX
Spiking Neural Networks (SNNs) simulators are essential tools to prototype biologically inspired models and neuromorphic hardware architectures and predict their performance. For such a tool, ease of use and flexibility are critical, but so is simulation speed especially given the complexity inherent to simulating SNN. Here, we present SNNAX, a JAX-based framework for simulating and training such models with PyTorch-like intuitiveness and JAX-like execution speed. SNNAX models are easily extended and customized to fit the desired model specifications and target neuromorphic hardware. Additionally, SNNAX offers key features for optimizing the training and deployment of SNNs such as flexible automatic differentiation and just-in-time compilation. We evaluate and compare SNNAX to other commonly used machine learning (ML) frameworks used for programming SNNs. We provide key performance metrics, best practices, documented examples for simulating SNNs in SNNAX, and implement several benchmarks used in the literature.
Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models
This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
☆ Obsidian: Cooperative State-Space Exploration for Performant Inference on Secure ML Accelerators
Trusted execution environments (TEEs) for machine learning accelerators are indispensable in secure and efficient ML inference. Optimizing workloads through state-space exploration for the accelerator architectures improves performance and energy consumption. However, such explorations are expensive and slow due to the large search space. Current research has to use fast analytical models that forego critical hardware details and cross-layer opportunities unique to the hardware security primitives. While cycle-accurate models can theoretically reach better designs, their high runtime cost restricts them to a smaller state space. We present Obsidian, an optimization framework for finding the optimal mapping from ML kernels to a secure ML accelerator. Obsidian addresses the above challenge by exploring the state space using analytical and cycle-accurate models cooperatively. The two main exploration components include: (1) A secure accelerator analytical model, that includes the effect of secure hardware while traversing the large mapping state space and produce the best m model mappings; (2) A compiler profiling step on a cycle-accurate model, that captures runtime bottlenecks to further improve execution runtime, energy and resource utilization and find the optimal model mapping. We compare our results to a baseline secure accelerator, comprising of the state-of-the-art security schemes obtained from guardnn [ 33 ] and sesame [11]. The analytical model reduces the inference latency by 20.5% for a cloud and 8.4% for an edge deployment with an energy improvement of 24% and 19% respectively. The cycle-accurate model, further reduces the latency by 9.1% for a cloud and 12.2% for an edge with an energy improvement of 13.8% and 13.1%.
☆ Boosting Certificate Robustness for Time Series Classification with Efficient Self-Ensemble
Recently, the issue of adversarial robustness in the time series domain has garnered significant attention. However, the available defense mechanisms remain limited, with adversarial training being the predominant approach, though it does not provide theoretical guarantees. Randomized Smoothing has emerged as a standout method due to its ability to certify a provable lower bound on robustness radius under $\ell_p$-ball attacks. Recognizing its success, research in the time series domain has started focusing on these aspects. However, existing research predominantly focuses on time series forecasting, or under the non-$\ell_p$ robustness in statistic feature augmentation for time series classification~(TSC). Our review found that Randomized Smoothing performs modestly in TSC, struggling to provide effective assurances on datasets with poor robustness. Therefore, we propose a self-ensemble method to enhance the lower bound of the probability confidence of predicted labels by reducing the variance of classification margins, thereby certifying a larger radius. This approach also addresses the computational overhead issue of Deep Ensemble~(DE) while remaining competitive and, in some cases, outperforming it in terms of robustness. Both theoretical analysis and experimental results validate the effectiveness of our method, demonstrating superior performance in robustness testing compared to baseline approaches.
comment: 6 figures, 4 tables, 10 pages
☆ UnLearning from Experience to Avoid Spurious Correlations
While deep neural networks can achieve state-of-the-art performance in many tasks, these models are more fragile than they appear. They are prone to learning spurious correlations in their training data, leading to surprising failure cases. In this paper, we propose a new approach that addresses the issue of spurious correlations: UnLearning from Experience (ULE). Our method is based on using two classification models trained in parallel: student and teacher models. Both models receive the same batches of training data. The student model is trained with no constraints and pursues the spurious correlations in the data. The teacher model is trained to solve the same classification problem while avoiding the mistakes of the student model. As training is done in parallel, the better the student model learns the spurious correlations, the more robust the teacher model becomes. The teacher model uses the gradient of the student's output with respect to its input to unlearn mistakes made by the student. We show that our method is effective on the Waterbirds, CelebA, Spawrious and UrbanCars datasets.
comment: 10 pages
☆ Regularized Multi-output Gaussian Convolution Process with Domain Adaptation
Multi-output Gaussian process (MGP) has been attracting increasing attention as a transfer learning method to model multiple outputs. Despite its high flexibility and generality, MGP still faces two critical challenges when applied to transfer learning. The first one is negative transfer, which occurs when there exists no shared information among the outputs. The second challenge is the input domain inconsistency, which is commonly studied in transfer learning yet not explored in MGP. In this paper, we propose a regularized MGP modeling framework with domain adaptation to overcome these challenges. More specifically, a sparse covariance matrix of MGP is proposed by using convolution process, where penalization terms are added to adaptively select the most informative outputs for knowledge transfer. To deal with the domain inconsistency, a domain adaptation method is proposed by marginalizing inconsistent features and expanding missing features to align the input domains among different outputs. Statistical properties of the proposed method are provided to guarantee the performance practically and asymptotically. The proposed framework outperforms state-of-the-art benchmarks in comprehensive simulation studies and one real case study of a ceramic manufacturing process. The results demonstrate the effectiveness of our method in dealing with both the negative transfer and the domain inconsistency.
☆ Unifying Causal Representation Learning with the Invariance Principle
Causal representation learning aims at recovering latent causal variables from high-dimensional observations to solve causal downstream tasks, such as predicting the effect of new interventions or more robust classification. A plethora of methods have been developed, each tackling carefully crafted problem settings that lead to different types of identifiability. The folklore is that these different settings are important, as they are often linked to different rungs of Pearl's causal hierarchy, although not all neatly fit. Our main contribution is to show that many existing causal representation learning approaches methodologically align the representation to known data symmetries. Identification of the variables is guided by equivalence classes across different data pockets that are not necessarily causal. This result suggests important implications, allowing us to unify many existing approaches in a single method that can mix and match different assumptions, including non-causal ones, based on the invariances relevant to our application. It also significantly benefits applicability, which we demonstrate by improving treatment effect estimation on real-world high-dimensional ecological data. Overall, this paper clarifies the role of causality assumptions in the discovery of causal variables and shifts the focus to preserving data symmetries.
comment: 36 pages
☆ Tractable Offline Learning of Regular Decision Processes
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs). In RDPs, the unknown dependency of future observations and rewards from the past interactions can be captured by some hidden finite-state automaton. For this reason, many RDP algorithms first reconstruct this unknown dependency using automata learning techniques. In this paper, we show that it is possible to overcome two strong limitations of previous offline RL algorithms for RDPs, notably RegORL. This can be accomplished via the introduction of two original techniques: the development of a new pseudometric based on formal languages, which removes a problematic dependency on $L_\infty^\mathsf{p}$-distinguishability parameters, and the adoption of Count-Min-Sketch (CMS), instead of naive counting. The former reduces the number of samples required in environments that are characterized by a low complexity in language-theoretic terms. The latter alleviates the memory requirements for long planning horizons. We derive the PAC sample complexity bounds associated to each of these techniques, and we validate the approach experimentally.
comment: To appear in EWRL 2024
☆ Convolutional Neural Networks for Automated Cellular Automaton Classification
The emergent dynamics in spacetime diagrams of cellular automata (CAs) is often organised by means of a number of behavioural classes. Whilst classification of elementary CAs is feasible and well-studied, non-elementary CAs are generally too diverse and numerous to exhaustively classify manually. In this chapter we treat the spacetime diagram as a digital image, and implement simple computer vision techniques to perform an automated classification of elementary cellular automata into the five Li-Packard classes. In particular, we present a supervised learning task to a convolutional neural network, in such a way that it may be generalised to non-elementary CAs. If we want to do so, we must divert the algorithm's focus away from the underlying 'microscopic' local updates. We first show that previously developed deep learning approaches have in fact been trained to identify the local update rule, rather than directly focus on the mesoscopic patterns that are associated with the particular behavioural classes. By means of a well-argued neural network design, as well as a number of data augmentation techniques, we then present a convolutional neural network that performs nearly perfectly at identifying the behavioural class, without necessarily first identifying the underlying microscopic dynamics.
comment: 19 pages, 12 figures, book chapter
☆ Complete and Efficient Covariants for 3D Point Configurations with Application to Learning Molecular Quantum Properties
When modeling physical properties of molecules with machine learning, it is desirable to incorporate $SO(3)$-covariance. While such models based on low body order features are not complete, we formulate and prove general completeness properties for higher order methods, and show that $6k-5$ of these features are enough for up to $k$ atoms. We also find that the Clebsch--Gordan operations commonly used in these methods can be replaced by matrix multiplications without sacrificing completeness, lowering the scaling from $O(l^6)$ to $O(l^3)$ in the degree of the features. We apply this to quantum chemistry, but the proposed methods are generally applicable for problems involving 3D point configurations.
☆ Task-Oriented Communication for Graph Data: A Graph Information Bottleneck Approach
Graph data, essential in fields like knowledge representation and social networks, often involves large networks with many nodes and edges. Transmitting these graphs can be highly inefficient due to their size and redundancy for specific tasks. This paper introduces a method to extract a smaller, task-focused subgraph that maintains key information while reducing communication overhead. Our approach utilizes graph neural networks (GNNs) and the graph information bottleneck (GIB) principle to create a compact, informative, and robust graph representation suitable for transmission. The challenge lies in the irregular structure of graph data, making GIB optimization complex. We address this by deriving a tractable variational upper bound for the objective function. Additionally, we propose the VQ-GIB mechanism, integrating vector quantization (VQ) to convert subgraph representations into a discrete codebook sequence, compatible with existing digital communication systems. Our experiments show that this GIB-based method significantly lowers communication costs while preserving essential task-related information. The approach demonstrates robust performance across various communication channels, suitable for both continuous and discrete systems.
☆ A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations
Machine translation in low-resource language pairs faces significant challenges due to the scarcity of parallel corpora and linguistic resources. This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy, impeding the performance of machine translation models. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Our methodology leverages a multilingual SBERT model to filter out problematic translations in the training data. Specifically, we employ an IndicSBERT similarity model to assess the semantic equivalence between original and translated sentences, allowing us to retain linguistically correct translations while discarding instances with substantial deviations. The results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT. This illustrates how cross-lingual sentence representations can reduce errors in machine translation scenarios with limited resources. By integrating multilingual sentence BERT models into the translation pipeline, this research contributes to advancing machine translation techniques in low-resource environments. The proposed method not only addresses the challenges in English-Marathi language pairs but also provides a valuable framework for enhancing translation quality in other low-resource language translation tasks.
comment: Accepted at I2CT 2024
☆ Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit
Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects.
☆ Decision Transformer for Enhancing Neural Local Search on the Job Shop Scheduling Problem
The job shop scheduling problem (JSSP) and its solution algorithms have been of enduring interest in both academia and industry for decades. In recent years, machine learning (ML) is playing an increasingly important role in advancing existing and building new heuristic solutions for the JSSP, aiming to find better solutions in shorter computation times. In this paper we build on top of a state-of-the-art deep reinforcement learning (DRL) agent, called Neural Local Search (NLS), which can efficiently and effectively control a large local neighborhood search on the JSSP. In particular, we develop a method for training the decision transformer (DT) algorithm on search trajectories taken by a trained NLS agent to further improve upon the learned decision-making sequences. Our experiments show that the DT successfully learns local search strategies that are different and, in many cases, more effective than those of the NLS agent itself. In terms of the tradeoff between solution quality and acceptable computational time needed for the search, the DT is particularly superior in application scenarios where longer computational times are acceptable. In this case, it makes up for the longer inference times required per search step, which are caused by the larger neural network architecture, through better quality decisions per step. Thereby, the DT achieves state-of-the-art results for solving the JSSP with ML-enhanced search.
comment: currently under review for IEEE Transactions on Cybernetics
☆ Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.
☆ Neural timescales from a computational perspective
Timescales of neural activity are diverse across and within brain areas, and experimental observations suggest that neural timescales reflect information in dynamic environments. However, these observations do not specify how neural timescales are shaped, nor whether particular timescales are necessary for neural computations and brain function. Here, we take a complementary perspective and synthesize three directions where computational methods can distill the broad set of empirical observations into quantitative and testable theories: We review (i) how data analysis methods allow us to capture different timescales of neural dynamics across different recording modalities, (ii) how computational models provide a mechanistic explanation for the emergence of diverse timescales, and (iii) how task-optimized models in machine learning uncover the functional relevance of neural timescales. This integrative computational approach, combined with empirical findings, would provide a more holistic understanding of how neural timescales capture the relationship between brain structure, dynamics, and behavior.
comment: 18 pages, 4 figures, 2 boxes
☆ Neural Networks with LSTM and GRU in Modeling Active Fires in the Amazon
This study presents a comprehensive methodology for modeling and forecasting the historical time series of fire spots detected by the AQUA_M-T satellite in the Amazon, Brazil. The approach utilizes a mixed Recurrent Neural Network (RNN) model, combining Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures to predict monthly accumulations of daily detected fire spots. A summary of the data revealed a consistent seasonality over time, with annual maximum and minimum fire spot values tending to repeat at the same periods each year. The primary objective is to verify whether the forecasts capture this inherent seasonality through rigorous statistical analysis. The methodology involved careful data preparation, model configuration, and training using cross-validation with two seeds, ensuring that the data generalizes well to the test and validation sets, and confirming the convergence of the model parameters. The results indicate that the mixed LSTM and GRU model offers improved accuracy in forecasting 12 months ahead, demonstrating its effectiveness in capturing complex temporal patterns and modeling the observed time series. This research significantly contributes to the application of deep learning techniques in environmental monitoring, specifically in fire spot forecasting. In addition to improving forecast accuracy, the proposed approach highlights the potential for adaptation to other time series forecasting challenges, opening new avenues for research and development in machine learning and natural phenomenon prediction. Keywords: Time Series Forecasting, Recurrent Neural Networks, Deep Learning.
comment: 16 pages, in Portuguese language, 24 figures
☆ Independence Constrained Disentangled Representation Learning from Epistemological Perspective
Disentangled Representation Learning aims to improve the explainability of deep learning methods by training a data encoder that identifies semantically meaningful latent variables in the data generation process. Nevertheless, there is no consensus regarding a universally accepted definition for the objective of disentangled representation learning. In particular, there is a considerable amount of discourse regarding whether should the latent variables be mutually independent or not. In this paper, we first investigate these arguments on the interrelationships between latent variables by establishing a conceptual bridge between Epistemology and Disentangled Representation Learning. Then, inspired by these interdisciplinary concepts, we introduce a two-level latent space framework to provide a general solution to the prior arguments on this issue. Finally, we propose a novel method for disentangled representation learning by employing an integration of mutual information constraint and independence constraint within the Generative Adversarial Network (GAN) framework. Experimental results demonstrate that our proposed method consistently outperforms baseline approaches in both quantitative and qualitative evaluations. The method exhibits strong performance across multiple commonly used metrics and demonstrates a great capability in disentangling various semantic factors, leading to an improved quality of controllable generation, which consequently benefits the explainability of the algorithm.
☆ Causality-Aware Transformer Networks for Robotic Navigation
Recent advances in machine learning algorithms have garnered growing interest in developing versatile Embodied AI systems. However, current research in this domain reveals opportunities for improvement. First, the direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling, potentially limiting its performance in Embodied AI tasks. Second, the reliance on task-specific configurations, such as pre-trained modules and dataset-specific logic, compromises the generalizability of these methods. We address these constraints by initially exploring the unique differences between Embodied AI tasks and other sequential data tasks through the lens of Causality, presenting a causal framework to elucidate the inadequacies of conventional sequential methods for Embodied AI. By leveraging this causal perspective, we propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module to enhance the models's Environmental Understanding capability. Meanwhile, our method is devoid of task-specific inductive biases and can be trained in an End-to-End manner, which enhances the method's generalizability across various contexts. Empirical evaluations demonstrate that our methodology consistently surpasses benchmark performances across a spectrum of settings, tasks and simulation environments. Extensive ablation studies reveal that the performance gains can be attributed to the Causal Understanding Module, which demonstrates effectiveness and efficiency in both Reinforcement Learning and Supervised Learning settings.
☆ Introduction to Machine Learning
This book introduces the mathematical foundations and techniques that lead to the development and analysis of many of the algorithms that are used in machine learning. It starts with an introductory chapter that describes notation used throughout the book and serve at a reminder of basic concepts in calculus, linear algebra and probability and also introduces some measure theoretic terminology, which can be used as a reading guide for the sections that use these tools. The introductory chapters also provide background material on matrix analysis and optimization. The latter chapter provides theoretical support to many algorithms that are used in the book, including stochastic gradient descent, proximal methods, etc. After discussing basic concepts for statistical prediction, the book includes an introduction to reproducing kernel theory and Hilbert space techniques, which are used in many places, before addressing the description of various algorithms for supervised statistical learning, including linear methods, support vector machines, decision trees, boosting, or neural networks. The subject then switches to generative methods, starting with a chapter that presents sampling methods and an introduction to the theory of Markov chains. The following chapter describe the theory of graphical models, an introduction to variational methods for models with latent variables, and to deep-learning based generative models. The next chapters focus on unsupervised learning methods, for clustering, factor analysis and manifold learning. The final chapter of the book is theory-oriented and discusses concentration inequalities and generalization bounds.
comment: textbook
☆ Learning-Based Error Detection System for Advanced Vehicle Instrument Cluster Rendering
The automotive industry is currently expanding digital display options with every new model that comes onto the market. This entails not just an expansion in dimensions, resolution, and customization choices, but also the capability to employ novel display effects like overlays while assembling the content of the display cluster. Unfortunately, this raises the need for appropriate monitoring systems that can detect rendering errors and apply appropriate countermeasures when required. Classical solutions such as Cyclic Redundancy Checks (CRC) will soon be no longer viable as any sort of alpha blending, warping of scaling of content can cause unwanted CRC violations. Therefore, we propose a novel monitoring approach to verify correctness of displayed content using telltales (e.g. warning signs) as example. It uses a learning-based approach to separate "good" telltales, i.e. those that a human driver will understand correctly, and "corrupted" telltales, i.e. those that will not be visible or perceived correctly. As a result, it possesses inherent resilience against individual pixel errors and implicitly supports changing backgrounds, overlay or scaling effects. This is underlined by our experimental study where all "corrupted" test patterns were correctly classified, while no false alarms were triggered.
comment: 9 pages
☆ Conformal Prediction in Dynamic Biological Systems
Uncertainty quantification (UQ) is the process of systematically determining and characterizing the degree of confidence in computational model predictions. In the context of systems biology, especially with dynamic models, UQ is crucial because it addresses the challenges posed by nonlinearity and parameter sensitivity, allowing us to properly understand and extrapolate the behavior of complex biological systems. Here, we focus on dynamic models represented by deterministic nonlinear ordinary differential equations. Many current UQ approaches in this field rely on Bayesian statistical methods. While powerful, these methods often require strong prior specifications and make parametric assumptions that may not always hold in biological systems. Additionally, these methods face challenges in domains where sample sizes are limited, and statistical inference becomes constrained, with computational speed being a bottleneck in large models of biological systems. As an alternative, we propose the use of conformal inference methods, introducing two novel algorithms that, in some instances, offer non-asymptotic guarantees, enhancing robustness and scalability across various applications. We demonstrate the efficacy of our proposed algorithms through several scenarios, highlighting their advantages over traditional Bayesian approaches. The proposed methods show promising results for diverse biological data structures and scenarios, offering a general framework to quantify uncertainty for dynamic models of biological systems.The software for the methodology and the reproduction of the results is available at https://zenodo.org/doi/10.5281/zenodo.13644870.
☆ AdvSecureNet: A Python Toolkit for Adversarial Machine Learning
Machine learning models are vulnerable to adversarial attacks. Several tools have been developed to research these vulnerabilities, but they often lack comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch based toolkit for adversarial machine learning that is the first to natively support multi-GPU setups for attacks, defenses, and evaluation. It is the first toolkit that supports both CLI and API interfaces and external YAML configuration files to enhance versatility and reproducibility. The toolkit includes multiple attacks, defenses and evaluation metrics. Rigiorous software engineering practices are followed to ensure high code quality and maintainability. The project is available as an open-source project on GitHub at https://github.com/melihcatal/advsecurenet and installable via PyPI.
☆ (Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models
Epistemic uncertainty is crucial for safety-critical applications and out-of-distribution detection tasks. Yet, we uncover a paradoxical phenomenon in deep learning models: an epistemic uncertainty collapse as model complexity increases, challenging the assumption that larger models invariably offer better uncertainty quantification. We propose that this stems from implicit ensembling within large models. To support this hypothesis, we demonstrate epistemic uncertainty collapse empirically across various architectures, from explicit ensembles of ensembles and simple MLPs to state-of-the-art vision models, including ResNets and Vision Transformers -- for the latter, we examine implicit ensemble extraction and decompose larger models into diverse sub-models, recovering epistemic uncertainty. We provide theoretical justification for these phenomena and explore their implications for uncertainty estimation.
comment: 10 pages
☆ Hypothesizing Missing Causal Variables with LLMs
Scientific discovery is a catalyst for human intellectual advances, driven by the cycle of hypothesis generation, experimental design, data evaluation, and iterative assumption refinement. This process, while crucial, is expensive and heavily dependent on the domain knowledge of scientists to generate hypotheses and navigate the scientific cycle. Central to this is causality, the ability to establish the relationship between the cause and the effect. Motivated by the scientific discovery process, in this work, we formulate a novel task where the input is a partial causal graph with missing variables, and the output is a hypothesis about the missing variables to complete the partial graph. We design a benchmark with varying difficulty levels and knowledge assumptions about the causal graph. With the growing interest in using Large Language Models (LLMs) to assist in scientific discovery, we benchmark open-source and closed models on our testbed. We show the strong ability of LLMs to hypothesize the mediation variables between a cause and its effect. In contrast, they underperform in hypothesizing the cause and effect variables themselves. We also observe surprising results where some of the open-source models outperform the closed GPT-4 model.
comment: Code - https://github.com/ivaxi0s/hypothesizing-causal-variable-llm
☆ A Fashion Item Recommendation Model in Hyperbolic Space CVPR 2024
In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users' purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.
comment: This work was presented at the CVFAD Workshop at CVPR 2024
☆ An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.
comment: Accepted in the IEEE Soken Language Technology Workshop 2024
☆ Multiview Random Vector Functional Link Network for Predicting DNA-Binding Proteins
The identification of DNA-binding proteins (DBPs) is a critical task due to their significant impact on various biological activities. Understanding the mechanisms underlying protein-DNA interactions is essential for elucidating various life activities. In recent years, machine learning-based models have been prominently utilized for DBP prediction. In this paper, to predict DBPs, we propose a novel framework termed a multiview random vector functional link (MvRVFL) network, which fuses neural network architecture with multiview learning. The proposed MvRVFL model combines the benefits of late and early fusion, allowing for distinct regularization parameters across different views while leveraging a closed-form solution to determine unknown parameters efficiently. The primal objective function incorporates a coupling term aimed at minimizing a composite of errors stemming from all views. From each of the three protein views of the DBP datasets, we extract five features. These features are then fused together by incorporating a hidden feature during the model training process. The performance of the proposed MvRVFL model on the DBP dataset surpasses that of baseline models, demonstrating its superior effectiveness. Furthermore, we extend our assessment to the UCI, KEEL, AwA, and Corel5k datasets, to establish the practicality of the proposed models. The consistency error bound, the generalization error bound, and empirical findings, coupled with rigorous statistical analyses, confirm the superior generalization capabilities of the MvRVFL model compared to the baseline models.
☆ BMI Prediction from Handwritten English Characters Using a Convolutional Neural Network
A person's Body Mass Index, or BMI, is the most widely used parameter for assessing their health. BMI is a crucial predictor of potential diseases that may arise at higher body fat levels because it is correlated with body fat. Conversely, a community's or an individual's nutritional status can be determined using the BMI. Although deep learning models are used in several studies to estimate BMI from face photos and other data, no previous research established a clear connection between deep learning techniques for handwriting analysis and BMI prediction. This article addresses this research gap with a deep learning approach to estimating BMI from handwritten characters by developing a convolutional neural network (CNN). A dataset containing samples from 48 people in lowercase English scripts is successfully captured for the BMI prediction task. The proposed CNN-based approach reports a commendable accuracy of 99.92%. Performance comparison with other popular CNN architectures reveals that AlexNet and InceptionV3 achieve the second and third-best performance, with the accuracy of 99.69% and 99.53%, respectively.
☆ Advancing Cyber Incident Timeline Analysis Through Rule Based AI and Large Language Models
Timeline Analysis (TA) is a key part of Timeline Forensics (TF) in Digital Forensics (DF), focusing primarily on examining and analysing temporal digital artefacts such as timestamps, derived from event logs, file metadata, and other related data to correlate events resulting from cyber incidents and reconstruct their chronological timeline. Traditional tools often struggle to efficiently process the vast volume and variety of data acquired during DF investigations and Incident Response (IR) processes. This paper presents a novel framework, GenDFIR, that combines Rule-Based Artificial Intelligence (R-BAI) algorithms with Large Language Models (LLMs) to advance and automate the TA process. Our approach consists of two main stages (1) We use R-BAI to identify and select anomalous digital artefacts based on predefined rules. (2) The selected artefacts are then converted into embeddings for processing by an LLM with the help of a Retrieval-Augmented Generation (RAG) agent. The LLM consequently leverages its capabilities to perform automated TA on the artefacts and predict potential incident scenarios. To validate our framework, we evaluate GenDFIR performance, efficiency, and reliability using various metrics across synthetic cyber incident simulation scenarios. This paper presents a proof of concept, where the findings demonstrate the significant potential of integrating R-BAI and LLMs for TA. This novel approach highlights the power of Generative AI (GenAI), specifically LLMs, and opens new avenues for advanced threat detection and incident reconstruction, representing a significant step forward in the field.
comment: 25 pages
☆ Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation
Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
comment: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
☆ Understanding eGFR Trajectories and Kidney Function Decline via Large Multimodal Models
The estimated Glomerular Filtration Rate (eGFR) is an essential indicator of kidney function in clinical practice. Although traditional equations and Machine Learning (ML) models using clinical and laboratory data can estimate eGFR, accurately predicting future eGFR levels remains a significant challenge for nephrologists and ML researchers. Recent advances demonstrate that Large Language Models (LLMs) and Large Multimodal Models (LMMs) can serve as robust foundation models for diverse applications. This study investigates the potential of LMMs to predict future eGFR levels with a dataset consisting of laboratory and clinical values from 50 patients. By integrating various prompting techniques and ensembles of LMMs, our findings suggest that these models, when combined with precise prompts and visual representations of eGFR trajectories, offer predictive performance comparable to existing ML models. This research extends the application of foundation models and suggests avenues for future studies to harness these models in addressing complex medical forecasting challenges.
comment: This preprint version includes corrections of typographical errors related to numerical values in Table 2, which were present in the version published at the BDH workshop in MIPR 2024. These corrections do not affect the overall conclusions of the study
☆ Sample what you cant compress
For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a principled interpretation. Concurrently, in generative settings diffusion has demonstrated a remarkable ability to create crisp, high quality results and has solid theoretical underpinnings (from variational inference to direct study as the Fisher Divergence). Our work combines autoencoder representation learning with diffusion and is, to our knowledge, the first to demonstrate the efficacy of jointly learning a continuous encoder and decoder under a diffusion-based loss. We demonstrate that this approach yields better reconstruction quality as compared to GAN-based autoencoders while being easier to tune. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss. Since our decoder is stochastic, it can generate details not encoded in the otherwise deterministic latent representation; we therefore name our approach "Sample what you can't compress", or SWYCC for short.
☆ Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems
While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.
comment: 4 pages, 4 figures, for demo samples, see https://sytronik.github.io/demos/voc_smth_aug/
☆ Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal
Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks' datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks.
☆ CoAst: Validation-Free Contribution Assessment for Federated Learning based on Cross-Round Valuation
In the federated learning (FL) process, since the data held by each participant is different, it is necessary to figure out which participant has a higher contribution to the model performance. Effective contribution assessment can help motivate data owners to participate in the FL training. Research works in this field can be divided into two directions based on whether a validation dataset is required. Validation-based methods need to use representative validation data to measure the model accuracy, which is difficult to obtain in practical FL scenarios. Existing validation-free methods assess the contribution based on the parameters and gradients of local models and the global model in a single training round, which is easily compromised by the stochasticity of model training. In this work, we propose CoAst, a practical method to assess the FL participants' contribution without access to any validation data. The core idea of CoAst involves two aspects: one is to only count the most important part of model parameters through a weights quantization, and the other is a cross-round valuation based on the similarity between the current local parameters and the global parameter updates in several subsequent communication rounds. Extensive experiments show that CoAst has comparable assessment reliability to existing validation-based methods and outperforms existing validation-free methods.
☆ Reliable Deep Diffusion Tensor Estimation: Rethinking the Power of Data-Driven Optimization Routine
Diffusion tensor imaging (DTI) holds significant importance in clinical diagnosis and neuroscience research. However, conventional model-based fitting methods often suffer from sensitivity to noise, leading to decreased accuracy in estimating DTI parameters. While traditional data-driven deep learning methods have shown potential in terms of accuracy and efficiency, their limited generalization to out-of-training-distribution data impedes their broader application due to the diverse scan protocols used across centers, scanners, and studies. This work aims to tackle these challenges and promote the use of DTI by introducing a data-driven optimization-based method termed DoDTI. DoDTI combines the weighted linear least squares fitting algorithm and regularization by denoising technique. The former fits DW images from diverse acquisition settings into diffusion tensor field, while the latter applies a deep learning-based denoiser to regularize the diffusion tensor field instead of the DW images, which is free from the limitation of fixed-channel assignment of the network. The optimization object is solved using the alternating direction method of multipliers and then unrolled to construct a deep neural network, leveraging a data-driven strategy to learn network parameters. Extensive validation experiments are conducted utilizing both internally simulated datasets and externally obtained in-vivo datasets. The results, encompassing both qualitative and quantitative analyses, showcase that the proposed method attains state-of-the-art performance in DTI parameter estimation. Notably, it demonstrates superior generalization, accuracy, and efficiency, rendering it highly reliable for widespread application in the field.
☆ Adversarial Attacks on Machine Learning-Aided Visualizations
Research in ML4VIS investigates how to use machine learning (ML) techniques to generate visualizations, and the field is rapidly growing with high societal impact. However, as with any computational pipeline that employs ML processes, ML4VIS approaches are susceptible to a range of ML-specific adversarial attacks. These attacks can manipulate visualization generations, causing analysts to be tricked and their judgments to be impaired. Due to a lack of synthesis from both visualization and ML perspectives, this security aspect is largely overlooked by the current ML4VIS literature. To bridge this gap, we investigate the potential vulnerabilities of ML-aided visualizations from adversarial attacks using a holistic lens of both visualization and ML perspectives. We first identify the attack surface (i.e., attack entry points) that is unique in ML-aided visualizations. We then exemplify five different adversarial attacks. These examples highlight the range of possible attacks when considering the attack surface and multiple different adversary capabilities. Our results show that adversaries can induce various attacks, such as creating arbitrary and deceptive visualizations, by systematically identifying input attributes that are influential in ML inferences. Based on our observations of the attack surface characteristics and the attack examples, we underline the importance of comprehensive studies of security issues and defense mechanisms as a call of urgency for the ML4VIS community.
comment: This is the author's version of the article that has been accepted by the Journal of Visualization
☆ Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes
High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). These problems are exacerbated on low-performance graphics hardware, e.g. on mobile devices. We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed layer order from outermost to innermost. We model mesh layers as SDF shells with optimal spacing learned during training. After baking, we fit UV textures to the corresponding meshes. We show that our method can represent challenging fuzzy objects while achieving higher frame rates than volume-based and splatting-based methods on low-end and mobile devices.
☆ Demographic parity in regression and classification within the unawareness framework
This paper explores the theoretical foundations of fair regression under the constraint of demographic parity within the unawareness framework, where disparate treatment is prohibited, extending existing results where such treatment is permitted. Specifically, we aim to characterize the optimal fair regression function when minimizing the quadratic loss. Our results reveal that this function is given by the solution to a barycenter problem with optimal transport costs. Additionally, we study the connection between optimal fair cost-sensitive classification, and optimal fair regression. We demonstrate that nestedness of the decision sets of the classifiers is both necessary and sufficient to establish a form of equivalence between classification and regression. Under this nestedness assumption, the optimal classifiers can be derived by applying thresholds to the optimal fair regression function; conversely, the optimal fair regression function is characterized by the family of cost-sensitive classifiers.
☆ ForeCal: Random Forest-based Calibration for DNNs
Deep neural network(DNN) based classifiers do extremely well in discriminating between observations, resulting in higher ROC AUC and accuracy metrics, but their outputs are often miscalibrated with respect to true event likelihoods. Post-hoc calibration algorithms are often used to calibrate the outputs of these classifiers. Methods like Isotonic regression, Platt scaling, and Temperature scaling have been shown to be effective in some cases but are limited by their parametric assumptions and/or their inability to capture complex non-linear relationships. We propose ForeCal - a novel post-hoc calibration algorithm based on Random forests. ForeCal exploits two unique properties of Random forests: the ability to enforce weak monotonicity and range-preservation. It is more powerful in achieving calibration than current state-of-the-art methods, is non-parametric, and can incorporate exogenous information as features to learn a better calibration function. Through experiments on 43 diverse datasets from the UCI ML repository, we show that ForeCal outperforms existing methods in terms of Expected Calibration Error(ECE) with minimal impact on the discriminative power of the base DNN as measured by AUC.
☆ Adversarial Learning for Neural PDE Solvers with Sparse Data
Neural network solvers for partial differential equations (PDEs) have made significant progress, yet they continue to face challenges related to data scarcity and model robustness. Traditional data augmentation methods, which leverage symmetry or invariance, impose strong assumptions on physical systems that often do not hold in dynamic and complex real-world applications. To address this research gap, this study introduces a universal learning strategy for neural network PDEs, named Systematic Model Augmentation for Robust Training (SMART). By focusing on challenging and improving the model's weaknesses, SMART reduces generalization error during training under data-scarce conditions, leading to significant improvements in prediction accuracy across various PDE scenarios. The effectiveness of the proposed method is demonstrated through both theoretical analysis and extensive experimentation. The code will be available.
☆ Transfer-based Adversarial Poisoning Attacks for Online (MIMO-)Deep Receviers
Recently, the design of wireless receivers using deep neural networks (DNNs), known as deep receivers, has attracted extensive attention for ensuring reliable communication in complex channel environments. To adapt quickly to dynamic channels, online learning has been adopted to update the weights of deep receivers with over-the-air data (e.g., pilots). However, the fragility of neural models and the openness of wireless channels expose these systems to malicious attacks. To this end, understanding these attack methods is essential for robust receiver design.In this paper, we propose a transfer-based adversarial poisoning attack method for online receivers.Without knowledge of the attack target, adversarial perturbations are injected to the pilots, poisoning the online deep receiver and impairing its ability to adapt to dynamic channels and nonlinear effects. In particular, our attack method targets Deep Soft Interference Cancellation (DeepSIC)[1] using online meta-learning.As a classical model-driven deep receiver, DeepSIC incorporates wireless domain knowledge into its architecture. This integration allows it to adapt efficiently to time-varying channels with only a small number of pilots, achieving optimal performance in a multi-input and multi-output (MIMO) scenario.The deep receiver in this scenario has a number of applications in the field of wireless communication, which motivates our study of the attack methods targeting it.Specifically, we demonstrate the effectiveness of our attack in simulations on synthetic linear, synthetic nonlinear, static, and COST 2100 channels. Simulation results indicate that the proposed poisoning attack significantly reduces the performance of online receivers in rapidly changing scenarios.
comment: 15 pages, 14 figures
Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning
Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.
☆ Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering
Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.
comment: 39 pages, 9 figures
☆ Deep Adaptive Interest Network: Personalized Recommendation with Context-Aware Learning
In personalized recommendation systems, accurately capturing users' evolving interests and combining them with contextual information is a critical research area. This paper proposes a novel model called the Deep Adaptive Interest Network (DAIN), which dynamically models users' interests while incorporating context-aware learning mechanisms to achieve precise and adaptive personalized recommendations. DAIN leverages deep learning techniques to build an adaptive interest network structure that can capture users' interest changes in real-time while further optimizing recommendation results by integrating contextual information. Experiments conducted on several public datasets demonstrate that DAIN excels in both recommendation performance and computational efficiency. This research not only provides a new solution for personalized recommendation systems but also offers fresh insights into the application of context-aware learning in recommendation systems.
☆ Relative-Translation Invariant Wasserstein Distance
We introduce a new family of distances, relative-translation invariant Wasserstein distances ($RW_p$), for measuring the similarity of two probability distributions under distribution shift. Generalizing it from the classical optimal transport model, we show that $RW_p$ distances are also real distance metrics defined on the quotient set $\mathcal{P}_p(\mathbb{R}^n)/\sim$ and invariant to distribution translations. When $p=2$, the $RW_2$ distance enjoys more exciting properties, including decomposability of the optimal transport model, translation-invariance of the $RW_2$ distance, and a Pythagorean relationship between $RW_2$ and the classical quadratic Wasserstein distance ($W_2$). Based on these properties, we show that a distribution shift, measured by $W_2$ distance, can be explained in the bias-variance perspective. In addition, we propose a variant of the Sinkhorn algorithm, named $RW_2$ Sinkhorn algorithm, for efficiently calculating $RW_2$ distance, coupling solutions, as well as $W_2$ distance. We also provide the analysis of numerical stability and time complexity for the proposed algorithm. Finally, we validate the $RW_2$ distance metric and the algorithm performance with three experiments. We conduct one numerical validation for the $RW_2$ Sinkhorn algorithm and show two real-world applications demonstrating the effectiveness of using $RW_2$ under distribution shift: digits recognition and similar thunderstorm detection. The experimental results report that our proposed algorithm significantly improves the computational efficiency of Sinkhorn in certain practical applications, and the $RW_2$ distance is robust to distribution translations compared with baselines.
☆ Abstractive Text Summarization: State of the Art, Challenges, and Improvements
Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
comment: 9 Tables, 7 Figures
☆ Adaptive Class Emergence Training: Enhancing Neural Network Stability and Generalization through Progressive Target Evolution
Recent advancements in artificial intelligence, particularly deep neural networks, have pushed the boundaries of what is achievable in complex tasks. Traditional methods for training neural networks in classification problems often rely on static target outputs, such as one-hot encoded vectors, which can lead to unstable optimization and difficulties in handling non-linearities within data. In this paper, we propose a novel training methodology that progressively evolves the target outputs from a null vector to one-hot encoded vectors throughout the training process. This gradual transition allows the network to adapt more smoothly to the increasing complexity of the classification task, maintaining an equilibrium state that reduces the risk of overfitting and enhances generalization. Our approach, inspired by concepts from structural equilibrium in finite element analysis, has been validated through extensive experiments on both synthetic and real-world datasets. The results demonstrate that our method achieves faster convergence, improved accuracy, and better generalization, especially in scenarios with high data complexity and noise. This progressive training framework offers a robust alternative to classical methods, opening new perspectives for more efficient and stable neural network training.
comment: 15 pages, 9 figures, 2 tables
☆ Learning Privacy-Preserving Student Networks via Discriminative-Generative Distillation
While deep models have proved successful in learning rich knowledge from massive well-annotated data, they may pose a privacy leakage risk in practical deployment. It is necessary to find an effective trade-off between high utility and strong privacy. In this work, we propose a discriminative-generative distillation approach to learn privacy-preserving deep models. Our key idea is taking models as bridge to distill knowledge from private data and then transfer it to learn a student network via two streams. First, discriminative stream trains a baseline classifier on private data and an ensemble of teachers on multiple disjoint private subsets, respectively. Then, generative stream takes the classifier as a fixed discriminator and trains a generator in a data-free manner. After that, the generator is used to generate massive synthetic data which are further applied to train a variational autoencoder (VAE). Among these synthetic data, a few of them are fed into the teacher ensemble to query labels via differentially private aggregation, while most of them are embedded to the trained VAE for reconstructing synthetic data. Finally, a semi-supervised student learning is performed to simultaneously handle two tasks: knowledge transfer from the teachers with distillation on few privately labeled synthetic data, and knowledge enhancement with tangent-normal adversarial regularization on many triples of reconstructed synthetic data. In this way, our approach can control query cost over private data and mitigate accuracy degradation in a unified manner, leading to a privacy-preserving student model. Extensive experiments and analysis clearly show the effectiveness of the proposed approach.
comment: This paper is accepted by IEEE Transactions on Image Processing (TIP)
☆ Building Math Agents with Multi-Turn Iterative Preference Learning
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model's performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.
comment: A multi-turn direct preference learning framework for tool-integrated reasoning tasks
☆ Gaussian Rate-Distortion-Perception Coding and Entropy-Constrained Scalar Quantization
This paper investigates the best known bounds on the quadratic Gaussian distortion-rate-perception function with limited common randomness for the Kullback-Leibler divergence-based perception measure, as well as their counterparts for the squared Wasserstein-2 distance-based perception measure, recently established by Xie et al. These bounds are shown to be nondegenerate in the sense that they cannot be deduced from each other via a refined version of Talagrand's transportation inequality. On the other hand, an improved lower bound is established when the perception measure is given by the squared Wasserstein-2 distance. In addition, it is revealed by exploiting the connection between rate-distortion-perception coding and entropy-constrained scalar quantization that all the aforementioned bounds are generally not tight in the weak perception constraint regime.
Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing
Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at https://github.com/ChicyChen/LOCO-Edit.
☆ Optimal Neural Network Approximation for High-Dimensional Continuous Functions
Recently, the authors of Shen Yang Zhang (JMLR, 2022) developed a neural network with width $36d(2d + 1)$ and depth $11$, which utilizes a special activation function called the elementary universal activation function, to achieve the super approximation property for functions in $C([a,b]^d)$. That is, the constructed network only requires a fixed number of neurons to approximate a $d$-variate continuous function on a $d$-dimensional hypercube with arbitrary accuracy. Their network uses $\mathcal{O}(d^2)$ fixed neurons. One natural question to address is whether we can reduce the number of these neurons in such a network. By leveraging a variant of the Kolmogorov Superposition Theorem, our analysis shows that there is a neural network generated by the elementary universal activation function with only $366d +365$ fixed, intrinsic (non-repeated) neurons that attains this super approximation property. Furthermore, we present a family of continuous functions that requires at least width $d$, and therefore at least $d$ intrinsic neurons, to achieve arbitrary accuracy in its approximation. This shows that the requirement of $\mathcal{O}(d)$ intrinsic neurons is optimal in the sense that it grows linearly with the input dimension $d$, unlike some approximation methods where parameters may grow exponentially with $d$.
☆ Machine Learning Applications to Computational Plasma Physics and Reduced-Order Plasma Modeling: A Perspective
Machine learning (ML) provides a broad spectrum of tools and architectures that enable the transformation of data from simulations and experiments into useful and explainable science, thereby augmenting domain knowledge. Furthermore, ML-enhanced numerical modelling can revamp scientific computing for real-world complex engineering systems, creating unique opportunities to examine the operation of the technologies in detail and automate their optimization and control. In recent years, ML applications have seen significant growth across various scientific domains, particularly in fluid mechanics, where ML has shown great promise in enhancing computational modeling of fluid flows. In contrast, ML applications in numerical plasma physics research remain relatively limited in scope and extent. Despite this, the close relationship between fluid mechanics and plasma physics presents a valuable opportunity to create a roadmap for transferring ML advances in fluid flow modeling to computational plasma physics. This Perspective aims to outline such a roadmap. We begin by discussing some general fundamental aspects of ML, including the various categories of ML algorithms and the different types of problems that can be solved with the help of ML. With regard to each problem type, we then present specific examples from the use of ML in computational fluid dynamics, reviewing several insightful prior efforts. We also review recent ML applications in plasma physics for each problem type. The paper discusses promising future directions and development pathways for ML in plasma modelling within the different application areas. Additionally, we point out prominent challenges that must be addressed to realize ML's full potential in computational plasma physics, including the need for cost-effective high-fidelity simulation tools for extensive data generation.
comment: 42 pages, 20 figures
☆ Understanding the Role of Functional Diversity in Weight-Ensembling with Ingredient Selection and Multidimensional Scaling ICML 2024
Weight-ensembles are formed when the parameters of multiple neural networks are directly averaged into a single model. They have demonstrated generalization capability in-distribution (ID) and out-of-distribution (OOD) which is not completely understood, though they are thought to successfully exploit functional diversity allotted by each distinct model. Given a collection of models, it is also unclear which combination leads to the optimal weight-ensemble; the SOTA is a linear-time ``greedy" method. We introduce two novel weight-ensembling approaches to study the link between performance dynamics and the nature of how each method decides to use apply the functionally diverse components, akin to diversity-encouragement in the prediction-ensemble literature. We develop a visualization tool to explain how each algorithm explores various domains defined via pairwise-distances to further investigate selection and algorithms' convergence. Empirical analyses shed perspectives which reinforce how high-diversity enhances weight-ensembling while qualifying the extent to which diversity alone improves accuracy. We also demonstrate that sampling positionally distinct models can contribute just as meaningfully to improvements in a weight-ensemble.
comment: Published at the ICML 2024 (Vienna, Austria) Workshop on Foundation Models in the Wild
☆ Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA ICML2024
Parameter-Efficient Fine-Tuning (PEFT) has risen as an innovative training strategy that updates only a select few model parameters, significantly lowering both computational and memory demands. PEFT also helps to decrease data transfer in federated learning settings, where communication depends on the size of updates. In this work, we explore the constraints of previous studies that integrate a well-known PEFT method named LoRA with federated fine-tuning, then introduce RoLoRA, a robust federated fine-tuning framework that utilizes an alternating minimization approach for LoRA, providing greater robustness against decreasing fine-tuning parameters and increasing data heterogeneity. Our results indicate that RoLoRA not only presents the communication benefits but also substantially enhances the robustness and effectiveness in multiple federated fine-tuning scenarios.
comment: Presented at ES-FOMO-II@ICML2024
☆ NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
☆ Optimal sampling for least-squares approximation
Least-squares approximation is one of the most important methods for recovering an unknown function from data. While in many applications the data is fixed, in many others there is substantial freedom to choose where to sample. In this paper, we review recent progress on optimal sampling for (weighted) least-squares approximation in arbitrary linear spaces. We introduce the Christoffel function as a key quantity in the analysis of (weighted) least-squares approximation from random samples, then show how it can be used to construct sampling strategies that possess near-optimal sample complexity: namely, the number of samples scales log-linearly in $n$, the dimension of the approximation space. We discuss a series of variations, extensions and further topics, and throughout highlight connections to approximation theory, machine learning, information-based complexity and numerical linear algebra. Finally, motivated by various contemporary applications, we consider a generalization of the classical setting where the samples need not be pointwise samples of a scalar-valued function, and the approximation space need not be linear. We show that even in this significantly more general setting suitable generalizations of the Christoffel function still determine the sample complexity. This provides a unified procedure for designing improved sampling strategies for general recovery problems. This article is largely self-contained, and intended to be accessible to nonspecialists.
☆ Data-driven 2D stationary quantum droplets and wave propagations in the amended GP equation with two potentials via deep neural networks learning
In this paper, we develop a systematic deep learning approach to solve two-dimensional (2D) stationary quantum droplets (QDs) and investigate their wave propagation in the 2D amended Gross-Pitaevskii equation with Lee-Huang-Yang correction and two kinds of potentials. Firstly, we use the initial-value iterative neural network (IINN) algorithm for 2D stationary quantum droplets of stationary equations. Then the learned stationary QDs are used as the initial value conditions for physics-informed neural networks (PINNs) to explore their evolutions in the some space-time region. Especially, we consider two types of potentials, one is the 2D quadruple-well Gaussian potential and the other is the PT-symmetric HO-Gaussian potential, which lead to spontaneous symmetry breaking and the generation of multi-component QDs. The used deep learning method can also be applied to study wave propagations of other nonlinear physical models.
comment: 17 pages, 12 figures (Proc. R. Soc. A, accepted for publication). arXiv admin note: text overlap with arXiv:2409.01124
☆ Subsidy design for better social outcomes
Overcoming the impact of selfish behavior of rational players in multiagent systems is a fundamental problem in game theory. Without any intervention from a central agent, strategic users take actions in order to maximize their personal utility, which can lead to extremely inefficient overall system performance, often indicated by a high Price of Anarchy. Recent work (Lin et al. 2021) investigated and formalized yet another undesirable behavior of rational agents, that of avoiding freely available information about the game for selfish reasons, leading to worse social outcomes. A central planner can significantly mitigate these issues by injecting a subsidy to reduce certain costs associated with the system and obtain net gains in the system performance. Crucially, the planner needs to determine how to allocate this subsidy effectively. We formally show that designing subsidies that perfectly optimize the social good, in terms of minimizing the Price of Anarchy or preventing the information avoidance behavior, is computationally hard under standard complexity theoretic assumptions. On the positive side, we show that we can learn provably good values of subsidy in repeated games coming from the same domain. This data-driven subsidy design approach avoids solving computationally hard problems for unseen games by learning over polynomially many games. We also show that optimal subsidy can be learned with no-regret given an online sequence of games, under mild assumptions on the cost matrix. Our study focuses on two distinct games: a Bayesian extension of the well-studied fair cost-sharing game, and a component maintenance game with engineering applications.
comment: 30 pages, 3 figures, 5 tables
☆ Generative artificial intelligence for computational chemistry: a roadmap to predicting emergent phenomena
The recent surge in Generative Artificial Intelligence (AI) has introduced exciting possibilities for computational chemistry. Generative AI methods have made significant progress in sampling molecular structures across chemical species, developing force fields, and speeding up simulations. This Perspective offers a structured overview, beginning with the fundamental theoretical concepts in both Generative AI and computational chemistry. It then covers widely used Generative AI methods, including autoencoders, generative adversarial networks, reinforcement learning, flow models and language models, and highlights their selected applications in diverse areas including force field development, and protein/RNA structure prediction. A key focus is on the challenges these methods face before they become truly predictive, particularly in predicting emergent chemical phenomena. We believe that the ultimate goal of a simulation method or theory is to predict phenomena not seen before, and that Generative AI should be subject to these same standards before it is deemed useful for chemistry. We suggest that to overcome these challenges, future AI models need to integrate core chemical principles, especially from statistical mechanics.
☆ Probing self-attention in self-supervised speech models for cross-linguistic differences
Speech models have gained traction thanks to increase in accuracy from novel transformer architectures. While this impressive increase in performance across automatic speech recognition (ASR) benchmarks is noteworthy, there is still much that is unknown about the use of attention mechanisms for speech-related tasks. For example, while it is assumed that these models are learning language-independent (i.e., universal) speech representations, there has not yet been an in-depth exploration of what it would mean for the models to be language-independent. In the current paper, we explore this question within the realm of self-attention mechanisms of one small self-supervised speech transformer model (TERA). We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language. We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining. We also present a head ablation study which shows that models across languages primarily rely on diagonal heads to classify phonemes.
comment: 10 pages, 18 figures
☆ RoboKoop: Efficient Control Conditioned Representations from Visual Input in Robotics using Koopman Operator
Developing agents that can perform complex control tasks from high-dimensional observations is a core ability of autonomous agents that requires underlying robust task control policies and adapting the underlying visual representations to the task. Most existing policies need a lot of training samples and treat this problem from the lens of two-stage learning with a controller learned on top of pre-trained vision models. We approach this problem from the lens of Koopman theory and learn visual representations from robotic agents conditioned on specific downstream tasks in the context of learning stabilizing control for the agent. We introduce a Contrastive Spectral Koopman Embedding network that allows us to learn efficient linearized visual representations from the agent's visual data in a high dimensional latent space and utilizes reinforcement learning to perform off-policy control on top of the extracted representations with a linear controller. Our method enhances stability and control in gradient dynamics over time, significantly outperforming existing approaches by improving efficiency and accuracy in learning task policies over extended horizons.
comment: Accepted to the $8^{th}$ Conference on Robot Learning (CoRL 2024)
☆ Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources
Modern web services adopt cloud-native principles to leverage the advantages of microservices. To consistently guarantee high Quality of Service (QoS) according to Service Level Agreements (SLAs), ensure satisfactory user experiences, and minimize operational costs, each microservice must be provisioned with the right amount of resources. However, accurately provisioning microservices with adequate resources is complex and depends on many factors, including workload intensity and the complex interconnections between microservices. To address this challenge, we develop a model that captures the relationship between an end-to-end latency, requests at the front-end level, and resource utilization. We then use the developed model to predict the end-to-end latency. Our solution leverages the Temporal Fusion Transformer (TFT), an attention-based architecture equipped with interpretability features. When the prediction results indicate SLA non-compliance, we use the feature importance provided by the TFT as covariates in Kernel Ridge Regression (KRR), with the response variable being the desired latency, to learn the parameters associated with the feature importance. These learned parameters reflect the adjustments required to the features to ensure SLA compliance. We demonstrate the merit of our approach with a microservice-based application and provide a roadmap to deployment.
comment: 14 pages, 5 figures
☆ Backdoor defense, learnability and obfuscation
We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the "trigger", while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker's strategy must work for a randomly-chosen trigger. Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of Hanneke et al. (2022) to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation.
comment: 29 pages
☆ Better Verified Explanations with Applications to Incorrectness and Out-of-Distribution Detection
Building on VeriX (Verified eXplainability, arXiv:2212.01051), a system for producing optimal verified explanations for machine learning model outputs, we present VeriX+, which significantly improves both the size and the generation time of verified explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time -- the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain (Junker 2004) algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of 38% on the GTSRB dataset and a time reduction of 90% on MNIST. We also explore applications of our verified explanations and show that explanation size is a useful proxy for both incorrectness detection and out-of-distribution detection.
☆ An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning
Multi-agent reinforcement learning (MARL) has exploded in popularity in recent years. Many approaches have been developed but they can be divided into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and Decentralized training and execution (DTE). CTDE methods are the most common as they can use centralized information during training but execute in a decentralized manner -- using only information available to that agent during execution. CTDE is the only paradigm that requires a separate training phase where any available information (e.g., other agent policies, underlying states) can be used. As a result, they can be more scalable than CTE methods, do not require communication during execution, and can often perform well. CTDE fits most naturally with the cooperative case, but can be potentially applied in competitive or mixed settings depending on what information is assumed to be observed. This text is an introduction to CTDE in cooperative MARL. It is meant to explain the setting, basic concepts, and common methods. It does not cover all work in CTDE MARL as the subarea is quite extensive. I have included work that I believe is important for understanding the main concepts in the subarea and apologize to those that I have omitted.
comment: arXiv admin note: text overlap with arXiv:2405.06161
☆ Can Your Generative Model Detect Out-of-Distribution Covariate Shift? ECCV 2024
Detecting Out-of-Distribution~(OOD) sensory data and covariate distribution shift aims to identify new test examples with different high-level image statistics to the captured, normal and In-Distribution (ID) set. Existing OOD detection literature largely focuses on semantic shift with little-to-no consensus over covariate shift. Generative models capture the ID data in an unsupervised manner, enabling them to effectively identify samples that deviate significantly from this learned distribution, irrespective of the downstream task. In this work, we elucidate the ability of generative models to detect and quantify domain-specific covariate shift through extensive analyses that involves a variety of models. To this end, we conjecture that it is sufficient to detect most occurring sensory faults (anomalies and deviations in global signals statistics) by solely modeling high-frequency signal-dependent and independent details. We propose a novel method, CovariateFlow, for OOD detection, specifically tailored to covariate heteroscedastic high-frequency image-components using conditional Normalizing Flows (cNFs). Our results on CIFAR10 vs. CIFAR10-C and ImageNet200 vs. ImageNet200-C demonstrate the effectiveness of the method by accurately detecting OOD covariate shift. This work contributes to enhancing the fidelity of imaging systems and aiding machine learning models in OOD detection in the presence of covariate shift.
comment: ECCV 2024
♻ ☆ Enhancing Graph Neural Networks with Limited Labeled Data by Actively Distilling Knowledge from Large Language Models
Graphs are pervasive in the real-world, such as social network analysis, bioinformatics, and knowledge graphs. Graph neural networks (GNNs) have great ability in node classification, a fundamental task on graphs. Unfortunately, conventional GNNs still face challenges in scenarios with few labeled nodes, despite the prevalence of few-shot node classification tasks in real-world applications. To address this challenge, various approaches have been proposed, including graph meta-learning, transfer learning, and methods based on Large Language Models (LLMs). However, traditional meta-learning and transfer learning methods often require prior knowledge from base classes or fail to exploit the potential advantages of unlabeled nodes. Meanwhile, LLM-based methods may overlook the zero-shot capabilities of LLMs and rely heavily on the quality of generated contexts. In this paper, we propose a novel approach that integrates LLMs and GNNs, leveraging the zero-shot inference and reasoning capabilities of LLMs and employing a Graph-LLM-based active learning paradigm to enhance GNNs' performance. Extensive experiments demonstrate the effectiveness of our model in improving node classification accuracy with considerably limited labeled data, surpassing state-of-the-art baselines by significant margins.
comment: 10 pages, 3 Figures
♻ ☆ Decentralized Intelligence Network (DIN)
Decentralized Intelligence Network (DIN) is a theoretical framework designed to address challenges in AI development, particularly focusing on data fragmentation and siloing issues. It facilitates effective AI training within sovereign data networks by overcoming barriers to accessing diverse data sources, leveraging: 1) personal data stores to ensure data sovereignty, where data remains securely within Participants' control; 2) a scalable federated learning protocol implemented on a public blockchain for decentralized AI training, where only model parameter updates are shared, keeping data within the personal data stores; and 3) a scalable, trustless cryptographic rewards mechanism on a public blockchain to incentivize participation and ensure fair reward distribution through a decentralized auditing protocol. This approach guarantees that no entity can prevent or control access to training data or influence financial benefits, as coordination and reward distribution are managed on the public blockchain with an immutable record. The framework supports effective AI training by allowing Participants to maintain control over their data, benefit financially, and contribute to a decentralized, scalable ecosystem that leverages collective AI to develop beneficial algorithms.
comment: 16 pages, 1 figure. DIN was presented by the author as a speaker at the Summit on Responsible Decentralized Intelligence - Future of Decentralization and AI, hosted by Berkeley RDI on August 6, 2024, at the Verizon Center, Cornell Tech Campus, Roosevelt Island, NYC
♻ ☆ Kolmogorov n-Widths for Multitask Physics-Informed Machine Learning (PIML) Methods: Towards Robust Metrics
Physics-informed machine learning (PIML) as a means of solving partial differential equations (PDE) has garnered much attention in the Computational Science and Engineering (CS&E) world. This topic encompasses a broad array of methods and models aimed at solving a single or a collection of PDE problems, called multitask learning. PIML is characterized by the incorporation of physical laws into the training process of machine learning models in lieu of large data when solving PDE problems. Despite the overall success of this collection of methods, it remains incredibly difficult to analyze, benchmark, and generally compare one approach to another. Using Kolmogorov n-widths as a measure of effectiveness of approximating functions, we judiciously apply this metric in the comparison of various multitask PIML architectures. We compute lower accuracy bounds and analyze the model's learned basis functions on various PDE problems. This is the first objective metric for comparing multitask PIML architectures and helps remove uncertainty in model validation from selective sampling and overfitting. We also identify avenues of improvement for model architectures, such as the choice of activation function, which can drastically affect model generalization to "worst-case" scenarios, which is not observed when reporting task-specific errors. We also incorporate this metric into the optimization process through regularization, which improves the models' generalizability over the multitask PDE problem.
♻ ☆ Hybrid Decentralized Optimization: Leveraging Both First- and Zeroth-Order Optimizers for Faster Convergence
Distributed optimization is the standard way of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods. Yet, there are settings where some computationally-bounded nodes may not be able to implement first-order, gradient-based optimization, while they could still contribute to joint optimization tasks. In this paper, we initiate the study of hybrid decentralized optimization, studying settings where nodes with zeroth-order and first-order optimization capabilities co-exist in a distributed system, and attempt to jointly solve an optimization task over some data distribution. We essentially show that, under reasonable parameter settings, such a system can not only withstand noisier zeroth-order agents but can even benefit from integrating such agents into the optimization process, rather than ignoring their information. At the core of our approach is a new analysis of distributed optimization with noisy and possibly-biased gradient estimators, which may be of independent interest. Our results hold for both convex and non-convex objectives. Experimental results on standard optimization tasks confirm our analysis, showing that hybrid first-zeroth order optimization can be practical, even when training deep neural networks.
comment: Shayan Talaei and Matin Ansaripour contributed equally to this work
♻ ☆ The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem
Large language models (LLMs) are useful tools with the capacity for performing specific types of knowledge work at an effective scale. However, LLM deployments in high-risk and safety-critical domains pose unique challenges, notably the issue of ``hallucination,'' where LLMs can generate fabricated information. This is particularly concerning in settings such as drug safety, where inaccuracies could lead to patient harm. To mitigate these risks, we have developed and demonstrated a proof of concept suite of guardrails specifically designed to mitigate certain types of hallucinations and errors for drug safety, and potentially applicable to other medical safety-critical contexts. These guardrails include mechanisms to detect anomalous documents to prevent the ingestion of inappropriate data, identify incorrect drug names or adverse event terms, and convey uncertainty in generated content. We integrated these guardrails with an LLM fine-tuned for a text-to-text task, which involves converting both structured and unstructured data within adverse event reports into natural language. This method was applied to translate individual case safety reports, demonstrating effective application in a pharmacovigilance processing task. Our guardrail framework offers a set of tools with broad applicability across various domains, ensuring LLMs can be safely used in high-risk situations by eliminating the occurrence of key errors, including the generation of incorrect pharmacovigilance-related terms, thus adhering to stringent regulatory and quality standards in medical safety-critical environments.
comment: 27 pages, 6 figures, 4 tables and supplementary material provided
♻ ☆ GenoCraft: A Comprehensive, User-Friendly Web-Based Platform for High-Throughput Omics Data Analysis and Visualization
The surge in high-throughput omics data has reshaped the landscape of biological research, underlining the need for powerful, user-friendly data analysis and interpretation tools. This paper presents GenoCraft, a web-based comprehensive software solution designed to handle the entire pipeline of omics data processing. GenoCraft offers a unified platform featuring advanced bioinformatics tools, covering all aspects of omics data analysis. It encompasses a range of functionalities, such as normalization, quality control, differential analysis, network analysis, pathway analysis, and diverse visualization techniques. This software makes state-of-the-art omics data analysis more accessible to a wider range of users. With GenoCraft, researchers and data scientists have access to an array of cutting-edge bioinformatics tools under a user-friendly interface, making it a valuable resource for managing and analyzing large-scale omics data. The API with an interactive web interface is publicly available at https://genocraft.stanford. edu/. We also release all the codes in https://github.com/futianfan/GenoCraft.
♻ ☆ $μ$GUIDE: a framework for quantitative imaging via generalized uncertainty-driven inference using deep learning
This work proposes $\mu$GUIDE: a general Bayesian framework to estimate posterior distributions of tissue microstructure parameters from any given biophysical model or MRI signal representation, with exemplar demonstration in diffusion-weighted MRI. Harnessing a new deep learning architecture for automatic signal feature selection combined with simulation-based inference and efficient sampling of the posterior distributions, $\mu$GUIDE bypasses the high computational and time cost of conventional Bayesian approaches and does not rely on acquisition constraints to define model-specific summary statistics. The obtained posterior distributions allow to highlight degeneracies present in the model definition and quantify the uncertainty and ambiguity of the estimated parameters.
♻ ☆ Partially Observable Multi-Agent Reinforcement Learning with Information Sharing ICML 2023
We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communications. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-efficient single-agent RL with partial observations, for efficiently solving POSGs. {Inspired by the inefficiency of planning in the ground-truth model,} we then propose to further \emph{approximate} the shared common information to construct an {approximate model} of the POSG, in which planning an approximate \emph{equilibrium} (in terms of solving the original POSG) can be quasi-efficient, i.e., of quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm that is \emph{both} statistically and computationally quasi-efficient. {Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a much more challenging goal. We establish concrete computational and sample complexities under several common structural assumptions of the model.} We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.
comment: Journal extension of the conference version at ICML 2023. Changed to the more general reward function form, added new results for learning in Dec-POMDPs, and streamlined proof outlines
♻ ☆ Domain Decomposition-based coupling of Operator Inference reduced order models via the Schwarz alternating method
This paper presents and evaluates an approach for coupling together subdomain-local reduced order models (ROMs) constructed via non-intrusive operator inference (OpInf) with each other and with subdomain-local full order models (FOMs), following a domain decomposition of the spatial geometry on which a given partial differential equation (PDE) is posed. Joining subdomain-local models is accomplished using the overlapping Schwarz alternating method, a minimally-intrusive multiscale coupling technique that works by transforming a monolithic problem into a sequence of subdomain-local problems, which communicate through transmission boundary conditions imposed on the subdomain interfaces. After formulating the overlapping Schwarz alternating method for OpInf ROMs, termed OpInf-Schwarz, we evaluate the method's accuracy and efficiency on several test cases involving the heat equation in two spatial dimensions. We demonstrate that the method is capable of coupling together arbitrary combinations of OpInf ROMs and FOMs, and that speed-ups over a monolithic FOM are possible when performing OpInf ROM coupling.
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ Convolutional L2LFlows: Generating Accurate Showers in Highly Granular Calorimeters Using Convolutional Normalizing Flows
In the quest to build generative surrogate models as computationally efficient alternatives to rule-based simulations, the quality of the generated samples remains a crucial frontier. So far, normalizing flows have been among the models with the best fidelity. However, as the latent space in such models is required to have the same dimensionality as the data space, scaling up normalizing flows to high dimensional datasets is not straightforward. The prior L2LFlows approach successfully used a series of separate normalizing flows and sequence of conditioning steps to circumvent this problem. In this work, we extend L2LFlows to simulate showers with a 9-times larger profile in the lateral direction. To achieve this, we introduce convolutional layers and U-Net-type connections, move from masked autoregressive flows to coupling layers, and demonstrate the successful modelling of showers in the ILD Electromagnetic Calorimeter as well as Dataset 3 from the public CaloChallenge dataset.
♻ ☆ Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques
We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We utilize imitation learning to approximate the reference policy, ensuring stability and effectiveness in training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
♻ ☆ Revisiting Character-level Adversarial Attacks for Language Models ICML 2024
Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.
comment: Accepted in ICML 2024
♻ ☆ Privacy-aware Berrut Approximated Coded Computing for Federated Learning
Federated Learning (FL) is an interesting strategy that enables the collaborative training of an AI model among different data owners without revealing their private datasets. Even so, FL has some privacy vulnerabilities that have been tried to be overcome by applying some techniques like Differential Privacy (DP), Homomorphic Encryption (HE), or Secure Multi-Party Computation (SMPC). However, these techniques have some important drawbacks that might narrow their range of application: problems to work with non-linear functions and to operate large matrix multiplications and high communication and computational costs to manage semi-honest nodes. In this context, we propose a solution to guarantee privacy in FL schemes that simultaneously solves the previously mentioned problems. Our proposal is based on the Berrut Approximated Coded Computing, a technique from the Coded Distributed Computing paradigm, adapted to a Secret Sharing configuration, to provide input privacy to FL in a scalable way. It can be applied for computing non-linear functions and treats the special case of distributed matrix multiplication, a key primitive at the core of many automated learning tasks. Because of these characteristics, it could be applied in a wide range of FL scenarios, since it is independent of the machine learning models or aggregation algorithms used in the FL scheme. We provide analysis of the achieved privacy and complexity of our solution and, due to the extensive numerical results performed, a good trade-off between privacy and precision can be observed.
♻ ☆ A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction
Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased (underestimating actual values), while those for small-valued outcomes are positively biased (overestimating actual values). We refer to this linear central tendency warped bias as the "systematic bias of machine learning regression". In this paper, we first demonstrate that this systematic prediction bias persists across various machine learning regression models, and then delve into its theoretical underpinnings. To address this issue, we propose a general constrained optimization approach designed to correct this bias and develop computationally efficient implementation algorithms. Simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning regression models, our method effectively addresses the longstanding issue of "systematic bias of machine learning regression" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.
♻ ☆ Pre-processing and Compression: Understanding Hidden Representation Refinement Across Imaging Domains via Intrinsic Dimension
In recent years, there has been interest in how geometric properties such as intrinsic dimension (ID) of a neural network's hidden representations change through its layers, and how such properties are predictive of important model behavior such as generalization ability. However, evidence has begun to emerge that such behavior can change significantly depending on the domain of the network's training data, such as natural versus medical images. Here, we further this inquiry by exploring how the ID of a network's learned representations changes through its layers, in essence, characterizing how the network successively refines the information content of input data to be used for predictions. Analyzing eleven natural and medical image datasets across six network architectures, we find that how ID changes through the network differs noticeably between natural and medical image models. Specifically, medical image models peak in representation ID earlier in the network, implying a difference in the image features and their abstractness that are typically used for downstream tasks in these domains. Additionally, we discover a strong correlation of this peak representation ID with the ID of the data in its input space, implying that the intrinsic information content of a model's learned representations is guided by that of the data it was trained on. Overall, our findings emphasize notable discrepancies in network behavior between natural and non-natural imaging domains regarding hidden representation information content, and provide further insights into how a network's learned features are shaped by its training data.
♻ ☆ Energy-Efficient Channel Decoding for Wireless Federated Learning: Convergence Analysis and Adaptive Design
One of the most critical challenges for deploying distributed learning solutions, such as federated learning (FL), in wireless networks is the limited battery capacity of mobile clients. While it is a common belief that the major energy consumption of mobile clients comes from the uplink data transmission, this paper presents a novel finding, namely channel decoding also contributes significantly to the overall energy consumption of mobile clients in FL. Motivated by this new observation, we propose an energy-efficient adaptive channel decoding scheme that leverages the intrinsic robustness of FL to model errors. In particular, the robustness is exploited to reduce the energy consumption of channel decoders at mobile clients by adaptively adjusting the number of decoding iterations. We theoretically prove that wireless FL with communication errors can converge at the same rate as the case with error-free communication provided the bit error rate (BER) is properly constrained. An adaptive channel decoding scheme is then proposed to improve the energy efficiency of wireless FL systems. Experimental results demonstrate that the proposed method maintains the same learning accuracy while reducing the channel decoding energy consumption by ~20% when compared to an existing approach.
comment: This paper has been accepted by the IEEE TWC. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation
Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these tools to improve them further. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content. In the present work, we have identified a fundamental limitation related to the image generation ability of LLMs, and termed it The NO Syndrome. This negation blindness refers to LLMs inability to correctly comprehend NO related natural language prompts to generate the desired images. Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found to be suffering from this syndrome. To demonstrate the generalization of this limitation, we carried out simulation experiments and conducted entropy-based and benchmark statistical analysis tests on various LLMs in multiple languages, including English, Hindi, and French. We conclude that the NO syndrome is a significant flaw in current LLMs that needs to be addressed. A related finding of this study showed a consistent discrepancy between image and textual responses as a result of this NO syndrome. We posit that the introduction of a negation context-aware reinforcement learning based feedback loop between the LLMs textual response and generated image could help ensure the generated text is based on both the LLMs correct contextual understanding of the negation query and the generated visual output.
comment: 15 pages, 7 figures
♻ ☆ Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS 2024
In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions that rely on keywords. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection engines that rely on textual features and keywords are bypassed, an occurrence our observations show happens frequently.
comment: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)
♻ ☆ Fast and interpretable Support Vector Classification based on the truncated ANOVA decomposition
Support Vector Machines (SVMs) are an important tool for performing classification on scattered data, where one usually has to deal with many data points in high-dimensional spaces. We propose solving SVMs in primal form using feature maps based on trigonometric functions or wavelets. In small dimensional settings the Fast Fourier Transform (FFT) and related methods are a powerful tool in order to deal with the considered basis functions. For growing dimensions the classical FFT-based methods become inefficient due to the curse of dimensionality. Therefore, we restrict ourselves to multivariate basis functions, each of which only depends on a small number of dimensions. This is motivated by the well-known sparsity of effects and recent results regarding the reconstruction of functions from scattered data in terms of truncated analysis of variance (ANOVA) decompositions, which makes the resulting model even interpretable in terms of importance of the features as well as their couplings. The usage of small superposition dimensions has the consequence that the computational effort no longer grows exponentially but only polynomially with respect to the dimension. In order to enforce sparsity regarding the basis coefficients, we use the frequently applied $\ell_2$-norm and, in addition, $\ell_1$-norm regularization. The found classifying function, which is the linear combination of basis functions, and its variance can then be analyzed in terms of the classical ANOVA decomposition of functions. Based on numerical examples we show that we are able to recover the signum of a function that perfectly fits our model assumptions. Furthermore, we perform classification on different artificial and real-world data sets. We obtain better results with $\ell_1$-norm regularization, both in terms of accuracy and clarity of interpretability.
♻ ☆ The future of cosmological likelihood-based inference: accelerated high-dimensional parameter estimation and model comparison
We advocate for a new paradigm of cosmological likelihood-based inference, leveraging recent developments in machine learning and its underlying technology, to accelerate Bayesian inference in high-dimensional settings. Specifically, we combine (i) emulation, where a machine learning model is trained to mimic cosmological observables, e.g. CosmoPower-JAX; (ii) differentiable and probabilistic programming, e.g. JAX and NumPyro, respectively; (iii) scalable Markov chain Monte Carlo (MCMC) sampling techniques that exploit gradients, e.g. Hamiltonian Monte Carlo; and (iv) decoupled and scalable Bayesian model selection techniques that compute the Bayesian evidence purely from posterior samples, e.g. the learned harmonic mean implemented in harmonic. This paradigm allows us to carry out a complete Bayesian analysis, including both parameter estimation and model selection, in a fraction of the time of traditional approaches. First, we demonstrate the application of this paradigm on a simulated cosmic shear analysis for a Stage IV survey in 37- and 39-dimensional parameter spaces, comparing $\Lambda$CDM and a dynamical dark energy model ($w_0w_a$CDM). We recover posterior contours and evidence estimates that are in excellent agreement with those computed by the traditional nested sampling approach while reducing the computational cost from 8 months on 48 CPU cores to 2 days on 12 GPUs. Second, we consider a joint analysis between three simulated next-generation surveys, each performing a 3x2pt analysis, resulting in 157- and 159-dimensional parameter spaces. Standard nested sampling techniques are simply unlikely to be feasible in this high-dimensional setting, requiring a projected 12 years of compute time on 48 CPU cores; on the other hand, the proposed approach only requires 8 days of compute time on 24 GPUs. All packages used in our analyses are publicly available.
comment: 14 pages, 6 figures. Accepted for publication in the Open Journal of Astrophysics. Codes available at https://github.com/alessiospuriomancini/cosmopower, https://github.com/dpiras/cosmopower-jax, https://github.com/astro-informatics/harmonic/
♻ ☆ A possible late-time transition of $M_B$ inferred via neural networks
The strengthening of tensions in the cosmological parameters has led to a reconsideration of fundamental aspects of standard cosmology. The tension in the Hubble constant can also be viewed as a tension between local and early Universe constraints on the absolute magnitude $M_B$ of Type Ia supernova. In this work, we reconsider the possibility of a variation of this parameter in a model-independent way. We employ neural networks to agnostically constrain the value of the absolute magnitude as well as assess the impact and statistical significance of a variation in $M_B$ with redshift from the Pantheon+ compilation, together with a thorough analysis of the neural network architecture. We find an indication for a possible transition redshift at the $z\approx 1$ region.
comment: 13 pages, 9 sets of figures, 2 tables. To appear in JCAP
♻ ☆ Variational Mode Decomposition and Linear Embeddings are What You Need For Time-Series Forecasting
Time-series forecasting often faces challenges due to data volatility, which can lead to inaccurate predictions. Variational Mode Decomposition (VMD) has emerged as a promising technique to mitigate volatility by decomposing data into distinct modes, thereby enhancing forecast accuracy. In this study, we integrate VMD with linear models to develop a robust forecasting framework. Our approach is evaluated on 13 diverse datasets, including ETTm2, WindTurbine, M4, and 10 air quality datasets from various Southeast Asian cities. The effectiveness of the VMD strategy is assessed by comparing Root Mean Squared Error (RMSE) values from models utilizing VMD against those without it. Additionally, we benchmark linear-based models against well-known neural network architectures such as LSTM, Bidirectional LSTM, and RNN. The results demonstrate a significant reduction in RMSE across nearly all models following VMD application. Notably, the Linear + VMD model achieved the lowest average RMSE in univariate forecasting at 0.619. In multivariate forecasting, the DLinear + VMD model consistently outperformed others, attaining the lowest RMSE across all datasets with an average of 0.019. These findings underscore the effectiveness of combining VMD with linear models for superior time-series forecasting.
comment: For associated repository, see https://github.com/Espalemit/VMD-With-LTSF-Linear.git
♻ ☆ GT-CausIn: a novel causal-based insight for traffic prediction
Traffic forecasting is an important application of spatiotemporal series prediction. Among different methods, graph neural networks have achieved so far the most promising results, learning relations between graph nodes then becomes a crucial task. However, improvement space is very limited when these relations are learned in a node-to-node manner. The challenge stems from (1) obscure temporal dependencies between different stations, (2) difficulties in defining variables beyond the node level, and (3) no ready-made method to validate the learned relations. To confront these challenges, we define legitimate traffic causal variables to discover the causal relation inside the traffic network, which is carefully checked with statistic tools and case analysis. We then present a novel model named Graph Spatial-Temporal Network Based on Causal Insight (GT-CausIn), where prior learned causal information is integrated with graph diffusion layers and temporal convolutional network (TCN) layers. Experiments are carried out on two real-world traffic datasets: PEMS-BAY and METR-LA, which show that GT-CausIn significantly outperforms the state-of-the-art models on mid-term and long-term prediction.
♻ ☆ When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective
Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%. The source code is available at https://github.com/IBM/VP-LLR.
♻ ☆ Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing
The incorporation of advanced sensors and machine learning techniques has enabled modern manufacturing enterprises to perform data-driven classification-based anomaly detection based on the sensor data collected in manufacturing processes. However, one critical challenge is that newly presented defect category may manifest as the manufacturing process continues, resulting in monitoring performance deterioration of previously trained machine learning models. Hence, there is an increasing need for empowering machine learning models to learn continually. Among all continual learning methods, memory-based continual learning has the best performance but faces the constraints of data storage capacity. To address this issue, this paper develops a novel pseudo replay-based continual learning framework by integrating class incremental learning and oversampling-based data generation. Without storing all the data, the developed framework could generate high-quality data representing previous classes to train machine learning model incrementally when new category anomaly occurs. In addition, it could even enhance the monitoring performance since it also effectively improves the data quality. The effectiveness of the proposed framework is validated in three cases studies, which leverages supervised classification problem for anomaly detection. The experimental results show that the developed method is very promising in detecting novel anomaly while maintaining a good performance on the previous task and brings up more flexibility in model architecture.
♻ ☆ Navigating the Maize: Cyclic and conditional computational graphs for molecular simulation
Many computational chemistry and molecular simulation workflows can be expressed as graphs. This abstraction is useful to modularize and potentially reuse existing components, as well as provide parallelization and ease reproducibility. Existing tools represent the computation as a directed acyclic graph (DAG), thus allowing efficient execution by parallelization of concurrent branches. These systems can, however, generally not express cyclic and conditional workflows. We therefore developed Maize, a workflow manager for cyclic and conditional graphs based on the principles of flow-based programming. By running each node of the graph concurrently in separate processes and allowing communication at any time through dedicated inter-node channels, arbitrary graph structures can be executed. We demonstrate the effectiveness of the tool on a dynamic active learning task in computational drug design, involving the use of a small molecule generative model and an associated scoring system, and on a reactivity prediction pipeline using quantum-chemistry and semiempirical approaches.
♻ ☆ DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data
Classification tasks present challenges due to class imbalances and evolving data distributions. Addressing these issues requires a robust method to handle imbalances while effectively detecting out-of-distribution (OOD) samples not encountered during training. This study introduces a novel OOD detection algorithm designed for tabular datasets, titled Deep Neural Network-based Gaussian Descriptor for Imbalanced Tabular Data (DNN-GDITD). The DNN-GDITD algorithm can be placed on top of any DNN to facilitate better classification of imbalanced data and OOD detection using spherical decision boundaries. Using a combination of Push, Score-based, and focal losses, DNN-GDITD assigns confidence scores to test data points, categorizing them as known classes or as an OOD sample. Extensive experimentation on tabular datasets demonstrates the effectiveness of DNN-GDITD compared to three OOD algorithms. Evaluation encompasses imbalanced and balanced scenarios on diverse tabular datasets, including a synthetic financial dispute dataset and publicly available tabular datasets like Gas Sensor, Drive Diagnosis, and MNIST, showcasing DNN-GDITD's versatility.
comment: 17 pages
♻ ☆ MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation
This paper presents MMA-MRNNet, a novel deep learning architecture for dynamic multi-output Facial Expression Intensity Estimation (FEIE) from video data. Traditional approaches to this task often rely on complex 3-D CNNs, which require extensive pre-training and assume that facial expressions are uniformly distributed across all frames of a video. These methods struggle to handle videos of varying lengths, often resorting to ad-hoc strategies that either discard valuable information or introduce bias. MMA-MRNNet addresses these challenges through a two-stage process. First, the Multiple Models of Affect (MMA) extractor component is a Multi-Task Learning CNN that concurrently estimates valence-arousal, recognizes basic facial expressions, and detects action units in each frame. These representations are then processed by a Masked RNN component, which captures temporal dependencies and dynamically updates weights according to the true length of the input video, ensuring that only the most relevant features are used for the final prediction. The proposed unimodal non-ensemble learning MMA-MRNNet was evaluated on the Hume-Reaction dataset and demonstrated significantly superior performance, surpassing state-of-the-art methods by a wide margin, regardless of whether they were unimodal, multimodal, or ensemble approaches. Finally, we demonstrated the effectiveness of the MMA component of our proposed method across multiple in-the-wild datasets, where it consistently outperformed all state-of-the-art methods across various metrics.
♻ ☆ What Formal Languages Can Transformers Express? A Survey
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.
comment: One minor correction in {\S}5.1
♻ ☆ Decision-Focused Learning: Foundations, State of the Art, Benchmark and Future Opportunities
Decision-focused learning (DFL) is an emerging paradigm that integrates machine learning (ML) and constrained optimization to enhance decision quality by training ML models in an end-to-end system. This approach shows significant potential to revolutionize combinatorial decision-making in real-world applications that operate under uncertainty, where estimating unknown parameters within decision models is a major challenge. This paper presents a comprehensive review of DFL, providing an in-depth analysis of both gradient-based and gradient-free techniques used to combine ML and constrained optimization. It evaluates the strengths and limitations of these techniques and includes an extensive empirical evaluation of eleven methods across seven problems. The survey also offers insights into recent advancements and future research directions in DFL. Code and benchmark: https://github.com/PredOpt/predopt-benchmarks
comment: Experimental Survey and Benchmarking
♻ ☆ Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?
Real-world autonomous driving systems must make safe decisions in the face of rare and diverse traffic scenarios. Current state-of-the-art planners are mostly evaluated on real-world datasets like nuScenes (open-loop) or nuPlan (closed-loop). In particular, nuPlan seems to be an expressive evaluation method since it is based on real-world data and closed-loop, yet it mostly covers basic driving scenarios. This makes it difficult to judge a planner's capabilities to generalize to rarely-seen situations. Therefore, we propose a novel closed-loop benchmark interPlan containing several edge cases and challenging driving scenarios. We assess existing state-of-the-art planners on our benchmark and show that neither rule-based nor learning-based planners can safely navigate the interPlan scenarios. A recently evolving direction is the usage of foundation models like large language models (LLM) to handle generalization. We evaluate an LLM-only planner and introduce a novel hybrid planner that combines an LLM-based behavior planner with a rule-based motion planner that achieves state-of-the-art performance on our benchmark.
♻ ☆ In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data
Studying and analyzing cropland is a difficult task due to its dynamic and heterogeneous growth behavior. Usually, diverse data sources can be collected for its estimation. Although deep learning models have proven to excel in the crop classification task, they face substantial challenges when dealing with multiple inputs, named Multi-View Learning (MVL). The methods used in the MVL scenario can be structured based on the encoder architecture, the fusion strategy, and the optimization technique. The literature has primarily focused on using specific encoder architectures for local regions, lacking a deeper exploration of other components in the MVL methodology. In contrast, we investigate the simultaneous selection of the fusion strategy and encoder architecture, assessing global-scale cropland and crop-type classifications. We use a range of five fusion strategies (Input, Feature, Decision, Ensemble, Hybrid) and five temporal encoders (LSTM, GRU, TempCNN, TAE, L-TAE) as possible configurations in the MVL method. We use the CropHarvest dataset for validation, which provides optical, radar, weather time series, and topographic information as input data. We found that in scenarios with a limited number of labeled samples, a unique configuration is insufficient for all the cases. Instead, a specialized combination should be meticulously sought, including an encoder and fusion strategy. To streamline this search process, we suggest identifying the optimal encoder architecture tailored for a particular fusion strategy, and then determining the most suitable fusion strategy for the classification task. We provide a methodological framework for researchers exploring crop classification through an MVL methodology.
comment: submitted to journal
♻ ☆ Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation ACL
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
comment: Accepted at the MACLEAN workshop in the ECML/PKDD 2024
♻ ☆ Scalable Glacier Mapping using Deep Learning and Open Earth Observation Data Matches the Accuracy of Manual Delineation
Accurate global glacier mapping is critical for understanding climate change impacts. Despite its importance, automated glacier mapping at a global scale remains largely unexplored. Here we address this gap and propose Glacier-VisionTransformer-U-Net (GlaViTU), a convolutional-transformer deep learning model, and five strategies for multitemporal global-scale glacier mapping using open satellite imagery. Assessing the spatial, temporal and cross-sensor generalisation shows that our best strategy achieves intersection over union >0.85 on previously unobserved images in most cases, which drops to >0.75 for debris-rich areas such as High-Mountain Asia and increases to >0.90 for regions dominated by clean ice. A comparative validation against human expert uncertainties in terms of area and distance deviations underscores GlaViTU performance, approaching or matching expert-level delineation. Adding synthetic aperture radar data, namely, backscatter and interferometric coherence, increases the accuracy in all regions where available. The calibrated confidence for glacier extents is reported making the predictions more reliable and interpretable. We also release a benchmark dataset that covers 9% of glaciers worldwide. Our results support efforts towards automated multitemporal and global glacier mapping.
comment: after major revision, expanded validation
♻ ☆ Smart E-commerce Recommendations with Semantic AI
In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.
comment: 8 pages
♻ ☆ A Hybrid Framework for Spatial Interpolation: Merging Data-driven with Domain Knowledge
Estimating spatially distributed information through the interpolation of scattered observation datasets often overlooks the critical role of domain knowledge in understanding spatial dependencies. Additionally, the features of these data sets are typically limited to the spatial coordinates of the scattered observation locations. In this paper, we propose a hybrid framework that integrates data-driven spatial dependency feature extraction with rule-assisted spatial dependency function mapping to augment domain knowledge. We demonstrate the superior performance of our framework in two comparative application scenarios, highlighting its ability to capture more localized spatial features in the reconstructed distribution fields. Furthermore, we underscore its potential to enhance nonlinear estimation capabilities through the application of transformed fuzzy rules and to quantify the inherent uncertainties associated with the observation data sets. Our framework introduces an innovative approach to spatial information estimation by synergistically combining observational data with rule-assisted domain knowledge.
comment: 21 pages, 13 figures; typos corrected, references updated
♻ ☆ Open Implementation and Study of BEST-RQ for Speech Processing ICASSP 2024
Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
comment: Accepted in IEEE ICASSP 2024 workshop on Self-supervision in Audio, Speech and Beyond (SASB 2024)
♻ ☆ Moderate Adaptive Linear Units (MoLU)
We propose a new high-performance activation function, Moderate Adaptive Linear Units (MoLU), for the deep neural network. The MoLU is a simple, beautiful and powerful activation function that can be a good main activation function among hundreds of activation functions. Because the MoLU is made up of the elementary functions, not only it is a diffeomorphism (i.e. analytic over whole domains), but also it reduces the training time.
comment: 4 pages, 5 figures
♻ ☆ Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: https://github.com/Workday/cpc.
♻ ☆ Simultaneous Training of First- and Second-Order Optimizers in Population-Based Reinforcement Learning
The tuning of hyperparameters in reinforcement learning (RL) is critical, as these parameters significantly impact an agent's performance and learning efficiency. Dynamic adjustment of hyperparameters during the training process can significantly enhance both the performance and stability of learning. Population-based training (PBT) provides a method to achieve this by continuously tuning hyperparameters throughout the training. This ongoing adjustment enables models to adapt to different learning stages, resulting in faster convergence and overall improved performance. In this paper, we propose an enhancement to PBT by simultaneously utilizing both first- and second-order optimizers within a single population. We conducted a series of experiments using the TD3 algorithm across various MuJoCo environments. Our results, for the first time, empirically demonstrate the potential of incorporating second-order optimizers within PBT-based RL. Specifically, the combination of the K-FAC optimizer with Adam led to up to a 10% improvement in overall performance compared to PBT using only Adam. Additionally, in environments where Adam occasionally fails, such as the Swimmer environment, the mixed population with K-FAC exhibited more reliable learning outcomes, offering a significant advantage in training stability without a substantial increase in computational time.
comment: 8 pages, 5 figures
♻ ☆ SparQ Attention: Bandwidth-Efficient LLM Inference
The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.
♻ ☆ Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
comment: Journal Paper, 14 pages
♻ ☆ BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation
Bimanual manipulation tasks typically involve multiple stages which require efficient interactions between two arms, posing step-wise and stage-wise challenges for imitation learning systems. Specifically, failure and delay of one step will broadcast through time, hinder success and efficiency of each sub-stage task, and thereby overall task performance. Although recent works have made strides in addressing certain challenges, few approaches explicitly consider the multi-stage nature of bimanual tasks while simultaneously emphasizing the importance of inference speed. In this paper, we introduce a novel keypose-conditioned consistency policy tailored for bimanual manipulation. It is a hierarchical imitation learning framework that consists of a high-level keypose predictor and a low-level trajectory generator. The predicted keyposes provide guidance for trajectory generation and also mark the completion of one sub-stage task. The trajectory generator is designed as a consistency model trained from scratch without distillation, which generates action sequences conditioning on current observations and predicted keyposes with fast inference speed. Simulated and real-world experimental results demonstrate that the proposed approach surpasses baseline methods in terms of success rate and operational efficiency. Codes are available at https://github.com/ManUtdMoon/BiKC.
comment: Accepted by The 16th International Workshop on the Algorithmic Foundations of Robotics (WAFR 2024)
♻ ☆ NetMamba: Efficient Network Traffic Classification via Pre-training Unidirectional Mamba
Network traffic classification is a crucial research area aiming to enhance service quality, streamline network management, and bolster cybersecurity. To address the growing complexity of transmission encryption techniques, various machine learning and deep learning methods have been proposed. However, existing approaches face two main challenges. Firstly, they struggle with model inefficiency due to the quadratic complexity of the widely used Transformer architecture. Secondly, they suffer from inadequate traffic representation because of discarding important byte information while retaining unwanted biases. To address these challenges, we propose NetMamba, an efficient linear-time state space model equipped with a comprehensive traffic representation scheme. We adopt a specially selected and improved unidirectional Mamba architecture for the networking field, instead of the Transformer, to address efficiency issues. In addition, we design a traffic representation scheme to extract valid information from massive traffic data while removing biased information. Evaluation experiments on six public datasets encompassing three main classification tasks showcase NetMamba's superior classification performance compared to state-of-the-art baselines. It achieves an accuracy rate of nearly 99% (some over 99%) in all tasks. Additionally, NetMamba demonstrates excellent efficiency, improving inference speed by up to 60 times while maintaining comparably low memory usage. Furthermore, NetMamba exhibits superior few-shot learning abilities, achieving better classification performance with fewer labeled data. To the best of our knowledge, NetMamba is the first model to tailor the Mamba architecture for networking.
♻ ☆ From Categories to Classifiers: Name-Only Continual Learning by Exploring the Web
Continual Learning (CL) often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice. We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation. In this scenario, learners adapt to new category shifts using only category names without the luxury of annotated training data. Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification. We investigate the reliability of our web data and find them comparable, and in some cases superior, to manually annotated datasets. Additionally, we show that by harnessing the web, we can create support sets that surpass state-of-the-art name-only classification that create support sets using generative models or image retrieval from LAION-5B, achieving up to 25% boost in accuracy. When applied across varied continual learning contexts, our method consistently exhibits a small performance gap in comparison to models trained on manually annotated datasets. We present EvoTrends, a class-incremental dataset made from the web to capture real-world trends, created in just minutes. Overall, this paper underscores the potential of using uncurated webly-supervised data to mitigate the challenges associated with manual data labeling in continual learning.
♻ ☆ A Systematic Review on Sleep Stage Classification and Sleep Disorder Detection Using Artificial Intelligence
Sleep is vital for people's physical and mental health, and sound sleep can help them focus on daily activities. Therefore, a sleep study that includes sleep patterns and sleep disorders is crucial to enhancing our knowledge about individuals' health status. This study aims to provide a comprehensive, systematic review of the recent literature to analyze the different approaches and their outcomes in sleep studies, which includes works on "sleep stages classification" and "sleep disorder detection" using AI. In this review, 183 articles were initially selected from different journals, among which 80 records were enlisted for explicit review, ranging from 2016 to 2023. Brain waves were the most commonly employed body parameters for sleep staging and disorder studies (almost 29% of the research used brain activity signals exclusively, and 77% combined with the other signals). The convolutional neural network (CNN), the most widely used of the 34 distinct artificial intelligence models, comprised 27%. The other models included the long short-term memory (LSTM), support vector machine (SVM), random forest (RF), and recurrent neural network (RNN), which consisted of 11%, 6%, 6%, and 5% sequentially. For performance metrics, accuracy was widely used for a maximum of 83.75% of the cases, the F1 score of 45%, Kappa of 36.25%, Sensitivity of 31.25%, and Specificity of 30% of cases, along with the other metrics. This article would help physicians and researchers get the gist of AI's contribution to sleep studies and the feasibility of their intended work.
comment: 39 pages, 11 Figures, 8 Tables
♻ ☆ The Fault in our Stars: Quality Assessment of Code Generation Benchmarks SC
Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. To conduct this study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them. We also investigated whether fixing the identified quality issues in the benchmarks' prompts affects a model's performance. We also studied memorization issues of the evaluation dataset, which can put into question a benchmark's trustworthiness. We found that code generation evaluation benchmarks mainly focused on Python and coding exercises and had very limited contextual dependencies to challenge the model. These datasets and the developers' prompts suffer from quality issues like spelling and grammatical errors, unclear sentences to express developers' intent, and not using proper documentation style. Fixing all these issues in the benchmarks can lead to a better performance for Python code generation, but not a significant improvement was observed for Java code generation. We also found evidence that GPT-3.5-Turbo and CodeGen-2.5 models may have data contamination issues.
comment: Accepted at the 24th IEEE International Conference on Source Code Analysis and Manipulation(SCAM 2024) Research Track
♻ ☆ Sample Complexity of Variance-reduced Distributionally Robust Q-learning
Dynamic decision-making under distributional shifts is of fundamental interest in theory and applications of reinforcement learning: The distribution of the environment in which the data is collected can differ from that of the environment in which the model is deployed. This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts. These algorithms are designed to efficiently approximate the $q$-function of an infinite-horizon $\gamma$-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise $\epsilon$-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of $\tilde O(|\mathbf{S}||\mathbf{A}|(1-\gamma)^{-4}\epsilon^{-2})$, where $\mathbf{S}$ and $\mathbf{A}$ denote the state and action spaces. This is the first complexity result that is independent of the ambiguity size $\delta$, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.
♻ ☆ Predicting and Interpreting Energy Barriers of Metallic Glasses with Graph Neural Networks ICML 2024
Metallic Glasses (MGs) are widely used materials that are stronger than steel while being shapeable as plastic. While understanding the structure-property relationship of MGs remains a challenge in materials science, studying their energy barriers (EBs) as an intermediary step shows promise. In this work, we utilize Graph Neural Networks (GNNs) to model MGs and study EBs. We contribute a new dataset for EB prediction and a novel Symmetrized GNN (SymGNN) model that is E(3)-invariant in expectation. SymGNN handles invariance by aggregating over orthogonal transformations of the graph structure. When applied to EB prediction, SymGNN are more accurate than molecular dynamics (MD) local-sampling methods and other machine-learning models. Compared to precise MD simulations, SymGNN reduces the inference time on new MGs from roughly 41 days to less than one second. We apply explanation algorithms to reveal the relationship between structures and EBs. The structures that we identify through explanations match the medium-range order (MRO) hypothesis and possess unique topological properties. Our work enables effective prediction and interpretation of MG EBs, bolstering material science research.
comment: ICML 2024. Code available at https://github.com/haoyuli02/SymGNN
♻ ☆ Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data
Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions, while preserving data privacy. Unfortunately, the learning performance of FEEL may be compromised due to limited training data in a single edge cluster. In this paper, we investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL). By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency, while improving the learning performance by accessing richer training data from multiple edge clusters. A training algorithm for SD-FEEL with three main procedures in each round is presented, including local model updates, intra-cluster and inter-cluster model aggregations, which is proved to converge on non-independent and identically distributed (non-IID) data. We also characterize the interplay between the network topology of the edge servers and the communication overhead of inter-cluster model aggregation on the training performance. Experiment results corroborate our analysis and demonstrate the effectiveness of SD-FFEL in achieving faster convergence than traditional federated learning architectures. Besides, guidelines on choosing critical hyper-parameters of the training algorithm are also provided.
♻ ☆ EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification
Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namely EnsLoss, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the "legitimacy" of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout, EnsLoss enables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness of EnsLoss compared to fixed loss methods is demonstrated through experiments on a broad range of 14 OpenML tabular datasets and 46 image datasets with various deep learning architectures. Python repository and source code are available on GitHub at https://github.com/statmlben/ensloss.
comment: 31 pages; 4 figures
♻ ☆ A Confidence Interval for the $\ell_2$ Expected Calibration Error
Recent advances in machine learning have significantly improved prediction accuracy in various applications. However, ensuring the calibration of probabilistic predictions remains a significant challenge. Despite efforts to enhance model calibration, the rigorous statistical evaluation of model calibration remains less explored. In this work, we develop confidence intervals the $\ell_2$ Expected Calibration Error (ECE). We consider top-1-to-$k$ calibration, which includes both the popular notion of confidence calibration as well as full calibration. For a debiased estimator of the ECE, we show asymptotic normality, but with different convergence rates and asymptotic variances for calibrated and miscalibrated models. We develop methods to construct asymptotically valid confidence intervals for the ECE, accounting for this behavior as well as non-negativity. Our theoretical findings are supported through extensive experiments, showing that our methods produce valid confidence intervals with shorter lengths compared to those obtained by resampling-based methods.
♻ ☆ A Novel Approach to Classify Power Quality Signals Using Vision Transformers
With the rapid integration of electronically interfaced renewable energy resources and loads into smart grids, there is increasing interest in power quality disturbances (PQD) classification to enhance the security and efficiency of these grids. This paper introduces a new approach to PQD classification based on the Vision Transformer (ViT) model. When a PQD occurs, the proposed approach first converts the power quality signal into an image and then utilizes a pre-trained ViT to accurately determine the class of the PQD. Unlike most previous works, which were limited to a few disturbance classes or small datasets, the proposed method is trained and tested on a large dataset with 17 disturbance classes. Our experimental results show that the proposed ViT-based approach achieves PQD classification precision and recall of 98.28% and 97.98%, respectively, outperforming recently proposed techniques applied to the same dataset.
comment: IECON 2024-50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, U.S.A, 2024, pp. 1-6
♻ ☆ Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular Data
Data sets with imbalanced class sizes, where one class size is much smaller than that of others, occur exceedingly often in many applications, including those with biological foundations, such as disease diagnosis and drug discovery. Therefore, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to do so can result in heavy costs. Nonetheless, many data classification procedures do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed method is validated using six molecular data sets and compared to other related techniques. The computational experiments show that the proposed technique is superior to competing approaches even in the case of a high class imbalance ratio.
♻ ☆ CCPL: Cross-modal Contrastive Protein Learning ICPR 2024
Effective protein representation learning is crucial for predicting protein functions. Traditional methods often pretrain protein language models on large, unlabeled amino acid sequences, followed by finetuning on labeled data. While effective, these methods underutilize the potential of protein structures, which are vital for function determination. Common structural representation techniques rely heavily on annotated data, limiting their generalizability. Moreover, structural pretraining methods, similar to natural language pretraining, can distort actual protein structures. In this work, we introduce a novel unsupervised protein structure representation pretraining method, cross-modal contrastive protein learning (CCPL). CCPL leverages a robust protein language model and uses unsupervised contrastive alignment to enhance structure learning, incorporating self-supervised structural constraints to maintain intrinsic structural information. We evaluated our model across various benchmarks, demonstrating the framework's superiority.
comment: Accepted to ICPR 2024
♻ ☆ Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning ECCV 2024
Federated Class Continual Learning (FCCL) merges the challenges of distributed client learning with the need for seamless adaptation to new classes without forgetting old ones. The key challenge in FCCL is catastrophic forgetting, an issue that has been explored to some extent in Continual Learning (CL). However, due to privacy preservation requirements, some conventional methods, such as experience replay, are not directly applicable to FCCL. Existing FCCL methods mitigate forgetting by generating historical data through federated training of GANs or data-free knowledge distillation. However, these approaches often suffer from unstable training of generators or low-quality generated data, limiting their guidance for the model. To address this challenge, we propose a novel method of data replay based on diffusion models. Instead of training a diffusion model, we employ a pre-trained conditional diffusion model to reverse-engineer each class, searching the corresponding input conditions for each class within the model's input space, significantly reducing computational resources and time consumption while ensuring effective generation. Furthermore, we enhance the classifier's domain generalization ability on generated and real data through contrastive learning, indirectly improving the representational capability of generated data for real data. Comprehensive experiments demonstrate that our method significantly outperforms existing baselines. Code is available at https://github.com/jinglin-liang/DDDR.
comment: Accepted by ECCV 2024 Oral
♻ ☆ Small noise analysis for Tikhonov and RKHS regularizations
Regularization plays a pivotal role in ill-posed machine learning and inverse problems. However, the fundamental comparative analysis of various regularization norms remains open. We establish a small noise analysis framework to assess the effects of norms in Tikhonov and RKHS regularizations, in the context of ill-posed linear inverse problems with Gaussian noise. This framework studies the convergence rates of regularized estimators in the small noise limit and reveals the potential instability of the conventional L2-regularizer. We solve such instability by proposing an innovative class of adaptive fractional RKHS regularizers, which covers the L2 Tikhonov and RKHS regularizations by adjusting the fractional smoothness parameter. A surprising insight is that over-smoothing via these fractional RKHSs consistently yields optimal convergence rates, but the optimal hyper-parameter may decay too fast to be selected in practice.
♻ ☆ Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network
Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences, including the development of cancer. Earlier identification of mutagenic compounds in the drug development process is therefore crucial for preventing the progression of unsafe candidates and reducing development costs. While computational techniques, especially machine learning models have become increasingly prevalent for this endpoint, they rely on a single modality. In this work, we introduce a novel stacked ensemble based mutagenicity prediction model which incorporate multiple modalities such as simplified molecular input line entry system (SMILES) and molecular graph. These modalities capture diverse information about molecules such as substructural, physicochemical, geometrical and topological. To derive substructural, geometrical and physicochemical information, we use SMILES, while topological information is extracted through a graph attention network (GAT) via molecular graph. Our model uses a stacked ensemble of machine learning classifiers to make predictions using these multiple features. We employ the explainable artificial intelligence (XAI) technique SHAP (Shapley Additive Explanations) to determine the significance of each classifier and the most relevant features in the prediction. We demonstrate that our method surpasses SOTA methods on two standard datasets across various metrics. Notably, we achieve an area under the curve of 95.21\% on the Hansen benchmark dataset, affirming the efficacy of our method in predicting mutagenicity. We believe that this research will captivate the interest of both clinicians and computational biologists engaged in translational research.
comment: Submitted to a journal
♻ ☆ OpenVLA: An Open-Source Vision-Language-Action Model
Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.
comment: Website: https://openvla.github.io/
♻ ☆ SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
♻ ☆ Thresholded Lexicographic Ordered Multiobjective Reinforcement Learning ECAI 2024
Lexicographic multi-objective problems, which impose a lexicographic importance order over the objectives, arise in many real-life scenarios. Existing Reinforcement Learning work directly addressing lexicographic tasks has been scarce. The few proposed approaches were all noted to be heuristics without theoretical guarantees as the Bellman equation is not applicable to them. Additionally, the practical applicability of these prior approaches also suffers from various issues such as not being able to reach the goal state. While some of these issues have been known before, in this work we investigate further shortcomings, and propose fixes for improving practical performance in many cases. We also present a policy optimization approach using our Lexicographic Projection Optimization (LPO) algorithm that has the potential to address these theoretical and practical concerns. Finally, we demonstrate our proposed algorithms on benchmark problems.
comment: Full version of ECAI 2024 paper
♻ ☆ LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.
♻ ☆ Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.
♻ ☆ From Lab to Field: Real-World Evaluation of an AI-Driven Smart Video Solution to Enhance Community Safety
This article adopts and evaluates an AI-enabled Smart Video Solution (SVS) designed to enhance safety in the real world. The system integrates with existing infrastructure camera networks, leveraging recent advancements in AI for easy adoption. Prioritizing privacy and ethical standards, pose based data is used for downstream AI tasks such as anomaly detection. Cloud-based infrastructure and mobile app are deployed, enabling real-time alerts within communities. The SVS employs innovative data representation and visualization techniques, such as the Occupancy Indicator, Statistical Anomaly Detection, Bird's Eye View, and Heatmaps, to understand pedestrian behaviors and enhance public safety. Evaluation of the SVS demonstrates its capacity to convert complex computer vision outputs into actionable insights for stakeholders, community partners, law enforcement, urban planners, and social scientists. This article presents a comprehensive real-world deployment and evaluation of the SVS, implemented in a community college environment across 16 cameras. The system integrates AI-driven visual processing, supported by statistical analysis, database management, cloud communication, and user notifications. Additionally, the article evaluates the end-to-end latency from the moment an AI algorithm detects anomalous behavior in real-time at the camera level to the time stakeholders receive a notification. The results demonstrate the system's robustness, effectively managing 16 CCTV cameras with a consistent throughput of 16.5 frames per second (FPS) over a 21-hour period and an average end-to-end latency of 26.76 seconds between anomaly detection and alert issuance.
♻ ☆ Spectral-Aware Augmentation for Enhanced Graph Representation Learning
Graph Contrastive Learning (GCL) has demonstrated remarkable effectiveness in learning representations on graphs in recent years. To generate ideal augmentation views, the augmentation generation methods should preserve essential information while discarding less relevant details for downstream tasks. However, current augmentation methods usually involve random topology corruption in the spatial domain, which fails to adequately address information spread across different frequencies in the spectral domain. Our preliminary study highlights this issue, demonstrating that spatial random perturbations impact all frequency bands almost uniformly. Given that task-relevant information typically resides in specific spectral regions that vary across graphs, this one-size-fits-all approach can pose challenges. We argue that indiscriminate spatial random perturbation might unintentionally weaken task-relevant information, reducing its effectiveness. To tackle this challenge, we propose applying perturbations selectively, focusing on information specific to different frequencies across diverse graphs. In this paper, we present GASSER, a model that applies tailored perturbations to specific frequencies of graph structures in the spectral domain, guided by spectral hints. Through extensive experimentation and theoretical analysis, we demonstrate that the augmentation views generated by GASSER are adaptive, controllable, and intuitively aligned with the homophily ratios and spectrum of graph structures.
♻ ☆ GCEPNet: Graph Convolution-Enhanced Expectation Propagation for Massive MIMO Detection
Massive MIMO (multiple-input multiple-output) detection is an important topic in wireless communication and various machine learning based methods have been developed recently for this task. Expectation Propagation (EP) and its variants are widely used for MIMO detection and have achieved the best performance. However, EP-based solvers fail to capture the correlation between unknown variables, leading to a loss of information, and in addition, they are computationally expensive. In this paper, we show that the real-valued system can be modeled as spectral signal convolution on graph, through which the correlation between unknown variables can be captured. Based on such analysis, we propose graph convolution-enhanced expectation propagation (GCEPNet). GCEPNet incorporates data-dependent attention scores into Chebyshev polynomial for powerful graph convolution with better generalization capacity. It enables a better estimation of the cavity distribution for EP and empirically achieves the state-of-the-art (SOTA) MIMO detection performance with much faster inference speed. To our knowledge, we are the first to shed light on the connection between the system model and graph convolution, and the first to design the data-dependent coefficients for graph convolution.
comment: In IEEE GLOBECOM 2024 Conference Proceedings
♻ ☆ OceanNet: A principled neural operator-based digital twin for regional oceans
While data-driven approaches demonstrate great potential in atmospheric modeling and weather forecasting, ocean modeling poses distinct challenges due to complex bathymetry, land, vertical structure, and flow non-linearity. This study introduces OceanNet, a principled neural operator-based digital twin for ocean circulation. OceanNet uses a Fourier neural operator and predictor-evaluate-corrector integration scheme to mitigate autoregressive error growth and enhance stability over extended time scales. A spectral regularizer counteracts spectral bias at smaller scales. OceanNet is applied to the northwest Atlantic Ocean western boundary current (the Gulf Stream), focusing on the task of seasonal prediction for Loop Current eddies and the Gulf Stream meander. Trained using historical sea surface height (SSH) data, OceanNet demonstrates competitive forecast skill by outperforming SSH predictions by an uncoupled, state-of-the-art dynamical ocean model forecast, reducing computation by 500,000 times. These accomplishments demonstrate the potential of physics-inspired deep neural operators as cost-effective alternatives to high-resolution numerical ocean models.
comment: Supplementary information can be found in: https://drive.google.com/file/d/1NoxJLa967naJT787a5-IfZ7f_MmRuZMP/view?usp=sharing
♻ ☆ Counterpart Fairness -- Addressing Systematic between-group Differences in Fairness Evaluation
When using machine learning (ML) to aid decision-making, it is critical to ensure that an algorithmic decision is fair and does not discriminate against specific individuals/groups, particularly those from underprivileged populations. Existing group fairness methods aim to ensure equal outcomes (such as loan approval rates) across groups delineated by protected variables like race or gender. However, these methods overlook the intricate, inherent differences among these groups that could influence outcomes. The confounding factors, which are non-protected variables but manifest systematic differences, can significantly affect fairness evaluation. Therefore, we recommend a more refined and comprehensive approach that accounts for both the systematic differences within groups and the multifaceted, intertwined confounding effects. We proposed a fairness metric based on counterparts (i.e., individuals who are similar with respect to the task of interest) from different groups, whose group identities cannot be distinguished algorithmically by exploring confounding factors. We developed a propensity-score-based method for identifying counterparts, avoiding the issue of comparing "oranges" with "apples". In addition, we introduced a counterpart-based statistical fairness index, called Counterpart-Fairness (CFair), to assess the fairness of ML models. Various empirical studies were conducted to validate the effectiveness of CFair.
comment: 24 pages, 9 figures, 14 tables
♻ ☆ Towards Understanding Neural Collapse: The Effects of Batch Normalization and Weight Decay
Neural Collapse (NC) is a geometric structure recently observed at the terminal phase of training deep neural networks, which states that last-layer feature vectors for the same class would "collapse" to a single point, while features of different classes become equally separated. We demonstrate that batch normalization (BN) and weight decay (WD) critically influence the emergence of NC. In the near-optimal loss regime, we establish an asymptotic lower bound on the emergence of NC that depends only on the WD value, training loss, and the presence of last-layer BN. Our experiments substantiate theoretical insights by showing that models demonstrate a stronger presence of NC with BN, appropriate WD values, lower loss, and lower last-layer feature norm. Our findings offer a novel perspective in studying the role of BN and WD in shaping neural network features.
♻ ☆ Language-Guided World Models: A Model-Based Approach to AI Control ACL 2024
This paper introduces the concept of Language-Guided World Models (LWMs) -- probabilistic models that can simulate environments by reading texts. Agents equipped with these models provide humans with more extensive and efficient control, allowing them to simultaneously alter agent behaviors in multiple tasks via natural verbal communication. In this work, we take initial steps in developing robust LWMs that can generalize to compositionally novel language descriptions. We design a challenging world modeling benchmark based on the game of MESSENGER (Hanjie et al., 2021), featuring evaluation settings that require varying degrees of compositional generalization. Our experiments reveal the lack of generalizability of the state-of-the-art Transformer model, as it offers marginal improvements in simulation quality over a no-text baseline. We devise a more robust model by fusing the Transformer with the EMMA attention mechanism (Hanjie et al., 2021). Our model substantially outperforms the Transformer and approaches the performance of a model with an oracle semantic parsing and grounding capability. To demonstrate the practicality of this model in improving AI safety and transparency, we simulate a scenario in which the model enables an agent to present plans to a human before execution, and to revise plans based on their language feedback.
comment: SpLU-RoboNLP workshop at ACL 2024
Information Retrieval 21
☆ Bioinformatics Retrieval Augmentation Data (BRAD) Digital Assistant
We present a prototype for a Bioinformatics Retrieval Augmentation Data (BRAD) digital assistant. BRAD integrates a suite of tools to handle a wide range of bioinformatics tasks, from code execution to online search. We demonstrate BRAD's capabilities through (1) improved question-and-answering with retrieval augmented generation (RAG), (2) BRAD's ability to run and write complex software pipelines, and (3) BRAD's ability to organize and distribute tasks across individual and teams of agents. We use BRAD for automation of bioinformatics workflows, performing tasks ranging from gene enrichment and searching the archive to automatic code generation and running biomarker identification pipelines. BRAD is a step toward the ultimate goal to develop a digital twin of laboratories driven by self-contained loops for hypothesis generation and testing of digital biology experiments.
☆ Building a Scalable, Effective, and Steerable Search and Ranking Platform
Modern e-commerce platforms offer vast product selections, making it difficult for customers to find items that they like and that are relevant to their current session intent. This is why it is key for e-commerce platforms to have near real-time scalable and adaptable personalized ranking and search systems. While numerous methods exist in the scientific literature for building such systems, many are unsuitable for large-scale industrial use due to complexity and performance limitations. Consequently, industrial ranking systems often resort to computationally efficient yet simplistic retrieval or candidate generation approaches, which overlook near real-time and heterogeneous customer signals, which results in a less personalized and relevant experience. Moreover, related customer experiences are served by completely different systems, which increases complexity, maintenance, and inconsistent experiences. In this paper, we present a personalized, adaptable near real-time ranking platform that is reusable across various use cases, such as browsing and search, and that is able to cater to millions of items and customers under heavy load (thousands of requests per second). We employ transformer-based models through different ranking layers which can learn complex behavior patterns directly from customer action sequences while being able to incorporate temporal (e.g. in-session) and contextual information. We validate our system through a series of comprehensive offline and online real-world experiments at a large online e-commerce platform, and we demonstrate its superiority when compared to existing systems, both in terms of customer experience as well as in net revenue. Finally, we share the lessons learned from building a comprehensive, modern ranking platform for use in a large-scale e-commerce environment.
☆ Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?
The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.
comment: https://github.com/yixuantt/PoolingAndAttn
☆ RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models
Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, models trained on domain-specific data often yield better results within their respective domains. While prior work in information retrieval has tackled this through multi-task training, the topic of combining multiple domain-specific expert retrievers remains unexplored, despite its popularity in language model generation. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts along with a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of experts without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average) commonly used in language modeling. Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. To our knowledge, RouterRetriever is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models with effective routing over a single, general-purpose embedding model in retrieval tasks.
☆ A Fashion Item Recommendation Model in Hyperbolic Space CVPR 2024
In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users' purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.
comment: This work was presented at the CVFAD Workshop at CVPR 2024
☆ AlignGroup: Learning and Aligning Group Consensus with Member Preferences for Group Recommendation CIKM 2024
Group activities are important behaviors in human society, providing personalized recommendations for groups is referred to as the group recommendation task. Existing methods can usually be categorized into two strategies to infer group preferences: 1) determining group preferences by aggregating members' personalized preferences, and 2) inferring group consensus by capturing group members' coherent decisions after common compromises. However, the former would suffer from the lack of group-level considerations, and the latter overlooks the fine-grained preferences of individual users. To this end, we propose a novel group recommendation method AlignGroup, which focuses on both group consensus and individual preferences of group members to infer the group decision-making. Specifically, AlignGroup explores group consensus through a well-designed hypergraph neural network that efficiently learns intra- and inter-group relationships. Moreover, AlignGroup innovatively utilizes a self-supervised alignment task to capture fine-grained group decision-making by aligning the group consensus with members' common preferences. Extensive experiments on two real-world datasets validate that our AlignGroup outperforms the state-of-the-art on both the group recommendation task and the user recommendation task, as well as outperforms the efficiency of most baselines.
comment: 10 pages, accepted by CIKM 2024
☆ iRangeGraph: Improvising Range-dedicated Graphs for Range-filtering Nearest Neighbor Search SIGMOD 2025
Range-filtering approximate nearest neighbor (RFANN) search is attracting increasing attention in academia and industry. Given a set of data objects, each being a pair of a high-dimensional vector and a numeric value, an RFANN query with a vector and a numeric range as parameters returns the data object whose numeric value is in the query range and whose vector is nearest to the query vector. To process this query, a recent study proposes to build $O(n^2)$ dedicated graph-based indexes for all possible query ranges to enable efficient processing on a database of $n$ objects. As storing all these indexes is prohibitively expensive, the study constructs compressed indexes instead, which reduces the memory consumption considerably. However, this incurs suboptimal performance because the compression is lossy. In this study, instead of materializing a compressed index for every possible query range in preparation for querying, we materialize graph-based indexes, called elemental graphs, for a moderate number of ranges. We then provide an effective and efficient algorithm that during querying can construct an index for any query range using the elemental graphs. We prove that the time needed to construct such an index is low. We also cover an experimental study on real-world datasets that provides evidence that the materialized elemental graphs only consume moderate space and that the proposed method is capable of superior and stable query performance across different query workloads.
comment: The paper has been accepted by SIGMOD 2025
☆ An Effective Tag Assignment Approach for Billboard Advertisement
Billboard Advertisement has gained popularity due to its significant outrage in return on investment. To make this advertisement approach more effective, the relevant information about the product needs to be reached to the relevant set of people. This can be achieved if the relevant set of tags can be mapped to the correct slots. Formally, we call this problem the Tag Assignment Problem in Billboard Advertisement. Given trajectory, billboard database, and a set of selected billboard slots and tags, this problem asks to output a mapping of selected tags to the selected slots so that the influence is maximized. We model this as a variant of traditional bipartite matching called One-To-Many Bipartite Matching (OMBM). Unlike traditional bipartite matching, a tag can be assigned to only one slot; in the OMBM, a tag can be assigned to multiple slots while the vice versa can not happen. We propose an iterative solution approach that incrementally allocates the tags to the slots. The proposed methodology has been explained with an illustrated example. A complexity analysis of the proposed solution approach has also been conducted. The experimental results on real-world trajectory and billboard datasets prove our claim on the effectiveness and efficiency of the proposed solution.
comment: This Paper has been accepted at The 25th International Web Information Systems Engineering Conference (WISE-2024)
☆ Deep Adaptive Interest Network: Personalized Recommendation with Context-Aware Learning
In personalized recommendation systems, accurately capturing users' evolving interests and combining them with contextual information is a critical research area. This paper proposes a novel model called the Deep Adaptive Interest Network (DAIN), which dynamically models users' interests while incorporating context-aware learning mechanisms to achieve precise and adaptive personalized recommendations. DAIN leverages deep learning techniques to build an adaptive interest network structure that can capture users' interest changes in real-time while further optimizing recommendation results by integrating contextual information. Experiments conducted on several public datasets demonstrate that DAIN excels in both recommendation performance and computational efficiency. This research not only provides a new solution for personalized recommendation systems but also offers fresh insights into the application of context-aware learning in recommendation systems.
☆ NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
☆ Do We Trust What They Say or What They Do? A Multimodal User Embedding Provides Personalized Explanations
With the rapid development of social media, the importance of analyzing social network user data has also been put on the agenda. User representation learning in social media is a critical area of research, based on which we can conduct personalized content delivery, or detect malicious actors. Being more complicated than many other types of data, social network user data has inherent multimodal nature. Various multimodal approaches have been proposed to harness both text (i.e. post content) and relation (i.e. inter-user interaction) information to learn user embeddings of higher quality. The advent of Graph Neural Network models enables more end-to-end integration of user text embeddings and user interaction graphs in social networks. However, most of those approaches do not adequately elucidate which aspects of the data - text or graph structure information - are more helpful for predicting each specific user under a particular task, putting some burden on personalized downstream analysis and untrustworthy information filtering. We propose a simple yet effective framework called Contribution-Aware Multimodal User Embedding (CAMUE) for social networks. We have demonstrated with empirical evidence, that our approach can provide personalized explainable predictions, automatically mitigating the impact of unreliable information. We also conducted case studies to show how reasonable our results are. We observe that for most users, graph structure information is more trustworthy than text information, but there are some reasonable cases where text helps more. Our work paves the way for more explainable, reliable, and effective social media user embedding which allows for better personalized content delivery.
♻ ☆ The Design of an LLM-powered Unstructured Analytics System
LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.
comment: 6 pages, 3 figures, fixed typos
Multimodal Recommender Systems: A Survey
The recommender system (RS) has been an integral toolkit of online services. They are equipped with various deep learning techniques to model user preference based on identifier and attribute information. With the emergence of multimedia services, such as short videos, news and etc., understanding these contents while recommending becomes critical. Besides, multimodal features are also helpful in alleviating the problem of data sparsity in RS. Thus, Multimodal Recommender System (MRS) has attracted much attention from both academia and industry recently. In this paper, we will give a comprehensive survey of the MRS models, mainly from technical views. First, we conclude the general procedures and major challenges for MRS. Then, we introduce the existing MRS models according to four categories, i.e., Modality Encoder, Feature Interaction, Feature Enhancement and Model Optimization. Besides, to make it convenient for those who want to research this field, we also summarize the dataset and code resources. Finally, we discuss some promising future directions of MRS and conclude this paper. To access more details of the surveyed papers, such as implementation code, we open source a repository.
comment: accepted by CSUR
♻ ☆ MARS: Matching Attribute-aware Representations for Text-based Sequential Recommendation CIKM 2024
Sequential recommendation aims to predict the next item a user is likely to prefer based on their sequential interaction history. Recently, text-based sequential recommendation has emerged as a promising paradigm that uses pre-trained language models to exploit textual item features to enhance performance and facilitate knowledge transfer to unseen datasets. However, existing text-based recommender models still struggle with two key challenges: (i) representing users and items with multiple attributes, and (ii) matching items with complex user interests. To address these challenges, we propose a novel model, Matching Attribute-aware Representations for Text-based Sequential Recommendation (MARS). MARS extracts detailed user and item representations through attribute-aware text encoding, capturing diverse user intents with multiple attribute-aware representations. It then computes user-item scores via attribute-wise interaction matching, effectively capturing attribute-level user preferences. Our extensive experiments demonstrate that MARS significantly outperforms existing sequential models, achieving improvements of up to 24.43% and 29.26% in Recall@10 and NDCG@10 across five benchmark datasets. Code is available at https://github.com/junieberry/MARS
comment: CIKM 2024
♻ ☆ HIRO: Hierarchical Information Retrieval Optimization
Retrieval-Augmented Generation (RAG) has revolutionized natural language processing by dynamically integrating external knowledge into Large Language Models (LLMs), addressing their limitation of static training datasets. Recent implementations of RAG leverage hierarchical data structures, which organize documents at various levels of summarization and information density. This complexity, however, can cause LLMs to "choke" on information overload, necessitating more sophisticated querying mechanisms. In this context, we introduce Hierarchical Information Retrieval Optimization (HIRO), a novel querying approach that employs a Depth-First Search (DFS)-based recursive similarity score calculation and branch pruning. This method uniquely minimizes the context delivered to the LLM without informational loss, effectively managing the challenge of excessive data. HIRO's refined approach is validated by a 10.85% improvement in performance on the NarrativeQA dataset.
♻ ☆ Large Language Models for Information Retrieval: A Survey
As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions, such as search agents, within this expanding field.
comment: updated to version 3
♻ ☆ Smart E-commerce Recommendations with Semantic AI
In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.
comment: 8 pages
♻ ☆ NFARec: A Negative Feedback-Aware Recommender Model SIGIR 2024
Graph neural network (GNN)-based models have been extensively studied for recommendations, as they can extract high-order collaborative signals accurately which is required for high-quality recommender systems. However, they neglect the valuable information gained through negative feedback in two aspects: (1) different users might hold opposite feedback on the same item, which hampers optimal information propagation in GNNs, and (2) even when an item vastly deviates from users' preferences, they might still choose it and provide a negative rating. In this paper, we propose a negative feedback-aware recommender model (NFARec) that maximizes the leverage of negative feedback. To transfer information to multi-hop neighbors along an optimal path effectively, NFARec adopts a feedback-aware correlation that guides hypergraph convolutions (HGCs) to learn users' structural representations. Moreover, NFARec incorporates an auxiliary task - predicting the feedback sentiment polarity (i.e., positive or negative) of the next interaction - based on the Transformer Hawkes Process. The task is beneficial for understanding users by learning the sentiment expressed in their previous sequential feedback patterns and predicting future interactions. Extensive experiments demonstrate that NFARec outperforms competitive baselines. Our source code and data are released at https://github.com/WangXFng/NFARec.
comment: Accepted to SIGIR 2024
♻ ☆ CaDRec: Contextualized and Debiased Recommender Model SIGIR 2024
Recommender models aimed at mining users' behavioral patterns have raised great attention as one of the essential applications in daily life. Recent work on graph neural networks (GNNs) or debiasing methods has attained remarkable gains. However, they still suffer from (1) over-smoothing node embeddings caused by recursive convolutions with GNNs, and (2) the skewed distribution of interactions due to popularity and user-individual biases. This paper proposes a contextualized and debiased recommender model (CaDRec). To overcome the over-smoothing issue, we explore a novel hypergraph convolution operator that can select effective neighbors during convolution by introducing both structural context and sequential context. To tackle the skewed distribution, we propose two strategies for disentangling interactions: (1) modeling individual biases to learn unbiased item embeddings, and (2) incorporating item popularity with positional encoding. Moreover, we mathematically show that the imbalance of the gradients to update item embeddings exacerbates the popularity bias, thus adopting regularization and weighting schemes as solutions. Extensive experiments on four datasets demonstrate the superiority of the CaDRec against state-of-the-art (SOTA) methods. Our source code and data are released at https://github.com/WangXFng/CaDRec.
comment: Accepted to SIGIR 2024
♻ ☆ Evaluating Named Entity Recognition Using Few-Shot Prompting with Large Language Models
This paper evaluates Few-Shot Prompting with Large Language Models for Named Entity Recognition (NER). Traditional NER systems rely on extensive labeled datasets, which are costly and time-consuming to obtain. Few-Shot Prompting or in-context learning enables models to recognize entities with minimal examples. We assess state-of-the-art models like GPT-4 in NER tasks, comparing their few-shot performance to fully supervised benchmarks. Results show that while there is a performance gap, large models excel in adapting to new entity types and domains with very limited data. We also explore the effects of prompt engineering, guided output format and context length on performance. This study underscores Few-Shot Learning's potential to reduce the need for large labeled datasets, enhancing NER scalability and accessibility.
comment: Github repo: https://github.com/GEODE-project/ner-llm
♻ ☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever EMNLP
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
comment: 8 pages, references at pp7,8; EMNLP workshop submission
Computation and Language 72
☆ Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.
☆ Optimal L-Systems for Stochastic L-system Inference Problems
This paper presents two novel theorems that address two open problems in stochastic Lindenmayer-system (L-system) inference, specifically focusing on the construction of an optimal stochastic L-system capable of generating a given sequence of strings. The first theorem delineates a method for crafting a stochastic L-system that maximizes the likelihood of producing a given sequence of words through a singular derivation. Furthermore, the second theorem determines the stochastic L-systems with the highest probability of producing a given sequence of words with multiple possible derivations. From these, we introduce an algorithm to infer an optimal stochastic L-system from a given sequence. This algorithm incorporates sophisticated optimization techniques, such as interior point methods, ensuring production of a stochastically optimal stochastic L-system suitable for generating the given sequence. This allows for the use of using stochastic L-systems as model for machine learning using only positive data for training.
☆ MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of five state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/mmlu-pro-plus}.
☆ Therapy as an NLP Task: Psychologists' Comparison of LLMs and Human Peers in CBT
Wider access to therapeutic care is one of the biggest challenges in mental health treatment. Due to institutional barriers, some people seeking mental health support have turned to large language models (LLMs) for personalized therapy, even though these models are largely unsanctioned and untested. We investigate the potential and limitations of using LLMs as providers of evidence-based therapy by using mixed methods clinical metrics. Using HELPERT, a prompt run on a large language model using the same process and training as a comparative group of peer counselors, we replicated publicly accessible mental health conversations rooted in Cognitive Behavioral Therapy (CBT) to compare session dynamics and counselor's CBT-based behaviors between original peer support sessions and their reconstructed HELPERT sessions. Two licensed, CBT-trained clinical psychologists evaluated the sessions using the Cognitive Therapy Rating Scale and provided qualitative feedback. Our findings show that the peer sessions are characterized by empathy, small talk, therapeutic alliance, and shared experiences but often exhibit therapist drift. Conversely, HELPERT reconstructed sessions exhibit minimal therapist drift and higher adherence to CBT methods but display a lack of collaboration, empathy, and cultural understanding. Through CTRS ratings and psychologists' feedback, we highlight the importance of human-AI collaboration for scalable mental health. Our work outlines the ethical implication of imparting human-like subjective qualities to LLMs in therapeutic settings, particularly the risk of deceptive empathy, which may lead to unrealistic patient expectations and potential harm.
☆ Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.
comment: Accepted to IEEE SLT 2024
☆ Unforgettable Generalization in Language Models
When language models (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.
comment: 18 pages, 9 figures, published in First Conference on Language Modeling 2024
☆ Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling
This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
comment: PhD Dissertation
☆ CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
☆ Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
comment: 26 pages, 5 figures
☆ Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models
The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.
☆ OLMoE: Open Mixture-of-Experts Language Models
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
comment: 61 pages (24 main), 36 figures, 14 tables
☆ Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model
Due to the inherent difficulty in modeling phonetic similarities across different languages, code-switching speech recognition presents a formidable challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Initially, a preceding routing network explicitly learns Language Identification (LID) tasks and selects experts based on acquired LID weights. This process ensures robust routing information to the MoE layer, mitigating interference from diverse language domains on expert network parameter updates. The LID weights are also employed to facilitate inter-group collaboration, enabling the integration of language-specific representations. Furthermore, within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language. Extensive experiments demonstrate the efficacy of our approach, achieving significant performance enhancements compared to alternative methods. Importantly, our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.
comment: Accepted to IEEE SLT 2024
☆ BEAVER: An Enterprise Benchmark for Text-to-SQL
Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the "dark web", (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.
☆ Foundations of Large Language Model Compression -- Part 1: Weight Quantization
In recent years, compression of large language models (LLMs) has emerged as an important problem to allow language model deployment on resource-constrained devices, reduce computational costs, and mitigate the environmental footprint of large-scale AI infrastructure. In this paper, we present the foundations of LLM quantization from a convex optimization perspective and propose a quantization method that builds on these foundations and outperforms previous methods. Our quantization framework, CVXQ, scales to models containing hundreds of billions of weight parameters and provides users with the flexibility to compress models to any specified model size, post-training. A reference implementation of CVXQ can be obtained from https://github.com/seannz/cvxq.
comment: Preprint
☆ FuzzCoder: Byte-level Fuzzing Test via Large Language Model
Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.
comment: 11 pages
☆ Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation
This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q\&A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.
comment: 10 pages, 3 figures, 3 tables
☆ 3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands
In this work, we present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. This includes the time for presenting a sign example, performing and recording the sign, and archiving the capture. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. We produce handshape labels for 1,000 signs from American Sign Language and evaluate the labels in a sign recognition task. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint. The 3D-LEX collection has been aligned with existing sign language benchmarks and linguistic resources, to support studies in 3D-aware sign language processing.
☆ What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: https://github.com/WowCZ/LongMIT.
comment: Work in progress
☆ Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertextual Analysis
This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, Koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM's ability to generate novel intertextual observations and connections highlights its potential to uncover new insights. However, the model also struggles with long query passages and the inclusion of false intertextual dependences, emphasizing the importance of expert evaluation. The expert-in-the-loop methodology presented offers a scalable approach for intertextual research into the complex web of intertextuality within and beyond the biblical corpus.
☆ The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?
In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.
☆ AgentRE: An Agent-Based Framework for Navigating Complex Information Landscapes in Relation Extraction CIKM 2024
The relation extraction (RE) in complex scenarios faces challenges such as diverse relation types and ambiguous relations between entities within a single sentence, leading to the poor performance of pure "text-in, text-out" language models (LMs). To address these challenges, in this paper, we propose an agent-based RE framework, namely AgentRE, which fully leverages the potential of large language models (LLMs) including memory, retrieval and reflection, to achieve RE in complex scenarios. Specifically, three major modules are built in AgentRE serving as the tools to help the agent acquire and process various useful information, thereby obtaining improved RE performance. Our extensive experimental results upon two datasets in English and Chinese demonstrate our AgentRE's superior performance, especially in low-resource scenarios. Additionally, the trajectories generated by AgentRE can be refined to construct a high-quality training dataset incorporating different reasoning methods, which can be used to fine-tune smaller models. Code is available at https://github.com/Lightblues/AgentRE.
comment: Accepted by CIKM 2024
☆ Towards Generative Class Prompt Learning for Few-shot Visual Recognition BMVC 2024
Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: https://github.com/soumitri2001/GCPL.
comment: Accepted at BMVC 2024
☆ Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations ALT
As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.
comment: 17 pages, 15 figures, shorter version submitted to 22nd Annual Workshop of the Australasian Language Technology Association (ALTA'24)
☆ LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning
Effective planning is essential for the success of any task, from organizing a vacation to routing autonomous vehicles and developing corporate strategies. It involves setting goals, formulating plans, and allocating resources to achieve them. LLMs are particularly well-suited for automated planning due to their strong capabilities in commonsense reasoning. They can deduce a sequence of actions needed to achieve a goal from a given state and identify an effective course of action. However, it is frequently observed that plans generated through direct prompting often fail upon execution. Our survey aims to highlight the existing challenges in planning with language models, focusing on key areas such as embodied environments, optimal scheduling, competitive and cooperative games, task decomposition, reasoning, and planning. Through this study, we explore how LLMs transform AI planning and provide unique insights into the future of LM-assisted planning.
☆ Training on the Benchmark Is Not All You Need
The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection
Explainable fake news detection predicts the authenticity of news items with annotated explanations. Today, Large Language Models (LLMs) are known for their powerful natural language understanding and explanation generation abilities. However, presenting LLMs for explainable fake news detection remains two main challenges. Firstly, fake news appears reasonable and could easily mislead LLMs, leaving them unable to understand the complex news-faking process. Secondly, utilizing LLMs for this task would generate both correct and incorrect explanations, which necessitates abundant labor in the loop. In this paper, we propose LLM-GAN, a novel framework that utilizes prompting mechanisms to enable an LLM to become Generator and Detector and for realistic fake news generation and detection. Our results demonstrate LLM-GAN's effectiveness in both prediction performance and explanation quality. We further showcase the integration of LLM-GAN to a cloud-native AI platform to provide better fake news detection service in the cloud.
☆ State-of-the-art Advances of Deep-learning Linguistic Steganalysis Research
With the evolution of generative linguistic steganography techniques, conventional steganalysis falls short in robustly quantifying the alterations induced by steganography, thereby complicating detection. Consequently, the research paradigm has pivoted towards deep-learning-based linguistic steganalysis. This study offers a comprehensive review of existing contributions and evaluates prevailing developmental trajectories. Specifically, we first provided a formalized exposition of the general formulas for linguistic steganalysis, while comparing the differences between this field and the domain of text classification. Subsequently, we classified the existing work into two levels based on vector space mapping and feature extraction models, thereby comparing the research motivations, model advantages, and other details. A comparative analysis of the experiments is conducted to assess the performances. Finally, the challenges faced by this field are discussed, and several directions for future development and key issues that urgently need to be addressed are proposed.
comment: Accepted by 2023 International Conference on Data, Information and Computing Science
☆ FC-KAN: Function Combinations in Kolmogorov-Arnold Networks
In this paper, we introduce FC-KAN, a Kolmogorov-Arnold Network (KAN) that leverages combinations of popular mathematical functions such as B-splines, wavelets, and radial basis functions on low-dimensional data through element-wise operations. We explore several methods for combining the outputs of these functions, including sum, element-wise product, the addition of sum and element-wise product, quadratic function representation, and concatenation. In our experiments, we compare FC-KAN with multi-layer perceptron network (MLP) and other existing KANs, such as BSRBF-KAN, EfficientKAN, FastKAN, and FasterKAN, on the MNIST and Fashion-MNIST datasets. A variant of FC-KAN, which uses a combination of outputs from B-splines and Difference of Gaussians (DoG) in the form of a quadratic function, outperformed all other models on the average of 5 independent training runs. We expect that FC-KAN can leverage function combinations to design future KANs. Our repository is publicly available at: https://github.com/hoangthangta/FC_KAN.
comment: 9 pages, 1 figure
☆ Empirical evidence of Large Language Model's influence on human spoken communication
Artificial Intelligence (AI) agents now interact with billions of humans in natural language, thanks to advances in Large Language Models (LLMs) like ChatGPT. This raises the question of whether AI has the potential to shape a fundamental aspect of human culture: the way we speak. Recent analyses revealed that scientific publications already exhibit evidence of AI-specific language. But this evidence is inconclusive, since scientists may simply be using AI to copy-edit their writing. To explore whether AI has influenced human spoken communication, we transcribed and analyzed about 280,000 English-language videos of presentations, talks, and speeches from more than 20,000 YouTube channels of academic institutions. We find a significant shift in the trend of word usage specific to words distinctively associated with ChatGPT following its release. These findings provide the first empirical evidence that humans increasingly imitate LLMs in their spoken language. Our results raise societal and policy-relevant concerns about the potential of AI to unintentionally reduce linguistic diversity, or to be deliberately misused for mass manipulation. They also highlight the need for further investigation into the feedback loops between machine behavior and human culture.
☆ Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits ECCV 2024
CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured -- in the form of tabular data -- visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP's powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP's image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code can be found at: https://github.com/insait-institute/MUZE
comment: Accepted to ECCV 2024
☆ In Defense of RAG in the Era of Long-Context Language Models
Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.
☆ Interpreting and Improving Large Language Models in Arithmetic Calculation ICML 2024
Large language models (LLMs) have demonstrated remarkable potential across numerous applications and have shown an emergent ability to tackle complex reasoning tasks, such as mathematical computations. However, even for the simplest arithmetic calculations, the intrinsic mechanisms behind LLMs remain mysterious, making it challenging to ensure reliability. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. Through comprehensive experiments, we find that LLMs frequently involve a small fraction (< 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes. Subsequently, the information from these operands is processed through multi-layer perceptrons (MLPs), progressively leading to the final solution. These pivotal heads/MLPs, though identified on a specific dataset, exhibit transferability across different datasets and even distinct tasks. This insight prompted us to investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance. We empirically find that such precise tuning can yield notable enhancements on mathematical prowess, without compromising the performance on non-mathematical tasks. Our work serves as a preliminary exploration into the arithmetic calculation abilities inherent in LLMs, laying a solid foundation to reveal more intricate mathematical tasks.
comment: Accepted by ICML 2024 (oral)
☆ From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning ICML 2024
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs' general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (<5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs.
comment: Accepted by ICML 2024
☆ CTG-KrEW: Generating Synthetic Structured Contextually Correlated Content by Conditional Tabular GAN with K-Means Clustering and Efficient Word Embedding
Conditional Tabular Generative Adversarial Networks (CTGAN) and their various derivatives are attractive for their ability to efficiently and flexibly create synthetic tabular data, showcasing strong performance and adaptability. However, there are certain critical limitations to such models. The first is their inability to preserve the semantic integrity of contextually correlated words or phrases. For instance, skillset in freelancer profiles is one such attribute where individual skills are semantically interconnected and indicative of specific domain interests or qualifications. The second challenge of traditional approaches is that, when applied to generate contextually correlated tabular content, besides generating semantically shallow content, they consume huge memory resources and CPU time during the training stage. To address these problems, we introduce a novel framework, CTGKrEW (Conditional Tabular GAN with KMeans Clustering and Word Embedding), which is adept at generating realistic synthetic tabular data where attributes are collections of semantically and contextually coherent words. CTGKrEW is trained and evaluated using a dataset from Upwork, a realworld freelancing platform. Comprehensive experiments were conducted to analyze the variability, contextual similarity, frequency distribution, and associativity of the generated data, along with testing the framework's system feasibility. CTGKrEW also takes around 99\% less CPU time and 33\% less memory footprints than the conventional approach. Furthermore, we developed KrEW, a web application to facilitate the generation of realistic data containing skill-related information. This application, available at https://riyasamanta.github.io/krew.html, is freely accessible to both the general public and the research community.
☆ Booster: Tackling Harmful Fine-tuing for Large Language Models via Attenuating Harmful Perturbation
Harmful fine-tuning issue \citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses \citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \url{https://github.com/git-disl/Booster}.
☆ Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data.
☆ AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.
comment: 8 pages, 5 figures, code available at https://anonymous.4open.science/r/AdaComp-8C0C/
☆ An Implementation of Werewolf Agent That does not Truly Trust LLMs
Werewolf is an incomplete information game, which has several challenges when creating a computer agent as a player given the lack of understanding of the situation and individuality of utterance (e.g., computer agents are not capable of characterful utterance or situational lying). We propose a werewolf agent that solves some of those difficulties by combining a Large Language Model (LLM) and a rule-based algorithm. In particular, our agent uses a rule-based algorithm to select an output either from an LLM or a template prepared beforehand based on the results of analyzing conversation history using an LLM. It allows the agent to refute in specific situations, identify when to end the conversation, and behave with persona. This approach mitigated conversational inconsistencies and facilitated logical utterance as a result. We also conducted a qualitative evaluation, which resulted in our agent being perceived as more human-like compared to an unmodified LLM. The agent is freely available for contributing to advance the research in the field of Werewolf game.
☆ Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture
This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom's Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs' abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models' performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.
comment: Submitted to O-COCOSDA 2024
☆ Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
Large language models (LLMs) have shown success in generating high-quality responses. In order to achieve better alignment with LLMs with human preference, various works are proposed based on specific optimization process, which, however, is not suitable to Black-Box LLMs like GPT-4, due to inaccessible parameters. In Black-Box LLMs case, their performance is highly dependent on the quality of the provided prompts. Existing methods to enhance response quality often involve a prompt refinement model, yet these approaches potentially suffer from semantic inconsistencies between the refined and original prompts, and typically overlook the relationship between them. To address these challenges, we introduce a self-instructed in-context learning framework that empowers LLMs to deliver more effective responses by generating reliable derived prompts to construct informative contextual environments. Our approach incorporates a self-instructed reinforcement learning mechanism, enabling direct interaction with the response model during derived prompt generation for better alignment. We then formulate querying as an in-context learning task, using responses from LLMs combined with the derived prompts to establish a contextual demonstration for the original prompt. This strategy ensures alignment with the original query, reduces discrepancies from refined prompts, and maximizes the LLMs' in-context learning capability. Extensive experiments demonstrate that the proposed method not only generates more reliable derived prompts but also significantly enhances LLMs' ability to deliver more effective responses, including Black-Box models such as GPT-4.
☆ VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka
This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.
comment: Submitted to O-COCOSDA 2024
☆ Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.
comment: Accepted to IEEE SLT 2024
☆ It is Time to Develop an Auditing Framework to Promote Value Aware Chatbots
The launch of ChatGPT in November 2022 marked the beginning of a new era in AI, the availability of generative AI tools for everyone to use. ChatGPT and other similar chatbots boast a wide range of capabilities from answering student homework questions to creating music and art. Given the large amounts of human data chatbots are built on, it is inevitable that they will inherit human errors and biases. These biases have the potential to inflict significant harm or increase inequity on different subpopulations. Because chatbots do not have an inherent understanding of societal values, they may create new content that is contrary to established norms. Examples of concerning generated content includes child pornography, inaccurate facts, and discriminatory posts. In this position paper, we argue that the speed of advancement of this technology requires us, as computer and data scientists, to mobilize and develop a values-based auditing framework containing a community established standard set of measurements to monitor the health of different chatbots and LLMs. To support our argument, we use a simple audit template to share the results of basic audits we conduct that are focused on measuring potential bias in search engine style tasks, code generation, and story generation. We identify responses from GPT 3.5 and GPT 4 that are both consistent and not consistent with values derived from existing law. While the findings come as no surprise, they do underscore the urgency of developing a robust auditing framework for openly sharing results in a consistent way so that mitigation strategies can be developed by the academic community, government agencies, and companies when our values are not being adhered to. We conclude this paper with recommendations for value-based strategies for improving the technologies.
comment: arXiv admin note: substantial text overlap with arXiv:2306.07500
☆ S$^3$c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners
Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S$^3$c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.
♻ ☆ Improving Rare Word Translation With Dictionaries and Attention Masking
In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.
comment: 11 pages, 3 figures, 3 tables. Accepted at AMTA 2024
♻ ☆ Low-Rank Quantization-Aware Training for LLMs
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-1/2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory. Our source code is available at https://github.com/qualcomm-ai-research/LR-QAT
♻ ☆ Foundation Models for Music: A Survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
♻ ☆ InkubaLM: A small language model for low-resource African languages
High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available at https://huggingface.co/lelapa to encourage research and development on low-resource languages.
♻ ☆ OceanGPT: A Large Language Model for Ocean Science Tasks ACL2024
Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet's surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reasons are the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever large language model in the ocean domain, which is expert in various ocean science tasks. We also propose OceanGPT, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology.
comment: ACL2024. Project Website: http://oceangpt.zjukg.cn/
♻ ☆ A Survey on Stability of Learning with Limited Labelled Data and its Sensitivity to the Effects of Randomness
Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-learning or few-shot learning, aims to effectively train a model using only a small amount of labelled samples. However, these approaches have been observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively affects the stability of the models, leading to large variances in results across training runs. When such sensitivity is disregarded, it can unintentionally, but unfortunately also intentionally, create an imaginary perception of research progress. Recently, this area started to attract research attention and the number of relevant studies is continuously growing. In this survey, we provide a comprehensive overview of 415 papers addressing the effects of randomness on the stability of learning with limited labelled data. We distinguish between four main tasks addressed in the papers (investigate/evaluate; determine; mitigate; benchmark/compare/report randomness effects), providing findings for each one. Furthermore, we identify and discuss seven challenges and open problems together with possible directions to facilitate further research. The ultimate goal of this survey is to emphasise the importance of this growing research area, which so far has not received an appropriate level of attention, and reveal impactful directions for future research.
comment: Accepted to ACM Comput. Surv. 2024
♻ ☆ Towards Scalable Automated Alignment of LLMs: A Survey
Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approaches. In this paper, we systematically review the recently emerging methods of automated alignment, attempting to explore how to achieve effective, scalable, automated alignment once the capabilities of LLMs exceed those of humans. Specifically, we categorize existing automated alignment methods into 4 major categories based on the sources of alignment signals and discuss the current status and potential development of each category. Additionally, we explore the underlying mechanisms that enable automated alignment and discuss the essential factors that make automated alignment technologies feasible and effective from the fundamental role of alignment.
comment: Paper List: https://github.com/cascip/awesome-auto-alignment
♻ ☆ White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
comment: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes and formatting improvements. V3: improvements from journal revisions
♻ ☆ A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors
The relationship between the quality of a string, as judged by a human reader, and its probability, $p(\boldsymbol{y})$ under a language model undergirds the development of better language models. For example, many popular algorithms for sampling from a language model have been conceived with the goal of manipulating $p(\boldsymbol{y})$ to place higher probability on strings that humans deem of high quality. In this article, we examine the probability--quality relationship in language models explicitly aligned to human preferences, e.g., through reinforcement learning through human feedback. We show that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
♻ ☆ Correcting misinformation on social media with a large language model
Real-world misinformation, often multimodal, can be partially or fully factual but misleading using diverse tactics like conflating correlation with causation. Such misinformation is severely understudied, challenging to address, and harms various social domains, particularly on social media, where it can spread rapidly. High-quality and timely correction of misinformation that identifies and explains its (in)accuracies effectively reduces false beliefs. Despite the wide acceptance of manual correction, it is difficult to be timely and scalable. While LLMs have versatile capabilities that could accelerate misinformation correction, they struggle due to a lack of recent information, a tendency to produce false content, and limitations in addressing multimodal information. We propose MUSE, an LLM augmented with access to and credibility evaluation of up-to-date information. By retrieving evidence as refutations or supporting context, MUSE identifies and explains content (in)accuracies with references. It conducts multimodal retrieval and interprets visual content to verify and correct multimodal content. Given the absence of a comprehensive evaluation approach, we propose 13 dimensions of misinformation correction quality. Then, fact-checking experts evaluate responses to social media content that are not presupposed to be misinformation but broadly include (partially) incorrect and correct posts that may (not) be misleading. Results demonstrate MUSE's ability to write high-quality responses to potential misinformation--across modalities, tactics, domains, political leanings, and for information that has not previously been fact-checked online--within minutes of its appearance on social media. Overall, MUSE outperforms GPT-4 by 37% and even high-quality responses from laypeople by 29%. Our work provides a general methodological and evaluative framework to correct misinformation at scale.
comment: 50 pages
♻ ☆ NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to a thousand GPUs for training the largest open-source LLMs such as Nemotron 4 340B and Llama 3.1 405B. NeMo-Aligner comes with highly optimized and scalable implementations for major paradigms of model alignment such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally, our toolkit supports running most of the alignment techniques in a Parameter Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for extensibility, allowing support for other alignment techniques with minimal effort. It is open-sourced with Apache 2.0 License and we invite community contributions at https://github.com/NVIDIA/NeMo-Aligner
comment: 16 pages, 4 figures, Accepted to COLM 2024
♻ ☆ Squid: Long Context as a New Modality for Energy-Efficient On-Device Language Models
This paper presents Dolphin, a novel decoder-decoder architecture for energy-efficient processing of long contexts in language models. Our approach addresses the significant energy consumption and latency challenges inherent in on-device models. Dolphin employs a compact 0.5B parameter decoder to distill extensive contextual information into a memory embedding, substantially reducing the input length for the primary 7B parameter decoder model. Inspired by vision-language models, we repurpose the image embedding projector to encode long textual contexts, effectively treating extended context as a distinct modality. This innovative method enables processing of substantially longer contexts without the typical computational overhead associated with extended input sequences. Empirical evaluations demonstrate a 10-fold improvement in energy efficiency and a 5-fold reduction in latency compared to conventional full-length context processing methods without losing quality of the response. Our work contributes to the development of more sustainable and scalable language models for on-device applications, addressing the critical need for energy-efficient and responsive AI technologies in resource-constrained environments while maintaining the accuracy to understand long contexts. This research has implications for the broader field of natural language processing, particularly in the domain of efficient model design for resource-limited settings. By enabling more sophisticated AI capabilities on edge devices, Dolphin paves the way for advanced language processing in a wide range of applications where computational resources are at a premium. The Dolphin model is publicly available at https://huggingface.co/NexaAIDev/Dolphin.
♻ ☆ OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. Language model systems often enable LLMs to generate code for arithmetic operations to achieve accurate calculations. However, this approach compromises speed and security, and fine-tuning risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in a single autoregressive step, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic. Our implementation using Llama 3 with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o with and without a code interpreter. Furthermore, OccamLlama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that OccamLLMs can excel in arithmetic tasks, even surpassing much larger models. We will make our code public shortly.
♻ ☆ Flood of Techniques and Drought of Theories: Emotion Mining in Disasters
Emotion mining has become a crucial tool for understanding human emotions during disasters, leveraging the extensive data generated on social media platforms. This paper aims to summarize existing research on emotion mining within disaster contexts, highlighting both significant discoveries and persistent issues. On the one hand, emotion mining techniques have achieved acceptable accuracy enabling applications such as rapid damage assessment and mental health surveillance. On the other hand, with many studies adopting data-driven approaches, several methodological issues remain. These include arbitrary emotion classification, ignoring biases inherent in data collection from social media, such as the overrepresentation of individuals from higher socioeconomic status on Twitter, and the lack of application of theoretical frameworks like cross-cultural comparisons. These problems can be summarized as a notable lack of theory-driven research and ignoring insights from social and behavioral sciences. This paper underscores the need for interdisciplinary collaboration between computer scientists and social scientists to develop more robust and theoretically grounded approaches in emotion mining. By addressing these gaps, we aim to enhance the effectiveness and reliability of emotion mining methodologies, ultimately contributing to improved disaster preparedness, response, and recovery. Keywords: emotion mining, sentiment analysis, natural disasters, psychology, technological disasters
♻ ☆ How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates decision-making capabilities of LLMs through the lens of Game Theory. We focus specifically on games that support the simultaneous participation of more than two agents. We introduce GAMA($\gamma$)-Bench, which evaluates LLMs' Gaming Ability in Multi-Agent environments. $\gamma$-Bench includes eight classical multi-agent games and a scoring scheme specially designed to quantitatively assess LLMs' performance. Leveraging $\gamma$-Bench, we investigate LLMs' robustness, generalizability, and strategies for enhancement. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we evaluate twelve versions from six models, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. We find that Gemini-1.5-Pro outperforms other models with a score of $63.8$ out of $100$, followed by LLaMA-3.1-70B and GPT-4 with scores of $60.9$ and $60.5$, respectively. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.
comment: 11 pages of main text. 20 pages of appendices. 12 figures, 9 tables. Added models: Gemini-1.5-Pro, LLaMA-3.1-{7, 70, 405}B, Mixtral-8x{7, 22}B, Qwen-2-72B
♻ ☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
♻ ☆ COFFEE: A Contrastive Oracle-Free Framework for Event Extraction ATC
Event extraction is a complex information extraction task that involves extracting events from unstructured text. Prior classification-based methods require comprehensive entity annotations for joint training, while newer generation-based methods rely on heuristic templates containing oracle information such as event type, which is often unavailable in real-world scenarios. In this study, we consider a more realistic setting of this task, namely the Oracle-Free Event Extraction (OFEE) task, where only the input context is given without any oracle information, including event type, event ontology and trigger word. To solve this task, we propose a new framework, called COFFEE, which extracts the events solely based on the document context without referring to any oracle information. In particular, a contrastive selection model is introduced in COFFEE to rectify the generated triggers and handle multi-event instances. The proposed COFFEE outperforms state-of-the-art approaches under the oracle-free setting of the event extraction task, as evaluated on a public event extraction benchmark ACE05.
comment: Accepted to MATCHING Workshop at ACL 2023
♻ ☆ Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification
The lack of a suitable tool for the analysis of conversational texts in the Persian language has made various analyses of these texts, including Sentiment Analysis, difficult. In this research, we tried to make the understanding of these texts easier for the machine by providing PSC, Persian Slang Converter, a tool for converting conversational texts into formal ones, and by using the most up-to-date and best deep learning methods along with the PSC, the sentiment learning of short Persian language texts for the machine in a better way. be made More than 10 million unlabeled texts from various social networks and movie subtitles (as Conversational texts) and about 10 million news texts (as formal texts) have been used for training unsupervised models and formal implementation of the tool. 60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model of short texts. Using the formal tool, 57% of the words of the corpus of conversation were converted. Finally, by using the formalizer, FastText model, and deep LSTM network, an accuracy of 81.91 was obtained on the test data.
comment: 16 pages, 4 figures, 14 tables
♻ ☆ Preference Learning Algorithms Do Not Learn Preference Rankings
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via $\textit{ranking accuracy}$. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the $\textit{idealized ranking accuracy}$ that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant $\textit{alignment gap}$ -- $\textit{i.e.}$, a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.
♻ ☆ A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs
This paper proposes a new method for preventing unsafe or otherwise low quality large language model (LLM) outputs, by leveraging the stochasticity of LLMs. We propose a system whereby LLM checkers vote on the acceptability of a generated output, regenerating it if a threshold of disapproval is reached, until sufficient checkers approve. We further propose estimators for cost and failure rate, and based on those estimators and experimental data tailored to the application, we propose an algorithm that achieves a desired failure rate at the least possible cost. We demonstrate that, under these models, failure rate decreases exponentially as a function of cost when voter count and threshold are chosen according to the algorithm, and that the models reasonably estimate the actual performance of such a system in action, even with limited data.
comment: 7 pages, 2 figures
♻ ☆ ARN: Analogical Reasoning on Narratives
As a core cognitive skill that enables the transferability of information across domains, analogical reasoning has been extensively studied for both humans and computational models. However, while cognitive theories of analogy often focus on narratives and study the distinction between surface, relational, and system similarities, existing work in natural language processing has a narrower focus as far as relational analogies between word pairs. This gap brings a natural question: can state-of-the-art large language models (LLMs) detect system analogies between narratives? To gain insight into this question and extend word-based relational analogies to relational system analogies, we devise a comprehensive computational framework that operationalizes dominant theories of analogy, using narrative elements to create surface and system mappings. Leveraging the interplay between these mappings, we create a binary task and benchmark for Analogical Reasoning on Narratives (ARN), covering four categories of far (cross-domain)/near (within-domain) analogies and disanalogies. We show that while all LLMs can largely recognize near analogies, even the largest ones struggle with far analogies in a zero-shot setting, with GPT4.0 scoring below random. Guiding the models through solved examples and chain-of-thought reasoning enhances their analogical reasoning ability. Yet, since even in the few-shot setting, the best model only performs halfway between random and humans, ARN opens exciting directions for computational analogical reasoners.
♻ ☆ Zyda: A 1.3T Dataset for Open Language Modeling
The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.
♻ ☆ Correction with Backtracking Reduces Hallucination in Summarization
Abstractive summarization aims at generating natural language summaries of a source document that are succinct while preserving the important elements. Despite recent advances, neural text summarization models are known to be susceptible to hallucinating (or more correctly confabulating), that is to produce summaries with details that are not grounded in the source document. In this paper, we introduce a simple yet efficient technique, CoBa, to reduce hallucination in abstractive summarization. The approach is based on two steps: hallucination detection and mitigation. We show that the former can be achieved through measuring simple statistics about conditional word probabilities and distance to context words. Further, we demonstrate that straight-forward backtracking is surprisingly effective at mitigation. We thoroughly evaluate the proposed method with prior art on three benchmark datasets for text summarization. The results show that CoBa is effective and efficient in reducing hallucination, and offers great adaptability and flexibility. Code can be found at https://github.com/zhenzhel/CoBa.
♻ ☆ Investigating the Robustness of LLMs on Math Word Problems
Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, ProbleMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and better ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to ~6%.
♻ ☆ A Survey on Responsible Generative AI: What to Generate and What Not
In recent years, generative AI (GenAI), like large language models and text-to-image models, has received significant attention across various domains. However, ensuring the responsible generation of content by these models is crucial for their real-world applicability. This raises an interesting question: What should responsible GenAI generate, and what should it not? To answer the question, this paper investigates the practical responsible requirements of both textual and visual generative models, outlining five key considerations: generating truthful content, avoiding toxic content, refusing harmful instruction, leaking no training data-related content, and ensuring generated content identifiable. Specifically, we review recent advancements and challenges in addressing these requirements. Besides, we discuss and emphasize the importance of responsible GenAI across healthcare, education, finance, and artificial general intelligence domains. Through a unified perspective on both textual and visual generative models, this paper aims to provide insights into practical safety-related issues and further benefit the community in building responsible GenAI.
comment: 77 pages, 10 figures
♻ ☆ GPT has become financially literate: Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice
We assess the ability of GPT -- a large language model -- to serve as a financial robo-advisor for the masses, by using a financial literacy test. Davinci and ChatGPT based on GPT-3.5 score 66% and 65% on the financial literacy test, respectively, compared to a baseline of 33%. However, ChatGPT based on GPT-4 achieves a near-perfect 99% score, pointing to financial literacy becoming an emergent ability of state-of-the-art models. We use the Judge-Advisor System and a savings dilemma to illustrate how researchers might assess advice-utilization from large language models. We also present a number of directions for future research.
comment: 43 pages, 2 figures and 2 tables in main text; in V2 added information that this is the Author Accepted Manuscript version
♻ ☆ Interpretation of Intracardiac Electrograms Through Textual Representations
Understanding the irregular electrical activity of atrial fibrillation (AFib) has been a key challenge in electrocardiography. For serious cases of AFib, catheter ablations are performed to collect intracardiac electrograms (EGMs). EGMs offer intricately detailed and localized electrical activity of the heart and are an ideal modality for interpretable cardiac studies. Recent advancements in artificial intelligence (AI) has allowed some works to utilize deep learning frameworks to interpret EGMs during AFib. Additionally, language models (LMs) have shown exceptional performance in being able to generalize to unseen domains, especially in healthcare. In this study, we are the first to leverage pretrained LMs for finetuning of EGM interpolation and AFib classification via masked language modeling. We formulate the EGM as a textual sequence and present competitive performances on AFib classification compared against other representations. Lastly, we provide a comprehensive interpretability study to provide a multi-perspective intuition of the model's behavior, which could greatly benefit the clinical use.
comment: 17 pages, 7 figures; Accepted to CHIL 2024
♻ ☆ RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback ICML 2024
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
comment: Presented at ICML 2024
Computer Vision and Pattern Recognition 67
☆ Coaching a Robotic Sonographer: Learning Robotic Ultrasound with Sparse Expert's Feedback
Ultrasound is widely employed for clinical intervention and diagnosis, due to its advantages of offering non-invasive, radiation-free, and real-time imaging. However, the accessibility of this dexterous procedure is limited due to the substantial training and expertise required of operators. The robotic ultrasound (RUS) offers a viable solution to address this limitation; nonetheless, achieving human-level proficiency remains challenging. Learning from demonstrations (LfD) methods have been explored in RUS, which learns the policy prior from a dataset of offline demonstrations to encode the mental model of the expert sonographer. However, active engagement of experts, i.e. Coaching, during the training of RUS has not been explored thus far. Coaching is known for enhancing efficiency and performance in human training. This paper proposes a coaching framework for RUS to amplify its performance. The framework combines DRL (self-supervised practice) with sparse expert's feedback through coaching. The DRL employs an off-policy Soft Actor-Critic (SAC) network, with a reward based on image quality rating. The coaching by experts is modeled as a Partially Observable Markov Decision Process (POMDP), which updates the policy parameters based on the correction by the expert. The validation study on phantoms showed that coaching increases the learning rate by $25\%$ and the number of high-quality image acquisition by $74.5\%$.
comment: Accepted in IEEE Transactions on Medical Robotics and Bionics (TMRB) 2024
☆ What Do You See in Common? Learning Hierarchical Prototypes over Tree-of-Life to Discover Evolutionary Traits
A grand challenge in biology is to discover evolutionary traits - features of organisms common to a group of species with a shared ancestor in the tree of life (also referred to as phylogenetic tree). With the growing availability of image repositories in biology, there is a tremendous opportunity to discover evolutionary traits directly from images in the form of a hierarchy of prototypes. However, current prototype-based methods are mostly designed to operate over a flat structure of classes and face several challenges in discovering hierarchical prototypes, including the issue of learning over-specific features at internal nodes. To overcome these challenges, we introduce the framework of Hierarchy aligned Commonality through Prototypical Networks (HComP-Net). We empirically show that HComP-Net learns prototypes that are accurate, semantically consistent, and generalizable to unseen species in comparison to baselines on birds, butterflies, and fishes datasets. The code and datasets are available at https://github.com/Imageomics/HComPNet.
comment: 34 pages, 27 figures
☆ YoloTag: Vision-based Robust UAV Navigation with Fiducial Markers
By harnessing fiducial markers as visual landmarks in the environment, Unmanned Aerial Vehicles (UAVs) can rapidly build precise maps and navigate spaces safely and efficiently, unlocking their potential for fluent collaboration and coexistence with humans. Existing fiducial marker methods rely on handcrafted feature extraction, which sacrifices accuracy. On the other hand, deep learning pipelines for marker detection fail to meet real-time runtime constraints crucial for navigation applications. In this work, we propose YoloTag \textemdash a real-time fiducial marker-based localization system. YoloTag uses a lightweight YOLO v8 object detector to accurately detect fiducial markers in images while meeting the runtime constraints needed for navigation. The detected markers are then used by an efficient perspective-n-point algorithm to estimate UAV states. However, this localization system introduces noise, causing instability in trajectory tracking. To suppress noise, we design a higher-order Butterworth filter that effectively eliminates noise through frequency domain analysis. We evaluate our algorithm through real-robot experiments in an indoor environment, comparing the trajectory tracking performance of our method against other approaches in terms of several distance metrics.
☆ Visual Servoing for Robotic On-Orbit Servicing: A Survey
On-orbit servicing (OOS) activities will power the next big step for sustainable exploration and commercialization of space. Developing robotic capabilities for autonomous OOS operations is a priority for the space industry. Visual Servoing (VS) enables robots to achieve the precise manoeuvres needed for critical OOS missions by utilizing visual information for motion control. This article presents an overview of existing VS approaches for autonomous OOS operations with space manipulator systems (SMS). We divide the approaches according to their contribution to the typical phases of a robotic OOS mission: a) Recognition, b) Approach, and c) Contact. We also present a discussion on the reviewed VS approaches, identifying current trends. Finally, we highlight the challenges and areas for future research on VS techniques for robotic OOS.
comment: Accepted for publication at the 2024 International Conference on Space Robotics (iSpaRo)
☆ Geometry-aware Feature Matching for Large-Scale Structure from Motion
Establishing consistent and dense correspondences across multiple images is crucial for Structure from Motion (SfM) systems. Significant view changes, such as air-to-ground with very sparse view overlap, pose an even greater challenge to the correspondence solvers. We present a novel optimization-based approach that significantly enhances existing feature matching methods by introducing geometry cues in addition to color cues. This helps fill gaps when there is less overlap in large-scale scenarios. Our method formulates geometric verification as an optimization problem, guiding feature matching within detector-free methods and using sparse correspondences from detector-based methods as anchor points. By enforcing geometric constraints via the Sampson Distance, our approach ensures that the denser correspondences from detector-free methods are geometrically consistent and more accurate. This hybrid strategy significantly improves correspondence density and accuracy, mitigates multi-view inconsistencies, and leads to notable advancements in camera pose accuracy and point cloud density. It outperforms state-of-the-art feature matching methods on benchmark datasets and enables feature matching in challenging extreme large-scale settings.
☆ QID$^2$: An Image-Conditioned Diffusion Model for Q-space Up-sampling of DWI Data MICCAI 2024
We propose an image-conditioned diffusion model to estimate high angular resolution diffusion weighted imaging (DWI) from a low angular resolution acquisition. Our model, which we call QID$^2$, takes as input a set of low angular resolution DWI data and uses this information to estimate the DWI data associated with a target gradient direction. We leverage a U-Net architecture with cross-attention to preserve the positional information of the reference images, further guiding the target image generation. We train and evaluate QID$^2$ on single-shell DWI samples curated from the Human Connectome Project (HCP) dataset. Specifically, we sub-sample the HCP gradient directions to produce low angular resolution DWI data and train QID$^2$ to reconstruct the missing high angular resolution samples. We compare QID$^2$ with two state-of-the-art GAN models. Our results demonstrate that QID$^2$ not only achieves higher-quality generated images, but it consistently outperforms the GAN models in downstream tensor estimation across multiple metrics. Taken together, this study highlights the potential of diffusion models, and QID$^2$ in particular, for q-space up-sampling, thus offering a promising toolkit for clinical and research applications.
comment: Accepted at MICCAI 2024 International Workshop on Computational Diffusion MRI. Zijian Chen and Jueqi Wang contributed equally to this work
☆ Unsupervised Welding Defect Detection Using Audio And Video
In this work we explore the application of AI to robotic welding. Robotic welding is a widely used technology in many industries, but robots currently do not have the capability to detect welding defects which get introduced due to various reasons in the welding process. We describe how deep-learning methods can be applied to detect weld defects in real-time by recording the welding process with microphones and a camera. Our findings are based on a large database with more than 4000 welding samples we collected which covers different weld types, materials and various defect categories. All deep learning models are trained in an unsupervised fashion because the space of possible defects is large and the defects in our data may contain biases. We demonstrate that a reliable real-time detection of most categories of weld defects is feasible both from audio and video, with improvements achieved by combining both modalities. Specifically, the multi-modal approach achieves an average Area-under-ROC-Curve (AUC) of 0.92 over all eleven defect types in our data. We conclude the paper with an analysis of the results by defect type and a discussion of future work.
comment: 21 pages
☆ Biochemical Prostate Cancer Recurrence Prediction: Thinking Fast & Slow
Time to biochemical recurrence in prostate cancer is essential for prognostic monitoring of the progression of patients after prostatectomy, which assesses the efficacy of the surgery. In this work, we proposed to leverage multiple instance learning through a two-stage ``thinking fast \& slow'' strategy for the time to recurrence (TTR) prediction. The first (``thinking fast'') stage finds the most relevant WSI area for biochemical recurrence and the second (``thinking slow'') stage leverages higher resolution patches to predict TTR. Our approach reveals a mean C-index ($Ci$) of 0.733 ($\theta=0.059$) on our internal validation and $Ci=0.603$ on the LEOPARD challenge validation set. Post hoc attention visualization shows that the most attentive area contributes to the TTR prediction.
comment: 8 pages, 3 figures, methodology paper for LEOPRARD Challenge
☆ K-Origins: Better Colour Quantification for Neural Networks
K-Origins is a neural network layer designed to improve image-based network performances when learning colour, or intensities, is beneficial. Over 250 encoder-decoder convolutional networks are trained and tested on 16-bit synthetic data, demonstrating that K-Origins improves semantic segmentation accuracy in two scenarios: object detection with low signal-to-noise ratios, and segmenting multiple objects that are identical in shape but vary in colour. K-Origins generates output features from the input features, $\textbf{X}$, by the equation $\textbf{Y}_k = \textbf{X}-\textbf{J}\cdot w_k$ for each trainable parameter $w_k$, where $\textbf{J}$ is a matrix of ones. Additionally, networks with varying receptive fields were trained to determine optimal network depths based on the dimensions of target classes, suggesting that receptive field lengths should exceed object sizes. By ensuring a sufficient receptive field length and incorporating K-Origins, we can achieve better semantic network performance.
comment: 16 pages, 13 figures, 1 table
☆ Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems
Recent developments in vision language models (VLM) have shown great potential for diverse applications related to image understanding. In this study, we have explored state-of-the-art VLM models for vision-based transportation engineering tasks such as image classification and object detection. The image classification task involves congestion detection and crack identification, whereas, for object detection, helmet violations were identified. We have applied open-source models such as CLIP, BLIP, OWL-ViT, Llava-Next, and closed-source GPT-4o to evaluate the performance of these state-of-the-art VLM models to harness the capabilities of language understanding for vision-based transportation tasks. These tasks were performed by applying zero-shot prompting to the VLM models, as zero-shot prompting involves performing tasks without any training on those tasks. It eliminates the need for annotated datasets or fine-tuning for specific tasks. Though these models gave comparative results with benchmark Convolutional Neural Networks (CNN) models in the image classification tasks, for object localization tasks, it still needs improvement. Therefore, this study provides a comprehensive evaluation of the state-of-the-art VLM models highlighting the advantages and limitations of the models, which can be taken as the baseline for future improvement and wide-scale implementation.
☆ ADHD diagnosis based on action characteristics recorded in videos using machine learning
Demand for ADHD diagnosis and treatment is increasing significantly and the existing services are unable to meet the demand in a timely manner. In this work, we introduce a novel action recognition method for ADHD diagnosis by identifying and analysing raw video recordings. Our main contributions include 1) designing and implementing a test focusing on the attention and hyperactivity/impulsivity of participants, recorded through three cameras; 2) implementing a novel machine learning ADHD diagnosis system based on action recognition neural networks for the first time; 3) proposing classification criteria to provide diagnosis results and analysis of ADHD action characteristics.
comment: Neuroscience Applied
☆ Action-Based ADHD Diagnosis in Video
Attention Deficit Hyperactivity Disorder (ADHD) causes significant impairment in various domains. Early diagnosis of ADHD and treatment could significantly improve the quality of life and functioning. Recently, machine learning methods have improved the accuracy and efficiency of the ADHD diagnosis process. However, the cost of the equipment and trained staff required by the existing methods are generally huge. Therefore, we introduce the video-based frame-level action recognition network to ADHD diagnosis for the first time. We also record a real multi-modal ADHD dataset and extract three action classes from the video modality for ADHD diagnosis. The whole process data have been reported to CNTW-NHS Foundation Trust, which would be reviewed by medical consultants/professionals and will be made public in due course.
comment: 31st European Symposium on Artificial Neural Networks
☆ Optimal L-Systems for Stochastic L-system Inference Problems
This paper presents two novel theorems that address two open problems in stochastic Lindenmayer-system (L-system) inference, specifically focusing on the construction of an optimal stochastic L-system capable of generating a given sequence of strings. The first theorem delineates a method for crafting a stochastic L-system that maximizes the likelihood of producing a given sequence of words through a singular derivation. Furthermore, the second theorem determines the stochastic L-systems with the highest probability of producing a given sequence of words with multiple possible derivations. From these, we introduce an algorithm to infer an optimal stochastic L-system from a given sequence. This algorithm incorporates sophisticated optimization techniques, such as interior point methods, ensuring production of a stochastically optimal stochastic L-system suitable for generating the given sequence. This allows for the use of using stochastic L-systems as model for machine learning using only positive data for training.
☆ How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?
Large foundation models have revolutionized the field, yet challenges remain in optimizing multi-modal models for specialized visual tasks. We propose a novel, generalizable methodology to identify preferred image distributions for black-box Vision-Language Models (VLMs) by measuring output consistency across varied input prompts. Applying this to different rendering types of 3D objects, we demonstrate its efficacy across various domains requiring precise interpretation of complex structures, with a focus on Computer-Aided Design (CAD) as an exemplar field. We further refine VLM outputs using in-context learning with human feedback, significantly enhancing explanation quality. To address the lack of benchmarks in specialized domains, we introduce CAD-VQA, a new dataset for evaluating VLMs on CAD-related visual question answering tasks. Our evaluation of state-of-the-art VLMs on CAD-VQA establishes baseline performance levels, providing a framework for advancing VLM capabilities in complex visual reasoning tasks across various fields requiring expert-level visual interpretation. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/cad_vqa}.
☆ NoiseAttack: An Evasive Sample-Specific Multi-Targeted Backdoor Attack Through White Gaussian Noise
Backdoor attacks pose a significant threat when using third-party data for deep learning development. In these attacks, data can be manipulated to cause a trained model to behave improperly when a specific trigger pattern is applied, providing the adversary with unauthorized advantages. While most existing works focus on designing trigger patterns in both visible and invisible to poison the victim class, they typically result in a single targeted class upon the success of the backdoor attack, meaning that the victim class can only be converted to another class based on the adversary predefined value. In this paper, we address this issue by introducing a novel sample-specific multi-targeted backdoor attack, namely NoiseAttack. Specifically, we adopt White Gaussian Noise (WGN) with various Power Spectral Densities (PSD) as our underlying triggers, coupled with a unique training strategy to execute the backdoor attack. This work is the first of its kind to launch a vision backdoor attack with the intent to generate multiple targeted classes with minimal input configuration. Furthermore, our extensive experimental results demonstrate that NoiseAttack can achieve a high attack success rate against popular network architectures and datasets, as well as bypass state-of-the-art backdoor detection methods. Our source code and experiments are available at https://github.com/SiSL-URI/NoiseAttack/tree/main.
☆ A Novel Audio-Visual Information Fusion System for Mental Disorders Detection
Mental disorders are among the foremost contributors to the global healthcare challenge. Research indicates that timely diagnosis and intervention are vital in treating various mental disorders. However, the early somatization symptoms of certain mental disorders may not be immediately evident, often resulting in their oversight and misdiagnosis. Additionally, the traditional diagnosis methods incur high time and cost. Deep learning methods based on fMRI and EEG have improved the efficiency of the mental disorder detection process. However, the cost of the equipment and trained staff are generally huge. Moreover, most systems are only trained for a specific mental disorder and are not general-purpose. Recently, physiological studies have shown that there are some speech and facial-related symptoms in a few mental disorders (e.g., depression and ADHD). In this paper, we focus on the emotional expression features of mental disorders and introduce a multimodal mental disorder diagnosis system based on audio-visual information input. Our proposed system is based on spatial-temporal attention networks and innovative uses a less computationally intensive pre-train audio recognition network to fine-tune the video recognition module for better results. We also apply the unified system for multiple mental disorders (ADHD and depression) for the first time. The proposed system achieves over 80\% accuracy on the real multimodal ADHD dataset and achieves state-of-the-art results on the depression dataset AVEC 2014.
comment: 27th International Conference on Information (FUSION)
☆ What makes a face looks like a hat: Decoupling low-level and high-level Visual Properties with Image Triplets ECCV2024
In visual decision making, high-level features, such as object categories, have a strong influence on choice. However, the impact of low-level features on behavior is less understood partly due to the high correlation between high- and low-level features in the stimuli presented (e.g., objects of the same category are more likely to share low-level features). To disentangle these effects, we propose a method that de-correlates low- and high-level visual properties in a novel set of stimuli. Our method uses two Convolutional Neural Networks (CNNs) as candidate models of the ventral visual stream: the CORnet-S that has high neural predictivity in high-level, IT-like responses and the VGG-16 that has high neural predictivity in low-level responses. Triplets (root, image1, image2) of stimuli are parametrized by the level of low- and high-level similarity of images extracted from the different layers. These stimuli are then used in a decision-making task where participants are tasked to choose the most similar-to-the-root image. We found that different networks show differing abilities to predict the effects of low-versus-high-level similarity: while CORnet-S outperforms VGG-16 in explaining human choices based on high-level similarity, VGG-16 outperforms CORnet-S in explaining human choices based on low-level similarity. Using Brain-Score, we observed that the behavioral prediction abilities of different layers of these networks qualitatively corresponded to their ability to explain neural activity at different levels of the visual hierarchy. In summary, our algorithm for stimulus set generation enables the study of how different representations in the visual stream affect high-level cognitive behaviors.
comment: Accepted at Workshop on Human-inspired Computer Vision @ ECCV2024
☆ EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision
Estimating touch contact and pressure in egocentric vision is a central task for downstream applications in Augmented Reality, Virtual Reality, as well as many robotic applications, because it provides precise physical insights into hand-object interaction and object manipulation. However, existing contact pressure datasets lack egocentric views and hand poses, which are essential for accurate estimation during in-situ operation, both for AR/VR interaction and robotic manipulation. In this paper, we introduce EgoPressure,a novel dataset of touch contact and pressure interaction from an egocentric perspective, complemented with hand pose meshes and fine-grained pressure intensities for each contact. The hand poses in our dataset are optimized using our proposed multi-view sequence-based method that processes footage from our capture rig of 8 accurately calibrated RGBD cameras. EgoPressure comprises 5.0 hours of touch contact and pressure interaction from 21 participants captured by a moving egocentric camera and 7 stationary Kinect cameras, which provided RGB images and depth maps at 30 Hz. In addition, we provide baselines for estimating pressure with different modalities, which will enable future developments and benchmarking on the dataset. Overall, we demonstrate that pressure and hand poses are complementary, which supports our intention to better facilitate the physical understanding of hand-object interactions in AR/VR and robotics research.
☆ Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling
This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
comment: PhD Dissertation
☆ Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection, Removal, and Generation in the Era of Deep Learning
Shadows are formed when light encounters obstacles, leading to areas of diminished illumination. In computer vision, shadow detection, removal, and generation are crucial for enhancing scene understanding, refining image quality, ensuring visual consistency in video editing, and improving virtual environments. This paper presents a comprehensive survey of shadow detection, removal, and generation in images and videos within the deep learning landscape over the past decade, covering tasks, deep models, datasets, and evaluation metrics. Our key contributions include a comprehensive survey of shadow analysis, standardization of experimental comparisons, exploration of the relationships among model size, speed, and performance, a cross-dataset generalization study, identification of open issues and future directions, and provision of publicly available resources to support further research.
comment: Publicly available results, trained models, and evaluation metrics at https://github.com/xw-hu/Unveiling-Deep-Shadows
☆ DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction
Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction [14], while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time [24, 39]. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D [38]. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.
☆ Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models ECCV 2024
This paper addresses the limitations of adverse weather image restoration approaches trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework employing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clearness and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, utilizing a dual-step strategy with pseudo-labels assessed by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an effective training strategy to bootstrap restoration performance. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art works.
comment: Accepted by ECCV 2024
☆ LinFusion: 1 GPU, 1 Minute, 16K Image
Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.
comment: Work in Progress. Codes are available at https://github.com/Huage001/LinFusion
☆ DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
comment: Project webpage: https://depthcrafter.github.io
☆ GraspSplats: Efficient Manipulation with 3D Feature Splatting
The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.
comment: Project webpage: https://graspsplats.github.io/
☆ Physical Rule-Guided Convolutional Neural Network
The black-box nature of Convolutional Neural Networks (CNNs) and their reliance on large datasets limit their use in complex domains with limited labeled data. Physics-Guided Neural Networks (PGNNs) have emerged to address these limitations by integrating scientific principles and real-world knowledge, enhancing model interpretability and efficiency. This paper proposes a novel Physics-Guided CNN (PGCNN) architecture that incorporates dynamic, trainable, and automated LLM-generated, widely recognized rules integrated into the model as custom layers to address challenges like limited data and low confidence scores. The PGCNN is evaluated on multiple datasets, demonstrating superior performance compared to a baseline CNN model. Key improvements include a significant reduction in false positives and enhanced confidence scores for true detection. The results highlight the potential of PGCNNs to improve CNN performance for broader application areas.
♻ ☆ Open-vocabulary Temporal Action Localization using VLMs
Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos. A sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/.
comment: 7 pages, 5 figures, 4 tables. Last updated on September 3rd, 2024
♻ ☆ A Multiscale Gradient Fusion Method for Edge Detection in Color Images Utilizing the CBM3D Filter
In this paper, a color edge detection strategy based on collaborative filtering combined with multiscale gradient fusion is proposed. The block-matching and 3D (BM3D) filter are used to enhance the sparse representation in the transform domain and achieve the effect of denoising, whereas the multiscale gradient fusion makes up for the defect of loss of details in single-scale edge detection and improves the edge detection resolution and quality. First, the RGB images in the dataset are converted to XYZ color space images through mathematical operations. Second, the colored block-matching and 3D (CBM3D) filter are used on the sparse images and to remove noise interference. Then, the vector gradients of the color image and the anisotropic Gaussian directional derivative of the two scale parameters are calculated and averaged pixel-by-pixel to obtain a new edge strength map. Finally, the edge features are enhanced by image normalization and non-maximum suppression technology, and on that basis, the edge contour is obtained by double threshold selection and a new morphological refinement method. Through an experimental analysis of the edge detection dataset, the method proposed has good noise robustness and high edge quality, which is better than the Color Sobel, Color Canny, SE and Color AGDD as shown by the PR curve, AUC, PSNR, MSE, and FOM indicators.
comment: 1 figure, 2 tables
♻ ☆ IBO: Inpainting-Based Occlusion to Enhance Explainable Artificial Intelligence Evaluation in Histopathology
Histopathological image analysis is crucial for accurate cancer diagnosis and treatment planning. While deep learning models, especially convolutional neural networks, have advanced this field, their "black-box" nature raises concerns about interpretability and trustworthiness. Explainable Artificial Intelligence (XAI) techniques aim to address these concerns, but evaluating their effectiveness remains challenging. A significant issue with current occlusion-based XAI methods is that they often generate Out-of-Distribution (OoD) samples, leading to inaccurate evaluations. In this paper, we introduce Inpainting-Based Occlusion (IBO), a novel occlusion strategy that utilizes a Denoising Diffusion Probabilistic Model to inpaint occluded regions in histopathological images. By replacing cancerous areas with realistic, non-cancerous tissue, IBO minimizes OoD artifacts and preserves data integrity. We evaluate our method on the CAMELYON16 dataset through two phases: first, by assessing perceptual similarity using the Learned Perceptual Image Patch Similarity (LPIPS) metric, and second, by quantifying the impact on model predictions through Area Under the Curve (AUC) analysis. Our results demonstrate that IBO significantly improves perceptual fidelity, achieving nearly twice the improvement in LPIPS scores compared to the best existing occlusion strategy. Additionally, IBO increased the precision of XAI performance prediction from 42% to 71% compared to traditional methods. These results demonstrate IBO's potential to provide more reliable evaluations of XAI techniques, benefiting histopathology and other applications. The source code for this study is available at https://github.com/a-fsh-r/IBO.
comment: 19 pages, 6 figures
♻ ☆ Learning a Generalized Physical Face Model From Data
Physically-based simulation is a powerful approach for 3D facial animation as the resulting deformations are governed by physical constraints, allowing to easily resolve self-collisions, respond to external forces and perform realistic anatomy edits. Today's methods are data-driven, where the actuations for finite elements are inferred from captured skin geometry. Unfortunately, these approaches have not been widely adopted due to the complexity of initializing the material space and learning the deformation model for each character separately, which often requires a skilled artist followed by lengthy network training. In this work, we aim to make physics-based facial animation more accessible by proposing a generalized physical face model that we learn from a large 3D face dataset. Once trained, our model can be quickly fit to any unseen identity and produce a ready-to-animate physical face model automatically. Fitting is as easy as providing a single 3D face scan, or even a single face image. After fitting, we offer intuitive animation controls, as well as the ability to retarget animations across characters. All the while, the resulting animations allow for physical effects like collision avoidance, gravity, paralysis, bone reshaping and more.
♻ ☆ $OC^4-ReID$: Occluded Cloth-Changing Person Re-Identification
The study of Cloth-Changing Person Re-identification (CC-ReID) focuses on retrieving specific pedestrians when their clothing has changed, typically under the assumption that the entire pedestrian images are visible. Pedestrian images in real-world scenarios, however, are often partially obscured by obstacles, presenting a significant challenge to existing CC-ReID systems. In this paper, we introduce a more challenging task termed Occluded Cloth-Changing Person Re-Identification ($OC^4-ReID$), which simultaneously addresses two challenges of clothing changes and occlusion. Concretely, we construct two new datasets, Occ-LTCC and Occ-PRCC, based on original CC-ReID datasets to include random occlusions of key pedestrians components (e.g., head, torso). Moreover, a novel benchmark is proposed for $OC^4-ReID$ incorporating a Train-Test Micro Granularity Screening ($T^2MGS$) module to mitigate the influence of occlusion and proposing a Part-Robust Triplet (PRT) loss for partial features learning. Comprehensive experiments on the proposed datasets, as well as on two CC-ReID benchmark datasets demonstrate the superior performance of proposed method against other state-of-the-art methods. The codes and datasets are available at: https://github.com/1024AILab/OC4-ReID.
♻ ☆ Restorer: Removing Multi-Degradation with All-Axis Attention and Prompt Guidance
There are many excellent solutions in image restoration.However, most methods require on training separate models to restore images with different types of degradation.Although existing all-in-one models effectively address multiple types of degradation simultaneously, their performance in real-world scenarios is still constrained by the task confusion problem.In this work, we attempt to address this issue by introducing \textbf{Restorer}, a novel Transformer-based all-in-one image restoration model.To effectively address the complex degradation present in real-world images, we propose All-Axis Attention (AAA), a mechanism that simultaneously models long-range dependencies across both spatial and channel dimensions, capturing potential correlations along all axes.Additionally, we introduce textual prompts in Restorer to incorporate explicit task priors, enabling the removal of specific degradation types based on user instructions. By iterating over these prompts, Restorer can handle composite degradation in real-world scenarios without requiring additional training.Based on these designs, Restorer with one set of parameters demonstrates state-of-the-art performance in multiple image restoration tasks compared to existing all-in-one and even single-task models.Additionally, Restorer is efficient during inference, suggesting the potential in real-world applications.
♻ ☆ On the Federated Learning Framework for Cooperative Perception
Cooperative perception is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on the OpenV2V dataset, augmented with FedBEVT data, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of environmental perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.
comment: accepted by IEEE RA-L
♻ ☆ 3DGS.zip: A survey on 3D Gaussian Splatting Compression Methods
We present a work-in-progress survey on 3D Gaussian Splatting compression methods, focusing on their statistical performance across various benchmarks. This survey aims to facilitate comparability by summarizing key statistics of different compression approaches in a tabulated format. The datasets evaluated include TanksAndTemples, MipNeRF360, DeepBlending, and SyntheticNeRF. For each method, we report the Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), and the resultant size in megabytes (MB), as provided by the respective authors. This is an ongoing, open project, and we invite contributions from the research community as GitHub issues or pull requests. Please visit http://w-m.github.io/3dgs-compression-survey/ for more information and a sortable version of the table.
comment: 3D Gaussian Splatting compression survey; 3DGS compression; new approaches added
♻ ☆ SUMix: Mixup with Semantic and Uncertain Information ECCV2024
Mixup data augmentation approaches have been applied for various tasks of deep learning to improve the generalization ability of deep neural networks. Some existing approaches CutMix, SaliencyMix, etc. randomly replace a patch in one image with patches from another to generate the mixed image. Similarly, the corresponding labels are linearly combined by a fixed ratio $\lambda$ by l. The objects in two images may be overlapped during the mixing process, so some semantic information is corrupted in the mixed samples. In this case, the mixed image does not match the mixed label information. Besides, such a label may mislead the deep learning model training, which results in poor performance. To solve this problem, we proposed a novel approach named SUMix to learn the mixing ratio as well as the uncertainty for the mixed samples during the training process. First, we design a learnable similarity function to compute an accurate mix ratio. Second, an approach is investigated as a regularized term to model the uncertainty of the mixed samples. We conduct experiments on five image benchmarks, and extensive experimental results imply that our method is capable of improving the performance of classifiers with different cutting-based mixup approaches. The source code is available at https://github.com/JinXins/SUMix.
comment: Accepted by ECCV2024 [Camera Ready] (19 pages, 7 figures) with the source code at https://github.com/JinXins/SUMix
♻ ☆ SPIdepth: Strengthened Pose Information for Self-supervised Monocular Depth Estimation
Self-supervised monocular depth estimation has garnered considerable attention for its applications in autonomous driving and robotics. While recent methods have made strides in leveraging techniques like the Self Query Layer (SQL) to infer depth from motion, they often overlook the potential of strengthening pose information. In this paper, we introduce SPIdepth, a novel approach that prioritizes enhancing the pose network for improved depth estimation. Building upon the foundation laid by SQL, SPIdepth emphasizes the importance of pose information in capturing fine-grained scene structures. By enhancing the pose network's capabilities, SPIdepth achieves remarkable advancements in scene understanding and depth estimation. Experimental results on benchmark datasets such as KITTI, Cityscapes, and Make3D showcase SPIdepth's state-of-the-art performance, surpassing previous methods by significant margins. Specifically, SPIdepth tops the self-supervised KITTI benchmark. Additionally, SPIdepth achieves the lowest AbsRel (0.029), SqRel (0.069), and RMSE (1.394) on KITTI, establishing new state-of-the-art results. On Cityscapes, SPIdepth shows improvements over SQLdepth of 21.7% in AbsRel, 36.8% in SqRel, and 16.5% in RMSE, even without using motion masks. On Make3D, SPIdepth in zero-shot outperforms all other models. Remarkably, SPIdepth achieves these results using only a single image for inference, surpassing even methods that utilize video sequences for inference, thus demonstrating its efficacy and efficiency in real-world applications. Our approach represents a significant leap forward in self-supervised monocular depth estimation, underscoring the importance of strengthening pose information for advancing scene understanding in real-world applications. The code and pre-trained models are publicly available at https://github.com/Lavreniuk/SPIdepth.
♻ ☆ Realigned Softmax Warping for Deep Metric Learning
Deep Metric Learning (DML) loss functions traditionally aim to control the forces of separability and compactness within an embedding space so that the same class data points are pulled together and different class ones are pushed apart. Within the context of DML, a softmax operation will typically normalize distances into a probability for optimization, thus coupling all the push/pull forces together. This paper proposes a potential new class of loss functions that operate within a euclidean domain and aim to take full advantage of the coupled forces governing embedding space formation under a softmax. These forces of compactness and separability can be boosted or mitigated within controlled locations at will by using a warping function. In this work, we provide a simple example of a warping function and use it to achieve competitive, state-of-the-art results on various metric learning benchmarks.
comment: Preprint
♻ ☆ Progressive Domain Adaptation for Thermal Infrared Object Tracking
Due to the lack of large-scale labeled Thermal InfraRed (TIR) training datasets, most existing TIR trackers are trained directly on RGB datasets. However, tracking methods trained on RGB datasets suffer a significant drop-off in TIR data due to the domain shift issue. To this end, in this work, we propose a Progressive Domain Adaptation framework for TIR Tracking (PDAT), which transfers useful knowledge learned from RGB tracking to TIR tracking. The framework makes full use of large-scale labeled RGB datasets without requiring time-consuming and labor-intensive labeling of large-scale TIR data. Specifically, we first propose an adversarial-based global domain adaptation module to reduce domain gap on the feature level coarsely. Second, we design a clustering-based subdomain adaptation method to further align the feature distributions of the RGB and TIR datasets finely. These two domain adaptation modules gradually eliminate the discrepancy between the two domains, and thus learn domain-invariant fine-grained features through progressive training. Additionally, we collect a largescale TIR dataset with over 1.48 million unlabeled TIR images for training the proposed domain adaptation framework. Experimental results on five TIR tracking benchmarks show that the proposed method gains a nearly 6% success rate, demonstrating its effectiveness.
comment: 10 pages, 8 figures
♻ ☆ An Efficient Instance Segmentation Framework Using Segmentation Foundation Models with Oriented Bounding Box Prompts
Instance segmentation in unmanned aerial vehicle measurement is a long-standing challenge. Since horizontal bounding boxes introduce many interference objects, oriented bounding boxes (OBBs) are usually used for instance identification. However, based on ``segmentation within bounding box'' paradigm, current instance segmentation methods using OBBs are overly dependent on bounding box detection performance. To tackle this, this paper proposes OBSeg, an efficient instance segmentation framework using OBBs. OBSeg is based on box prompt-based segmentation foundation models (BSMs), e.g., Segment Anything Model. Specifically, OBSeg first detects OBBs to distinguish instances and provide coarse localization information. Then, it predicts OBB prompt-related masks for fine segmentation. Since OBBs only serve as prompts, OBSeg alleviates the over-dependence on bounding box detection performance of current instance segmentation methods using OBBs. In addition, to enable BSMs to handle OBB prompts, we propose a novel OBB prompt encoder. To make OBSeg more lightweight and further improve the performance of lightweight distilled BSMs, a Gaussian smoothing-based knowledge distillation method is introduced. Experiments demonstrate that OBSeg outperforms current instance segmentation methods on multiple public datasets. The code is available at https://github.com/zhen6618/OBBInstanceSegmentation.
♻ ☆ Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations
Composed image retrieval extends content-based image retrieval systems by enabling users to search using reference images and captions that describe their intention. Despite great progress in developing image-text compositors to extract discriminative visual-linguistic features, we identify a hitherto overlooked issue, triplet ambiguity, which impedes robust feature extraction. Triplet ambiguity refers to a type of semantic ambiguity that arises between the reference image, the relative caption, and the target image. It is mainly due to the limited representation of the annotated text, resulting in many noisy triplets where multiple visually dissimilar candidate images can be matched to an identical reference pair (i.e., a reference image + a relative caption). To address this challenge, we propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings, fostering complementary feature extraction and mitigating dependence on any single, potentially biased compositor; (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions to promote consensual outputs. During evaluation, the decisions of the four compositors are combined through a weighting scheme, enhancing overall agreement. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its competitiveness in addressing the fundamental limitations of existing methods.
comment: Accepted by Knowledge-Based Systems (KBS)
♻ ☆ CAST: Cross-Attention in Space and Time for Video Action Recognition NeurIPS 2023
Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.
comment: This is an accepted NeurIPS 2023. Project webpage is available at https://jong980812.github.io/CAST.github.io/ Code is available at https://github.com/KHU-VLL/CAST
♻ ☆ TALDS-Net: Task-Aware Adaptive Local Descriptors Selection for Few-shot Image Classification ICASSP 2024
Few-shot image classification aims to classify images from unseen novel classes with few samples. Recent works demonstrate that deep local descriptors exhibit enhanced representational capabilities compared to image-level features. However, most existing methods solely rely on either employing all local descriptors or directly utilizing partial descriptors, potentially resulting in the loss of crucial information. Moreover, these methods primarily emphasize the selection of query descriptors while overlooking support descriptors. In this paper, we propose a novel Task-Aware Adaptive Local Descriptors Selection Network (TALDS-Net), which exhibits the capacity for adaptive selection of task-aware support descriptors and query descriptors. Specifically, we compare the similarity of each local support descriptor with other local support descriptors to obtain the optimal support descriptor subset and then compare the query descriptors with the optimal support subset to obtain discriminative query descriptors. Extensive experiments demonstrate that our TALDS-Net outperforms state-of-the-art methods on both general and fine-grained datasets.
comment: 4 pages, 1 figures, is accepted by ICASSP 2024
♻ ☆ Asynchronous Blob Tracker for Event Cameras
Event-based cameras are popular for tracking fast-moving objects due to their high temporal resolution, low latency, and high dynamic range. In this paper, we propose a novel algorithm for tracking event blobs using raw events asynchronously in real time. We introduce the concept of an event blob as a spatio-temporal likelihood of event occurrence where the conditional spatial likelihood is blob-like. Many real-world objects such as car headlights or any quickly moving foreground objects generate event blob data. The proposed algorithm uses a nearest neighbour classifier with a dynamic threshold criteria for data association coupled with an extended Kalman filter to track the event blob state. Our algorithm achieves highly accurate blob tracking, velocity estimation, and shape estimation even under challenging lighting conditions and high-speed motions (> 11000 pixels/s). The microsecond time resolution achieved means that the filter output can be used to derive secondary information such as time-to-contact or range estimation, that will enable applications to real-world problems such as collision avoidance in autonomous driving.
comment: 18 pages, 16 figures, Manuscript was accepted on August 7, 2024, by IEEE Transactions on Robotics
♻ ☆ Rethinking Barely-Supervised Volumetric Medical Image Segmentation from an Unsupervised Domain Adaptation Perspective
This paper investigates an extremely challenging problem: barely-supervised volumetric medical image segmentation (BSS). A BSS training dataset consists of two parts: 1) a barely-annotated labeled set, where each labeled image contains only a single-slice annotation, and 2) an unlabeled set comprising numerous unlabeled volumetric images. State-of-the-art BSS methods employ a registration-based paradigm, which uses inter-slice image registration to propagate single-slice annotations into volumetric pseudo labels, constructing a completely annotated labeled set, to which a semi-supervised segmentation scheme can be applied. However, the paradigm has a critical limitation: the pseudo-labels generated by image registration are unreliable and noisy. Motivated by this, we propose a new perspective: instead of solving BSS within a semi-supervised learning scheme, this work formulates BSS as an unsupervised domain adaptation problem. To this end, we propose a novel BSS framework, \textbf{B}arely-supervised learning \textbf{via} unsupervised domain \textbf{A}daptation (BvA), as an alternative to the dominant registration paradigm. Specifically, we first design a novel noise-free labeled data construction algorithm (NFC) for slice-to-volume labeled data synthesis. Then, we introduce a frequency and spatial Mix-Up strategy (FSX) to mitigate the domain shifts. Extensive experiments demonstrate that our method provides a promising alternative for BSS. Remarkably, the proposed method, trained on the left atrial segmentation dataset with \textbf{only one} barely-labeled image, achieves a Dice score of 81.20%, outperforming the state-of-the-art by 61.71%. The code is available at \href{https://github.com/Senyh/BvA}{\textit{\texttt{https://github.com/Senyh/BvA}}}.
♻ ☆ Planning and Rendering: Towards Product Poster Generation with Diffusion Models
Product poster generation significantly optimizes design efficiency and reduces production costs. Prevailing methods predominantly rely on image-inpainting methods to generate clean background images for given products. Subsequently, poster layout generation methods are employed to produce corresponding layout results. However, the background images may not be suitable for accommodating textual content due to their complexity, and the fixed location of products limits the diversity of layout results. To alleviate these issues, we propose a novel product poster generation framework based on diffusion models named P\&R. The P\&R draws inspiration from the workflow of designers in creating posters, which consists of two stages: Planning and Rendering. At the planning stage, we propose a PlanNet to generate the layout of the product and other visual components considering both the appearance features of the product and semantic features of the text, which improves the diversity and rationality of the layouts. At the rendering stage, we propose a RenderNet to generate the background for the product while considering the generated layout, where a spatial fusion module is introduced to fuse the layout of different visual components. To foster the advancement of this field, we propose the first product poster generation dataset PPG30k, comprising 30k exquisite product poster images along with comprehensive image and text annotations. Our method outperforms the state-of-the-art product poster generation methods on PPG30k. The PPG30k will be released soon.
♻ ☆ Unveiling the Human-like Similarities of Automatic Facial Expression Recognition: An Empirical Exploration through Explainable AI
Facial expression recognition is vital for human behavior analysis, and deep learning has enabled models that can outperform humans. However, it is unclear how closely they mimic human processing. This study aims to explore the similarity between deep neural networks and human perception by comparing twelve different networks, including both general object classifiers and FER-specific models. We employ an innovative global explainable AI method to generate heatmaps, revealing crucial facial regions for the twelve networks trained on six facial expressions. We assess these results both quantitatively and qualitatively, comparing them to ground truth masks based on Friesen and Ekman's description and among them. We use Intersection over Union (IoU) and normalized correlation coefficients for comparisons. We generate 72 heatmaps to highlight critical regions for each expression and architecture. Qualitatively, models with pre-trained weights show more similarity in heatmaps compared to those without pre-training. Specifically, eye and nose areas influence certain facial expressions, while the mouth is consistently important across all models and expressions. Quantitatively, we find low average IoU values (avg. 0.2702) across all expressions and architectures. The best-performing architecture averages 0.3269, while the worst-performing one averages 0.2066. Dendrograms, built with the normalized correlation coefficient, reveal two main clusters for most expressions: models with pre-training and models without pre-training. Findings suggest limited alignment between human and AI facial expression recognition, with network architectures influencing the similarity, as similar architectures prioritize similar facial regions.
comment: Multimed Tools Appl (2024)
♻ ☆ Learning Exposure Correction in Dynamic Scenes
Exposure correction aims to enhance visual data suffering from improper exposures, which can greatly improve satisfactory visual effects. However, previous methods mainly focus on the image modality, and the video counterpart is less explored in the literature. Directly applying prior image-based methods to videos results in temporal incoherence with low visual quality. Through thorough investigation, we find that the development of relevant communities is limited by the absence of a benchmark dataset. Therefore, in this paper, we construct the first real-world paired video dataset, including both underexposure and overexposure dynamic scenes. To achieve spatial alignment, we utilize two DSLR cameras and a beam splitter to simultaneously capture improper and normal exposure videos. Additionally, we propose an end-to-end video exposure correction network, in which a dual-stream module is designed to deal with both underexposure and overexposure factors, enhancing the illumination based on Retinex theory. The extensive experiments based on various metrics and user studies demonstrate the significance of our dataset and the effectiveness of our method. The code and dataset are available at https://github.com/kravrolens/VECNet.
comment: To be published at ACM Multimedia 2024
♻ ☆ Deep Learning for Computer Vision based Activity Recognition and Fall Detection of the Elderly: a Systematic Review
As the percentage of elderly people in developed countries increases worldwide, the healthcare of this collective is a worrying matter, especially if it includes the preservation of their autonomy. In this direction, many studies are being published on Ambient Assisted Living (AAL) systems, which help to reduce the preoccupations raised by the independent living of the elderly. In this study, a systematic review of the literature is presented on fall detection and Human Activity Recognition (HAR) for the elderly, as the two main tasks to solve to guarantee the safety of elderly people living alone. To address the current tendency to perform these two tasks, the review focuses on the use of Deep Learning (DL) based approaches on computer vision data. In addition, different collections of data like DL models, datasets or hardware (e.g. depth or thermal cameras) are gathered from the reviewed studies and provided for reference in future studies. Strengths and weaknesses of existing approaches are also discussed and, based on them, our recommendations for future works are provided.
♻ ☆ RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation
The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Additionally, we have introduced the hierarchical dense attention module to fuse hierarchical visual semantic information with sparse embeddings to obtain fine-grained dense embeddings, and an implicit tracking module to generate a tracking token and provide historical information for the mask decoder. Furthermore, we employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively. Through comprehensive ablation studies, we demonstrate our model's practical and effective design choices. Extensive experiments conducted on Refer-Youtube-VOS, Ref-DAVIS17, and three referring image segmentation datasets validate the superiority and effectiveness of our RefSAM model over existing methods.
♻ ☆ PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning
Transformers have revolutionized the point cloud learning task, but the quadratic complexity hinders its extension to long sequence and makes a burden on limited computational resources. The recent advent of RWKV, a fresh breed of deep sequence models, has shown immense potential for sequence modeling in NLP tasks. In this paper, we present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field with necessary modifications for point cloud learning tasks. Specifically, taking the embedded point patches as input, we first propose to explore the global processing capabilities within PointRWKV blocks using modified multi-headed matrix-valued states and a dynamic attention recurrence mechanism. To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer. Furthermore, we design PointRWKV as a multi-scale framework for hierarchical feature learning of 3D point clouds, facilitating various downstream tasks. Extensive experiments on different point cloud learning tasks show our proposed PointRWKV outperforms the transformer- and mamba-based counterparts, while significantly saving about 42\% FLOPs, demonstrating the potential option for constructing foundational 3D models.
♻ ☆ White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
comment: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes and formatting improvements. V3: improvements from journal revisions
♻ ☆ Correlation-Embedded Transformer Tracking: A Single-Branch Framework
Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used for predicting target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.
comment: Extension of SBT paper, accepted by TPAMI
♻ ☆ Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation ECCV 2024
Current weakly-supervised incremental learning for semantic segmentation (WILSS) approaches only consider replacing pixel-level annotations with image-level labels, while the training images are still from well-designed datasets. In this work, we argue that widely available web images can also be considered for the learning of new classes. To achieve this, firstly we introduce a strategy to select web images which are similar to previously seen examples in the latent space using a Fourier-based domain discriminator. Then, an effective caption-driven reharsal strategy is proposed to preserve previously learnt classes. To our knowledge, this is the first work to rely solely on web images for both the learning of new concepts and the preservation of the already learned ones in WILSS. Experimental results show that the proposed approach can reach state-of-the-art performances without using manually selected and annotated data in the incremental steps.
comment: ECCV 2024
♻ ☆ Image-Based Virtual Try-On: A Survey
Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to accelerate the development.In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We additionally apply CLIP to assess the semantic alignment of try-on results, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset.In addition to quantitative and qualitative evaluation of current open-source methods, unresolved issues are highlighted and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.
comment: 30 pages, 20 figures
♻ ☆ DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception capability, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. We introduce an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, we propose a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to adaptively create a more compact visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks and the effectiveness of each component.
♻ ☆ Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers
Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
♻ ☆ Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection
Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5\%, \hl{90.4}\%, 94.4\%, and 97.4\%). Furthermore, it shows promising performance in open-set and cross-dataset cases. The data, code, and models can be found at: \url{https://github.com/shiwoaz/lap}.
♻ ☆ TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, TagCLIP (Trusty-aware guided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012, COCO-Stuff 164K and PASCAL Context. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4%, 1.7% and 2.1%, respectively, with negligible overheads. The code is available at https://github.com/dvlab-research/TagCLIP.
comment: TPAMI2024
♻ ☆ Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach ECCV 2024
In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset.
comment: Published at ECCV 2024
♻ ☆ The Impact of Print-Scanning in Heterogeneous Morph Evaluation Scenarios
Face morphing attacks pose an increasing threat to face recognition (FR) systems. A morphed photo contains biometric information from two different subjects to take advantage of vulnerabilities in FRs. These systems are particularly susceptible to attacks when the morphs are subjected to print-scanning to mask the artifacts generated during the morphing process. We investigate the impact of print-scanning on morphing attack detection through a series of evaluations on heterogeneous morphing attack scenarios. Our experiments show that we can increase the Mated Morph Presentation Match Rate (MMPMR) by up to 8.48%. Furthermore, when a Single-image Morphing Attack Detection (S-MAD) algorithm is not trained to detect print-scanned morphs the Morphing Attack Classification Error Rate (MACER) can increase by up to 96.12%, indicating significant vulnerability.
comment: Accepted as a special sessions paper at IJCB 2024
♻ ☆ AIGCs Confuse AI Too: Investigating and Explaining Synthetic Image-induced Hallucinations in Large Vision-Language Models
The evolution of Artificial Intelligence Generated Contents (AIGCs) is advancing towards higher quality. The growing interactions with AIGCs present a new challenge to the data-driven AI community: While AI-generated contents have played a crucial role in a wide range of AI models, the potential hidden risks they introduce have not been thoroughly examined. Beyond human-oriented forgery detection, AI-generated content poses potential issues for AI models originally designed to process natural data. In this study, we underscore the exacerbated hallucination phenomena in Large Vision-Language Models (LVLMs) caused by AI-synthetic images. Remarkably, our findings shed light on a consistent AIGC \textbf{hallucination bias}: the object hallucinations induced by synthetic images are characterized by a greater quantity and a more uniform position distribution, even these synthetic images do not manifest unrealistic or additional relevant visual features compared to natural images. Moreover, our investigations on Q-former and Linear projector reveal that synthetic images may present token deviations after visual projection, thereby amplifying the hallucination bias.
♻ ☆ Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning
Recently, multi-modal vision-language foundation models have gained significant attention in the medical field. While these models offer great opportunities, they still face crucial challenges, such as the requirement for fine-grained knowledge understanding in computer-aided diagnosis and the capability of utilizing very limited or even no task-specific labeled data in real-world clinical applications. In this study, we present MaCo, a masked contrastive chest X-ray foundation model that tackles these challenges. MaCo explores masked contrastive learning to simultaneously achieve fine-grained image understanding and zero-shot learning for a variety of medical imaging tasks. It designs a correlation weighting mechanism to adjust the correlation between masked chest X-ray image patches and their corresponding reports, thereby enhancing the model's representation learning capabilities. To evaluate the performance of MaCo, we conducted extensive experiments using 6 well-known open-source X-ray datasets. The experimental results demonstrate the superiority of MaCo over 10 state-of-the-art approaches across tasks such as classification, segmentation, detection, and phrase grounding. These findings highlight the significant potential of MaCo in advancing a wide range of medical image analysis tasks.
♻ ☆ BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction
Analyzing and reconstructing visual stimuli from brain signals effectively advances the understanding of human visual system. However, the EEG signals are complex and contain significant noise. This leads to substantial limitations in existing works of visual stimuli reconstruction from EEG, such as difficulties in aligning EEG embeddings with the fine-grained semantic information and a heavy reliance on additional large self-collected dataset for training. To address these challenges, we propose a novel approach called BrainVis. Firstly, we divide the EEG signals into various units and apply a self-supervised approach on them to obtain EEG time-domain features, in an attempt to ease the training difficulty. Additionally, we also propose to utilize the frequency-domain features to enhance the EEG representations. Then, we simultaneously align EEG time-frequency embeddings with the interpolation of the coarse and fine-grained semantics in the CLIP space, to highlight the primary visual components and reduce the cross-modal alignment difficulty. Finally, we adopt the cascaded diffusion models to reconstruct images. Using only 10\% training data of the previous work, our proposed BrainVis outperforms state of the arts in both semantic fidelity reconstruction and generation quality. The code is available at https://github.com/RomGai/BrainVis.
♻ ☆ GISR: Geometric Initialization and Silhouette-based Refinement for Single-View Robot Pose and Configuration Estimation
In autonomous robotics, measurement of the robot's internal state and perception of its environment, including interaction with other agents such as collaborative robots, are essential. Estimating the pose of the robot arm from a single view has the potential to replace classical eye-to-hand calibration approaches and is particularly attractive for online estimation and dynamic environments. In addition to its pose, recovering the robot configuration provides a complete spatial understanding of the observed robot that can be used to anticipate the actions of other agents in advanced robotics use cases. Furthermore, this additional redundancy enables the planning and execution of recovery protocols in case of sensor failures or external disturbances. We introduce GISR - a deep configuration and robot-to-camera pose estimation method that prioritizes execution in real-time. GISR consists of two modules: (i) a geometric initialization module that efficiently computes an approximate robot pose and configuration, and (ii) a deep iterative silhouette-based refinement module that arrives at a final solution in just a few iterations. We evaluate GISR on publicly available data and show that it outperforms existing methods of the same class in terms of both speed and accuracy, and can compete with approaches that rely on ground-truth proprioception and recover only the pose.
comment: IEEE Robotics and Automation Letters (under revision), code available at http://github.com/iwhitey/GISR-robot
♻ ☆ IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection
Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.
comment: 40 pages
♻ ☆ Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients
We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards training neural networks with binary weights, known as binary neural networks (BNNs), on quantum hardware. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. However, training them in practice remains to be an open challenge. Most known BNN-optimisers either rely on projected updates or binarise weights post-training. Instead, QP-SBGD approximately maps the gradient onto binary variables, by solving a quadratic constrained binary optimisation. Under practically reasonable assumptions, we show that this update rule converges with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the $\mathcal{NP}$-hard projection can be effectively executed on an adiabatic quantum annealer, harnessing recent advancements in quantum computation. We also introduce a projected version of this update rule and prove that if a fixed point exists in the binary variable space, the modified updates will converge to it. Last but not least, our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is on par with competitive and well-established baselines such as BinaryConnect, signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as well as binary graph neural networks.
♻ ☆ Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model ECCV 2024
In multi-class histopathology nuclei analysis tasks, the lack of training data becomes a main bottleneck for the performance of learning-based methods. To tackle this challenge, previous methods have utilized generative models to increase data by generating synthetic samples. However, existing methods often overlook the importance of considering the context of biological tissues (e.g., shape, spatial layout, and tissue type) in the synthetic data. Moreover, while generative models have shown superior performance in synthesizing realistic histopathology images, none of the existing methods are capable of producing image-label pairs at the same time. In this paper, we introduce a novel framework for co-synthesizing histopathology nuclei images and paired semantic labels using a context-conditioned joint diffusion model. We propose conditioning of a diffusion model using nucleus centroid layouts with structure-related text prompts to incorporate spatial and structural context information into the generation targets. Moreover, we enhance the granularity of our synthesized semantic labels by generating instance-wise nuclei labels using distance maps synthesized concurrently in conjunction with the images and semantic labels. We demonstrate the effectiveness of our framework in generating high-quality samples on multi-institutional, multi-organ, and multi-modality datasets. Our synthetic data consistently outperforms existing augmentation methods in the downstream tasks of nuclei segmentation and classification.
comment: ECCV 2024 accepted
Artificial Intelligence 64
☆ Coaching a Robotic Sonographer: Learning Robotic Ultrasound with Sparse Expert's Feedback
Ultrasound is widely employed for clinical intervention and diagnosis, due to its advantages of offering non-invasive, radiation-free, and real-time imaging. However, the accessibility of this dexterous procedure is limited due to the substantial training and expertise required of operators. The robotic ultrasound (RUS) offers a viable solution to address this limitation; nonetheless, achieving human-level proficiency remains challenging. Learning from demonstrations (LfD) methods have been explored in RUS, which learns the policy prior from a dataset of offline demonstrations to encode the mental model of the expert sonographer. However, active engagement of experts, i.e. Coaching, during the training of RUS has not been explored thus far. Coaching is known for enhancing efficiency and performance in human training. This paper proposes a coaching framework for RUS to amplify its performance. The framework combines DRL (self-supervised practice) with sparse expert's feedback through coaching. The DRL employs an off-policy Soft Actor-Critic (SAC) network, with a reward based on image quality rating. The coaching by experts is modeled as a Partially Observable Markov Decision Process (POMDP), which updates the policy parameters based on the correction by the expert. The validation study on phantoms showed that coaching increases the learning rate by $25\%$ and the number of high-quality image acquisition by $74.5\%$.
comment: Accepted in IEEE Transactions on Medical Robotics and Bionics (TMRB) 2024
☆ Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.
☆ TimeDiT: General-purpose Diffusion Transformers for Time Series Foundation Model ICML 2024
With recent advances in building foundation models for texts and video data, there is a surge of interest in foundation models for time series. A family of models have been developed, utilizing a temporal auto-regressive generative Transformer architecture, whose effectiveness has been proven in Large Language Models. While the empirical results are promising, almost all existing time series foundation models have only been tested on well-curated ``benchmark'' datasets very similar to texts. However, real-world time series exhibit unique challenges, such as variable channel sizes across domains, missing values, and varying signal sampling intervals due to the multi-resolution nature of real-world data. Additionally, the uni-directional nature of temporally auto-regressive decoding limits the incorporation of domain knowledge, such as physical laws expressed as partial differential equations (PDEs). To address these challenges, we introduce the Time Diffusion Transformer (TimeDiT), a general foundation model for time series that employs a denoising diffusion paradigm instead of temporal auto-regressive generation. TimeDiT leverages the Transformer architecture to capture temporal dependencies and employs diffusion processes to generate high-quality candidate samples without imposing stringent assumptions on the target distribution via novel masking schemes and a channel alignment strategy. Furthermore, we propose a finetuning-free model editing strategy that allows the seamless integration of external knowledge during the sampling process without updating any model parameters. Extensive experiments conducted on a varity of tasks such as forecasting, imputation, and anomaly detection, demonstrate the effectiveness of TimeDiT.
comment: 23 Pages, 6 Figures, 11 Tables. First present at ICML 2024 Workshop on Foundation Models in the Wild
☆ On the Benefits of Memory for Modeling Time-Dependent PDEs
Data-driven techniques have emerged as a promising alternative to traditional numerical methods for solving partial differential equations (PDEs). These techniques frequently offer a better trade-off between computational cost and accuracy for many PDE families of interest. For time-dependent PDEs, existing methodologies typically treat PDEs as Markovian systems, i.e., the evolution of the system only depends on the ``current state'', and not the past states. However, distortion of the input signals -- e.g., due to discretization or low-pass filtering -- can render the evolution of the distorted signals non-Markovian. In this work, motivated by the Mori-Zwanzig theory of model reduction, we investigate the impact of architectures with memory for modeling PDEs: that is, when past states are explicitly used to predict the future. We introduce Memory Neural Operator (MemNO), a network based on the recent SSM architectures and Fourier Neural Operator (FNO). We empirically demonstrate on a variety of PDE families of interest that when the input is given on a low-resolution grid, MemNO significantly outperforms the baselines without memory, achieving more than 6 times less error on unseen PDEs. Via a combination of theory and experiments, we show that the effect of memory is particularly significant when the solution of the PDE has high frequency Fourier components (e.g., low-viscosity fluid dynamics), and it also increases robustness to observation noise.
☆ Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.
comment: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Initial Development and Evaluation of the Creative Artificial Intelligence through Recurring Developments and Determinations (CAIRDD) System
Computer system creativity is a key step on the pathway to artificial general intelligence (AGI). It is elusive, however, due to the fact that human creativity is not fully understood and, thus, it is difficult to develop this capability in software. Large language models (LLMs) provide a facsimile of creativity and the appearance of sentience, while not actually being either creative or sentient. While LLMs have created bona fide new content, in some cases - such as with harmful hallucinations - inadvertently, their deliberate creativity is seen by some to not match that of humans. In response to this challenge, this paper proposes a technique for enhancing LLM output creativity via an iterative process of concept injection and refinement. Initial work on the development of the Creative Artificial Intelligence through Recurring Developments and Determinations (CAIRDD) system is presented and the efficacy of key system components is evaluated.
☆ Biochemical Prostate Cancer Recurrence Prediction: Thinking Fast & Slow
Time to biochemical recurrence in prostate cancer is essential for prognostic monitoring of the progression of patients after prostatectomy, which assesses the efficacy of the surgery. In this work, we proposed to leverage multiple instance learning through a two-stage ``thinking fast \& slow'' strategy for the time to recurrence (TTR) prediction. The first (``thinking fast'') stage finds the most relevant WSI area for biochemical recurrence and the second (``thinking slow'') stage leverages higher resolution patches to predict TTR. Our approach reveals a mean C-index ($Ci$) of 0.733 ($\theta=0.059$) on our internal validation and $Ci=0.603$ on the LEOPARD challenge validation set. Post hoc attention visualization shows that the most attentive area contributes to the TTR prediction.
comment: 8 pages, 3 figures, methodology paper for LEOPRARD Challenge
☆ Reinforcement Learning-enabled Satellite Constellation Reconfiguration and Retasking for Mission-Critical Applications
The development of satellite constellation applications is rapidly advancing due to increasing user demands, reduced operational costs, and technological advancements. However, a significant gap in the existing literature concerns reconfiguration and retasking issues within satellite constellations, which is the primary focus of our research. In this work, we critically assess the impact of satellite failures on constellation performance and the associated task requirements. To facilitate this analysis, we introduce a system modeling approach for GPS satellite constellations, enabling an investigation into performance dynamics and task distribution strategies, particularly in scenarios where satellite failures occur during mission-critical operations. Additionally, we introduce reinforcement learning (RL) techniques, specifically Q-learning, Policy Gradient, Deep Q-Network (DQN), and Proximal Policy Optimization (PPO), for managing satellite constellations, addressing the challenges posed by reconfiguration and retasking following satellite failures. Our results demonstrate that DQN and PPO achieve effective outcomes in terms of average rewards, task completion rates, and response times.
comment: Accepted for publication in the IEEE Military Communications Conference (IEEE MILCOM 2024)
☆ Action-Based ADHD Diagnosis in Video
Attention Deficit Hyperactivity Disorder (ADHD) causes significant impairment in various domains. Early diagnosis of ADHD and treatment could significantly improve the quality of life and functioning. Recently, machine learning methods have improved the accuracy and efficiency of the ADHD diagnosis process. However, the cost of the equipment and trained staff required by the existing methods are generally huge. Therefore, we introduce the video-based frame-level action recognition network to ADHD diagnosis for the first time. We also record a real multi-modal ADHD dataset and extract three action classes from the video modality for ADHD diagnosis. The whole process data have been reported to CNTW-NHS Foundation Trust, which would be reviewed by medical consultants/professionals and will be made public in due course.
comment: 31st European Symposium on Artificial Neural Networks
☆ NoiseAttack: An Evasive Sample-Specific Multi-Targeted Backdoor Attack Through White Gaussian Noise
Backdoor attacks pose a significant threat when using third-party data for deep learning development. In these attacks, data can be manipulated to cause a trained model to behave improperly when a specific trigger pattern is applied, providing the adversary with unauthorized advantages. While most existing works focus on designing trigger patterns in both visible and invisible to poison the victim class, they typically result in a single targeted class upon the success of the backdoor attack, meaning that the victim class can only be converted to another class based on the adversary predefined value. In this paper, we address this issue by introducing a novel sample-specific multi-targeted backdoor attack, namely NoiseAttack. Specifically, we adopt White Gaussian Noise (WGN) with various Power Spectral Densities (PSD) as our underlying triggers, coupled with a unique training strategy to execute the backdoor attack. This work is the first of its kind to launch a vision backdoor attack with the intent to generate multiple targeted classes with minimal input configuration. Furthermore, our extensive experimental results demonstrate that NoiseAttack can achieve a high attack success rate against popular network architectures and datasets, as well as bypass state-of-the-art backdoor detection methods. Our source code and experiments are available at https://github.com/SiSL-URI/NoiseAttack/tree/main.
☆ FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation
Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to one while inheriting the high VC performance of the multi-step diffusion-based VC. We obtain the model using adversarial conditional diffusion distillation (ACDD), leveraging the ability of generative adversarial networks and diffusion models while reconsidering the initial states in sampling. Evaluations of one-shot any-to-any VC demonstrate that FastVoiceGrad achieves VC performance superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/.
comment: Accepted to Interspeech 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/
☆ Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.
comment: Accepted to IEEE SLT 2024
☆ A+AI: Threats to Society, Remedies, and Governance
This document focuses on the threats, especially near-term threats, that Artificial Intelligence (AI) brings to society. Most of the threats discussed here can result from any algorithmic process, not just AI; in addition, defining AI is notoriously difficult. For both reasons, it is important to think of "A+AI": Algorithms and Artificial Intelligence. In addition to the threats, this paper discusses countermeasures to them, and it includes a table showing which countermeasures are likely to mitigate which threats. Thoughtful governance could manage the risks without seriously impeding progress; in fact, chances are it would accelerate progress by reducing the social chaos that would otherwise be likely. The paper lists specific actions government should take as soon as possible, namely: * Require all social media platforms accessible in the U.S. to offer users verification that their accounts are owned by citizens, and to display every account's verification status * Establish regulations to require that all products created or significantly modified with A+AI be clearly labeled as such; to restrict use of generative AI to create likenesses of persons; and to require creators of generative AI software to disclose materials used to train their software and to compensate the creators of any copyrighted material used * Fund a crash project of research on mitigating the threats * Fund educational campaigns to raise awareness of the threats
comment: 25 pages
☆ On a heuristic approach to the description of consciousness as a hypercomplex system state and the possibility of machine consciousness (German edition)
This article presents a heuristic view that shows that the inner states of consciousness experienced by every human being have a physical but imaginary hypercomplex basis. The hypercomplex description is necessary because certain processes of consciousness cannot be physically measured in principle, but nevertheless exist. Based on theoretical considerations, it could be possible - as a result of mathematical investigations into a so-called bicomplex algebra - to generate and use hypercomplex system states on machines in a targeted manner. The hypothesis of the existence of hypercomplex system states on machines is already supported by the surprising performance of highly complex AI systems. However, this has yet to be proven. In particular, there is a lack of experimental data that distinguishes such systems from other systems, which is why this question will be addressed in later articles. This paper describes the developed bicomplex algebra and possible applications of these findings to generate hypercomplex energy states on machines. In the literature, such system states are often referred to as machine consciousness. The article uses mathematical considerations to explain how artificial consciousness could be generated and what advantages this would have for such AI systems.
comment: 7 pages, in German language. 1 figure
☆ CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
☆ DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
comment: Project webpage: https://depthcrafter.github.io
☆ A Deployed Online Reinforcement Learning Algorithm In An Oral Health Clinical Trial
Dental disease is a prevalent chronic condition associated with substantial financial burden, personal suffering, and increased risk of systemic diseases. Despite widespread recommendations for twice-daily tooth brushing, adherence to recommended oral self-care behaviors remains sub-optimal due to factors such as forgetfulness and disengagement. To address this, we developed Oralytics, a mHealth intervention system designed to complement clinician-delivered preventative care for marginalized individuals at risk for dental disease. Oralytics incorporates an online reinforcement learning algorithm to determine optimal times to deliver intervention prompts that encourage oral self-care behaviors. We have deployed Oralytics in a registered clinical trial. The deployment required careful design to manage challenges specific to the clinical trials setting in the U.S. In this paper, we (1) highlight key design decisions of the RL algorithm that address these challenges and (2) conduct a re-sampling analysis to evaluate algorithm design decisions. A second phase (randomized control trial) of Oralytics is planned to start in spring 2025.
☆ OLMoE: Open Mixture-of-Experts Language Models
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
comment: 61 pages (24 main), 36 figures, 14 tables
☆ Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation IJCNN 2024
Low-resolution face recognition is a challenging task due to the missing of informative details. Recent approaches based on knowledge distillation have proven that high-resolution clues can well guide low-resolution face recognition via proper knowledge transfer. However, due to the distribution difference between training and testing faces, the learned models often suffer from poor adaptability. To address that, we split the knowledge transfer process into distillation and adaptation steps, and propose an adaptable instance-relation distillation approach to facilitate low-resolution face recognition. In the approach, the student distills knowledge from high-resolution teacher in both instance level and relation level, providing sufficient cross-resolution knowledge transfer. Then, the learned student can be adaptable to recognize low-resolution faces with adaptive batch normalization in inference. In this manner, the capability of recovering missing details of familiar low-resolution faces can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
comment: Accepted by IJCNN 2024
☆ AllWeatherNet:Unified Image enhancement for autonomous driving under adverse weather and lowlight-conditions
Adverse conditions like snow, rain, nighttime, and fog, pose challenges for autonomous driving perception systems. Existing methods have limited effectiveness in improving essential computer vision tasks, such as semantic segmentation, and often focus on only one specific condition, such as removing rain or translating nighttime images into daytime ones. To address these limitations, we propose a method to improve the visual quality and clarity degraded by such adverse conditions. Our method, AllWeather-Net, utilizes a novel hierarchical architecture to enhance images across all adverse conditions. This architecture incorporates information at three semantic levels: scene, object, and texture, by discriminating patches at each level. Furthermore, we introduce a Scaled Illumination-aware Attention Mechanism (SIAM) that guides the learning towards road elements critical for autonomous driving perception. SIAM exhibits robustness, remaining unaffected by changes in weather conditions or environmental scenes. AllWeather-Net effectively transforms images into normal weather and daytime scenes, demonstrating superior image enhancement results and subsequently enhancing the performance of semantic segmentation, with up to a 5.3% improvement in mIoU in the trained domain. We also show our model's generalization ability by applying it to unseen domains without re-training, achieving up to 3.9% mIoU improvement. Code can be accessed at: https://github.com/Jumponthemoon/AllWeatherNet.
☆ BEAVER: An Enterprise Benchmark for Text-to-SQL
Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the "dark web", (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.
♻ ☆ PID Accelerated Temporal Difference Algorithms
Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available. We give a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.
♻ ☆ Low-Rank Quantization-Aware Training for LLMs
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-1/2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory. Our source code is available at https://github.com/qualcomm-ai-research/LR-QAT
♻ ☆ Chemical Reaction Neural Networks for Fitting Accelerating Rate Calorimetry Data
As the demand for lithium-ion batteries rapidly increases there is a need to design these cells in a safe manner to mitigate thermal runaway. Thermal runaway in batteries leads to an uncontrollable temperature rise and potentially fires, which is a major safety concern. Typically, when modelling the chemical kinetics of thermal runaway calorimetry data ( e.g. Accelerating Rate Calorimetry (ARC)) is needed to determine the temperature-driven decomposition kinetics. Conventional methods of fitting Arrhenius Ordinary Differential Equation (ODE) thermal runaway models to Accelerated Rate Calorimetry (ARC) data make several assumptions that reduce the fidelity and generalizability of the obtained model. In this paper, Chemical Reaction Neural Networks (CRNNs) are trained to fit the kinetic parameters of N-equation Arrhenius ODEs to ARC data obtained from a Molicel 21700 P45B. The models are found to be better approximations of the experimental data. The flexibility of the method is demonstrated by experimenting with two-equation and four-equation models. Thermal runaway simulations are conducted in 3D using the obtained kinetic parameters, showing the applicability of the obtained thermal runaway models to large-scale simulations.
♻ ☆ Open-vocabulary Temporal Action Localization using VLMs
Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos. A sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/.
comment: 7 pages, 5 figures, 4 tables. Last updated on September 3rd, 2024
♻ ☆ Foundation Models for Music: A Survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
♻ ☆ Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS 2024
In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions that rely on keywords. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection features that rely on textual features and keywords are bypassed, an occurrence our observations show happens frequently.
comment: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)
♻ ☆ FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability
We present FairX, an open-source Python-based benchmarking tool designed for the comprehensive analysis of models under the umbrella of fairness, utility, and eXplainability (XAI). FairX enables users to train benchmarking bias-mitigation models and evaluate their fairness using a wide array of fairness metrics, data utility metrics, and generate explanations for model predictions, all within a unified framework. Existing benchmarking tools do not have the way to evaluate synthetic data generated from fair generative models, also they do not have the support for training fair generative models either. In FairX, we add fair generative models in the collection of our fair-model library (pre-processing, in-processing, post-processing) and evaluation metrics for evaluating the quality of synthetic fair data. This version of FairX supports both tabular and image datasets. It also allows users to provide their own custom datasets. The open-source FairX benchmarking package is publicly available at \url{https://github.com/fahim-sikder/FairX}.
♻ ☆ Towards Explainable Traffic Flow Prediction with Large Language Models
Traffic forecasting is crucial for intelligent transportation systems. It has experienced significant advancements thanks to the power of deep learning in capturing latent patterns of traffic data. However, recent deep-learning architectures require intricate model designs and lack an intuitive understanding of the mapping from input data to predicted results. Achieving both accuracy and explainability in traffic prediction models remains a challenge due to the complexity of traffic data and the inherent opacity of deep learning models. To tackle these challenges, we propose a Traffic flow Prediction model based on Large Language Models (LLMs) to generate explainable traffic predictions, named xTP-LLM. By transferring multi-modal traffic data into natural language descriptions, xTP-LLM captures complex time-series patterns and external factors from comprehensive traffic data. The LLM framework is fine-tuned using language-based instructions to align with spatial-temporal traffic flow data. Empirically, xTP-LLM shows competitive accuracy compared with deep learning baselines, while providing an intuitive and reliable explanation for predictions. This paper contributes to advancing explainable traffic prediction models and lays a foundation for future exploration of LLM applications in transportation. To the best of our knowledge, this is the first study to use LLM for explainable prediction of traffic flows.
comment: 31pages, 16 figures
♻ ☆ rerankers: A Lightweight Python Library to Unify Ranking Methods
This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches. Re-ranking is an integral component of many retrieval pipelines; however, there exist numerous approaches to it, relying on different implementation methods. rerankers unifies these methods into a single user-friendly interface, allowing practitioners and researchers alike to explore different methods while only changing a single line of Python code. Moreover ,rerankers ensures that its implementations are done with the fewest dependencies possible, and re-uses the original implementation whenever possible, guaranteeing that our simplified interface results in no performance degradation compared to more complex ones. The full source code and list of supported models are updated regularly and available at https://github.com/answerdotai/rerankers.
♻ ☆ OceanGPT: A Large Language Model for Ocean Science Tasks ACL2024
Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet's surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reasons are the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever large language model in the ocean domain, which is expert in various ocean science tasks. We also propose OceanGPT, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology.
comment: ACL2024. Project Website: http://oceangpt.zjukg.cn/
♻ ☆ Sentinel: An Aggregation Function to Secure Decentralized Federated Learning
Decentralized Federated Learning (DFL) emerges as an innovative paradigm to train collaborative models, addressing the single point of failure limitation. However, the security and trustworthiness of FL and DFL are compromised by poisoning attacks, negatively impacting its performance. Existing defense mechanisms have been designed for centralized FL and they do not adequately exploit the particularities of DFL. Thus, this work introduces Sentinel, a defense strategy to counteract poisoning attacks in DFL. Sentinel leverages the accessibility of local data and defines a three-step aggregation protocol consisting of similarity filtering, bootstrap validation, and normalization to safeguard against malicious model updates. Sentinel has been evaluated with diverse datasets and data distributions. Besides, various poisoning attack types and threat levels have been verified. The results improve the state-of-the-art performance against both untargeted and targeted poisoning attacks when data follows an IID (Independent and Identically Distributed) configuration. Besides, under non-IID configuration, it is analyzed how performance degrades both for Sentinel and other state-of-the-art robust aggregation methods.
♻ ☆ Statistical Context Detection for Deep Lifelong Reinforcement Learning
Context detection involves labeling segments of an online stream of data as belonging to different tasks. Task labels are used in lifelong learning algorithms to perform consolidation or other procedures that prevent catastrophic forgetting. Inferring task labels from online experiences remains a challenging problem. Most approaches assume finite and low-dimension observation spaces or a preliminary training phase during which task labels are learned. Moreover, changes in the transition or reward functions can be detected only in combination with a policy, and therefore are more difficult to detect than changes in the input distribution. This paper presents an approach to learning both policies and labels in an online deep reinforcement learning setting. The key idea is to use distance metrics, obtained via optimal transport methods, i.e., Wasserstein distance, on suitable latent action-reward spaces to measure distances between sets of data points from past and current streams. Such distances can then be used for statistical tests based on an adapted Kolmogorov-Smirnov calculation to assign labels to sequences of experiences. A rollback procedure is introduced to learn multiple policies by ensuring that only the appropriate data is used to train the corresponding policy. The combination of task detection and policy deployment allows for the optimization of lifelong reinforcement learning agents without an oracle that provides task labels. The approach is tested using two benchmarks and the results show promising performance when compared with related context detection algorithms. The results suggest that optimal transport statistical methods provide an explainable and justifiable procedure for online context detection and reward optimization in lifelong reinforcement learning.
comment: 10 pages excluding references and bibliography. Accepted at CoLLAs 2024
♻ ☆ Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction
Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this paper, we propose a novel post-hoc interpretability algorithm for cancer drug response prediction, CETExplainer, which incorporates a controllable edge-type-specific weighting mechanism. It considers the mutual information between subgraphs and predictions, proposing a structural scoring approach to provide fine-grained, biologically meaningful explanations for predictive models. We also introduce a method for constructing ground truth based on real-world datasets to quantitatively evaluate the proposed interpretability algorithm. Empirical analysis on the real-world dataset demonstrates that CETExplainer achieves superior stability and improves explanation quality compared to leading algorithms, thereby offering a robust and insightful tool for cancer drug prediction.
♻ ☆ GANs Conditioning Methods: A Survey
In recent years, Generative Adversarial Networks (GANs) have seen significant advancements, leading to their widespread adoption across various fields. The original GAN architecture enables the generation of images without any specific control over the content, making it an unconditional generation process. However, many practical applications require precise control over the generated output, which has led to the development of conditional GANs (cGANs) that incorporate explicit conditioning to guide the generation process. cGANs extend the original framework by incorporating additional information (conditions), enabling the generation of samples that adhere to that specific criteria. Various conditioning methods have been proposed, each differing in how they integrate the conditioning information into both the generator and the discriminator networks. In this work, we review the conditioning methods proposed for GANs, exploring the characteristics of each method and highlighting their unique mechanisms and theoretical foundations. Furthermore, we conduct a comparative analysis of these methods, evaluating their performance on various image datasets. Through these analyses, we aim to provide insights into the strengths and limitations of various conditioning techniques, guiding future research and application in generative modeling.
♻ ☆ A Survey on Stability of Learning with Limited Labelled Data and its Sensitivity to the Effects of Randomness
Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-learning or few-shot learning, aims to effectively train a model using only a small amount of labelled samples. However, these approaches have been observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively affects the stability of the models, leading to large variances in results across training runs. When such sensitivity is disregarded, it can unintentionally, but unfortunately also intentionally, create an imaginary perception of research progress. Recently, this area started to attract research attention and the number of relevant studies is continuously growing. In this survey, we provide a comprehensive overview of 415 papers addressing the effects of randomness on the stability of learning with limited labelled data. We distinguish between four main tasks addressed in the papers (investigate/evaluate; determine; mitigate; benchmark/compare/report randomness effects), providing findings for each one. Furthermore, we identify and discuss seven challenges and open problems together with possible directions to facilitate further research. The ultimate goal of this survey is to emphasise the importance of this growing research area, which so far has not received an appropriate level of attention, and reveal impactful directions for future research.
comment: Accepted to ACM Comput. Surv. 2024
♻ ☆ KTO: Model Alignment as Prospect Theoretic Optimization ICML 2024
Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
comment: ICML 2024
♻ ☆ RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation
The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Additionally, we have introduced the hierarchical dense attention module to fuse hierarchical visual semantic information with sparse embeddings to obtain fine-grained dense embeddings, and an implicit tracking module to generate a tracking token and provide historical information for the mask decoder. Furthermore, we employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively. Through comprehensive ablation studies, we demonstrate our model's practical and effective design choices. Extensive experiments conducted on Refer-Youtube-VOS, Ref-DAVIS17, and three referring image segmentation datasets validate the superiority and effectiveness of our RefSAM model over existing methods.
♻ ☆ Towards Scalable Automated Alignment of LLMs: A Survey
Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approaches. In this paper, we systematically review the recently emerging methods of automated alignment, attempting to explore how to achieve effective, scalable, automated alignment once the capabilities of LLMs exceed those of humans. Specifically, we categorize existing automated alignment methods into 4 major categories based on the sources of alignment signals and discuss the current status and potential development of each category. Additionally, we explore the underlying mechanisms that enable automated alignment and discuss the essential factors that make automated alignment technologies feasible and effective from the fundamental role of alignment.
comment: Paper List: https://github.com/cascip/awesome-auto-alignment
♻ ☆ Contrastive Learning and Abstract Concepts: The Case of Natural Numbers
Contrastive Learning (CL) has been successfully applied to classification and other downstream tasks related to concrete concepts, such as objects contained in the ImageNet dataset. No attempts seem to have been made so far in applying this promising scheme to more abstract entities. A prominent example of these could be the concept of (discrete) Quantity. CL can be frequently interpreted as a self-supervised scheme guided by some profound and ubiquitous conservation principle (e.g. conservation of identity in object classification tasks). In this introductory work we apply a suitable conservation principle to the semi-abstract concept of natural numbers by which discrete quantities can be estimated or predicted. We experimentally show, by means of a toy problem, that contrastive learning can be trained to count at a glance with high accuracy both at human as well as at super-human ranges.. We compare this with the results of a trained-to-count at a glance supervised learning (SL) neural network scheme of similar architecture. We show that both schemes exhibit similar good performance on baseline experiments, where the distributions of the training and testing stages are equal. Importantly, we demonstrate that in some generalization scenarios, where training and testing distributions differ, CL boasts more robust and much better error performance.
♻ ☆ Correcting misinformation on social media with a large language model
Real-world misinformation, often multimodal, can be partially or fully factual but misleading using diverse tactics like conflating correlation with causation. Such misinformation is severely understudied, challenging to address, and harms various social domains, particularly on social media, where it can spread rapidly. High-quality and timely correction of misinformation that identifies and explains its (in)accuracies effectively reduces false beliefs. Despite the wide acceptance of manual correction, it is difficult to be timely and scalable. While LLMs have versatile capabilities that could accelerate misinformation correction, they struggle due to a lack of recent information, a tendency to produce false content, and limitations in addressing multimodal information. We propose MUSE, an LLM augmented with access to and credibility evaluation of up-to-date information. By retrieving evidence as refutations or supporting context, MUSE identifies and explains content (in)accuracies with references. It conducts multimodal retrieval and interprets visual content to verify and correct multimodal content. Given the absence of a comprehensive evaluation approach, we propose 13 dimensions of misinformation correction quality. Then, fact-checking experts evaluate responses to social media content that are not presupposed to be misinformation but broadly include (partially) incorrect and correct posts that may (not) be misleading. Results demonstrate MUSE's ability to write high-quality responses to potential misinformation--across modalities, tactics, domains, political leanings, and for information that has not previously been fact-checked online--within minutes of its appearance on social media. Overall, MUSE outperforms GPT-4 by 37% and even high-quality responses from laypeople by 29%. Our work provides a general methodological and evaluative framework to correct misinformation at scale.
comment: 50 pages
♻ ☆ NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to a thousand GPUs for training the largest open-source LLMs such as Nemotron 4 340B and Llama 3.1 405B. NeMo-Aligner comes with highly optimized and scalable implementations for major paradigms of model alignment such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally, our toolkit supports running most of the alignment techniques in a Parameter Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for extensibility, allowing support for other alignment techniques with minimal effort. It is open-sourced with Apache 2.0 License and we invite community contributions at https://github.com/NVIDIA/NeMo-Aligner
comment: 16 pages, 4 figures, Accepted to COLM 2024
♻ ☆ PokerKit: A Comprehensive Python Library for Fine-Grained Multi-Variant Poker Game Simulations
PokerKit is an open-source Python library designed to overcome the restrictions of existing poker game simulation and hand evaluation tools, which typically support only a handful of poker variants and lack flexibility in game state control. In contrast, PokerKit significantly expands this scope by supporting an extensive array of poker variants and it provides a flexible architecture for users to define their custom games. This paper details the design and implementation of PokerKit, including its intuitive programmatic API, multi-variant game support, and a unified hand evaluation suite across different hand types. The flexibility of PokerKit allows for applications in diverse areas, such as poker AI development, tool creation, and online poker casino implementation. PokerKit's reliability has been established through static type checking, extensive doctests, and unit tests, achieving 99% code coverage. The introduction of PokerKit represents a significant contribution to the field of computer poker, fostering future research and advanced AI development for a wide variety of poker games. The source code is available at https://github.com/uoftcprg/pokerkit
comment: 8 pages, 5 figures, accepted to the IEEE Transactions on Games
♻ ☆ Halfway Escape Optimization: A Quantum-Inspired Solution for General Optimization Problems
This paper first proposes the Halfway Escape Optimization (HEO) algorithm, a quantum-inspired metaheuristic designed to address general optimization problems characterized by rugged landscapes and high-dimensionality with an efficient convergence rate. The study presents a comprehensive comparative evaluation of HEO's performance against established optimization algorithms, including Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Artificial Fish Swarm Algorithm (AFSA), Grey Wolf Optimizer (GWO), and Quantum behaved Particle Swarm Optimization (QPSO). The primary analysis encompasses 14 benchmark functions with dimension 30, demonstrating HEO's effectiveness and adaptability in navigating general optimization problems and providing valuable insights into its performance. The test of HEO in Pressure Vessel Design and Tubular Column Design infers its feasibility and potential in real-time applications. Further validation in Osmancik-97 and Cammeo Rice Classification proves the effectiveness of HEO and achieves a higher accuracy record.
♻ ☆ Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks \cite{qi2023fine}-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. Existing mitigation strategies include alignment stage solutions \cite{huang2024vaccine, rosati2024representation} and fine-tuning stage solutions \cite{huang2024lazy,mukhoti2023fine}. However, our evaluation shows that both categories of defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense, which however, is necessary to guarantee finetune performance. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.Our project page is at \url{https://huangtiansheng.github.io/Antidote_gh_page/}
♻ ☆ Investigating Recurrent Transformers with Dynamic Halt
In this paper, we comprehensively study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism: (1) the approach of incorporating a depth-wise recurrence similar to Universal Transformers; and (2) the approach of incorporating a chunk-wise temporal recurrence like Temporal Latent Bottleneck. Furthermore, we propose and investigate novel ways to extend and combine the above methods - for example, we propose a global mean-based dynamic halting mechanism for Universal Transformers and an augmentation of Temporal Latent Bottleneck with elements from Universal Transformer. We compare the models and probe their inductive biases in several diagnostic tasks, such as Long Range Arena (LRA), flip-flop language modeling, ListOps, and Logical Inference. The code is released in: https://github.com/JRC1995/InvestigatingRecurrentTransformers/tree/main
♻ ☆ OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. Language model systems often enable LLMs to generate code for arithmetic operations to achieve accurate calculations. However, this approach compromises speed and security, and fine-tuning risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in a single autoregressive step, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic. Our implementation using Llama 3 with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o with and without a code interpreter. Furthermore, OccamLlama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that OccamLLMs can excel in arithmetic tasks, even surpassing much larger models. We will make our code public shortly.
♻ ☆ How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates decision-making capabilities of LLMs through the lens of Game Theory. We focus specifically on games that support the simultaneous participation of more than two agents. We introduce GAMA($\gamma$)-Bench, which evaluates LLMs' Gaming Ability in Multi-Agent environments. $\gamma$-Bench includes eight classical multi-agent games and a scoring scheme specially designed to quantitatively assess LLMs' performance. Leveraging $\gamma$-Bench, we investigate LLMs' robustness, generalizability, and strategies for enhancement. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we evaluate twelve versions from six models, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. We find that Gemini-1.5-Pro outperforms other models with a score of $63.8$ out of $100$, followed by LLaMA-3.1-70B and GPT-4 with scores of $60.9$ and $60.5$, respectively. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.
comment: 11 pages of main text. 20 pages of appendices. 12 figures, 9 tables. Added models: Gemini-1.5-Pro, LLaMA-3.1-{7, 70, 405}B, Mixtral-8x{7, 22}B, Qwen-2-72B
♻ ☆ MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning
Determining the optimal size of a neural network is critical, as it directly impacts runtime performance and memory usage. Pruning is a well-established model compression technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. However, many recent pruning methods overlook the global contributions of individual model components, making it difficult to ensure that a pruned model meets the desired dataset and performance requirements. To address these challenges, we developed a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner utilizes layer clustering with the Centered Kernel Alignment (CKA) similarity metric, allowing us to incorporate global information from the neural network for more precise and efficient layer-wise pruning. We evaluated MPruner across various architectures and configurations, demonstrating its versatility and providing practical guidelines. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.
♻ ☆ BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction
Analyzing and reconstructing visual stimuli from brain signals effectively advances the understanding of human visual system. However, the EEG signals are complex and contain significant noise. This leads to substantial limitations in existing works of visual stimuli reconstruction from EEG, such as difficulties in aligning EEG embeddings with the fine-grained semantic information and a heavy reliance on additional large self-collected dataset for training. To address these challenges, we propose a novel approach called BrainVis. Firstly, we divide the EEG signals into various units and apply a self-supervised approach on them to obtain EEG time-domain features, in an attempt to ease the training difficulty. Additionally, we also propose to utilize the frequency-domain features to enhance the EEG representations. Then, we simultaneously align EEG time-frequency embeddings with the interpolation of the coarse and fine-grained semantics in the CLIP space, to highlight the primary visual components and reduce the cross-modal alignment difficulty. Finally, we adopt the cascaded diffusion models to reconstruct images. Using only 10\% training data of the previous work, our proposed BrainVis outperforms state of the arts in both semantic fidelity reconstruction and generation quality. The code is available at https://github.com/RomGai/BrainVis.
♻ ☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
♻ ☆ GISR: Geometric Initialization and Silhouette-based Refinement for Single-View Robot Pose and Configuration Estimation
In autonomous robotics, measurement of the robot's internal state and perception of its environment, including interaction with other agents such as collaborative robots, are essential. Estimating the pose of the robot arm from a single view has the potential to replace classical eye-to-hand calibration approaches and is particularly attractive for online estimation and dynamic environments. In addition to its pose, recovering the robot configuration provides a complete spatial understanding of the observed robot that can be used to anticipate the actions of other agents in advanced robotics use cases. Furthermore, this additional redundancy enables the planning and execution of recovery protocols in case of sensor failures or external disturbances. We introduce GISR - a deep configuration and robot-to-camera pose estimation method that prioritizes execution in real-time. GISR consists of two modules: (i) a geometric initialization module that efficiently computes an approximate robot pose and configuration, and (ii) a deep iterative silhouette-based refinement module that arrives at a final solution in just a few iterations. We evaluate GISR on publicly available data and show that it outperforms existing methods of the same class in terms of both speed and accuracy, and can compete with approaches that rely on ground-truth proprioception and recover only the pose.
comment: IEEE Robotics and Automation Letters (under revision), code available at http://github.com/iwhitey/GISR-robot
♻ ☆ IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection
Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.
comment: 40 pages
♻ ☆ Linear Contextual Bandits with Hybrid Payoff: Revisited ECML
We study the Linear Contextual Bandit problem in the hybrid reward setting. In this setting every arm's reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can reduce this setting to two closely related settings (a) Shared - no arm specific parameters, and (b) Disjoint - only arm specific parameters, enabling the application of two popular state of the art algorithms - $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ (Algorithm 1 in (Li et al. 2010)). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both algorithms, significantly improving on the known regret guarantees of these algorithms. Our novel analysis critically exploits the hybrid reward structure and the diversity condition. Moreover, we introduce a new algorithm $\texttt{HyLinUCB}$ that crucially modifies $\texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that $\texttt{HyLinUCB}$ also incurs only $O(\sqrt{T})$ regret for $T$ rounds. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of $\texttt{HyLinUCB}$.For number of arm specific parameters much larger than the number of shared parameters, we observe that $\texttt{DisLinUCB}$ incurs the lowest regret. In this case, regret of $\texttt{HyLinUCB}$ is the second best and extremely competitive to $\texttt{DisLinUCB}$. In all other situations, including our real-world dataset, $\texttt{HyLinUCB}$ has significantly lower regret than $\texttt{LinUCB}$, $\texttt{DisLinUCB}$ and other SOTA baselines we considered. We also empirically observe that the regret of $\texttt{HyLinUCB}$ grows much slower with the number of arms compared to baselines, making it suitable even for very large action spaces.
comment: Accepted at ECML PKDD 2024 as a Research Track Paper
♻ ☆ Taking the Next Step with Generative Artificial Intelligence: The Transformative Role of Multimodal Large Language Models in Science Education
The integration of Artificial Intelligence (AI), particularly Large Language Model (LLM)-based systems, in education has shown promise in enhancing teaching and learning experiences. However, the advent of Multimodal Large Language Models (MLLMs) like GPT-4 with vision (GPT-4V), capable of processing multimodal data including text, sound, and visual inputs, opens a new era of enriched, personalized, and interactive learning landscapes in education. Grounded in theory of multimedia learning, this paper explores the transformative role of MLLMs in central aspects of science education by presenting exemplary innovative learning scenarios. Possible applications for MLLMs could range from content creation to tailored support for learning, fostering competencies in scientific practices, and providing assessment and feedback. These scenarios are not limited to text-based and uni-modal formats but can be multimodal, increasing thus personalization, accessibility, and potential learning effectiveness. Besides many opportunities, challenges such as data protection and ethical considerations become more salient, calling for robust frameworks to ensure responsible integration. This paper underscores the necessity for a balanced approach in implementing MLLMs, where the technology complements rather than supplants the educator's role, ensuring thus an effective and ethical use of AI in science education. It calls for further research to explore the nuanced implications of MLLMs on the evolving role of educators and to extend the discourse beyond science education to other disciplines. Through the exploration of potentials, challenges, and future implications, we aim to contribute to a preliminary understanding of the transformative trajectory of MLLMs in science education and beyond.
comment: revised version 2. September 2024
♻ ☆ Preference Learning Algorithms Do Not Learn Preference Rankings
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via $\textit{ranking accuracy}$. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the $\textit{idealized ranking accuracy}$ that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant $\textit{alignment gap}$ -- $\textit{i.e.}$, a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.
♻ ☆ A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs
This paper proposes a new method for preventing unsafe or otherwise low quality large language model (LLM) outputs, by leveraging the stochasticity of LLMs. We propose a system whereby LLM checkers vote on the acceptability of a generated output, regenerating it if a threshold of disapproval is reached, until sufficient checkers approve. We further propose estimators for cost and failure rate, and based on those estimators and experimental data tailored to the application, we propose an algorithm that achieves a desired failure rate at the least possible cost. We demonstrate that, under these models, failure rate decreases exponentially as a function of cost when voter count and threshold are chosen according to the algorithm, and that the models reasonably estimate the actual performance of such a system in action, even with limited data.
comment: 7 pages, 2 figures
♻ ☆ Composer Style-specific Symbolic Music Generation Using Vector Quantized Discrete Diffusion Models
Emerging Denoising Diffusion Probabilistic Models (DDPM) have become increasingly utilised because of promising results they have achieved in diverse generative tasks with continuous data, such as image and sound synthesis. Nonetheless, the success of diffusion models has not been fully extended to discrete symbolic music. We propose to combine a vector quantized variational autoencoder (VQ-VAE) and discrete diffusion models for the generation of symbolic music with desired composer styles. The trained VQ-VAE can represent symbolic music as a sequence of indexes that correspond to specific entries in a learned codebook. Subsequently, a discrete diffusion model is used to model the VQ-VAE's discrete latent space. The diffusion model is trained to generate intermediate music sequences consisting of codebook indexes, which are then decoded to symbolic music using the VQ-VAE's decoder. The evaluation results demonstrate our model can generate symbolic music with target composer styles that meet the given conditions with a high accuracy of 72.36%. Our code is available at https://github.com/jinchengzhanggg/VQVAE-Diffusion.
♻ ☆ Zyda: A 1.3T Dataset for Open Language Modeling
The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.
♻ ☆ Correction with Backtracking Reduces Hallucination in Summarization
Abstractive summarization aims at generating natural language summaries of a source document that are succinct while preserving the important elements. Despite recent advances, neural text summarization models are known to be susceptible to hallucinating (or more correctly confabulating), that is to produce summaries with details that are not grounded in the source document. In this paper, we introduce a simple yet efficient technique, CoBa, to reduce hallucination in abstractive summarization. The approach is based on two steps: hallucination detection and mitigation. We show that the former can be achieved through measuring simple statistics about conditional word probabilities and distance to context words. Further, we demonstrate that straight-forward backtracking is surprisingly effective at mitigation. We thoroughly evaluate the proposed method with prior art on three benchmark datasets for text summarization. The results show that CoBa is effective and efficient in reducing hallucination, and offers great adaptability and flexibility. Code can be found at https://github.com/zhenzhel/CoBa.
♻ ☆ TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets
Clinical trials are pivotal for developing new medical treatments, yet they typically pose some risks such as patient mortality, adverse events, and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to forecast or simulate key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise and a deep understanding of trial designs have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate, serious adverse event, mortality rate, trial approval outcome, trial failure reason, drug dose finding, design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets' usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development. The curated dataset, metrics, and basic models are publicly available at https://github.com/ML2Health/ML2ClinicalTrials/tree/main/AI4Trial.
♻ ☆ Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming
Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux, to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem -- producing a definition given a formal specification expressed as an F* type. We provide a program-fragment checker that queries F* to check the correctness of candidate solutions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
comment: 47th International Conference on Software Engineering
♻ ☆ Cooperative Evolutionary Pressure and Diminishing Returns Might Explain the Fermi Paradox: On What Super-AIs Are Like
With an evolutionary approach, the basis of morality can be explained as adaptations to problems of cooperation. With 'evolution' taken in a broad sense, evolving AIs that satisfy the conditions for evolution to apply will be subject to the same cooperative evolutionary pressure as biological entities. Here the adaptiveness of increased cooperation as material safety and wealth increase is discussed -- for humans, for other societies, and for AIs. Diminishing beneficial returns from increased access to material resources also suggests the possibility that, on the whole, there will be no incentive to for instance colonize entire galaxies, thus providing a possible explanation of the Fermi paradox, wondering where everybody is. It is further argued that old societies could engender, give way to, super-AIs, since it is likely that super-AIs are feasible, and fitter. Closing is an aside on effective ways for morals and goals to affect life and society, emphasizing environments, cultures, and laws, and exemplified by how to eat. Appended are an algorithm for colonizing for example a galaxy quickly, models of the evolution of cooperation and fairness under diminishing returns, and software for simulating signaling development. It is also noted that there can be no exponential colonization or reproduction, for mathematical reasons, as each entity takes up a certain amount of space.
comment: 29 pages, 1 figure. Added clarifications, references, expansions, and did some copy editing
♻ ☆ A Survey on Responsible Generative AI: What to Generate and What Not
In recent years, generative AI (GenAI), like large language models and text-to-image models, has received significant attention across various domains. However, ensuring the responsible generation of content by these models is crucial for their real-world applicability. This raises an interesting question: What should responsible GenAI generate, and what should it not? To answer the question, this paper investigates the practical responsible requirements of both textual and visual generative models, outlining five key considerations: generating truthful content, avoiding toxic content, refusing harmful instruction, leaking no training data-related content, and ensuring generated content identifiable. Specifically, we review recent advancements and challenges in addressing these requirements. Besides, we discuss and emphasize the importance of responsible GenAI across healthcare, education, finance, and artificial general intelligence domains. Through a unified perspective on both textual and visual generative models, this paper aims to provide insights into practical safety-related issues and further benefit the community in building responsible GenAI.
comment: 77 pages, 10 figures
Machine Learning 51
☆ Double Machine Learning at Scale to Predict Causal Impact of Customer Actions ECML
Causal Impact (CI) of customer actions are broadly used across the industry to inform both short- and long-term investment decisions of various types. In this paper, we apply the double machine learning (DML) methodology to estimate the CI values across 100s of customer actions of business interest and 100s of millions of customers. We operationalize DML through a causal ML library based on Spark with a flexible, JSON-driven model configuration approach to estimate CI at scale (i.e., across hundred of actions and millions of customers). We outline the DML methodology and implementation, and associated benefits over the traditional potential outcomes based CI model. We show population-level as well as customer-level CI values along with confidence intervals. The validation metrics show a 2.2% gain over the baseline methods and a 2.5X gain in the computational time. Our contribution is to advance the scalable application of CI, while also providing an interface that allows faster experimentation, cross-platform support, ability to onboard new use cases, and improves accessibility of underlying code for partner teams.
comment: 16 pages, 11 figures. Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2023, Turin, Italy
☆ Generative Principal Component Regression via Variational Inference
The ability to manipulate complex systems, such as the brain, to modify specific outcomes has far-reaching implications, particularly in the treatment of psychiatric disorders. One approach to designing appropriate manipulations is to target key features of predictive models. While generative latent variable models, such as probabilistic principal component analysis (PPCA), is a powerful tool for identifying targets, they struggle incorporating information relevant to low-variance outcomes into the latent space. When stimulation targets are designed on the latent space in such a scenario, the intervention can be suboptimal with minimal efficacy. To address this problem, we develop a novel objective based on supervised variational autoencoders (SVAEs) that enforces such information is represented in the latent space. The novel objective can be used with linear models, such as PPCA, which we refer to as generative principal component regression (gPCR). We show in simulations that gPCR dramatically improves target selection in manipulation as compared to standard PCR and SVAEs. As part of these simulations, we develop a metric for detecting when relevant information is not properly incorporated into the loadings. We then show in two neural datasets related to stress and social behavior in which gPCR dramatically outperforms PCR in predictive performance and that SVAEs exhibit low incorporation of relevant information into the loadings. Overall, this work suggests that our method significantly improves target selection for manipulation using latent variable models over competitor inference schemes.
☆ TimeDiT: General-purpose Diffusion Transformers for Time Series Foundation Model ICML 2024
With recent advances in building foundation models for texts and video data, there is a surge of interest in foundation models for time series. A family of models have been developed, utilizing a temporal auto-regressive generative Transformer architecture, whose effectiveness has been proven in Large Language Models. While the empirical results are promising, almost all existing time series foundation models have only been tested on well-curated ``benchmark'' datasets very similar to texts. However, real-world time series exhibit unique challenges, such as variable channel sizes across domains, missing values, and varying signal sampling intervals due to the multi-resolution nature of real-world data. Additionally, the uni-directional nature of temporally auto-regressive decoding limits the incorporation of domain knowledge, such as physical laws expressed as partial differential equations (PDEs). To address these challenges, we introduce the Time Diffusion Transformer (TimeDiT), a general foundation model for time series that employs a denoising diffusion paradigm instead of temporal auto-regressive generation. TimeDiT leverages the Transformer architecture to capture temporal dependencies and employs diffusion processes to generate high-quality candidate samples without imposing stringent assumptions on the target distribution via novel masking schemes and a channel alignment strategy. Furthermore, we propose a finetuning-free model editing strategy that allows the seamless integration of external knowledge during the sampling process without updating any model parameters. Extensive experiments conducted on a varity of tasks such as forecasting, imputation, and anomaly detection, demonstrate the effectiveness of TimeDiT.
comment: 23 Pages, 6 Figures, 11 Tables. First present at ICML 2024 Workshop on Foundation Models in the Wild
☆ On the Benefits of Memory for Modeling Time-Dependent PDEs
Data-driven techniques have emerged as a promising alternative to traditional numerical methods for solving partial differential equations (PDEs). These techniques frequently offer a better trade-off between computational cost and accuracy for many PDE families of interest. For time-dependent PDEs, existing methodologies typically treat PDEs as Markovian systems, i.e., the evolution of the system only depends on the ``current state'', and not the past states. However, distortion of the input signals -- e.g., due to discretization or low-pass filtering -- can render the evolution of the distorted signals non-Markovian. In this work, motivated by the Mori-Zwanzig theory of model reduction, we investigate the impact of architectures with memory for modeling PDEs: that is, when past states are explicitly used to predict the future. We introduce Memory Neural Operator (MemNO), a network based on the recent SSM architectures and Fourier Neural Operator (FNO). We empirically demonstrate on a variety of PDE families of interest that when the input is given on a low-resolution grid, MemNO significantly outperforms the baselines without memory, achieving more than 6 times less error on unseen PDEs. Via a combination of theory and experiments, we show that the effect of memory is particularly significant when the solution of the PDE has high frequency Fourier components (e.g., low-viscosity fluid dynamics), and it also increases robustness to observation noise.
☆ QID$^2$: An Image-Conditioned Diffusion Model for Q-space Up-sampling of DWI Data MICCAI 2024
We propose an image-conditioned diffusion model to estimate high angular resolution diffusion weighted imaging (DWI) from a low angular resolution acquisition. Our model, which we call QID$^2$, takes as input a set of low angular resolution DWI data and uses this information to estimate the DWI data associated with a target gradient direction. We leverage a U-Net architecture with cross-attention to preserve the positional information of the reference images, further guiding the target image generation. We train and evaluate QID$^2$ on single-shell DWI samples curated from the Human Connectome Project (HCP) dataset. Specifically, we sub-sample the HCP gradient directions to produce low angular resolution DWI data and train QID$^2$ to reconstruct the missing high angular resolution samples. We compare QID$^2$ with two state-of-the-art GAN models. Our results demonstrate that QID$^2$ not only achieves higher-quality generated images, but it consistently outperforms the GAN models in downstream tensor estimation across multiple metrics. Taken together, this study highlights the potential of diffusion models, and QID$^2$ in particular, for q-space up-sampling, thus offering a promising toolkit for clinical and research applications.
comment: Accepted at MICCAI 2024 International Workshop on Computational Diffusion MRI. Zijian Chen and Jueqi Wang contributed equally to this work
☆ A Lesion-aware Edge-based Graph Neural Network for Predicting Language Ability in Patients with Post-stroke Aphasia MICCAI 2024
We propose a lesion-aware graph neural network (LEGNet) to predict language ability from resting-state fMRI (rs-fMRI) connectivity in patients with post-stroke aphasia. Our model integrates three components: an edge-based learning module that encodes functional connectivity between brain regions, a lesion encoding module, and a subgraph learning module that leverages functional similarities for prediction. We use synthetic data derived from the Human Connectome Project (HCP) for hyperparameter tuning and model pretraining. We then evaluate the performance using repeated 10-fold cross-validation on an in-house neuroimaging dataset of post-stroke aphasia. Our results demonstrate that LEGNet outperforms baseline deep learning methods in predicting language ability. LEGNet also exhibits superior generalization ability when tested on a second in-house dataset that was acquired under a slightly different neuroimaging protocol. Taken together, the results of this study highlight the potential of LEGNet in effectively learning the relationships between rs-fMRI connectivity and language ability in a patient cohort with brain lesions for improved post-stroke aphasia evaluation.
comment: Accepted at MICCAI 2024 International Workshop on Machine Learning in Clinical Neuroimaging (MLCN)
☆ K-Origins: Better Colour Quantification for Neural Networks
K-Origins is a neural network layer designed to improve image-based network performances when learning colour, or intensities, is beneficial. Over 250 encoder-decoder convolutional networks are trained and tested on 16-bit synthetic data, demonstrating that K-Origins improves semantic segmentation accuracy in two scenarios: object detection with low signal-to-noise ratios, and segmenting multiple objects that are identical in shape but vary in colour. K-Origins generates output features from the input features, $\textbf{X}$, by the equation $\textbf{Y}_k = \textbf{X}-\textbf{J}\cdot w_k$ for each trainable parameter $w_k$, where $\textbf{J}$ is a matrix of ones. Additionally, networks with varying receptive fields were trained to determine optimal network depths based on the dimensions of target classes, suggesting that receptive field lengths should exceed object sizes. By ensuring a sufficient receptive field length and incorporating K-Origins, we can achieve better semantic network performance.
comment: 16 pages, 13 figures, 1 table
☆ Reinforcement Learning-enabled Satellite Constellation Reconfiguration and Retasking for Mission-Critical Applications
The development of satellite constellation applications is rapidly advancing due to increasing user demands, reduced operational costs, and technological advancements. However, a significant gap in the existing literature concerns reconfiguration and retasking issues within satellite constellations, which is the primary focus of our research. In this work, we critically assess the impact of satellite failures on constellation performance and the associated task requirements. To facilitate this analysis, we introduce a system modeling approach for GPS satellite constellations, enabling an investigation into performance dynamics and task distribution strategies, particularly in scenarios where satellite failures occur during mission-critical operations. Additionally, we introduce reinforcement learning (RL) techniques, specifically Q-learning, Policy Gradient, Deep Q-Network (DQN), and Proximal Policy Optimization (PPO), for managing satellite constellations, addressing the challenges posed by reconfiguration and retasking following satellite failures. Our results demonstrate that DQN and PPO achieve effective outcomes in terms of average rewards, task completion rates, and response times.
comment: Accepted for publication in the IEEE Military Communications Conference (IEEE MILCOM 2024)
♻ ☆ PID Accelerated Temporal Difference Algorithms
Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available. We give a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.
♻ ☆ Improving Rare Word Translation With Dictionaries and Attention Masking
In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.
comment: 11 pages, 3 figures, 3 tables. Accepted at AMTA 2024
♻ ☆ Low-Rank Quantization-Aware Training for LLMs
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-1/2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory. Our source code is available at https://github.com/qualcomm-ai-research/LR-QAT
♻ ☆ Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of Peptides
Molecular Dynamics (MD) simulations are irreplaceable and ubiquitous in fields of materials science, chemistry, pharmacology just to name a few. Conventional MD simulations are plagued by numerical stability as well as long equilibration time issues, which limits broader applications of MD simulations. Recently, a surge of deep learning approaches have been devised for time-coarsened dynamics, which learns the state transition mechanism over much larger time scales to overcome these limitations. However, only a few methods target the underlying Boltzmann distribution by resampling techniques, where proposals are rarely accepted as new states with low efficiency. In this work, we propose a force-guided bridge matching model, FBM, a novel framework that first incorporates physical priors into bridge matching for full-atom time-coarsened dynamics. With the guidance of our well-designed intermediate force field, FBM is feasible to target the Boltzmann-like distribution by direct inference without extra steps. Experiments on small peptides verify our superiority in terms of comprehensive metrics and demonstrate transferability to unseen peptide systems.
♻ ☆ Verifiable cloud-based variational quantum algorithms
Variational quantum algorithms (VQAs) have shown potential for quantum advantage with noisy intermediate-scale quantum (NISQ) devices for quantum machine learning (QML). However, given the high cost and limited availability of quantum resources, delegating VQAs via cloud networks is a more practical solution for clients with limited quantum capabilities. Recently, Shingu et al.[Physical Review A, 105, 022603 (2022)] proposed a variational secure cloud quantum computing protocol, utilizing ancilla-driven quantum computation (ADQC) for cloud-based VQAs with minimal quantum resource consumption. However, their protocol lacks verifiability, which exposes it to potential malicious behaviors by the server. Additionally, channel loss requires frequent re-delegation as the size of the delegated variational circuit grows, complicating verification due to increased circuit complexity. This paper introduces a new protocol to address these challenges and enhance both verifiability and tolerance to channel loss in cloud-based VQAs.
♻ ☆ Bayesian Learning in a Nonlinear Multiscale State-Space Model
The ubiquity of multiscale interactions in complex systems is well-recognized, with development and heredity serving as a prime example of how processes at different temporal scales influence one another. This work introduces a novel multiscale state-space model to explore the dynamic interplay between systems interacting across different time scales, with feedback between each scale. We propose a Bayesian learning framework to estimate unknown states by learning the unknown process noise covariances within this multiscale model. We develop a Particle Gibbs with Ancestor Sampling (PGAS) algorithm for inference and demonstrate through simulations the efficacy of our approach.
comment: Corrected a typo
♻ ☆ Foundation Models for Music: A Survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
♻ ☆ Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS 2024
In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions that rely on keywords. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection features that rely on textual features and keywords are bypassed, an occurrence our observations show happens frequently.
comment: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)
♻ ☆ On the Convergence of Gradient Descent for Large Learning Rates
A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a phase transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.
♻ ☆ On the Federated Learning Framework for Cooperative Perception
Cooperative perception is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on the OpenV2V dataset, augmented with FedBEVT data, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of environmental perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.
comment: accepted by IEEE RA-L
♻ ☆ An embedding-based distance for temporal graphs
Temporal graphs are commonly used to represent time-resolved relations between entities in many natural and artificial systems. Many techniques were devised to investigate the evolution of temporal graphs by comparing their state at different time points. However, quantifying the similarity between temporal graphs as a whole is an open problem. Here, we use embeddings based on time-respecting random walks to introduce a new notion of distance between temporal graphs. This distance is well-defined for pairs of temporal graphs with different numbers of nodes and different time spans. We study the case of a matched pair of graphs, when a known relation exists between their nodes, and the case of unmatched graphs, when such a relation is unavailable and the graphs may be of different sizes. We use empirical and synthetic temporal network data to show that the distance we introduce discriminates graphs with different topological and temporal properties. We provide an efficient implementation of the distance computation suitable for large-scale temporal graphs.
♻ ☆ Heterogeneity-Informed Meta-Parameter Learning for Spatiotemporal Time Series Forecasting KDD'24
Spatiotemporal time series forecasting plays a key role in a wide range of real-world applications. While significant progress has been made in this area, fully capturing and leveraging spatiotemporal heterogeneity remains a fundamental challenge. Therefore, we propose a novel Heterogeneity-Informed Meta-Parameter Learning scheme. Specifically, our approach implicitly captures spatiotemporal heterogeneity through learning spatial and temporal embeddings, which can be viewed as a clustering process. Then, a novel spatiotemporal meta-parameter learning paradigm is proposed to learn spatiotemporal-specific parameters from meta-parameter pools, which is informed by the captured heterogeneity. Based on these ideas, we develop a Heterogeneity-Informed Spatiotemporal Meta-Network (HimNet) for spatiotemporal time series forecasting. Extensive experiments on five widely-used benchmarks demonstrate our method achieves state-of-the-art performance while exhibiting superior interpretability. Our code is available at https://github.com/XDZhelheim/HimNet.
comment: Published in KDD'24 Research Track
♻ ☆ FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability
We present FairX, an open-source Python-based benchmarking tool designed for the comprehensive analysis of models under the umbrella of fairness, utility, and eXplainability (XAI). FairX enables users to train benchmarking bias-mitigation models and evaluate their fairness using a wide array of fairness metrics, data utility metrics, and generate explanations for model predictions, all within a unified framework. Existing benchmarking tools do not have the way to evaluate synthetic data generated from fair generative models, also they do not have the support for training fair generative models either. In FairX, we add fair generative models in the collection of our fair-model library (pre-processing, in-processing, post-processing) and evaluation metrics for evaluating the quality of synthetic fair data. This version of FairX supports both tabular and image datasets. It also allows users to provide their own custom datasets. The open-source FairX benchmarking package is publicly available at \url{https://github.com/fahim-sikder/FairX}.
♻ ☆ Behavioral Learning of Dish Rinsing and Scrubbing based on Interruptive Direct Teaching Considering Assistance Rate
Robots are expected to manipulate objects in a safe and dexterous way. For example, washing dishes is a dexterous operation that involves scrubbing the dishes with a sponge and rinsing them with water. It is necessary to learn it safely without splashing water and without dropping the dishes. In this study, we propose a safe and dexterous manipulation system. The robot learns a dynamics model of the object by estimating the state of the object and the robot itself, the control input, and the amount of human assistance required (assistance rate) after the human corrects the initial trajectory of the robot's hands by interruptive direct teaching. By backpropagating the error between the estimated and the reference value using the acquired dynamics model, the robot can generate a control input that approaches the reference value, for example, so that human assistance is not required and the dish does not move excessively. This allows for adaptive rinsing and scrubbing of dishes with unknown shapes and properties. As a result, it is possible to generate safe actions that require less human assistance.
comment: Accepted at Advanced Robotics
♻ ☆ Towards Explainable Traffic Flow Prediction with Large Language Models
Traffic forecasting is crucial for intelligent transportation systems. It has experienced significant advancements thanks to the power of deep learning in capturing latent patterns of traffic data. However, recent deep-learning architectures require intricate model designs and lack an intuitive understanding of the mapping from input data to predicted results. Achieving both accuracy and explainability in traffic prediction models remains a challenge due to the complexity of traffic data and the inherent opacity of deep learning models. To tackle these challenges, we propose a Traffic flow Prediction model based on Large Language Models (LLMs) to generate explainable traffic predictions, named xTP-LLM. By transferring multi-modal traffic data into natural language descriptions, xTP-LLM captures complex time-series patterns and external factors from comprehensive traffic data. The LLM framework is fine-tuned using language-based instructions to align with spatial-temporal traffic flow data. Empirically, xTP-LLM shows competitive accuracy compared with deep learning baselines, while providing an intuitive and reliable explanation for predictions. This paper contributes to advancing explainable traffic prediction models and lays a foundation for future exploration of LLM applications in transportation. To the best of our knowledge, this is the first study to use LLM for explainable prediction of traffic flows.
comment: 31pages, 16 figures
♻ ☆ OceanGPT: A Large Language Model for Ocean Science Tasks ACL2024
Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet's surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reasons are the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever large language model in the ocean domain, which is expert in various ocean science tasks. We also propose OceanGPT, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology.
comment: ACL2024. Project Website: http://oceangpt.zjukg.cn/
♻ ☆ Statistical Context Detection for Deep Lifelong Reinforcement Learning
Context detection involves labeling segments of an online stream of data as belonging to different tasks. Task labels are used in lifelong learning algorithms to perform consolidation or other procedures that prevent catastrophic forgetting. Inferring task labels from online experiences remains a challenging problem. Most approaches assume finite and low-dimension observation spaces or a preliminary training phase during which task labels are learned. Moreover, changes in the transition or reward functions can be detected only in combination with a policy, and therefore are more difficult to detect than changes in the input distribution. This paper presents an approach to learning both policies and labels in an online deep reinforcement learning setting. The key idea is to use distance metrics, obtained via optimal transport methods, i.e., Wasserstein distance, on suitable latent action-reward spaces to measure distances between sets of data points from past and current streams. Such distances can then be used for statistical tests based on an adapted Kolmogorov-Smirnov calculation to assign labels to sequences of experiences. A rollback procedure is introduced to learn multiple policies by ensuring that only the appropriate data is used to train the corresponding policy. The combination of task detection and policy deployment allows for the optimization of lifelong reinforcement learning agents without an oracle that provides task labels. The approach is tested using two benchmarks and the results show promising performance when compared with related context detection algorithms. The results suggest that optimal transport statistical methods provide an explainable and justifiable procedure for online context detection and reward optimization in lifelong reinforcement learning.
comment: 10 pages excluding references and bibliography. Accepted at CoLLAs 2024
♻ ☆ Enhancing Cell Tracking with a Time-Symmetric Deep Learning Approach
The accurate tracking of live cells using video microscopy recordings remains a challenging task for popular state-of-the-art image processing based object tracking methods. In recent years, several existing and new applications have attempted to integrate deep-learning based frameworks for this task, but most of them still heavily rely on consecutive frame based tracking embedded in their architecture or other premises that hinder generalized learning. To address this issue, we aimed to develop a new deep-learning based tracking method that relies solely on the assumption that cells can be tracked based on their spatio-temporal neighborhood, without restricting it to consecutive frames. The proposed method has the additional benefit that the motion patterns of the cells can be learned completely by the predictor without any prior assumptions, and it has the potential to handle a large number of video frames with heavy artifacts. The efficacy of the proposed method is demonstrated through biologically motivated validation strategies and compared against multiple state-of-the-art cell tracking methods.
♻ ☆ Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction
Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this paper, we propose a novel post-hoc interpretability algorithm for cancer drug response prediction, CETExplainer, which incorporates a controllable edge-type-specific weighting mechanism. It considers the mutual information between subgraphs and predictions, proposing a structural scoring approach to provide fine-grained, biologically meaningful explanations for predictive models. We also introduce a method for constructing ground truth based on real-world datasets to quantitatively evaluate the proposed interpretability algorithm. Empirical analysis on the real-world dataset demonstrates that CETExplainer achieves superior stability and improves explanation quality compared to leading algorithms, thereby offering a robust and insightful tool for cancer drug prediction.
♻ ☆ GANs Conditioning Methods: A Survey
In recent years, Generative Adversarial Networks (GANs) have seen significant advancements, leading to their widespread adoption across various fields. The original GAN architecture enables the generation of images without any specific control over the content, making it an unconditional generation process. However, many practical applications require precise control over the generated output, which has led to the development of conditional GANs (cGANs) that incorporate explicit conditioning to guide the generation process. cGANs extend the original framework by incorporating additional information (conditions), enabling the generation of samples that adhere to that specific criteria. Various conditioning methods have been proposed, each differing in how they integrate the conditioning information into both the generator and the discriminator networks. In this work, we review the conditioning methods proposed for GANs, exploring the characteristics of each method and highlighting their unique mechanisms and theoretical foundations. Furthermore, we conduct a comparative analysis of these methods, evaluating their performance on various image datasets. Through these analyses, we aim to provide insights into the strengths and limitations of various conditioning techniques, guiding future research and application in generative modeling.
♻ ☆ A Survey on Stability of Learning with Limited Labelled Data and its Sensitivity to the Effects of Randomness
Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-learning or few-shot learning, aims to effectively train a model using only a small amount of labelled samples. However, these approaches have been observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively affects the stability of the models, leading to large variances in results across training runs. When such sensitivity is disregarded, it can unintentionally, but unfortunately also intentionally, create an imaginary perception of research progress. Recently, this area started to attract research attention and the number of relevant studies is continuously growing. In this survey, we provide a comprehensive overview of 415 papers addressing the effects of randomness on the stability of learning with limited labelled data. We distinguish between four main tasks addressed in the papers (investigate/evaluate; determine; mitigate; benchmark/compare/report randomness effects), providing findings for each one. Furthermore, we identify and discuss seven challenges and open problems together with possible directions to facilitate further research. The ultimate goal of this survey is to emphasise the importance of this growing research area, which so far has not received an appropriate level of attention, and reveal impactful directions for future research.
comment: Accepted to ACM Comput. Surv. 2024
♻ ☆ Efficient Heterogeneous Graph Learning via Random Projection
Heterogeneous Graph Neural Networks (HGNNs) are powerful tools for deep learning on heterogeneous graphs. Typical HGNNs require repetitive message passing during training, limiting efficiency for large-scale real-world graphs. Recent pre-computation-based HGNNs use one-time message passing to transform a heterogeneous graph into regular-shaped tensors, enabling efficient mini-batch training. Existing pre-computation-based HGNNs can be mainly categorized into two styles, which differ in how much information loss is allowed and efficiency. We propose a hybrid pre-computation-based HGNN, named Random Projection Heterogeneous Graph Neural Network (RpHGNN), which combines the benefits of one style's efficiency with the low information loss of the other style. To achieve efficiency, the main framework of RpHGNN consists of propagate-then-update iterations, where we introduce a Random Projection Squashing step to ensure that complexity increases only linearly. To achieve low information loss, we introduce a Relation-wise Neighbor Collection component with an Even-odd Propagation Scheme, which aims to collect information from neighbors in a finer-grained way. Experimental results indicate that our approach achieves state-of-the-art results on seven small and large benchmark datasets while also being 230% faster compared to the most effective baseline. Surprisingly, our approach not only surpasses pre-processing-based baselines but also outperforms end-to-end methods.
comment: Accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE)
♻ ☆ KTO: Model Alignment as Prospect Theoretic Optimization ICML 2024
Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
comment: ICML 2024
♻ ☆ End-to-end Feature Selection Approach for Learning Skinny Trees AISTATS 2024
We propose a new optimization-based approach for feature selection in tree ensembles, an important problem in statistics and machine learning. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importance scores, while very popular, they are known to have drawbacks. We propose Skinny Trees: an end-to-end toolkit for feature selection in tree ensembles where we train a tree ensemble while controlling the number of selected features. Our optimization-based approach learns an ensemble of differentiable trees, and simultaneously performs feature selection using a grouped $\ell_0$-regularizer. We use first-order methods for optimization and present convergence guarantees for our approach. We use a dense-to-sparse regularization scheduling scheme that can lead to more expressive and sparser tree ensembles. On 15 synthetic and real-world datasets, Skinny Trees can achieve $1.5\!\times\! -~620~\!\times\!$ feature compression rates, leading up to $10\times$ faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for 25\% feature budget, Skinny Trees outperforms LightGBM by $10.2\%$ (up to $37.7\%$), and Random Forests by $3\%$ (up to $12.5\%$).
comment: Accepted in AISTATS 2024
♻ ☆ Fair Mixed Effects Support Vector Machine
To ensure unbiased and ethical automated predictions, fairness must be a core principle in machine learning applications. Fairness in machine learning aims to mitigate biases present in the training data and model imperfections that could lead to discriminatory outcomes. This is achieved by preventing the model from making decisions based on sensitive characteristics like ethnicity or sexual orientation. A fundamental assumption in machine learning is the independence of observations. However, this assumption often does not hold true for data describing social phenomena, where data points are often clustered based. Hence, if the machine learning models do not account for the cluster correlations, the results may be biased. Especially high is the bias in cases where the cluster assignment is correlated to the variable of interest. We present a fair mixed effects support vector machine algorithm that can handle both problems simultaneously. With a reproducible simulation study we demonstrate the impact of clustered data on the quality of fair machine learning predictions.
comment: 14 pages, 6 figures
♻ ☆ White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
comment: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes and formatting improvements. V3: improvements from journal revisions
♻ ☆ Contrastive Learning and Abstract Concepts: The Case of Natural Numbers
Contrastive Learning (CL) has been successfully applied to classification and other downstream tasks related to concrete concepts, such as objects contained in the ImageNet dataset. No attempts seem to have been made so far in applying this promising scheme to more abstract entities. A prominent example of these could be the concept of (discrete) Quantity. CL can be frequently interpreted as a self-supervised scheme guided by some profound and ubiquitous conservation principle (e.g. conservation of identity in object classification tasks). In this introductory work we apply a suitable conservation principle to the semi-abstract concept of natural numbers by which discrete quantities can be estimated or predicted. We experimentally show, by means of a toy problem, that contrastive learning can be trained to count at a glance with high accuracy both at human as well as at super-human ranges.. We compare this with the results of a trained-to-count at a glance supervised learning (SL) neural network scheme of similar architecture. We show that both schemes exhibit similar good performance on baseline experiments, where the distributions of the training and testing stages are equal. Importantly, we demonstrate that in some generalization scenarios, where training and testing distributions differ, CL boasts more robust and much better error performance.
♻ ☆ NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to a thousand GPUs for training the largest open-source LLMs such as Nemotron 4 340B and Llama 3.1 405B. NeMo-Aligner comes with highly optimized and scalable implementations for major paradigms of model alignment such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally, our toolkit supports running most of the alignment techniques in a Parameter Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for extensibility, allowing support for other alignment techniques with minimal effort. It is open-sourced with Apache 2.0 License and we invite community contributions at https://github.com/NVIDIA/NeMo-Aligner
comment: 16 pages, 4 figures, Accepted to COLM 2024
♻ ☆ On the Optimality of Misspecified Spectral Algorithms
In the misspecified spectral algorithms problem, researchers usually assume the underground true function $f_{\rho}^{*} \in [\mathcal{H}]^{s}$, a less-smooth interpolation space of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ for some $s\in (0,1)$. The existing minimax optimal results require $\|f_{\rho}^{*}\|_{L^{\infty}}<\infty$ which implicitly requires $s > \alpha_{0}$ where $\alpha_{0}\in (0,1)$ is the embedding index, a constant depending on $\mathcal{H}$. Whether the spectral algorithms are optimal for all $s\in (0,1)$ is an outstanding problem lasting for years. In this paper, we show that spectral algorithms are minimax optimal for any $\alpha_{0}-\frac{1}{\beta} < s < 1$, where $\beta$ is the eigenvalue decay rate of $\mathcal{H}$. We also give several classes of RKHSs whose embedding index satisfies $ \alpha_0 = \frac{1}{\beta} $. Thus, the spectral algorithms are minimax optimal for all $s\in (0,1)$ on these RKHSs.
comment: 50 pages, 2 figures
♻ ☆ Safety Constrained Multi-Agent Reinforcement Learning for Active Voltage Control IJCAI2024
Active voltage control presents a promising avenue for relieving power congestion and enhancing voltage quality, taking advantage of the distributed controllable generators in the power network, such as roof-top photovoltaics. While Multi-Agent Reinforcement Learning (MARL) has emerged as a compelling approach to address this challenge, existing MARL approaches tend to overlook the constrained optimization nature of this problem, failing in guaranteeing safety constraints. In this paper, we formalize the active voltage control problem as a constrained Markov game and propose a safety-constrained MARL algorithm. We expand the primal-dual optimization RL method to multi-agent settings, and augment it with a novel approach of double safety estimation to learn the policy and to update the Lagrange-multiplier. In addition, we proposed different cost functions and investigated their influences on the behavior of our constrained MARL method. We evaluate our approach in the power distribution network simulation environment with real-world scale scenarios. Experimental results demonstrate the effectiveness of the proposed method compared with the state-of-the-art MARL methods. This paper is published at \url{https://www.ijcai.org/Proceedings/2024/}.
comment: Accepted by IJCAI2024
♻ ☆ Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers
Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
♻ ☆ TimeSeriesBench: An Industrial-Grade Benchmark for Time Series Anomaly Detection Models
Time series anomaly detection (TSAD) has gained significant attention due to its real-world applications to improve the stability of modern software systems. However, there is no effective way to verify whether they can meet the requirements for real-world deployment. Firstly, current algorithms typically train a specific model for each time series. Maintaining such many models is impractical in a large-scale system with tens of thousands of curves. The performance of using merely one unified model to detect anomalies remains unknown. Secondly, most TSAD models are trained on the historical part of a time series and are tested on its future segment. In distributed systems, however, there are frequent system deployments and upgrades, with new, previously unseen time series emerging daily. The performance of testing newly incoming unseen time series on current TSAD algorithms remains unknown. Lastly, the assumptions of the evaluation metrics in existing benchmarks are far from practical demands. To solve the above-mentioned problems, we propose an industrial-grade benchmark TimeSeriesBench. We assess the performance of existing algorithms across more than 168 evaluation settings and provide comprehensive analysis for the future design of anomaly detection algorithms. An industrial dataset is also released along with TimeSeriesBench.
comment: Accepted by ISSRE'24
♻ ☆ Investigating Recurrent Transformers with Dynamic Halt
In this paper, we comprehensively study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism: (1) the approach of incorporating a depth-wise recurrence similar to Universal Transformers; and (2) the approach of incorporating a chunk-wise temporal recurrence like Temporal Latent Bottleneck. Furthermore, we propose and investigate novel ways to extend and combine the above methods - for example, we propose a global mean-based dynamic halting mechanism for Universal Transformers and an augmentation of Temporal Latent Bottleneck with elements from Universal Transformer. We compare the models and probe their inductive biases in several diagnostic tasks, such as Long Range Arena (LRA), flip-flop language modeling, ListOps, and Logical Inference. The code is released in: https://github.com/JRC1995/InvestigatingRecurrentTransformers/tree/main
♻ ☆ OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. Language model systems often enable LLMs to generate code for arithmetic operations to achieve accurate calculations. However, this approach compromises speed and security, and fine-tuning risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in a single autoregressive step, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic. Our implementation using Llama 3 with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o with and without a code interpreter. Furthermore, OccamLlama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that OccamLLMs can excel in arithmetic tasks, even surpassing much larger models. We will make our code public shortly.
♻ ☆ MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning
Determining the optimal size of a neural network is critical, as it directly impacts runtime performance and memory usage. Pruning is a well-established model compression technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. However, many recent pruning methods overlook the global contributions of individual model components, making it difficult to ensure that a pruned model meets the desired dataset and performance requirements. To address these challenges, we developed a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner utilizes layer clustering with the Centered Kernel Alignment (CKA) similarity metric, allowing us to incorporate global information from the neural network for more precise and efficient layer-wise pruning. We evaluated MPruner across various architectures and configurations, demonstrating its versatility and providing practical guidelines. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.
♻ ☆ Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions
Learning-based control has recently shown great efficacy in performing complex tasks for various applications. However, to deploy it in real systems, it is of vital importance to guarantee the system will stay safe. Control Barrier Functions (CBFs) offer mathematical tools for designing safety-preserving controllers for systems with known dynamics. In this article, we first introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers using Gaussian Process (GP) regression to close the gap between an approximate mathematical model and the real system, which results in a second-order cone program (SOCP)-based control design. We then present the pointwise feasibility conditions of the resulting safety controller, highlighting the level of richness that the available system information must meet to ensure safety. We use these conditions to devise an event-triggered online data collection strategy that ensures the recursive feasibility of the learned safety controller. Our method works by constantly reasoning about whether the current information is sufficient to ensure safety or if new measurements under active safe exploration are required to reduce the uncertainty. As a result, our proposed framework can guarantee the forward invariance of the safe set defined by the CBF with high probability, even if it contains a priori unexplored regions. We validate the proposed framework in two numerical simulation experiments.
comment: Journal article. Includes the results of the 2021 CDC paper titled "Pointwise feasibility of gaussian process-based safety-critical control under model uncertainty" and proposes a recursively feasible safe online learning algorithm as new contribution
♻ ☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
♻ ☆ Kolmogorov Arnold Networks in Fraud Detection: Bridging the Gap Between Theory and Practice
This study evaluates the applicability of Kolmogorov-Arnold Networks (KAN) in fraud detection, finding that their effectiveness is context-dependent. We propose a quick decision rule using Principal Component Analysis (PCA) to assess the suitability of KAN: if data can be effectively separated in two dimensions using splines, KAN may outperform traditional models; otherwise, other methods could be more appropriate. We also introduce a heuristic approach to hyperparameter tuning, significantly reducing computational costs. These findings suggest that while KAN has potential, its use should be guided by data-specific assessments.
♻ ☆ Deep-MacroFin: Informed Equilibrium Neural Network for Continuous Time Economic Models
In this paper, we present Deep-MacroFin, a comprehensive framework designed to solve partial differential equations, with a particular focus on models in continuous time economics. This framework leverages deep learning methodologies, including conventional Multi-Layer Perceptrons and the newly developed Kolmogorov-Arnold Networks. It is optimized using economic information encapsulated by Hamilton-Jacobi-Bellman equations and coupled algebraic equations. The application of neural networks holds the promise of accurately resolving high-dimensional problems with fewer computational demands and limitations compared to standard numerical methods. This versatile framework can be readily adapted for elementary differential equations, and systems of differential equations, even in cases where the solutions may exhibit discontinuities. Importantly, it offers a more straightforward and user-friendly implementation than existing libraries.
comment: 25 pages, 8 figures
♻ ☆ What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims
Background: Collaborative Software Package Registries (SPRs) are an integral part of the software supply chain. Much engineering work synthesizes SPR package into applications. Prior research has examined SPRs for traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. Aims: Recent empirical research has examined PTM registries in ways such as vulnerabilities, reuse processes, and evolution. However, no existing research synthesizes them to provide a systematic understanding of the current knowledge. Some of the existing research includes qualitative claims lacking quantitative analysis. Our research fills these gaps by providing a knowledge synthesis and quantitative analyses. Methods: We first conduct a systematic literature review (SLR). We then observe that some of the claims are qualitative. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims. Results: From our SLR, we identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative validation. We successfully test 3 of these claims through a quantitative analysis, and directly compare one with traditional software. Our findings corroborate qualitative claims with quantitative measurements. Our findings are: (1) PTMs have a much higher turnover rate than traditional software, indicating a dynamic and rapidly evolving reuse environment within the PTM ecosystem; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: We confirm qualitative research claims with concrete metrics, supporting prior qualitative and case study research. Our measures show further dynamics of PTM reuse, inspiring research infrastructure and new measures.
comment: [ESEM'24] Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024
♻ ☆ Linear Contextual Bandits with Hybrid Payoff: Revisited ECML
We study the Linear Contextual Bandit problem in the hybrid reward setting. In this setting every arm's reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can reduce this setting to two closely related settings (a) Shared - no arm specific parameters, and (b) Disjoint - only arm specific parameters, enabling the application of two popular state of the art algorithms - $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ (Algorithm 1 in (Li et al. 2010)). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both algorithms, significantly improving on the known regret guarantees of these algorithms. Our novel analysis critically exploits the hybrid reward structure and the diversity condition. Moreover, we introduce a new algorithm $\texttt{HyLinUCB}$ that crucially modifies $\texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that $\texttt{HyLinUCB}$ also incurs only $O(\sqrt{T})$ regret for $T$ rounds. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of $\texttt{HyLinUCB}$.For number of arm specific parameters much larger than the number of shared parameters, we observe that $\texttt{DisLinUCB}$ incurs the lowest regret. In this case, regret of $\texttt{HyLinUCB}$ is the second best and extremely competitive to $\texttt{DisLinUCB}$. In all other situations, including our real-world dataset, $\texttt{HyLinUCB}$ has significantly lower regret than $\texttt{LinUCB}$, $\texttt{DisLinUCB}$ and other SOTA baselines we considered. We also empirically observe that the regret of $\texttt{HyLinUCB}$ grows much slower with the number of arms compared to baselines, making it suitable even for very large action spaces.
comment: Accepted at ECML PKDD 2024 as a Research Track Paper
♻ ☆ Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification
The lack of a suitable tool for the analysis of conversational texts in the Persian language has made various analyses of these texts, including Sentiment Analysis, difficult. In this research, we tried to make the understanding of these texts easier for the machine by providing PSC, Persian Slang Converter, a tool for converting conversational texts into formal ones, and by using the most up-to-date and best deep learning methods along with the PSC, the sentiment learning of short Persian language texts for the machine in a better way. be made More than 10 million unlabeled texts from various social networks and movie subtitles (as Conversational texts) and about 10 million news texts (as formal texts) have been used for training unsupervised models and formal implementation of the tool. 60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model of short texts. Using the formal tool, 57% of the words of the corpus of conversation were converted. Finally, by using the formalizer, FastText model, and deep LSTM network, an accuracy of 81.91 was obtained on the test data.
comment: 16 pages, 4 figures, 14 tables
♻ ☆ Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients
We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards training neural networks with binary weights, known as binary neural networks (BNNs), on quantum hardware. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. However, training them in practice remains to be an open challenge. Most known BNN-optimisers either rely on projected updates or binarise weights post-training. Instead, QP-SBGD approximately maps the gradient onto binary variables, by solving a quadratic constrained binary optimisation. Under practically reasonable assumptions, we show that this update rule converges with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the $\mathcal{NP}$-hard projection can be effectively executed on an adiabatic quantum annealer, harnessing recent advancements in quantum computation. We also introduce a projected version of this update rule and prove that if a fixed point exists in the binary variable space, the modified updates will converge to it. Last but not least, our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is on par with competitive and well-established baselines such as BinaryConnect, signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as well as binary graph neural networks.
Information Retrieval 5
☆ SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow
Document spanners have been proposed as a formal framework for declarative Information Extraction (IE) from text, following IE products from the industry and academia. Over the past decade, the framework has been studied thoroughly in terms of expressive power, complexity, and the ability to naturally combine text analysis with relational querying. This demonstration presents SpannerLib a library for embedding document spanners in Python code. SpannerLib facilitates the development of IE programs by providing an implementation of Spannerlog (Datalog-based documentspanners) that interacts with the Python code in two directions: rules can be embedded inside Python, and they can invoke custom Python code (e.g., calls to ML-based NLP models) via user-defined functions. The demonstration scenarios showcase IE programs, with increasing levels of complexity, within Jupyter Notebook.
comment: 4 pages
☆ Laser: Parameter-Efficient LLM Bi-Tuning for Sequential Recommendation with Collaborative Information
Sequential recommender systems are essential for discerning user preferences from historical interactions and facilitating targeted recommendations. Recent innovations employing Large Language Models (LLMs) have advanced the field by encoding item semantics, yet they often necessitate substantial parameter tuning and are resource-demanding. Moreover, these works fails to consider the diverse characteristics of different types of users and thus diminishes the recommendation accuracy. In this paper, we propose a parameter-efficient Large Language Model Bi-Tuning framework for sequential recommendation with collaborative information (Laser). Specifically, Bi-Tuning works by inserting trainable virtual tokens at both the prefix and suffix of the input sequence and freezing the LLM parameters, thus optimizing the LLM for the sequential recommendation. In our Laser, the prefix is utilized to incorporate user-item collaborative information and adapt the LLM to the recommendation task, while the suffix converts the output embeddings of the LLM from the language space to the recommendation space for the follow-up item recommendation. Furthermore, to capture the characteristics of different types of users when integrating the collaborative information via the prefix, we introduce M-Former, a lightweight MoE-based querying transformer that uses a set of query experts to integrate diverse user-specific collaborative information encoded by frozen ID-based sequential recommender systems, significantly improving the accuracy of recommendations. Extensive experiments on real-world datasets demonstrate that Laser can parameter-efficiently adapt LLMs to effective recommender systems, significantly outperforming state-of-the-art methods.
comment: 11 pages, 4 figures
☆ Blockchain-based Federated Recommendation with Incentive Mechanism
Nowadays, federated recommendation technology is rapidly evolving to help multiple organisations share data and train models while meeting user privacy, data security and government regulatory requirements. However, federated recommendation increases customer system costs such as power, computational and communication resources. Besides, federated recommendation systems are also susceptible to model attacks and data poisoning by participating malicious clients. Therefore, most customers are unwilling to participate in federated recommendation without any incentive. To address these problems, we propose a blockchain-based federated recommendation system with incentive mechanism to promote more trustworthy, secure, and efficient federated recommendation service. First, we construct a federated recommendation system based on NeuMF and FedAvg. Then we introduce a reverse auction mechanism to select optimal clients that can maximize the social surplus. Finally, we employ blockchain for on-chain evidence storage of models to ensure the safety of the federated recommendation system. The experimental results show that our proposed incentive mechanism can attract clients with superior training data to engage in the federal recommendation at a lower cost, which can increase the economic benefit of federal recommendation by 54.9\% while improve the recommendation performance. Thus our work provides theoretical and technological support for the construction of a harmonious and healthy ecological environment for the application of federal recommendation.
comment: This paper has been accepted on 2024 Blockchain and Web3 Technology Innovation and Application Exchange Conference (BWTAC 2024)
♻ ☆ rerankers: A Lightweight Python Library to Unify Ranking Methods
This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches. Re-ranking is an integral component of many retrieval pipelines; however, there exist numerous approaches to it, relying on different implementation methods. rerankers unifies these methods into a single user-friendly interface, allowing practitioners and researchers alike to explore different methods while only changing a single line of Python code. Moreover ,rerankers ensures that its implementations are done with the fewest dependencies possible, and re-uses the original implementation whenever possible, guaranteeing that our simplified interface results in no performance degradation compared to more complex ones. The full source code and list of supported models are updated regularly and available at https://github.com/answerdotai/rerankers.
♻ ☆ Impedance vs. Power Side-channel Vulnerabilities: A Comparative Study
In recent times, impedance side-channel analysis has emerged as a potent strategy for adversaries seeking to extract sensitive information from computing systems. It leverages variations in the intrinsic impedance of a chip's internal structure across different logic states. In this study, we conduct a comparative analysis between the newly explored impedance side channel and the well-established power side channel. Through experimental evaluation, we investigate the efficacy of these two side channels in extracting the cryptographic key from the Advanced Encryption Standard (AES) and analyze their performance. Our results indicate that impedance analysis demonstrates a higher potential for cryptographic key extraction compared to power side-channel analysis. Moreover, we identify scenarios where power side-channel analysis does not yield satisfactory results, whereas impedance analysis proves to be more robust and effective. This work not only underscores the significance of impedance side-channel analysis in enhancing cryptographic security but also emphasizes the necessity for a deeper understanding of its mechanisms and implications.
Computation and Language 37
☆ DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models
As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce {DiversityMedQA}, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
☆ The Compressor-Retriever Architecture for Language Model OS
Recent advancements in large language models (LLMs) have significantly enhanced their capacity to aggregate and process information across multiple modalities, enabling them to perform a wide range of tasks such as multimodal data querying, tool usage, web interactions, and handling long documents. These capabilities pave the way for transforming LLMs from mere chatbots into general-purpose agents capable of interacting with the real world. This paper explores the concept of using a language model as the core component of an operating system (OS), effectively acting as a CPU that processes data stored in a context window, which functions as RAM. A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions, a feature limited by the current session-based interaction paradigm due to context window size limit. To address this, we introduce compressor-retriever, a model-agnostic architecture designed for life-long context management. Unlike other long-context solutions such as retrieval-augmented generation, our approach exclusively uses the base model's forward function to compress and retrieve context, ensuring end-to-end differentiability. Preliminary experiments demonstrate the effectiveness of this architecture in in-context learning tasks, marking a step towards the development of a fully stateful LLM OS. Project repo available at: https://github.com/gblackout/LM-OS
☆ Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. These models use conditionally activated feedforward subnetworks in transformer blocks, allowing for a separation between total model parameters and per-example computation. However, large token-routed SMoE models face a significant challenge: during inference, the entire model must be used for a sequence or a batch, resulting in high latencies in a distributed setting that offsets the advantages of per-token sparse activation. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures, mainly modulating the choice of expert counts in pretraining. We investigate whether such pruned models offer advantages over smaller SMoE models trained from scratch, when evaluating and comparing them individually on tasks. To that end, we introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training. Our findings reveal a threshold pruning factor for the reduction that depends on the number of experts used in pretraining, above which, the reduction starts to degrade model performance. These insights contribute to our understanding of model design choices when pretraining with SMoE architectures, particularly useful when considering task-specific inference optimization for later stages.
☆ Masked Mixers for Language Generation and Retrieval
Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most information present in the input is necessarily lost. In support of this idea we observe poor input representation accuracy in transformers, but find more accurate representation in what we term masked mixers which replace self-attention with masked convolutions. Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations and somewhat less efficiently than optimized, current implementations. The most efficient learning algorithm observed for this dataset is a transformer-masked mixer hybrid, suggesting that these models learn in an orthogonal manner. We hypothesized that the information loss exhibited by transformers would be much more detrimental to retrieval than generation, and to test this we introduce an efficient training approach for retrieval models based on existing generative model embeddings. With this method, embeddings from masked mixers are found to result in far better summary-to-story retrieval compared to embeddings from transformers.
comment: 23 pages, 15 figures (11 primary, 4 supplementary)
☆ PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science
Recent advancements in large language models (LLMs) have opened new avenues for enhancing text classification efficiency in political science, surpassing traditional machine learning methods that often require extensive feature engineering, human labeling, and task-specific training. However, their effectiveness in achieving high classification accuracy remains questionable. This paper introduces a three-stage in-context learning approach that leverages LLMs to improve classification accuracy while minimizing experimental costs. Our method incorporates automatic enhanced prompt generation, adaptive exemplar selection, and a consensus mechanism that resolves discrepancies between two weaker LLMs, refined by an advanced LLM. We validate our approach using datasets from the BBC news reports, Kavanaugh Supreme Court confirmation, and 2018 election campaign ads. The results show significant improvements in classification F1 score (+0.36 for zero-shot classification) with manageable economic costs (-78% compared with human labeling), demonstrating that our method effectively addresses the limitations of traditional machine learning while offering a scalable and reliable solution for text analysis in political science.
comment: 23 pages, 5 figures
☆ Efficient and Scalable Estimation of Tool Representations in Vector Space
Recent advancements in function calling and tool use have significantly enhanced the capabilities of large language models (LLMs) by enabling them to interact with external information sources and execute complex tasks. However, the limited context window of LLMs presents challenges when a large number of tools are available, necessitating efficient methods to manage prompt length and maintain accuracy. Existing approaches, such as fine-tuning LLMs or leveraging their reasoning capabilities, either require frequent retraining or incur significant latency overhead. A more efficient solution involves training smaller models to retrieve the most relevant tools for a given query, although this requires high quality, domain-specific data. To address those challenges, we present a novel framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models. Empowered by LLMs, we create ToolBank, a new tool retrieval dataset that reflects real human user usages. For tool retrieval methodologies, we propose novel approaches: (1) Tool2Vec: usage-driven tool embedding generation for tool retrieval, (2) ToolRefiner: a staged retrieval method that iteratively improves the quality of retrieved tools, and (3) MLC: framing tool retrieval as a multi-label classification problem. With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank. Additionally, we present further experimental results to rigorously validate our methods. Our code is available at \url{https://github.com/SqueezeAILab/Tool2Vec}
♻ ☆ Dissociation of Faithful and Unfaithful Reasoning in LLMs
Large language models (LLMs) often improve their performance in downstream tasks when they generate Chain of Thought reasoning text before producing an answer. We investigate how LLMs recover from errors in Chain of Thought. Through analysis of error recovery behaviors, we find evidence for unfaithfulness in Chain of Thought, which occurs when models arrive at the correct answer despite invalid reasoning text. We identify factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer. Critically, these factors have divergent effects on faithful and unfaithful recoveries. Our results indicate that there are distinct mechanisms driving faithful and unfaithful error recoveries. Selective targeting of these mechanisms may be able to drive down the rate of unfaithful reasoning and improve model interpretability.
comment: code published at https://github.com/CoTErrorRecovery/CoTErrorRecovery
♻ ☆ Manipulating Large Language Models to Increase Product Visibility
Large language models (LLMs) are increasingly being integrated into search engines to provide natural language responses tailored to user queries. Customers and end-users are also becoming more dependent on these models for quick and easy purchase decisions. In this work, we investigate whether recommendations from LLMs can be manipulated to enhance a product's visibility. We demonstrate that adding a strategic text sequence (STS) -- a carefully crafted message -- to a product's information page can significantly increase its likelihood of being listed as the LLM's top recommendation. To understand the impact of STS, we use a catalog of fictitious coffee machines and analyze its effect on two target products: one that seldom appears in the LLM's recommendations and another that usually ranks second. We observe that the strategic text sequence significantly enhances the visibility of both products by increasing their chances of appearing as the top recommendation. This ability to manipulate LLM-generated search responses provides vendors with a considerable competitive advantage and has the potential to disrupt fair market competition. Just as search engine optimization (SEO) revolutionized how webpages are customized to rank higher in search engine results, influencing LLM recommendations could profoundly impact content optimization for AI-driven search services. Code for our experiments is available at https://github.com/aounon/llm-rank-optimizer.
♻ ☆ Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions
This paper examines the role of cognitive biases in the decision-making processes of large language models (LLMs), challenging the conventional goal of eliminating all biases. We show that certain cognitive biases when properly balanced, can enhance decision-making efficiency through rational deviations and heuristic shortcuts. By introducing heuristic moderation and an abstention option, which allows LLMs to withhold responses when uncertain, we reduce error rates, improve decision accuracy, and optimize decision rates. Using the Balance Rigor and Utility (BRU) dataset, developed through expert collaboration, our findings demonstrate that targeted inspection of cognitive biases aligns LLM decisions more closely with human reasoning, enhancing reliability and suggesting strategies for future improvements. This approach offers a novel way to leverage cognitive biases to improve the practical utility of LLMs across various applications.
comment: This article is currently under review. All data will be open on GitHub once the review is complete. https://github.com/limanwang/Balancing-Rigor-and-Utility
♻ ☆ Eliciting Informative Text Evaluations with Large Language Models
Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in large language models. This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media. We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels -- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM.
comment: Accepted by the Twenty-Fifth ACM Conference on Economics and Computation (EC'24)
♻ ☆ Exploring Bias and Prediction Metrics to Characterise the Fairness of Machine Learning for Equity-Centered Public Health Decision-Making: A Narrative Review
Background: The rapid advancement of Machine Learning (ML) represents novel opportunities to enhance public health research, surveillance, and decision-making. However, there is a lack of comprehensive understanding of algorithmic bias, systematic errors in predicted population health outcomes, resulting from the public health application of ML. The objective of this narrative review is to explore the types of bias generated by ML and quantitative metrics to assess these biases. Methods : We performed search on PubMed, MEDLINE, IEEE (Institute of Electrical and Electronics Engineers), ACM (Association for Computing Machinery) Digital Library, Science Direct, and Springer Nature. We used keywords to identify studies describing types of bias and metrics to measure these in the domain of ML and public and population health published in English between 2008 and 2023, inclusive. Results: A total of 72 articles met the inclusion criteria. Our review identified the commonly described types of bias and quantitative metrics to assess these biases from an equity perspective. Conclusion : The review will help formalize the evaluation framework for ML on public health from an equity perspective.
comment: under review
♻ ☆ Domain-Specific Improvement on Psychotherapy Chatbot Using Assistant ICASSP 2024
Large language models (LLMs) have demonstrated impressive generalization capabilities on specific tasks with human-written instruction data. However, the limited quantity, diversity, and professional expertise of such instruction data raise concerns about the performance of LLMs in psychotherapy tasks when provided with domain-specific instructions. To address this, we firstly propose Domain-Specific Assistant Instructions based on AlexanderStreet therapy, and secondly, we use an adaption fine-tuning method and retrieval augmented generation method to improve pre-trained LLMs. Through quantitative evaluation of linguistic quality using automatic and human evaluation, we observe that pre-trained LLMs on Psychotherapy Assistant Instructions outperform state-of-the-art LLMs response baselines. Our Assistant-Instruction approach offers a half-annotation method to align pre-trained LLMs with instructions and provide pre-trained LLMs with more psychotherapy knowledge.
comment: Accepted at ICASSP 2024 EIHRC Workshop
♻ ☆ Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks
Understanding cognitive processes in the brain demands sophisticated models capable of replicating neural dynamics at large scales. We present a physiologically inspired speech recognition architecture, compatible and scalable with deep learning frameworks, and demonstrate that end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network. Significant cross-frequency couplings, indicative of these oscillations, are measured within and across network layers during speech processing, whereas no such interactions are observed when handling background noise inputs. Furthermore, our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance. Overall, on top of developing our understanding of synchronisation phenomena notably observed in the human auditory pathway, our architecture exhibits dynamic and efficient information processing, with relevance to neuromorphic technology.
♻ ☆ Analyzing Diversity in Healthcare LLM Research: A Scientometric Perspective
The deployment of large language models (LLMs) in healthcare has demonstrated substantial potential for enhancing clinical decision-making, administrative efficiency, and patient outcomes. However, the underrepresentation of diverse groups in the development and application of these models can perpetuate biases, leading to inequitable healthcare delivery. This paper presents a comprehensive scientometric analysis of LLM research for healthcare, including data from January 1, 2021, to July 1, 2024. By analyzing metadata from PubMed and Dimensions, including author affiliations, countries, and funding sources, we assess the diversity of contributors to LLM research. Our findings highlight significant gender and geographic disparities, with a predominance of male authors and contributions primarily from high-income countries (HICs). We introduce a novel journal diversity index based on Gini diversity to measure the inclusiveness of scientific publications. Our results underscore the necessity for greater representation in order to ensure the equitable application of LLMs in healthcare. We propose actionable strategies to enhance diversity and inclusivity in artificial intelligence research, with the ultimate goal of fostering a more inclusive and equitable future in healthcare innovation.
♻ ☆ Sentiment Analysis Across Languages: Evaluation Before and After Machine Translation to English
People communicate in more than 7,000 languages around the world, with around 780 languages spoken in India alone. Despite this linguistic diversity, research on Sentiment Analysis has predominantly focused on English text data, resulting in a disproportionate availability of sentiment resources for English. This paper examines the performance of transformer models in Sentiment Analysis tasks across multilingual datasets and text that has undergone machine translation. By comparing the effectiveness of these models in different linguistic contexts, we gain insights into their performance variations and potential implications for sentiment analysis across diverse languages. We also discuss the shortcomings and potential for future work towards the end.
comment: 6 pages, 3 Figures
♻ ☆ ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction
In order to facilitate features such as faceted product search and product comparison, e-commerce platforms require accurately structured product data, including precise attribute/value pairs. Vendors often times provide unstructured product descriptions consisting only of an offer title and a textual description. Consequently, extracting attribute values from titles and descriptions is vital for e-commerce platforms. State-of-the-art attribute value extraction methods based on pre-trained language models, such as BERT, face two drawbacks (i) the methods require significant amounts of task-specific training data and (ii) the fine-tuned models have problems with generalising to unseen attribute values that were not part of the training data. This paper explores the potential of using large language models as a more training data-efficient and more robust alternative to existing AVE methods. We propose prompt templates for describing the target attributes of the extraction to the LLM, covering both zero-shot and few-shot scenarios. In the zero-shot scenario, textual and JSON-based target schema representations of the attributes are compared. In the few-shot scenario, we investigate (i) the provision of example attribute values, (ii) the selection of in-context demonstrations, (iii) shuffled ensembling to prevent position bias, and (iv) fine-tuning the LLM. We evaluate the prompt templates in combination with hosted LLMs, such as GPT-3.5 and GPT-4, and open-source LLMs which can be run locally. We compare the performance of the LLMs to the PLM-based methods SU-OpenTag, AVEQA, and MAVEQA. The highest average F1-score of 86% was achieved by GPT-4. Llama-3-70B performs only 3% worse than GPT-4, making it a competitive open-source alternative. Given the same training data, this prompt/GPT-4 combination outperforms the best PLM baseline by an average of 6% F1-score.
♻ ☆ AMERICANO: Argument Generation with Discourse-driven Decomposition and Agent Interaction
Argument generation is a challenging task in natural language processing, which requires rigorous reasoning and proper content organization. Inspired by recent chain-of-thought prompting that breaks down a complex task into intermediate steps, we propose Americano, a novel framework with agent interaction for argument generation. Our approach decomposes the generation process into sequential actions grounded on argumentation theory, which first executes actions sequentially to generate argumentative discourse components, and then produces a final argument conditioned on the components. To further mimic the human writing process and improve the left-to-right generation paradigm of current autoregressive language models, we introduce an argument refinement module which automatically evaluates and refines argument drafts based on feedback received. We evaluate our framework on the task of counterargument generation using a subset of Reddit/CMV dataset. The results show that our method outperforms both end-to-end and chain-of-thought prompting methods and can generate more coherent and persuasive arguments with diverse and rich contents.
comment: INLG 2024
♻ ☆ Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study
The advent of large language models such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been comprehensively assessed. This study addresses this gap by introducing a novel multi-task spatial evaluation dataset, designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding and path planning, each with verified, accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4o, and ZhipuAI's glm-4, through a two-phase testing approach. Initially, we conducted zero-shot testing, followed by categorizing the dataset by difficulty and performing prompt tuning tests. Results indicate that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it surpassed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For example, the Chain-of-Thought (COT) strategy increased gpt-4o's accuracy in path planning from 12.4% to 87.5%, while a one-shot strategy enhanced moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
♻ ☆ LiveFC: A System for Live Fact-Checking of Audio Streams
The advances in the digital era have led to rapid dissemination of information. This has also aggravated the spread of misinformation and disinformation. This has potentially serious consequences, such as civil unrest. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. While automated fact-checking approaches exist, they do not operate in real-time and do not always account for spread of misinformation through different modalities. This is particularly important as proactive fact-checking on live streams in real-time can help people be informed of false narratives and prevent catastrophic consequences that may cause civil unrest. This is particularly relevant with the rapid dissemination of information through video on social media platforms or other streams like political rallies and debates. Hence, in this work we develop a platform named LiveFC, that can aid in fact-checking live audio streams in real-time. LiveFC has a user-friendly interface that displays the claims detected along with their veracity and evidence for live streams with associated speakers for claims from respective segments. The app can be accessed at http://livefc.factiverse.ai and a screen recording of the demo can be found at https://bit.ly/3WVAoIw.
comment: Under Review, 11 pages
♻ ☆ Generalizing Fairness to Generative Language Models via Reformulation of Non-discrimination Criteria
Generative AI, such as large language models, has undergone rapid development within recent years. As these models become increasingly available to the public, concerns arise about perpetuating and amplifying harmful biases in applications. Gender stereotypes can be harmful and limiting for the individuals they target, whether they consist of misrepresentation or discrimination. Recognizing gender bias as a pervasive societal construct, this paper studies how to uncover and quantify the presence of gender biases in generative language models. In particular, we derive generative AI analogues of three well-known non-discrimination criteria from classification, namely independence, separation and sufficiency. To demonstrate these criteria in action, we design prompts for each of the criteria with a focus on occupational gender stereotype, specifically utilizing the medical test to introduce the ground truth in the generative AI context. Our results address the presence of occupational gender bias within such conversational language models.
♻ ☆ A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning KDD
Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ability. We refined the text chunks and tables in web pages, added attribute predictors to reduce hallucinations, conducted LLM Knowledge Extractor and Knowledge Graph Extractor, and finally built a reasoning strategy with all the references. We evaluated our system on the CRAG dataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and online evaluations demonstrate that our system significantly enhances complex reasoning capabilities. In local evaluations, we have significantly improved accuracy and reduced error rates compared to the baseline model, achieving a notable increase in scores. In the meanwhile, we have attained outstanding results in online assessments, demonstrating the performance and generalization capabilities of the proposed system. The source code for our system is released in \url{https://gitlab.aicrowd.com/shizueyy/crag-new}.
comment: Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024
♻ ☆ ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST$^\text{EM}$ and Self-Rewarding LM.
comment: 30 pages
♻ ☆ FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs' context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before inference by fine-tuning only the last Feed-Forward Network (FFN) module. This targeted approach ensures efficient optimization without overfitting, significantly improving the model's ability to comprehend and accurately follow the context. Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures. For instance, FastMem improves the accuracy of Llama 3-8B-Inst on the NQ-SWAP dataset from 59.1% to 71.6%, and reduces the output structure failure rate of Qwen 1.5-4B-Chat from 34.9% to 25.5%. Extensive experimental results highlight FastMem's potential to offer a robust solution to enhance the reliability and accuracy of LLMs in various applications. Our code is available at: https://github.com/IAAR-Shanghai/FastMem
♻ ☆ A Formal Perspective on Byte-Pair Encoding ACL 2023
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})$-approximation of an optimal merge sequence, where ${\sigma(\boldsymbol{\mu}^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbol{\mu}^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
comment: ACL 2023
♻ ☆ Pitfalls and Outlooks in Using COMET
Since its introduction, the COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects: 1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores is not comparable between papers or even technical setups and we put forward our perspective on fixing each issue. Furthermore, we release the SacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation. The goal of this work is to help the community make more sound use of the COMET metric.
♻ ☆ AudioBench: A Universal Benchmark for Audio Large Language Models
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments.
comment: v3 - Abundent update on models and evaluation details; Code: https://github.com/AudioLLMs/AudioBench
♻ ☆ Contrasting Linguistic Patterns in Human and LLM-Generated News Text
We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from six different LLMs that cover three different families and four sizes in total. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric, and sociolinguistic aspects. The results reveal various measurable differences between human and AI-generated texts. Human texts exhibit more scattered sentence length distributions, more variety of vocabulary, a distinct use of dependency and constituent types, shorter constituents, and more optimized dependency distances. Humans tend to exhibit stronger negative emotions (such as fear and disgust) and less joy compared to text generated by LLMs, with the toxicity of these models increasing as their size grows. LLM outputs use more numbers, symbols and auxiliaries (suggesting objective language) than human texts, as well as more pronouns. The sexist bias prevalent in human text is also expressed by LLMs, and even magnified in all of them but one. Differences between LLMs and humans are larger than between LLMs.
comment: Published at Artificial Intelligence Review vol. 57, 265
♻ ☆ LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
In the context of English as a Foreign Language (EFL) writing education, LLM-as-a-tutor can assist students by providing real-time feedback on their essays. However, challenges arise in assessing LLM-as-a-tutor due to differing standards between educational and general use cases. To bridge this gap, we integrate pedagogical principles to assess student-LLM interaction. First, we explore how LLMs can function as English tutors, providing effective essay feedback tailored to students. Second, we propose three metrics to evaluate LLM-as-a-tutor specifically designed for EFL writing education, emphasizing pedagogical aspects. In this process, EFL experts evaluate the feedback from LLM-as-a-tutor regarding quality and characteristics. On the other hand, EFL learners assess their learning outcomes from interaction with LLM-as-a-tutor. This approach lays the groundwork for developing LLMs-as-a-tutor tailored to the needs of EFL learners, advancing the effectiveness of writing education in this context.
♻ ☆ MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents
Machine learning research, crucial for technological advancements and innovation, often faces significant challenges due to its inherent complexity, slow pace of experimentation, and the necessity for specialized expertise. Motivated by this, we present a new systematic framework, autonomous Machine Learning Research with large language models (MLR-Copilot), designed to enhance machine learning research productivity through the automatic generation and implementation of research ideas using Large Language Model (LLM) agents. The framework consists of three phases: research idea generation, experiment implementation, and implementation execution. First, existing research papers are used to generate hypotheses and experimental plans vis IdeaAgent powered by LLMs. Next, the implementation generation phase translates these plans into executables with ExperimentAgent. This phase leverages retrieved prototype code and optionally retrieves candidate models and data. Finally, the execution phase, also managed by ExperimentAgent, involves running experiments with mechanisms for human feedback and iterative debugging to enhance the likelihood of achieving executable research outcomes. We evaluate our framework on five machine learning research tasks and the experimental results show the framework's potential to facilitate the research progress and innovations.
♻ ☆ Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation ICPR 2024
In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.
comment: Accepted at ICPR 2024, Project Website: https://vl2g.github.io/projects/visTrans/
♻ ☆ An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models ECCV 2024
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
comment: Accepted to ECCV 2024 (Oral), code is released at https://github.com/pkunlp-icler/FastV,
♻ ☆ CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
With the profound development of large language models(LLMs), their safety concerns have garnered increasing attention. However, there is a scarcity of Chinese safety benchmarks for LLMs, and the existing safety taxonomies are inadequate, lacking comprehensive safety detection capabilities in authentic Chinese scenarios. In this work, we introduce CHiSafetyBench, a dedicated safety benchmark for evaluating LLMs' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts. CHiSafetyBench incorporates a dataset that covers a hierarchical Chinese safety taxonomy consisting of 5 risk areas and 31 categories. This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively. Utilizing this benchmark, we validate the feasibility of automatic evaluation as a substitute for human evaluation and conduct comprehensive automatic safety assessments on mainstream Chinese LLMs. Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities. Our dataset is publicly available at https://github.com/UnicomAI/UnicomBenchmark/tree/main/CHiSafetyBench.
comment: 16 pages, 5 figures
♻ ☆ ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation
Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement. Data available here: https://github.com/a-brassard/ACORN
comment: 18 pages, 7 figures, accepted to COLM 2024. Data available here: https://github.com/a-brassard/ACORN
♻ ☆ MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms ACL 2024
Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos, making it challenging for machines to comprehend the information or emotions associated with interactions in online spaces. Multimodal Large Language Models (MLLMs) have emerged as a promising solution to these challenges, yet they struggle to accurately interpret human emotions and complex content such as misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through our exhaustive evaluation on ten size-variants of four open-source MLLMs, we have identified significant performance disparities, highlighting the need for advancements in models' social understanding capabilities. Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. Our code and data are available at https://github.com/claws-lab/MMSoc.git.
comment: In Proceedings of ACL 2024
♻ ☆ Persuasion Games using Large Language Models
Large Language Models (LLMs) have emerged as formidable instruments capable of comprehending and producing human-like text. This paper explores the potential of LLMs, to shape user perspectives and subsequently influence their decisions on particular tasks. This capability finds applications in diverse domains such as Investment, Credit cards and Insurance, wherein they assist users in selecting appropriate insurance policies, investment plans, Credit cards, Retail, as well as in Behavioral Change Support Systems (BCSS). We present a sophisticated multi-agent framework wherein a consortium of agents operate in collaborative manner. The primary agent engages directly with user agents through persuasive dialogue, while the auxiliary agents perform tasks such as information retrieval, response analysis, development of persuasion strategies, and validation of facts. Empirical evidence from our experiments demonstrates that this collaborative methodology significantly enhances the persuasive efficacy of the LLM. We continuously analyze the resistance of the user agent to persuasive efforts and counteract it by employing a combination of rule-based and LLM-based resistance-persuasion mapping techniques. We employ simulated personas and generate conversations in insurance, banking, and retail domains to evaluate the proficiency of large language models (LLMs) in recognizing, adjusting to, and influencing various personality types. Concurrently, we examine the resistance mechanisms employed by LLM simulated personas. Persuasion is quantified via measurable surveys before and after interaction, LLM-generated scores on conversation, and user decisions (purchase or non-purchase).
♻ ☆ Cultural Compass: Predicting Transfer Learning Success in Offensive Language Detection with Cultural Features EMNLP 2023
The increasing ubiquity of language technology necessitates a shift towards considering cultural diversity in the machine learning realm, particularly for subjective tasks that rely heavily on cultural nuances, such as Offensive Language Detection (OLD). Current understanding underscores that these tasks are substantially influenced by cultural values, however, a notable gap exists in determining if cultural features can accurately predict the success of cross-cultural transfer learning for such subjective tasks. Addressing this, our study delves into the intersection of cultural features and transfer learning effectiveness. The findings reveal that cultural value surveys indeed possess a predictive power for cross-cultural transfer learning success in OLD tasks and that it can be further improved using offensive word distance. Based on these results, we advocate for the integration of cultural information into datasets. Additionally, we recommend leveraging data sources rich in cultural information, such as surveys, to enhance cultural adaptability. Our research signifies a step forward in the quest for more inclusive, culturally sensitive language technologies.
comment: Findings of EMNLP 2023 (update)
♻ ☆ From Wide to Deep: Dimension Lifting Network for Parameter-efficient Knowledge Graph Embedding
Knowledge graph embedding (KGE) that maps entities and relations into vector representations is essential for downstream applications. Conventional KGE methods require high-dimensional representations to learn the complex structure of knowledge graph, but lead to oversized model parameters. Recent advances reduce parameters by low-dimensional entity representations, while developing techniques (e.g., knowledge distillation or reinvented representation forms) to compensate for reduced dimension. However, such operations introduce complicated computations and model designs that may not benefit large knowledge graphs. To seek a simple strategy to improve the parameter efficiency of conventional KGE models, we take inspiration from that deeper neural networks require exponentially fewer parameters to achieve expressiveness comparable to wider networks for compositional structures. We view all entity representations as a single-layer embedding network, and conventional KGE methods that adopt high-dimensional entity representations equal widening the embedding network to gain expressiveness. To achieve parameter efficiency, we instead propose a deeper embedding network for entity representations, i.e., a narrow entity embedding layer plus a multi-layer dimension lifting network (LiftNet). Experiments on three public datasets show that by integrating LiftNet, four conventional KGE methods with 16-dimensional representations achieve comparable link prediction accuracy as original models that adopt 512-dimensional representations, saving 68.4% to 96.9% parameters.
Computer Vision and Pattern Recognition 44
♻ ☆ Inter-Frame Compression for Dynamic Point Cloud Geometry Coding
Efficient point cloud compression is essential for applications like virtual and mixed reality, autonomous driving, and cultural heritage. This paper proposes a deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression. We propose a lossy geometry compression scheme that predicts the latent representation of the current frame using the previous frame by employing a novel feature space inter-prediction network. The proposed network utilizes sparse convolutions with hierarchical multiscale 3D feature learning to encode the current frame using the previous frame. The proposed method introduces a novel predictor network for motion compensation in the feature domain to map the latent representation of the previous frame to the coordinates of the current frame to predict the current frame's feature embedding. The framework transmits the residual of the predicted features and the actual features by compressing them using a learned probabilistic factorized entropy model. At the receiver, the decoder hierarchically reconstructs the current frame by progressively rescaling the feature embedding. The proposed framework is compared to the state-of-the-art Video-based Point Cloud Compression (V-PCC) and Geometry-based Point Cloud Compression (G-PCC) schemes standardized by the Moving Picture Experts Group (MPEG). The proposed method achieves more than 88% BD-Rate (Bjontegaard Delta Rate) reduction against G-PCCv20 Octree, more than 56% BD-Rate savings against G-PCCv20 Trisoup, more than 62% BD-Rate reduction against V-PCC intra-frame encoding mode, and more than 52% BD-Rate savings against V-PCC P-frame-based inter-frame encoding mode using HEVC. These significant performance gains are cross-checked and verified in the MPEG working group.
♻ ☆ SEDMamba: Enhancing Selective State Space Modelling with Bottleneck Mechanism and Fine-to-Coarse Temporal Fusion for Efficient Error Detection in Robot-Assisted Surgery
Automated detection of surgical errors can improve robotic-assisted surgery. Despite promising progress, existing methods still face challenges in capturing rich temporal context to establish long-term dependencies while maintaining computational efficiency. In this paper, we propose a novel hierarchical model named SEDMamba, which incorporates the selective state space model (SSM) into surgical error detection, facilitating efficient long sequence modelling with linear complexity. SEDMamba enhances selective SSM with a bottleneck mechanism and fine-to-coarse temporal fusion (FCTF) to detect and temporally localize surgical errors in long videos. The bottleneck mechanism compresses and restores features within their spatial dimension, thereby reducing computational complexity. FCTF utilizes multiple dilated 1D convolutional layers to merge temporal information across diverse scale ranges, accommodating errors of varying duration. Our work also contributes the first-of-its-kind, frame-level, in-vivo surgical error dataset to support error detection in real surgical cases. Specifically, we deploy the clinically validated observational clinical human reliability assessment tool (OCHRA) to annotate the errors during suturing tasks in an open-source radical prostatectomy dataset (SAR-RARP50). Experimental results demonstrate that our SEDMamba outperforms state-of-the-art methods with at least 1.82% AUC and 3.80% AP performance gains with significantly reduced computational complexity. The corresponding error annotations, code and models will be released at https://github.com/wzjialang/SEDMamba.
comment: 8 pages
♻ ☆ Efficient Video Object Segmentation via Modulated Cross-Attention Memory WACV 2025
Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.
comment: WACV 2025
♻ ☆ RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance
Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
♻ ☆ Multi-Visual-Inertial System: Analysis, Calibration and Estimation
In this paper, we study state estimation of multi-visual-inertial systems (MVIS) and develop sensor fusion algorithms to optimally fuse an arbitrary number of asynchronous inertial measurement units (IMUs) or gyroscopes and global and(or) rolling shutter cameras. We are especially interested in the full calibration of the associated visual-inertial sensors, including the IMU or camera intrinsics and the IMU-IMU(or camera) spatiotemporal extrinsics as well as the image readout time of rolling-shutter cameras (if used). To this end, we develop a new analytic combined IMU integration with intrinsics-termed ACI3-to preintegrate IMU measurements, which is leveraged to fuse auxiliary IMUs and(or) gyroscopes alongside a base IMU. We model the multi-inertial measurements to include all the necessary inertial intrinsic and IMU-IMU spatiotemporal extrinsic parameters, while leveraging IMU-IMU rigid-body constraints to eliminate the necessity of auxiliary inertial poses and thus reducing computational complexity. By performing observability analysis of MVIS, we prove that the standard four unobservable directions remain - no matter how many inertial sensors are used, and also identify, for the first time, degenerate motions for IMU-IMU spatiotemporal extrinsics and auxiliary inertial intrinsics. In addition to the extensive simulations that validate our analysis and algorithms, we have built our own MVIS sensor rig and collected over 25 real-world datasets to experimentally verify the proposed calibration against the state-of-the-art calibration method such as Kalibr. We show that the proposed MVIS calibration is able to achieve competing accuracy with improved convergence and repeatability, which is open sourced to better benefit the community.
♻ ☆ On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models
Volumetric medical segmentation models have achieved significant success on organ and tumor-based segmentation tasks in recent years. However, their vulnerability to adversarial attacks remains largely unexplored, raising serious concerns regarding the real-world deployment of tools employing such models in the healthcare sector. This underscores the importance of investigating the robustness of existing models. In this context, our work aims to empirically examine the adversarial robustness across current volumetric segmentation architectures, encompassing Convolutional, Transformer, and Mamba-based models. We extend this investigation across four volumetric segmentation datasets, evaluating robustness under both white box and black box adversarial attacks. Overall, we observe that while both pixel and frequency-based attacks perform reasonably well under \emph{white box} setting, the latter performs significantly better under transfer-based black box attacks. Across our experiments, we observe transformer-based models show higher robustness than convolution-based models with Mamba-based models being the most vulnerable. Additionally, we show that large-scale training of volumetric segmentation models improves the model's robustness against adversarial attacks. The code and robust models are available at https://github.com/HashmatShadab/Robustness-of-Volumetric-Medical-Segmentation-Models.
comment: Accepted at British Machine Vision Conference 2024
♻ ☆ Training-free Long Video Generation with Chain of Diffusion Model Experts
Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.
♻ ☆ TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos
We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by a large margin from prior work. https://yufu-wang.github.io/tram4d/
comment: The project website: https://yufu-wang.github.io/tram4d/
♻ ☆ Privacy-Aware Document Visual Question Answering ICDAR 2024
Document Visual Question Answering (DocVQA) has quickly grown into a central task of document understanding. But despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees. In this work, we explore privacy in the domain of DocVQA for the first time, highlighting privacy issues in state of the art multi-modal LLM models used for DocVQA, and explore possible solutions. Specifically, we focus on invoice processing as a realistic document understanding scenario, and propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers. We employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the data of the invoice provider is the sensitive information to be protected. We demonstrate that non-private models tend to memorise, a behaviour that can lead to exposing private information. We then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through either or both of the two input modalities: vision (document image) or language (OCR tokens). Finally, we design attacks exploiting the memorisation effect of the model, and demonstrate their effectiveness in probing a representative DocVQA models.
comment: 35 pages, 12 figures, accepted for publication at the 18th International Conference on Document Analysis and Recognition, ICDAR 2024
♻ ☆ Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? ECCV 2024
Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.
comment: Preprint of paper to be presented at Fairness and Ethics Towards Transparent AI: Facing the Challenge through Model Debiasing (FAILED) during ECCV 2024
♻ ☆ LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation
Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions. Due to the varying object scales in medical images, MIRS demands robust vision-language modeling and comprehensive multi-scale interaction for precise localization and segmentation under linguistic guidance. However, existing medical image segmentation methods fall short in meeting these demands, resulting in insufficient segmentation accuracy. In response, we propose an approach named Language-guided Scale-aware MedSegmentor (LSMS), incorporating two appealing designs: (1)~a Scale-aware Vision-Language Attention module that leverages diverse convolutional kernels to acquire rich visual knowledge and interact closely with linguistic features, thereby enhancing lesion localization capability; (2)~a Full-Scale Decoder that globally models multi-modal features across various scales, capturing complementary information between scales to accurately outline lesion boundaries. Addressing the lack of suitable datasets for MIRS, we constructed a vision-language medical dataset called Reference Hepatic Lesion Segmentation (RefHL-Seg). This dataset comprises 2,283 abdominal CT slices from 231 cases, with corresponding textual annotations and segmentation masks for various liver lesions in images. We validated the performance of LSMS for MIRS and conventional medical image segmentation tasks across various datasets. Our LSMS consistently outperforms on all datasets with lower computational costs. The code and datasets will be released.
comment: 14 pages, 5 figures
♻ ☆ Implicit Concept Removal of Diffusion Models ECCV2024
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. These concepts, termed as the "implicit concepts", could be unintentionally learned during training and then be generated uncontrollably during inference. Existing removal methods still struggle to eliminate implicit concepts primarily due to their dependency on the model's ability to recognize concepts it actually can not discern. To address this, we utilize the intrinsic geometric characteristics of implicit concepts and present the Geom-Erasing, a novel concept removal method based on the geometric-driven control. Specifically, once an unwanted implicit concept is identified, we integrate the existence and geometric information of the concept into the text prompts with the help of an accessible classifier or detector model. Subsequently, the model is optimized to identify and disentangle this information, which is then adopted as negative prompts during generation. Moreover, we introduce the Implicit Concept Dataset (ICD), a novel image-text dataset imbued with three typical implicit concepts (i.e., QR codes, watermarks, and text), reflecting real-life situations where implicit concepts are easily injected. Geom-Erasing effectively mitigates the generation of implicit concepts, achieving the state-of-the-art results on the Inappropriate Image Prompts (I2P) and our challenging Implicit Concept Dataset (ICD) benchmarks.
comment: Accepted by ECCV2024. Project Page: https://kaichen1998.github.io/projects/geom-erasing/
♻ ☆ GUing: A Mobile GUI Search Engine using a Vision-Language Model
App developers use the Graphical User Interface (GUI) of other apps as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and often lack important app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which usually display the most representative screenshots and are often captioned (i.e.~labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.
♻ ☆ An open dataset for oracle bone script recognition and decipherment
Oracle bone script, one of the earliest known forms of ancient Chinese writing, presents invaluable research materials for scholars studying the humanities and geography of the Shang Dynasty, dating back 3,000 years. The immense historical and cultural significance of these writings cannot be overstated. However, the passage of time has obscured much of their meaning, presenting a significant challenge in deciphering these ancient texts. With the advent of Artificial Intelligence (AI), employing AI to assist in deciphering Oracle Bone Characters (OBCs) has become a feasible option. Yet, progress in this area has been hindered by a lack of high-quality datasets. To address this issue, this paper details the creation of the HUST-OBC dataset. This dataset encompasses 77,064 images of 1,588 individual deciphered characters and 62,989 images of 9,411 undeciphered characters, with a total of 140,053 images, compiled from diverse sources. The hope is that this dataset could inspire and assist future research in deciphering those unknown OBCs. All the codes and datasets are available at https://github.com/Yuliang-Liu/Open-Oracle.
♻ ☆ SABER-6D: Shape Representation Based Implicit Object Pose Estimation ECCV 2024
In this paper, we propose a novel encoder-decoder architecture, named SABER, to learn the 6D pose of the object in the embedding space by learning shape representation at a given pose. This model enables us to learn pose by performing shape representation at a target pose from RGB image input. We perform shape representation as an auxiliary task which helps us in learning rotations space for an object based on 2D images. An image encoder predicts the rotation in the embedding space and the DeepSDF based decoder learns to represent the object's shape at the given pose. As our approach is shape based, the pipeline is suitable for any type of object irrespective of the symmetry. Moreover, we need only a CAD model of the objects to train SABER. Our pipeline is synthetic data based and can also handle symmetric objects without symmetry labels and, thus, no additional labeled training data is needed. The experimental evaluation shows that our method achieves close to benchmark results for both symmetric objects and asymmetric objects on Occlusion-LineMOD, and T-LESS datasets.
comment: ECCV 2024 R6D workshop
♻ ☆ A Deep-Learning-Based Label-free No-Reference Image Quality Assessment Metric: Application in Sodium MRI Denoising
New multinuclear MRI techniques, such as sodium MRI, generally suffer from low image quality due to an inherently low signal. Postprocessing methods, such as image denoising, have been developed for image enhancement. However, the assessment of these enhanced images is challenging especially considering when there is a lack of high resolution and high signal images as reference, such as in sodium MRI. No-reference Image Quality Assessment (NR-IQA) metrics are approaches to solve this problem. Existing learning-based NR-IQA metrics rely on labels derived from subjective human opinions or metrics like Signal-to-Noise Ratio (SNR), which are either time-consuming or lack accurate ground truths, resulting in unreliable assessment. We note that deep learning (DL) models have a unique characteristic in that they are specialized to a characteristic training set, meaning that deviations between the input testing data from the training data will reduce prediction accuracy. Therefore, we propose a novel DL-based NR-IQA metric, the Model Specialization Metric (MSM), which does not depend on ground-truth images or labels. MSM measures the difference between the input image and the model's prediction for evaluating the quality of the input image. Experiments conducted on both simulated distorted proton T1-weighted MR images and denoised sodium MR images demonstrate that MSM exhibits a superior evaluation performance on various simulated noises and distortions. MSM also has a substantial agreement with the expert evaluations, achieving an averaged Cohen's Kappa coefficient of 0.6528, outperforming the existing NR-IQA metrics.
comment: 13 pages, 3 figures
♻ ☆ Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model ACM MM 2024
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. Our code and weights are available at https://github.com/xhanxu/Mamba3D.
comment: ACM MM 2024. Code and weights are available at https://github.com/xhanxu/Mamba3D
♻ ☆ Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis BMVC2024
Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.
comment: Accepted at BMVC2024
♻ ☆ Structured Generative Models for Scene Understanding
This position paper argues for the use of \emph{structured generative models} (SGMs) for the understanding of static scenes. This requires the reconstruction of a 3D scene from an input image (or a set of multi-view images), whereby the contents of the image(s) are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along with global variables like scene lighting and camera parameters. This approach also requires scene models which account for the co-occurrences and inter-relationships of objects in a scene. The SGM approach has the merits that it is compositional and generative, which lead to interpretability and editability. \\\\ To pursue the SGM agenda, we need models for objects and scenes, and approaches to carry out inference. We first review models for objects, which include ``things'' (object categories that have a well defined shape), and ``stuff'' (categories which have amorphous spatial extent). We then move on to review \emph{scene models} which describe the inter-relationships of objects. Perhaps the most challenging problem for SGMs is \emph{inference} of the objects, lighting and camera parameters, and scene inter-relationships from input consisting of a single or multiple images. We conclude with a discussion of issues that need addressing to advance the SGM agenda.
comment: 32 pages, 10 figures
♻ ☆ MAPL: Memory Augmentation and Pseudo-Labeling for Semi-Supervised Anomaly Detection
Large unlabeled data and difficult-to-identify anomalies are the urgent issues need to overcome in most industrial scene. In order to address this issue, a new meth-odology for detecting surface defects in in-dustrial settings is introduced, referred to as Memory Augmentation and Pseudo-Labeling(MAPL). The methodology first in-troduces an anomaly simulation strategy, which significantly improves the model's ability to recognize rare or unknown anom-aly types by generating simulated anomaly samples. To cope with the problem of the lack of labeling of anomalous simulated samples, a pseudo-labeler method based on a one-classifier ensemble was employed in this study, which enhances the robustness of the model in the case of limited labeling data by automatically selecting key pseudo-labeling hyperparameters. Meanwhile, a memory-enhanced learning mechanism is introduced to effectively predict abnormal regions by analyzing the difference be-tween the input samples and the normal samples in the memory pool. An end-to-end learning framework is employed by MAPL to identify the abnormal regions directly from the input data, which optimizes the ef-ficiency and real-time performance of de-tection. By conducting extensive trials on the recently developed BHAD dataset (in-cluding MVTec AD [1], Visa [2], and MDPP [3]), MAPL achieves an average im-age-level AUROC score of 86.2%, demon-strating a 5.1% enhancement compared to the original MemSeg [4] model. The source code is available at https://github.com/jzc777/MAPL.
♻ ☆ MCDubber: Multimodal Context-Aware Expressive Video Dubbing SC2024
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.
comment: Accepted by NCMMSC2024
♻ ☆ UniUSNet: A Promptable Framework for Universal Ultrasound Disease Prediction and Tissue Segmentation
Ultrasound is widely used in clinical practice due to its affordability, portability, and safety. However, current AI research often overlooks combined disease prediction and tissue segmentation. We propose UniUSNet, a universal framework for ultrasound image classification and segmentation. This model handles various ultrasound types, anatomical positions, and input formats, excelling in both segmentation and classification tasks. Trained on a comprehensive dataset with over 9.7K annotations from 7 distinct anatomical positions, our model matches state-of-the-art performance and surpasses single-dataset and ablated models. Zero-shot and fine-tuning experiments show strong generalization and adaptability with minimal fine-tuning. We plan to expand our dataset and refine the prompting mechanism, with model weights and code available at (https://github.com/Zehui-Lin/UniUSNet).
comment: Accepted to BIBM 2024
♻ ☆ GuidedNet: Semi-Supervised Multi-Organ Segmentation via Labeled Data Guide Unlabeled Data ACM MM2024
Semi-supervised multi-organ medical image segmentation aids physicians in improving disease diagnosis and treatment planning and reduces the time and effort required for organ annotation.Existing state-of-the-art methods train the labeled data with ground truths and train the unlabeled data with pseudo-labels. However, the two training flows are separate, which does not reflect the interrelationship between labeled and unlabeled data.To address this issue, we propose a semi-supervised multi-organ segmentation method called GuidedNet, which leverages the knowledge from labeled data to guide the training of unlabeled data. The primary goals of this study are to improve the quality of pseudo-labels for unlabeled data and to enhance the network's learning capability for both small and complex organs.A key concept is that voxel features from labeled and unlabeled data that are close to each other in the feature space are more likely to belong to the same class.On this basis, a 3D Consistent Gaussian Mixture Model (3D-CGMM) is designed to leverage the feature distributions from labeled data to rectify the generated pseudo-labels.Furthermore, we introduce a Knowledge Transfer Cross Pseudo Supervision (KT-CPS) strategy, which leverages the prior knowledge obtained from the labeled data to guide the training of the unlabeled data, thereby improving the segmentation accuracy for both small and complex organs. Extensive experiments on two public datasets, FLARE22 and AMOS, demonstrated that GuidedNet is capable of achieving state-of-the-art performance. The source code with our proposed model are available at https://github.com/kimjisoo12/GuidedNet.
comment: Accepted by ACM MM2024, 10 pages, 5 figures
♻ ☆ PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding
3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions. This requires the model to not only focus on the target object itself but also to consider the surrounding environment to determine whether the descriptions are met. Most previous works attempt to accomplish both tasks within the same module, which can easily lead to a distraction of attention. To this end, we propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts. Specifically, in the target object branch, the decoder processes text tokens that describe features of the target object (e.g., category and color), guiding the queries to pay attention to the target object itself. In the surrounding branch, the queries align with other text tokens that carry surrounding environment information, making the attention maps accurately capture the layout described in the text. Benefiting from the proposed dual-branch design, the queries are allowed to focus on points relevant to each branch's specific objective. Moreover, we design an adaptive position encoding method for each branch respectively. In the target object branch, the position encoding relies on the relative positions between seed points and predicted 3D boxes. In the surrounding branch, the attention map is additionally guided by the confidence between visual and text features, enabling the queries to focus on points that have valuable layout information. Extensive experiments demonstrate that we surpass the state-of-the-art on two widely adopted 3D visual grounding datasets, ScanRefer and Nr3D.
♻ ☆ DORec: Decomposed Object Reconstruction and Segmentation Utilizing 2D Self-Supervised Features
Recovering 3D geometry and textures of individual objects is crucial for many robotics applications, such as manipulation, pose estimation, and autonomous driving. However, decomposing a target object from a complex background is challenging. Most existing approaches rely on costly manual labels to acquire object instance perception. Recent advancements in 2D self-supervised learning offer new prospects for identifying objects of interest, yet leveraging such noisy 2D features for clean decomposition remains difficult. In this paper, we propose a Decomposed Object Reconstruction (DORec) network based on neural implicit representations. Our key idea is to use 2D self-supervised features to create two levels of masks for supervision: a binary mask for foreground regions and a K-cluster mask for semantically similar regions. These complementary masks result in robust decomposition. Experimental results on different datasets show DORec's superiority in segmenting and reconstructing diverse foreground objects from varied backgrounds enabling downstream tasks such as pose estimation.
♻ ☆ FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models ECCV 2024
The substantial computational costs of diffusion models, especially due to the repeated denoising steps necessary for high-quality image generation, present a major obstacle to their widespread adoption. While several studies have attempted to address this issue by reducing the number of score function evaluations (NFE) using advanced ODE solvers without fine-tuning, the decreased number of denoising iterations misses the opportunity to update fine details, resulting in noticeable quality degradation. In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. Reusing feature maps with high temporal similarity opens up a new opportunity to save computation resources without compromising output quality. To realize the practical benefits of this intuition, we conduct an extensive analysis and propose a novel method, FRDiff. FRDiff is designed to harness the advantages of both reduced NFE and feature reuse, achieving a Pareto frontier that balances fidelity and latency trade-offs in various generative tasks.
comment: Accepted at ECCV 2024. Code : https://github.com/ECoLab-POSTECH/FRDiff
♻ ☆ Evolution-aware VAriance (EVA) Coreset Selection for Medical Image Classification
In the medical field, managing high-dimensional massive medical imaging data and performing reliable medical analysis from it is a critical challenge, especially in resource-limited environments such as remote medical facilities and mobile devices. This necessitates effective dataset compression techniques to reduce storage, transmission, and computational cost. However, existing coreset selection methods are primarily designed for natural image datasets, and exhibit doubtful effectiveness when applied to medical image datasets due to challenges such as intra-class variation and inter-class similarity. In this paper, we propose a novel coreset selection strategy termed as Evolution-aware VAriance (EVA), which captures the evolutionary process of model training through a dual-window approach and reflects the fluctuation of sample importance more precisely through variance measurement. Extensive experiments on medical image datasets demonstrate the effectiveness of our strategy over previous SOTA methods, especially at high compression rates. EVA achieves 98.27% accuracy with only 10% training data, compared to 97.20% for the full training set. None of the compared baseline methods can exceed Random at 5% selection rate, while EVA outperforms Random by 5.61%, showcasing its potential for efficient medical image analysis.
comment: Accepted by ACM Multimedia 2024 (oral), see: https://openreview.net/forum?id=m1qrB9KSYD
♻ ☆ Biometrics and Behavior Analysis for Detecting Distractions in e-Learning
In this article, we explore computer vision approaches to detect abnormal head pose during e-learning sessions and we introduce a study on the effects of mobile phone usage during these sessions. We utilize behavioral data collected from 120 learners monitored while participating in a MOOC learning sessions. Our study focuses on the influence of phone-usage events on behavior and physiological responses, specifically attention, heart rate, and meditation, before, during, and after phone usage. Additionally, we propose an approach for estimating head pose events using images taken by the webcam during the MOOC learning sessions to detect phone-usage events. Our hypothesis suggests that head posture undergoes significant changes when learners interact with a mobile phone, contrasting with the typical behavior seen when learners face a computer during e-learning sessions. We propose an approach designed to detect deviations in head posture from the average observed during a learner's session, operating as a semi-supervised method. This system flags events indicating alterations in head posture for subsequent human review and selection of mobile phone usage occurrences with a sensitivity over 90%.
comment: Published in IEEE Intl. Symposium on Computers in Education (SIIE) 2024
♻ ☆ VAAD: Visual Attention Analysis Dashboard applied to e-Learning
In this paper, we present an approach in the Multimodal Learning Analytics field. Within this approach, we have developed a tool to visualize and analyze eye movement data collected during learning sessions in online courses. The tool is named VAAD, an acronym for Visual Attention Analysis Dashboard. These eye movement data have been gathered using an eye-tracker and subsequently processed and visualized for interpretation. The purpose of the tool is to conduct a descriptive analysis of the data by facilitating its visualization, enabling the identification of differences and learning patterns among various learner populations. Additionally, it integrates a predictive module capable of anticipating learner activities during a learning session. Consequently, VAAD holds the potential to offer valuable insights into online learning behaviors from both descriptive and predictive perspectives.
comment: Published in IEEE Intl. Symposium on Computers in Education (SIIE) 2024
♻ ☆ Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation
Medical image segmentation plays a crucial role in computer-aided diagnosis. However, existing methods heavily rely on fully supervised training, which requires a large amount of labeled data with time-consuming pixel-wise annotations. Moreover, accurately segmenting lesions poses challenges due to variations in shape, size, and location. To address these issues, we propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image Segmentation (DEC-Seg). First, we propose a Cross-level Feature Aggregation (CFA) module that integrates cross-level adjacent layers to enhance the feature representation ability across different resolutions. To address scale variation, we present a scale-enhanced consistency constraint, which ensures consistency in the segmentation maps generated from the same input image at different scales. This constraint helps handle variations in lesion sizes and improves the robustness of the model. Furthermore, we propose a cross-generative consistency scheme, in which the original and perturbed images can be reconstructed using cross-segmentation maps. This consistency constraint allows us to mine effective feature representations and boost the segmentation performance. To further exploit the scale information, we propose a Dual-scale Complementary Fusion (DCF) module that integrates features from two scale-specific decoders operating at different scales to help produce more accurate segmentation maps. Extensive experimental results on multiple medical segmentation tasks (polyp, skin lesion, and brain glioma) demonstrate the effectiveness of our DEC-Seg against other state-of-the-art semi-supervised segmentation approaches. The implementation code will be released at https://github.com/taozh2017/DECSeg.
comment: 12 pages 10 figures
♻ ☆ Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.
comment: Accepted by IEEE TCSVT 2024, Camera Ready Version
♻ ☆ Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities
Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.
comment: 15 pages, 3 figures, 4 tables
♻ ☆ Instant Adversarial Purification with Adversarial Consistency Distillation
Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.
♻ ☆ Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation ICPR 2024
In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.
comment: Accepted at ICPR 2024, Project Website: https://vl2g.github.io/projects/visTrans/
♻ ☆ An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models ECCV 2024
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
comment: Accepted to ECCV 2024 (Oral), code is released at https://github.com/pkunlp-icler/FastV,
♻ ☆ Video Diffusion Models are Strong Video Inpainter
Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.
♻ ☆ A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse
Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.
comment: 21 pages, 7 figures, 10 tables
♻ ☆ Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. Most event-based methods project events into dense frames and process them using conventional vision models, resulting in high computational complexity. A recent trend is to develop point-based networks that achieve efficient event processing by learning sparse representations. However, existing works may lack robust local information aggregators and effective feature interaction operations, thus limiting their modeling capabilities. To this end, we propose an attention-aware model named Event Voxel Set Transformer (EVSTr) for efficient spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder that consists of two well-designed components, including the Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and the Voxel Self-Attention Layer (VSAL) for global feature interaction. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy (S$^{2}$TM) to learn motion patterns from a sequence of segmented voxel sets. The proposed model is evaluated on two recognition tasks, including object classification and action recognition. To provide a convincing model evaluation, we present a new event-based action recognition dataset (NeuroHAR) recorded in challenging scenarios. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
comment: Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
♻ ☆ MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms ACL 2024
Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos, making it challenging for machines to comprehend the information or emotions associated with interactions in online spaces. Multimodal Large Language Models (MLLMs) have emerged as a promising solution to these challenges, yet they struggle to accurately interpret human emotions and complex content such as misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through our exhaustive evaluation on ten size-variants of four open-source MLLMs, we have identified significant performance disparities, highlighting the need for advancements in models' social understanding capabilities. Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. Our code and data are available at https://github.com/claws-lab/MMSoc.git.
comment: In Proceedings of ACL 2024
♻ ☆ Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance
Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underline{SAM} with se\underline{m}antic f\underline{e}ature fu\underline{s}ion guidanc\underline{e} (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency prediction. To address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework. The code will be available at \url{https://github.com/Angknpng/Sammese}.
comment: 10 pages, 9 figures
♻ ☆ S-NeRF++: Autonomous Driving Simulation via Neural Reconstruction and Generation
Autonomous driving simulation system plays a crucial role in enhancing self-driving data and simulating complex and rare traffic scenarios, ensuring navigation safety. However, traditional simulation systems, which often heavily rely on manual modeling and 2D image editing, struggled with scaling to extensive scenes and generating realistic simulation data. In this study, we present S-NeRF++, an innovative autonomous driving simulation system based on neural reconstruction. Trained on widely-used self-driving datasets such as nuScenes and Waymo, S-NeRF++ can generate a large number of realistic street scenes and foreground objects with high rendering quality as well as offering considerable flexibility in manipulation and simulation. Specifically, S-NeRF++ is an enhanced neural radiance field for synthesizing large-scale scenes and moving vehicles, with improved scene parameterization and camera pose learning. The system effectively utilizes noisy and sparse LiDAR data to refine training and address depth outliers, ensuring high-quality reconstruction and novel-view rendering. It also provides a diverse foreground asset bank by reconstructing and generating different foreground vehicles to support comprehensive scenario creation.Moreover, we have developed an advanced foreground-background fusion pipeline that skillfully integrates illumination and shadow effects, further enhancing the realism of our simulations. With the high-quality simulated data provided by our S-NeRF++, we found the perception methods enjoy performance boosts on several autonomous driving downstream tasks, further demonstrating our proposed simulator's effectiveness.
♻ ☆ NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches
Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) on https://www.kaggle.com/nutritionverse/datasets as part of an open initiative to accelerate machine learning for dietary sensing.
comment: Corrections made to Tables 6, 7, and 8, and corrections made to Experiments Part C. Additional clarification made in Section 4
♻ ☆ DarkGS: Learning Neural Illumination and 3D Gaussians Relighting for Robotic Exploration in the Dark
Humans have the remarkable ability to construct consistent mental models of an environment, even under limited or varying levels of illumination. We wish to endow robots with this same capability. In this paper, we tackle the challenge of constructing a photorealistic scene representation under poorly illuminated conditions and with a moving light source. We approach the task of modeling illumination as a learning problem, and utilize the developed illumination model to aid in scene reconstruction. We introduce an innovative framework that uses a data-driven approach, Neural Light Simulators (NeLiS), to model and calibrate the camera-light system. Furthermore, we present DarkGS, a method that applies NeLiS to create a relightable 3D Gaussian scene model capable of real-time, photorealistic rendering from novel viewpoints. We show the applicability and robustness of our proposed simulator and system in a variety of real-world environments.
comment: 8 pages, 10 figures
♻ ☆ Towards Non-invasive and Personalized Management of Breast Cancer Patients from Multiparametric MRI via A Large Mixture-of-Modality-Experts Model
Breast magnetic resonance imaging (MRI) is the imaging technique with the highest sensitivity for detecting breast cancer and is routinely used for women at high risk. Despite the comprehensive multiparametric protocol of breast MRI, existing artificial intelligence-based studies predominantly rely on single sequences and have limited validation. Here we report a large mixture-of-modality-experts model (MOME) that integrates multiparametric MRI information within a unified structure, offering a noninvasive method for personalized breast cancer management. We have curated the largest multiparametric breast MRI dataset, involving 5,205 patients from three hospitals in the north, southeast, and southwest of China, for the development and extensive evaluation of our model. MOME demonstrated accurate and robust identification of breast cancer. It achieved comparable performance for malignancy recognition to that of four senior radiologists and significantly outperformed a junior radiologist, with 0.913 AUROC, 0.948 AUPRC, 0.905 F1 score, and 0.723 MCC. Our findings suggest that MOME could reduce the need for biopsies in BI-RADS 4 patients with a ratio of 7.3%, classify triple-negative breast cancer with an AUROC of 0.709, and predict pathological complete response to neoadjuvant chemotherapy with an AUROC of 0.694. The model further supports scalable and interpretable inference, adapting to missing modalities and providing decision explanations by highlighting lesions and measuring modality contributions. MOME exemplifies a discriminative, robust, scalable, and interpretable multimodal model, paving the way for noninvasive, personalized management of breast cancer patients based on multiparametric breast imaging data.
comment: 27 pages, 8 figures, 10 tables
Artificial Intelligence 55
♻ ☆ PhysORD: A Neuro-Symbolic Approach for Physics-infused Motion Prediction in Off-road Driving
Motion prediction is critical for autonomous off-road driving, however, it presents significantly more challenges than on-road driving because of the complex interaction between the vehicle and the terrain. Traditional physics-based approaches encounter difficulties in accurately modeling dynamic systems and external disturbance. In contrast, data-driven neural networks require extensive datasets and struggle with explicitly capturing the fundamental physical laws, which can easily lead to poor generalization. By merging the advantages of both methods, neuro-symbolic approaches present a promising direction. These methods embed physical laws into neural models, potentially significantly improving generalization capabilities. However, no prior works were evaluated in real-world settings for off-road driving. To bridge this gap, we present PhysORD, a neural-symbolic approach integrating the conservation law, i.e., the Euler-Lagrange equation, into data-driven neural models for motion prediction in off-road driving. Our experiments showed that PhysORD can accurately predict vehicle motion and tolerate external disturbance by modeling uncertainties. It outperforms existing methods both in accuracy and efficiency and demonstrates data-efficient learning and generalization ability in long-term prediction.
♻ ☆ Dissociation of Faithful and Unfaithful Reasoning in LLMs
Large language models (LLMs) often improve their performance in downstream tasks when they generate Chain of Thought reasoning text before producing an answer. We investigate how LLMs recover from errors in Chain of Thought. Through analysis of error recovery behaviors, we find evidence for unfaithfulness in Chain of Thought, which occurs when models arrive at the correct answer despite invalid reasoning text. We identify factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer. Critically, these factors have divergent effects on faithful and unfaithful recoveries. Our results indicate that there are distinct mechanisms driving faithful and unfaithful error recoveries. Selective targeting of these mechanisms may be able to drive down the rate of unfaithful reasoning and improve model interpretability.
comment: code published at https://github.com/CoTErrorRecovery/CoTErrorRecovery
♻ ☆ SEDMamba: Enhancing Selective State Space Modelling with Bottleneck Mechanism and Fine-to-Coarse Temporal Fusion for Efficient Error Detection in Robot-Assisted Surgery
Automated detection of surgical errors can improve robotic-assisted surgery. Despite promising progress, existing methods still face challenges in capturing rich temporal context to establish long-term dependencies while maintaining computational efficiency. In this paper, we propose a novel hierarchical model named SEDMamba, which incorporates the selective state space model (SSM) into surgical error detection, facilitating efficient long sequence modelling with linear complexity. SEDMamba enhances selective SSM with a bottleneck mechanism and fine-to-coarse temporal fusion (FCTF) to detect and temporally localize surgical errors in long videos. The bottleneck mechanism compresses and restores features within their spatial dimension, thereby reducing computational complexity. FCTF utilizes multiple dilated 1D convolutional layers to merge temporal information across diverse scale ranges, accommodating errors of varying duration. Our work also contributes the first-of-its-kind, frame-level, in-vivo surgical error dataset to support error detection in real surgical cases. Specifically, we deploy the clinically validated observational clinical human reliability assessment tool (OCHRA) to annotate the errors during suturing tasks in an open-source radical prostatectomy dataset (SAR-RARP50). Experimental results demonstrate that our SEDMamba outperforms state-of-the-art methods with at least 1.82% AUC and 3.80% AP performance gains with significantly reduced computational complexity. The corresponding error annotations, code and models will be released at https://github.com/wzjialang/SEDMamba.
comment: 8 pages
♻ ☆ Manipulating Large Language Models to Increase Product Visibility
Large language models (LLMs) are increasingly being integrated into search engines to provide natural language responses tailored to user queries. Customers and end-users are also becoming more dependent on these models for quick and easy purchase decisions. In this work, we investigate whether recommendations from LLMs can be manipulated to enhance a product's visibility. We demonstrate that adding a strategic text sequence (STS) -- a carefully crafted message -- to a product's information page can significantly increase its likelihood of being listed as the LLM's top recommendation. To understand the impact of STS, we use a catalog of fictitious coffee machines and analyze its effect on two target products: one that seldom appears in the LLM's recommendations and another that usually ranks second. We observe that the strategic text sequence significantly enhances the visibility of both products by increasing their chances of appearing as the top recommendation. This ability to manipulate LLM-generated search responses provides vendors with a considerable competitive advantage and has the potential to disrupt fair market competition. Just as search engine optimization (SEO) revolutionized how webpages are customized to rank higher in search engine results, influencing LLM recommendations could profoundly impact content optimization for AI-driven search services. Code for our experiments is available at https://github.com/aounon/llm-rank-optimizer.
♻ ☆ Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions
This paper examines the role of cognitive biases in the decision-making processes of large language models (LLMs), challenging the conventional goal of eliminating all biases. We show that certain cognitive biases when properly balanced, can enhance decision-making efficiency through rational deviations and heuristic shortcuts. By introducing heuristic moderation and an abstention option, which allows LLMs to withhold responses when uncertain, we reduce error rates, improve decision accuracy, and optimize decision rates. Using the Balance Rigor and Utility (BRU) dataset, developed through expert collaboration, our findings demonstrate that targeted inspection of cognitive biases aligns LLM decisions more closely with human reasoning, enhancing reliability and suggesting strategies for future improvements. This approach offers a novel way to leverage cognitive biases to improve the practical utility of LLMs across various applications.
comment: This article is currently under review. All data will be open on GitHub once the review is complete. https://github.com/limanwang/Balancing-Rigor-and-Utility
♻ ☆ On the Impacts of Contexts on Repository-Level Code Generation
CodeLLMs have gained widespread adoption for code generation tasks, yet their capacity to handle repository-level code generation with complex contextual dependencies remains underexplored. Our work underscores the critical importance of leveraging repository-level contexts to generate executable and functionally correct code. We present \textbf{\methodnamews}, a novel benchmark designed to evaluate repository-level code generation, with a focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts. Our study examines a controlled scenario where developers specify essential code dependencies (contexts), challenging models to integrate them effectively. Additionally, we introduce an instruction-tuned dataset that enhances CodeLLMs' ability to leverage dependencies, along with a new metric, \textit{Dependency Invocation Rate (DIR)}, to quantify context utilization. Experimental results reveal that while pretrained LLMs demonstrate superior performance in terms of correctness, instruction-tuned models excel in context utilization and debugging capabilities. \methodnamews offers a comprehensive evaluation framework for assessing code functionality and alignment with developer intent, thereby advancing the development of more reliable CodeLLMs for real-world applications. The dataset and source code are available at~\url{https://github.com/FSoft-AI4Code/RepoExec}.
♻ ☆ Eliciting Informative Text Evaluations with Large Language Models
Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in large language models. This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media. We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels -- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM.
comment: Accepted by the Twenty-Fifth ACM Conference on Economics and Computation (EC'24)
♻ ☆ Uplift Modeling Under Limited Supervision
Estimating causal effects in e-commerce tends to involve costly treatment assignments which can be impractical in large-scale settings. Leveraging machine learning to predict such treatment effects without actual intervention is a standard practice to diminish the risk. However, existing methods for treatment effect prediction tend to rely on training sets of substantial size, which are built from real experiments and are thus inherently risky to create. In this work we propose a graph neural network to diminish the required training set size, relying on graphs that are common in e-commerce data. Specifically, we view the problem as node regression with a restricted number of labeled instances, develop a two-model neural architecture akin to previous causal effect estimators, and test varying message-passing layers for encoding. Furthermore, as an extra step, we combine the model with an acquisition function to guide the creation of the training set in settings with extremely low experimental budget. The framework is flexible since each step can be used separately with other models or treatment policies. The experiments on real large-scale networks indicate a clear advantage of our methodology over the state of the art, which in many cases performs close to random, underlining the need for models that can generalize with limited supervision to reduce experimental risks.
♻ ☆ Into the Unknown: Self-Learning Large Language Models
We address the main problem of self-learning LLM: the question of what to learn. We propose a self-learning LLM framework that enables an LLM to independently learn previously unknown knowledge through self-assessment of their own hallucinations. We introduce a concept called Point in the Unknown (PiU) to identify atomic knowledge unknown to a model, along with four methods for automatic PiUs identification, facilitating the creation of a self-learning loop that focuses exclusively on the absorption of currently unknown knowledge into the model. Additionally, we developed evaluation metrics to gauge an LLM's self-learning capability. Our experiments revealed that LLMs with at least 3B parameters that have undergone some instruction training would be able to perform self-learning well. We further proved the effectiveness of self-learning by comparing the performance of a model that has undergone self-learning to a model that has not. Our self-learning concept allows more efficient LLM updates and opens new perspectives for LLM knowledge exchange.
comment: 10 pages, 3 figures, 3 tables, submitted
AI-Assisted Generation of Difficult Math Questions
Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.
♻ ☆ Online Detection of Anomalies in Temporal Knowledge Graphs with Interpretability SIGMOD 2025
Temporal knowledge graphs (TKGs) are valuable resources for capturing evolving relationships among entities, yet they are often plagued by noise, necessitating robust anomaly detection mechanisms. Existing dynamic graph anomaly detection approaches struggle to capture the rich semantics introduced by node and edge categories within TKGs, while TKG embedding methods lack interpretability, undermining the credibility of anomaly detection. Moreover, these methods falter in adapting to pattern changes and semantic drifts resulting from knowledge updates. To tackle these challenges, we introduce AnoT, an efficient TKG summarization method tailored for interpretable online anomaly detection in TKGs. AnoT begins by summarizing a TKG into a novel rule graph, enabling flexible inference of complex patterns in TKGs. When new knowledge emerges, AnoT maps it onto a node in the rule graph and traverses the rule graph recursively to derive the anomaly score of the knowledge. The traversal yields reachable nodes that furnish interpretable evidence for the validity or the anomalous of the new knowledge. Overall, AnoT embodies a detector-updater-monitor architecture, encompassing a detector for offline TKG summarization and online scoring, an updater for real-time rule graph updates based on emerging knowledge, and a monitor for estimating the approximation error of the rule graph. Experimental results on four real-world datasets demonstrate that AnoT surpasses existing methods significantly in terms of accuracy and interoperability. All of the raw datasets and the implementation of AnoT are provided in https://github.com/zjs123/ANoT.
comment: 26 pages, 10 figures. Accepted by SIGMOD 2025
♻ ☆ NeuFair: Neural Network Fairness Repair with Dropout ISSTA 2024
This paper investigates neuron dropout as a post-processing bias mitigation for deep neural networks (DNNs). Neural-driven software solutions are increasingly applied in socially critical domains with significant fairness implications. While neural networks are exceptionally good at finding statistical patterns from data, they may encode and amplify existing biases from the historical data. Existing bias mitigation algorithms often require modifying the input dataset or the learning algorithms. We posit that the prevalent dropout methods that prevent over-fitting during training by randomly dropping neurons may be an effective and less intrusive approach to improve the fairness of pre-trained DNNs. However, finding the ideal set of neurons to drop is a combinatorial problem. We propose NeuFair, a family of post-processing randomized algorithms that mitigate unfairness in pre-trained DNNs via dropouts during inference after training. Our randomized search is guided by an objective to minimize discrimination while maintaining the model's utility. We show that our design of randomized algorithms is effective and efficient in improving fairness (up to 69%) with minimal or no model performance degradation. We provide intuitive explanations of these phenomena and carefully examine the influence of various hyperparameters of search algorithms on the results. Finally, we empirically and conceptually compare NeuFair to different state-of-the-art bias mitigators.
comment: Paper accepted at ACM ISSTA 2024
♻ ☆ An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs ACL 2024
Large language models (LLMs) have shown strong arithmetic reasoning capabilities when prompted with Chain-of-Thought (CoT) prompts. However, we have only a limited understanding of how they are processed by LLMs. To demystify it, prior work has primarily focused on ablating different components in the CoT prompt and empirically observing their resulting LLM performance change. Yet, the reason why these components are important to LLM reasoning is not explored. To fill this gap, in this work, we investigate ``neuron activation'' as a lens to provide a unified explanation to observations made by prior work. Specifically, we look into neurons within the feed-forward layers of LLMs that may have activated their arithmetic reasoning capabilities, using Llama2 as an example. To facilitate this investigation, we also propose an approach based on GPT-4 to automatically identify neurons that imply arithmetic reasoning. Our analyses revealed that the activation of reasoning neurons in the feed-forward layers of an LLM can explain the importance of various components in a CoT prompt, and future research can extend it for a more complete understanding.
comment: 9 pages, 1 figure, to be published in ACL 2024
♻ ☆ Privacy-Aware Document Visual Question Answering ICDAR 2024
Document Visual Question Answering (DocVQA) has quickly grown into a central task of document understanding. But despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees. In this work, we explore privacy in the domain of DocVQA for the first time, highlighting privacy issues in state of the art multi-modal LLM models used for DocVQA, and explore possible solutions. Specifically, we focus on invoice processing as a realistic document understanding scenario, and propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers. We employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the data of the invoice provider is the sensitive information to be protected. We demonstrate that non-private models tend to memorise, a behaviour that can lead to exposing private information. We then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through either or both of the two input modalities: vision (document image) or language (OCR tokens). Finally, we design attacks exploiting the memorisation effect of the model, and demonstrate their effectiveness in probing a representative DocVQA models.
comment: 35 pages, 12 figures, accepted for publication at the 18th International Conference on Document Analysis and Recognition, ICDAR 2024
♻ ☆ Domain-Specific Improvement on Psychotherapy Chatbot Using Assistant ICASSP 2024
Large language models (LLMs) have demonstrated impressive generalization capabilities on specific tasks with human-written instruction data. However, the limited quantity, diversity, and professional expertise of such instruction data raise concerns about the performance of LLMs in psychotherapy tasks when provided with domain-specific instructions. To address this, we firstly propose Domain-Specific Assistant Instructions based on AlexanderStreet therapy, and secondly, we use an adaption fine-tuning method and retrieval augmented generation method to improve pre-trained LLMs. Through quantitative evaluation of linguistic quality using automatic and human evaluation, we observe that pre-trained LLMs on Psychotherapy Assistant Instructions outperform state-of-the-art LLMs response baselines. Our Assistant-Instruction approach offers a half-annotation method to align pre-trained LLMs with instructions and provide pre-trained LLMs with more psychotherapy knowledge.
comment: Accepted at ICASSP 2024 EIHRC Workshop
♻ ☆ Sentiment Analysis Across Languages: Evaluation Before and After Machine Translation to English
People communicate in more than 7,000 languages around the world, with around 780 languages spoken in India alone. Despite this linguistic diversity, research on Sentiment Analysis has predominantly focused on English text data, resulting in a disproportionate availability of sentiment resources for English. This paper examines the performance of transformer models in Sentiment Analysis tasks across multilingual datasets and text that has undergone machine translation. By comparing the effectiveness of these models in different linguistic contexts, we gain insights into their performance variations and potential implications for sentiment analysis across diverse languages. We also discuss the shortcomings and potential for future work towards the end.
comment: 6 pages, 3 Figures
♻ ☆ The Cultivated Practices of Text-to-Image Generation
Humankind is entering a novel creative era in which anybody can synthesize digital information using generative artificial intelligence (AI). Text-to-image generation, in particular, has become vastly popular and millions of practitioners produce AI-generated images and AI art online. This chapter first gives an overview of the key developments that enabled a healthy co-creative online ecosystem around text-to-image generation to rapidly emerge, followed by a high-level description of key elements in this ecosystem. A particular focus is placed on prompt engineering, a creative practice that has been embraced by the AI art community. It is then argued that the emerging co-creative ecosystem constitutes an intelligent system on its own - a system that both supports human creativity, but also potentially entraps future generations and limits future development efforts in AI. The chapter discusses the potential risks and dangers of cultivating this co-creative ecosystem, such as the bias inherent in today's training data, potential quality degradation in future image generation systems due to synthetic data becoming common place, and the potential long-term effects of text-to-image generation on people's imagination, ambitions, and development.
comment: In "Humane autonomous technology - Re-thinking experience with and in intelligent systems", Palgrave Macmillan, 2024
♻ ☆ HyperInterval: Hypernetwork approach to training weight interval regions in continual learning
Recently, a new Continual Learning (CL) paradigm was presented to control catastrophic forgetting, called Interval Continual Learning (InterContiNet), which relies on enforcing interval constraints on the neural network parameter space. Unfortunately, InterContiNet training is challenging due to the high dimensionality of the weight space, making intervals difficult to manage. To address this issue, we introduce \our{} \footnote{The source code is available at https://github.com/gmum/HyperInterval}, a technique that employs interval arithmetic within the embedding space and utilizes a hypernetwork to map these intervals to the target network parameter space. We train interval embeddings for consecutive tasks and train a hypernetwork to transform these embeddings into weights of the target network. An embedding for a given task is trained along with the hypernetwork, preserving the response of the target network for the previous task embeddings. Interval arithmetic works with a more manageable, lower-dimensional embedding space rather than directly preparing intervals in a high-dimensional weight space. Our model allows faster and more efficient training. Furthermore, \our{} maintains the guarantee of not forgetting. At the end of training, we can choose one universal embedding to produce a single network dedicated to all tasks. In such a framework, hypernetwork is used only for training and, finally, we can utilize one set of weights. \our{} obtains significantly better results than InterContiNet and gives SOTA results on several benchmarks.
♻ ☆ Evolving Virtual World with Delta-Engine
In this paper, we focus on the \emph{virtual world}, a cyberspace where people can live in. An ideal virtual world shares great similarity with our real world. One of the crucial aspects is its evolving nature, reflected by individuals' capability to grow and thereby influence the objective world. Such dynamics is unpredictable and beyond the reach of existing systems. For this, we propose a special engine called \textbf{\emph{Delta-Engine}} to drive this virtual world. $\Delta$ associates the world's evolution to the engine's scalability. It consists of a base engine and a neural proxy. The base engine programs the prototype of the virtual world; given a trigger, the neural proxy generates new snippets on the base engine through \emph{incremental prediction}. This paper presents a full-stack introduction to the delta-engine. The key feature of the delta-engine is its scalability to unknown elements within the world, Technically, it derives from the prefect co-work of the neural proxy and the base engine, and the alignment with high-quality data. We introduce an engine-oriented fine-tuning method that embeds the base engine into the proxy. We then discuss the human-LLM collaborative design to produce novel and interesting data efficiently. Eventually, we propose three evaluation principles to comprehensively assess the performance of a delta engine: naive evaluation, incremental evaluation, and adversarial evaluation.
♻ ☆ Stabilizing Extreme Q-learning by Maclaurin Expansion
In offline reinforcement learning, in-sample learning methods have been widely used to prevent performance degradation caused by evaluating out-of-distribution actions from the dataset. Extreme Q-learning (XQL) employs a loss function based on the assumption that Bellman error follows a Gumbel distribution, enabling it to model the soft optimal value function in an in-sample manner. It has demonstrated strong performance in both offline and online reinforcement learning settings. However, issues remain, such as the instability caused by the exponential term in the loss function and the risk of the error distribution deviating from the Gumbel distribution. Therefore, we propose Maclaurin Expanded Extreme Q-learning to enhance stability. In this method, applying Maclaurin expansion to the loss function in XQL enhances stability against large errors. This approach involves adjusting the modeled value function between the value function under the behavior policy and the soft optimal value function, thus achieving a trade-off between stability and optimality depending on the order of expansion. It also enables adjustment of the error distribution assumption from a normal distribution to a Gumbel distribution. Our method significantly stabilizes learning in online RL tasks from DM Control, where XQL was previously unstable. Additionally, it improves performance in several offline RL tasks from D4RL.
comment: Accepted at RLC 2024: The first Reinforcement Learning Conference
♻ ☆ An Effective Information Theoretic Framework for Channel Pruning
Channel pruning is a promising method for accelerating and compressing convolutional neural networks. However, current pruning algorithms still remain unsolved problems that how to assign layer-wise pruning ratios properly and discard the least important channels with a convincing criterion. In this paper, we present a novel channel pruning approach via information theory and interpretability of neural networks. Specifically, we regard information entropy as the expected amount of information for convolutional layers. In addition, if we suppose a matrix as a system of linear equations, a higher-rank matrix represents there exist more solutions to it, which indicates more uncertainty. From the point of view of information theory, the rank can also describe the amount of information. In a neural network, considering the rank and entropy as two information indicators of convolutional layers, we propose a fusion function to reach a compromise of them, where the fusion results are defined as ``information concentration''. When pre-defining layer-wise pruning ratios, we employ the information concentration as a reference instead of heuristic and engineering tuning to provide a more interpretable solution. Moreover, we leverage Shapley values, which are a potent tool in the interpretability of neural networks, to evaluate the channel contributions and discard the least important channels for model compression while maintaining its performance. Extensive experiments demonstrate the effectiveness and promising performance of our method. For example, our method improves the accuracy by 0.21% when reducing 45.5% FLOPs and removing 40.3% parameters for ResNet-56 on CIFAR-10. Moreover, our method obtains loss in Top-1/Top-5 accuracies of 0.43%/0.11% by reducing 41.6% FLOPs and removing 35.0% parameters for ResNet-50 on ImageNet.
♻ ☆ Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model ACM MM 2024
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. Our code and weights are available at https://github.com/xhanxu/Mamba3D.
comment: ACM MM 2024. Code and weights are available at https://github.com/xhanxu/Mamba3D
♻ ☆ RLCP: A Reinforcement Learning-based Copyright Protection Method for Text-to-Image Diffusion Model
The increasing sophistication of text-to-image generative models has led to complex challenges in defining and enforcing copyright infringement criteria and protection. Existing methods, such as watermarking and dataset deduplication, fail to provide comprehensive solutions due to the lack of standardized metrics and the inherent complexity of addressing copyright infringement in diffusion models. To deal with these challenges, we propose a Reinforcement Learning-based Copyright Protection(RLCP) method for Text-to-Image Diffusion Model, which minimizes the generation of copyright-infringing content while maintaining the quality of the model-generated dataset. Our approach begins with the introduction of a novel copyright metric grounded in copyright law and court precedents on infringement. We then utilize the Denoising Diffusion Policy Optimization (DDPO) framework to guide the model through a multi-step decision-making process, optimizing it using a reward function that incorporates our proposed copyright metric. Additionally, we employ KL divergence as a regularization term to mitigate some failure modes and stabilize RL fine-tuning. Experiments conducted on 3 mixed datasets of copyright and non-copyright images demonstrate that our approach significantly reduces copyright infringement risk while maintaining image quality.
comment: arXiv admin note: text overlap with arXiv:2403.12052 by other authors
♻ ☆ Barlow Twins Deep Neural Network for Advanced 1D Drug-Target Interaction Prediction
Accurate prediction of drug-target interactions is critical for advancing drug discovery. By reducing time and cost, machine learning and deep learning can accelerate this laborious discovery process. In a novel approach, BarlowDTI, we utilise the powerful Barlow Twins architecture for feature-extraction while considering the structure of the target protein. Our method achieves state-of-the-art predictive performance against multiple established benchmarks using only one-dimensional input. The use of gradient boosting machine as the underlying predictor ensures fast and efficient predictions without the need for substantial computational resources. We also investigate how the model reaches its decision based on individual training samples. By comparing co-crystal structures, we find that BarlowDTI effectively exploits catalytically active and stabilising residues, highlighting the model's ability to generalise from one-dimensional input data. In addition, we further benchmark new baselines against existing methods. Together, these innovations improve the efficiency and effectiveness of drug-target interaction predictions, providing robust tools for accelerating drug development and deepening the understanding of molecular interactions. Therefore, we provide an easy-to-use web interface that can be freely accessed at https://www.bio.nat.tum.de/oc2/barlowdti .
comment: Refined model architecture, additional results added
♻ ☆ LiveFC: A System for Live Fact-Checking of Audio Streams
The advances in the digital era have led to rapid dissemination of information. This has also aggravated the spread of misinformation and disinformation. This has potentially serious consequences, such as civil unrest. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. While automated fact-checking approaches exist, they do not operate in real-time and do not always account for spread of misinformation through different modalities. This is particularly important as proactive fact-checking on live streams in real-time can help people be informed of false narratives and prevent catastrophic consequences that may cause civil unrest. This is particularly relevant with the rapid dissemination of information through video on social media platforms or other streams like political rallies and debates. Hence, in this work we develop a platform named LiveFC, that can aid in fact-checking live audio streams in real-time. LiveFC has a user-friendly interface that displays the claims detected along with their veracity and evidence for live streams with associated speakers for claims from respective segments. The app can be accessed at http://livefc.factiverse.ai and a screen recording of the demo can be found at https://bit.ly/3WVAoIw.
comment: Under Review, 11 pages
♻ ☆ Qualitative and quantitative analysis of student's perceptions in the use of generative AI in educational environments
The effective integration of generative artificial intelligence in education is a fundamental aspect to prepare future generations. The objective of this study is to analyze from a quantitative and qualitative point of view the perception of controlled student-IA interaction within the classroom. This analysis includes assessing the ethical implications and everyday use of AI tools, as well as understanding whether AI tools encourage students to pursue STEM careers. Several points for improvement in education are found, such as the challenge of getting teachers to engage with new technologies and adapt their methods in all subjects, not just those related to technologies.
comment: 17 pages, 7 figures, 4 tables
♻ ☆ DiffLoad: Uncertainty Quantification in Electrical Load Forecasting with the Diffusion Model
Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.
comment: Accepted by IEEE Transactions on Power Systems, 2024
♻ ☆ Generalizing Fairness to Generative Language Models via Reformulation of Non-discrimination Criteria
Generative AI, such as large language models, has undergone rapid development within recent years. As these models become increasingly available to the public, concerns arise about perpetuating and amplifying harmful biases in applications. Gender stereotypes can be harmful and limiting for the individuals they target, whether they consist of misrepresentation or discrimination. Recognizing gender bias as a pervasive societal construct, this paper studies how to uncover and quantify the presence of gender biases in generative language models. In particular, we derive generative AI analogues of three well-known non-discrimination criteria from classification, namely independence, separation and sufficiency. To demonstrate these criteria in action, we design prompts for each of the criteria with a focus on occupational gender stereotype, specifically utilizing the medical test to introduce the ground truth in the generative AI context. Our results address the presence of occupational gender bias within such conversational language models.
♻ ☆ Neural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited Data CCS '24
To protect the intellectual property of well-trained deep neural networks (DNNs), black-box watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples and extracted from suspect models using only API access, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. However, current robustness evaluations are primarily performed under moderate attacks or unrealistic settings. Existing removal attacks could only crack a small subset of the mainstream black-box watermarks, and fall short in four key aspects: incomplete removal, reliance on prior knowledge of the watermark, performance degradation, and high dependency on data. In this paper, we propose a watermark-agnostic removal attack called \textsc{Neural Dehydration} (\textit{abbrev.} \textsc{Dehydra}), which effectively erases all ten mainstream black-box watermarks from DNNs, with only limited or even no data dependence. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss and achieve data-free watermark removal on five of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Dehydra} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with existing removal attacks, \textsc{Dehydra} achieves strong removal effectiveness across all the covered watermarks, preserving at least $90\%$ of the stolen model utility, under the data-limited settings, i.e., less than $2\%$ of the training data or even data-free.
comment: 20 pages, ver 2.0, accepted by CCS '24
♻ ☆ Physics simulation capabilities of LLMs
[Abridged abstract] Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding. Combining these two capabilities could one day enable AI systems to simulate and predict the physical world. We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems. We condition LLM generation on the use of well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains. We contribute $\sim 50$ original and challenging problems in celestial mechanics (with REBOUND), stellar physics (with MESA), 1D fluid dynamics (with Dedalus) and non-linear dynamics (with SciPy). Since our problems do not admit unique solutions, we evaluate LLM performance on several soft metrics: counts of lines that contain different types of errors (coding, physics, necessity and sufficiency) as well as a more "educational" Pass-Fail metric focused on capturing the salient physical ingredients of the problem at hand. As expected, today's SOTA LLM (GPT4) zero-shot fails most of our problems, although about 40\% of the solutions could plausibly get a passing grade. About $70-90 \%$ of the code lines produced are necessary, sufficient and correct (coding \& physics). Physics and coding errors are the most common, with some unnecessary or insufficient lines. We observe significant variations across problem class and difficulty. We identify several failure modes of GPT4 in the computational physics domain. Our reconnaissance work provides a snapshot of current computational capabilities in classical physics and points to obvious improvement targets if AI systems are ever to reach a basic level of autonomy in physics simulation capabilities.
comment: Accepted for publication in Physica Scripta. Abridged abstract. 15 pages + appendix, 1 figure. Comments are welcome
♻ ☆ On Learning Action Costs from Input Plans
Most of the work on learning action models focus on learning the actions' dynamics from input plans. This allows us to specify the valid plans of a planning task. However, very little work focuses on learning action costs, which in turn allows us to rank the different plans. In this paper we introduce a new problem: that of learning the costs of a set of actions such that a set of input plans are optimal under the resulting planning model. To solve this problem we present $LACFIP^k$, an algorithm to learn action's costs from unlabeled input plans. We provide theoretical and empirical results showing how $LACFIP^k$ can successfully solve this task.
♻ ☆ ACE, a generic constraint solver
Constraint Programming (CP) is a useful technology for modeling and solving combinatorial constrained problems. On the one hand, on can use a library like PyCSP3 for easily modeling problems arising in various application fields (e.g., scheduling, planning, data-mining, cryptography, bio-informatics, organic chemistry, etc.). Problem instances can then be directly generated from specific models and data. On the other hand, for solving instances (notably, represented in XCSP3 format), one can use a constraint solver like ACE, which is presented in this paper. ACE is an open-source constraint solver, developed in Java, which focuses on integer variables (including 0/1-Boolean variables), state-of-the-art table constraints, popular global constraints, search heuristics and (mono-criterion) optimization.
♻ ☆ FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs' context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before inference by fine-tuning only the last Feed-Forward Network (FFN) module. This targeted approach ensures efficient optimization without overfitting, significantly improving the model's ability to comprehend and accurately follow the context. Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures. For instance, FastMem improves the accuracy of Llama 3-8B-Inst on the NQ-SWAP dataset from 59.1% to 71.6%, and reduces the output structure failure rate of Qwen 1.5-4B-Chat from 34.9% to 25.5%. Extensive experimental results highlight FastMem's potential to offer a robust solution to enhance the reliability and accuracy of LLMs in various applications. Our code is available at: https://github.com/IAAR-Shanghai/FastMem
♻ ☆ Exploring and Learning Structure: Active Inference Approach in Navigational Agents
Drawing inspiration from animal navigation strategies, we introduce a novel computational model for navigation and mapping, rooted in biologically inspired principles. Animals exhibit remarkable navigation abilities by efficiently using memory, imagination, and strategic decision-making to navigate complex and aliased environments. Building on these insights, we integrate traditional cognitive mapping approaches with an Active Inference Framework (AIF) to learn an environment structure in a few steps. Through the incorporation of topological mapping for long-term memory and AIF for navigation planning and structure learning, our model can dynamically apprehend environmental structures and expand its internal map with predicted beliefs during exploration. Comparative experiments with the Clone-Structured Graph (CSCG) model highlight our model's ability to rapidly learn environmental structures in a single episode, with minimal navigation overlap. this is achieved without prior knowledge of the dimensions of the environment or the type of observations, showcasing its robustness and effectiveness in navigating ambiguous environments.
comment: IWAI workshop 2024
♻ ☆ Classification of Heart Sounds Using Multi-Branch Deep Convolutional Network and LSTM-CNN
This paper presents a fast and cost-effective method for diagnosing cardiac abnormalities with high accuracy and reliability using low-cost systems in clinics. The primary limitation of automatic diagnosing of cardiac diseases is the rarity of correct and acceptable labeled samples, which can be expensive to prepare. To address this issue, two methods are proposed in this work. The first method is a unique Multi-Branch Deep Convolutional Neural Network (MBDCN) architecture inspired by human auditory processing, specifically designed to optimize feature extraction by employing various sizes of convolutional filters and audio signal power spectrum as input. In the second method, called as Long short-term memory-Convolutional Neural (LSCN) model, Additionally, the network architecture includes Long Short-Term Memory (LSTM) network blocks to improve feature extraction in the time domain. The innovative approach of combining multiple parallel branches consisting of the one-dimensional convolutional layers along with LSTM blocks helps in achieving superior results in audio signal processing tasks. The experimental results demonstrate superiority of the proposed methods over the state-of-the-art techniques. The overall classification accuracy of heart sounds with the LSCN network is more than 96%. The efficiency of this network is significant compared to common feature extraction methods such as Mel Frequency Cepstral Coefficients (MFCC) and wavelet transform. Therefore, the proposed method shows promising results in the automatic analysis of heart sounds and has potential applications in the diagnosis and early detection of cardiovascular diseases.
comment: 19 pages, "Submitted to Biomedical Signal Processing and Control"
♻ ☆ FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models ECCV 2024
The substantial computational costs of diffusion models, especially due to the repeated denoising steps necessary for high-quality image generation, present a major obstacle to their widespread adoption. While several studies have attempted to address this issue by reducing the number of score function evaluations (NFE) using advanced ODE solvers without fine-tuning, the decreased number of denoising iterations misses the opportunity to update fine details, resulting in noticeable quality degradation. In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. Reusing feature maps with high temporal similarity opens up a new opportunity to save computation resources without compromising output quality. To realize the practical benefits of this intuition, we conduct an extensive analysis and propose a novel method, FRDiff. FRDiff is designed to harness the advantages of both reduced NFE and feature reuse, achieving a Pareto frontier that balances fidelity and latency trade-offs in various generative tasks.
comment: Accepted at ECCV 2024. Code : https://github.com/ECoLab-POSTECH/FRDiff
♻ ☆ Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems
Artificial Intelligence (AI) has paved the way for revolutionary decision-making processes, which if harnessed appropriately, can contribute to advancements in various sectors, from healthcare to economics. However, its black box nature presents significant ethical challenges related to bias and transparency. AI applications are hugely impacted by biases, presenting inconsistent and unreliable findings, leading to significant costs and consequences, highlighting and perpetuating inequalities and unequal access to resources. Hence, developing safe, reliable, ethical, and Trustworthy AI systems is essential. Our team of researchers working with Trustworthy and Responsible AI, part of the Transdisciplinary Scholarship Initiative within the University of Calgary, conducts research on Trustworthy and Responsible AI, including fairness, bias mitigation, reproducibility, generalization, interpretability, and authenticity. In this paper, we review and discuss the intricacies of AI biases, definitions, methods of detection and mitigation, and metrics for evaluating bias. We also discuss open challenges with regard to the trustworthiness and widespread application of AI across diverse domains of human-centric decision making, as well as guidelines to foster Responsible and Trustworthy AI models.
comment: 44 pages, 2 figures
♻ ☆ OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection
Recent studies have demonstrated the significant potential of Large Language Models (LLMs) in generating Register Transfer Level (RTL) code, with notable advancements showcased by commercial models such as GPT-4 and Claude3-Opus. However, these proprietary LLMs often raise concerns regarding privacy and security. While open-source LLMs offer solutions to these concerns, they typically underperform commercial models in RTL code generation tasks, primarily due to the scarcity of high-quality open-source RTL datasets. To address this challenge, we introduce OriGen , a fully open-source framework that incorporates self-reflection capabilities and a novel dataset augmentation methodology for generating high-quality, large-scale RTL code. Our approach employs a code-tocode augmentation technique to enhance the quality of open-source RTL code datasets. Furthermore, OriGen can rectify syntactic errors through a self-reflection process that leverages compiler feedback. Experimental results demonstrate that OriGen significantly outperforms other open-source alternatives in RTL code generation. It surpasses the previous best-performing open-source LLM by 12.8% and even exceeds GPT-4 Turbo in the pass@1 metric on the VerilogEval-Human benchmark. Moreover, OriGen exhibits superior capabilities in self-reflection and error correction, outperforming GPT-4 by 19.9% on a benchmark designed to evaluate self-reflection capabilities.
♻ ☆ ERATTA: Extreme RAG for Table To Answers with Large Language Models
Large language models (LLMs) with retrieval augmented-generation (RAG) have been the optimal choice for scalable generative AI solutions in the recent past. Although RAG implemented with AI agents (agentic-RAG) has been recently popularized, its suffers from unstable cost and unreliable performances for Enterprise-level data-practices. Most existing use-cases that incorporate RAG with LLMs have been either generic or extremely domain specific, thereby questioning the scalability and generalizability of RAG-LLM approaches. In this work, we propose a unique LLM-based system where multiple LLMs can be invoked to enable data authentication, user-query routing, data-retrieval and custom prompting for question-answering capabilities from Enterprise-data tables. The source tables here are highly fluctuating and large in size and the proposed framework enables structured responses in under 10 seconds per query. Additionally, we propose a five metric scoring module that detects and reports hallucinations in the LLM responses. Our proposed system and scoring metrics achieve >90% confidence scores across hundreds of user queries in the sustainability, financial health and social media domains. Extensions to the proposed extreme RAG architectures can enable heterogeneous source querying using LLMs.
comment: 5 pages, 4 tables, IEEE Big Data, 2024
♻ ☆ Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.
comment: Accepted by IEEE TCSVT 2024, Camera Ready Version
♻ ☆ Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities
Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.
comment: 15 pages, 3 figures, 4 tables
♻ ☆ Instant Adversarial Purification with Adversarial Consistency Distillation
Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.
♻ ☆ MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents
Machine learning research, crucial for technological advancements and innovation, often faces significant challenges due to its inherent complexity, slow pace of experimentation, and the necessity for specialized expertise. Motivated by this, we present a new systematic framework, autonomous Machine Learning Research with large language models (MLR-Copilot), designed to enhance machine learning research productivity through the automatic generation and implementation of research ideas using Large Language Model (LLM) agents. The framework consists of three phases: research idea generation, experiment implementation, and implementation execution. First, existing research papers are used to generate hypotheses and experimental plans vis IdeaAgent powered by LLMs. Next, the implementation generation phase translates these plans into executables with ExperimentAgent. This phase leverages retrieved prototype code and optionally retrieves candidate models and data. Finally, the execution phase, also managed by ExperimentAgent, involves running experiments with mechanisms for human feedback and iterative debugging to enhance the likelihood of achieving executable research outcomes. We evaluate our framework on five machine learning research tasks and the experimental results show the framework's potential to facilitate the research progress and innovations.
♻ ☆ Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation ICPR 2024
In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.
comment: Accepted at ICPR 2024, Project Website: https://vl2g.github.io/projects/visTrans/
♻ ☆ An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models ECCV 2024
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
comment: Accepted to ECCV 2024 (Oral), code is released at https://github.com/pkunlp-icler/FastV,
♻ ☆ A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse
Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.
comment: 21 pages, 7 figures, 10 tables
♻ ☆ Pearl: A Production-ready Reinforcement Learning Agent
Reinforcement learning (RL) is a versatile framework for optimizing long-term goals. Although many real-world problems can be formalized with RL, learning and deploying a performant RL policy requires a system designed to address several important challenges, including the exploration-exploitation dilemma, partial observability, dynamic action spaces, and safety concerns. While the importance of these challenges has been well recognized, existing open-source RL libraries do not explicitly address them. This paper introduces Pearl, a Production-Ready RL software package designed to embrace these challenges in a modular way. In addition to presenting benchmarking results, we also highlight examples of Pearl's ongoing industry adoption to demonstrate its advantages for production use cases. Pearl is open sourced on GitHub at github.com/facebookresearch/pearl and its official website is pearlagent.github.io.
♻ ☆ An Idiosyncrasy of Time-discretization in Reinforcement Learning
Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic.
comment: RLC 2024
♻ ☆ Unicorn: U-Net for Sea Ice Forecasting with Convolutional Neural Ordinary Differential Equations
Sea ice at the North Pole is vital to global climate dynamics. However, accurately forecasting sea ice poses a significant challenge due to the intricate interaction among multiple variables. Leveraging the capability to integrate multiple inputs and powerful performances seamlessly, many studies have turned to neural networks for sea ice forecasting. This paper introduces a novel deep architecture named Unicorn, designed to forecast weekly sea ice. Our model integrates multiple time series images within its architecture to enhance its forecasting performance. Moreover, we incorporate a bottleneck layer within the U-Net architecture, serving as neural ordinary differential equations with convolution operations, to capture the spatiotemporal dynamics of latent variables. Through real data analysis with datasets spanning from 1998 to 2021, our proposed model demonstrates significant improvements over state-of-the-art models in the sea ice concentration forecasting task. It achieves an average MAE improvement of 12% compared to benchmark models. Additionally, our method outperforms existing approaches in sea ice extent forecasting, achieving a classification performance improvement of approximately 18%. These experimental results show the superiority of our proposed model.
♻ ☆ CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
With the profound development of large language models(LLMs), their safety concerns have garnered increasing attention. However, there is a scarcity of Chinese safety benchmarks for LLMs, and the existing safety taxonomies are inadequate, lacking comprehensive safety detection capabilities in authentic Chinese scenarios. In this work, we introduce CHiSafetyBench, a dedicated safety benchmark for evaluating LLMs' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts. CHiSafetyBench incorporates a dataset that covers a hierarchical Chinese safety taxonomy consisting of 5 risk areas and 31 categories. This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively. Utilizing this benchmark, we validate the feasibility of automatic evaluation as a substitute for human evaluation and conduct comprehensive automatic safety assessments on mainstream Chinese LLMs. Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities. Our dataset is publicly available at https://github.com/UnicomAI/UnicomBenchmark/tree/main/CHiSafetyBench.
comment: 16 pages, 5 figures
♻ ☆ United We Stand: Decentralized Multi-Agent Planning With Attrition ECAI 2024
Decentralized planning is a key element of cooperative multi-agent systems for information gathering tasks. However, despite the high frequency of agent failures in realistic large deployment scenarios, current approaches perform poorly in the presence of failures, by not converging at all, and/or by making very inefficient use of resources (e.g. energy). In this work, we propose Attritable MCTS (A-MCTS), a decentralized MCTS algorithm capable of timely and efficient adaptation to changes in the set of active agents. It is based on the use of a global reward function for the estimation of each agent's local contribution, and regret matching for coordination. We evaluate its effectiveness in realistic data-harvesting problems under different scenarios. We show both theoretically and experimentally that A-MCTS enables efficient adaptation even under high failure rates. Results suggest that, in the presence of frequent failures, our solution improves substantially over the best existing approaches in terms of global utility and scalability.
comment: To appear in ECAI 2024
♻ ☆ Persuasion Games using Large Language Models
Large Language Models (LLMs) have emerged as formidable instruments capable of comprehending and producing human-like text. This paper explores the potential of LLMs, to shape user perspectives and subsequently influence their decisions on particular tasks. This capability finds applications in diverse domains such as Investment, Credit cards and Insurance, wherein they assist users in selecting appropriate insurance policies, investment plans, Credit cards, Retail, as well as in Behavioral Change Support Systems (BCSS). We present a sophisticated multi-agent framework wherein a consortium of agents operate in collaborative manner. The primary agent engages directly with user agents through persuasive dialogue, while the auxiliary agents perform tasks such as information retrieval, response analysis, development of persuasion strategies, and validation of facts. Empirical evidence from our experiments demonstrates that this collaborative methodology significantly enhances the persuasive efficacy of the LLM. We continuously analyze the resistance of the user agent to persuasive efforts and counteract it by employing a combination of rule-based and LLM-based resistance-persuasion mapping techniques. We employ simulated personas and generate conversations in insurance, banking, and retail domains to evaluate the proficiency of large language models (LLMs) in recognizing, adjusting to, and influencing various personality types. Concurrently, we examine the resistance mechanisms employed by LLM simulated personas. Persuasion is quantified via measurable surveys before and after interaction, LLM-generated scores on conversation, and user decisions (purchase or non-purchase).
♻ ☆ NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches
Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) on https://www.kaggle.com/nutritionverse/datasets as part of an open initiative to accelerate machine learning for dietary sensing.
comment: Corrections made to Tables 6, 7, and 8, and corrections made to Experiments Part C. Additional clarification made in Section 4
♻ ☆ From Wide to Deep: Dimension Lifting Network for Parameter-efficient Knowledge Graph Embedding
Knowledge graph embedding (KGE) that maps entities and relations into vector representations is essential for downstream applications. Conventional KGE methods require high-dimensional representations to learn the complex structure of knowledge graph, but lead to oversized model parameters. Recent advances reduce parameters by low-dimensional entity representations, while developing techniques (e.g., knowledge distillation or reinvented representation forms) to compensate for reduced dimension. However, such operations introduce complicated computations and model designs that may not benefit large knowledge graphs. To seek a simple strategy to improve the parameter efficiency of conventional KGE models, we take inspiration from that deeper neural networks require exponentially fewer parameters to achieve expressiveness comparable to wider networks for compositional structures. We view all entity representations as a single-layer embedding network, and conventional KGE methods that adopt high-dimensional entity representations equal widening the embedding network to gain expressiveness. To achieve parameter efficiency, we instead propose a deeper embedding network for entity representations, i.e., a narrow entity embedding layer plus a multi-layer dimension lifting network (LiftNet). Experiments on three public datasets show that by integrating LiftNet, four conventional KGE methods with 16-dimensional representations achieve comparable link prediction accuracy as original models that adopt 512-dimensional representations, saving 68.4% to 96.9% parameters.
♻ ☆ Towards Non-invasive and Personalized Management of Breast Cancer Patients from Multiparametric MRI via A Large Mixture-of-Modality-Experts Model
Breast magnetic resonance imaging (MRI) is the imaging technique with the highest sensitivity for detecting breast cancer and is routinely used for women at high risk. Despite the comprehensive multiparametric protocol of breast MRI, existing artificial intelligence-based studies predominantly rely on single sequences and have limited validation. Here we report a large mixture-of-modality-experts model (MOME) that integrates multiparametric MRI information within a unified structure, offering a noninvasive method for personalized breast cancer management. We have curated the largest multiparametric breast MRI dataset, involving 5,205 patients from three hospitals in the north, southeast, and southwest of China, for the development and extensive evaluation of our model. MOME demonstrated accurate and robust identification of breast cancer. It achieved comparable performance for malignancy recognition to that of four senior radiologists and significantly outperformed a junior radiologist, with 0.913 AUROC, 0.948 AUPRC, 0.905 F1 score, and 0.723 MCC. Our findings suggest that MOME could reduce the need for biopsies in BI-RADS 4 patients with a ratio of 7.3%, classify triple-negative breast cancer with an AUROC of 0.709, and predict pathological complete response to neoadjuvant chemotherapy with an AUROC of 0.694. The model further supports scalable and interpretable inference, adapting to missing modalities and providing decision explanations by highlighting lesions and measuring modality contributions. MOME exemplifies a discriminative, robust, scalable, and interpretable multimodal model, paving the way for noninvasive, personalized management of breast cancer patients based on multiparametric breast imaging data.
comment: 27 pages, 8 figures, 10 tables
Machine Learning 52
♻ ☆ Advanced Predictive Modeling for Enhanced Mortality Prediction in ICU Stroke Patients Using Clinical Data
Background: Stroke is second-leading cause of disability and death among adults. Approximately 17 million people suffer from a stroke annually, with about 85% being ischemic strokes. Predicting mortality of ischemic stroke patients in intensive care unit (ICU) is crucial for optimizing treatment strategies, allocating resources, and improving survival rates. Methods: We acquired data on ICU ischemic stroke patients from MIMIC-IV database, including diagnoses, vital signs, laboratory tests, medications, procedures, treatments, and clinical notes. Stroke patients were randomly divided into training (70%, n=2441), test (15%, n=523), and validation (15%, n=523) sets. To address data imbalances, we applied Synthetic Minority Over-sampling Technique (SMOTE). We selected 30 features for model development, significantly reducing feature number from 1095 used in the best study. We developed a deep learning model to assess mortality risk and implemented several baseline machine learning models for comparison. Results: XGB-DL model, combining XGBoost for feature selection and deep learning, effectively minimized false positives. Model's AUROC improved from 0.865 (95% CI: 0.821 - 0.905) on first day to 0.903 (95% CI: 0.868 - 0.936) by fourth day using data from 3,646 ICU mortality patients in the MIMIC-IV database with 0.945 AUROC (95% CI: 0.944 - 0.947) during training. Although other ML models also performed well in terms of AUROC, we chose Deep Learning for its higher specificity. Conclusions: Through enhanced feature selection and data cleaning, proposed model demonstrates a 13% AUROC improvement compared to existing models while reducing feature number from 1095 in previous studies to 30.
♻ ☆ AlphaFold Meets Flow Matching for Generating Protein Ensembles ICML 2024
The biological functions of proteins often depend on dynamic structural ensembles. In this work, we develop a flow-based generative modeling approach for learning and sampling the conformational landscapes of proteins. We repurpose highly accurate single-state predictors such as AlphaFold and ESMFold and fine-tune them under a custom flow matching framework to obtain sequence-conditoned generative models of protein structure called AlphaFlow and ESMFlow. When trained and evaluated on the PDB, our method provides a superior combination of precision and diversity compared to AlphaFold with MSA subsampling. When further trained on ensembles from all-atom MD, our method accurately captures conformational flexibility, positional distributions, and higher-order ensemble observables for unseen proteins. Moreover, our method can diversify a static PDB structure with faster wall-clock convergence to certain equilibrium properties than replicate MD trajectories, demonstrating its potential as a proxy for expensive physics-based simulations. Code is available at https://github.com/bjing2016/alphaflow.
comment: ICML 2024
♻ ☆ Automatic Differentiation is Essential in Training Neural Networks for Solving Differential Equations
Neural network-based approaches have recently shown significant promise in solving partial differential equations (PDEs) in science and engineering, especially in scenarios featuring complex domains or incorporation of empirical data. One advantage of the neural network methods for PDEs lies in its automatic differentiation (AD), which necessitates only the sample points themselves, unlike traditional finite difference (FD) approximations that require nearby local points to compute derivatives. In this paper, we quantitatively demonstrate the advantage of AD in training neural networks. The concept of truncated entropy is introduced to characterize the training property. Specifically, through comprehensive experimental and theoretical analyses conducted on random feature models and two-layer neural networks, we discover that the defined truncated entropy serves as a reliable metric for quantifying the residual loss of random feature models and the training speed of neural networks for both AD and FD methods. Our experimental and theoretical analyses demonstrate that, from a training perspective, AD outperforms FD in solving PDEs.
♻ ☆ On the limits of neural network explainability via descrambling
We characterize the exact solutions to neural network descrambling--a mathematical model for explaining the fully connected layers of trained neural networks (NNs). By reformulating the problem to the minimization of the Brockett function arising in graph matching and complexity theory we show that the principal components of the hidden layer preactivations can be characterized as the optimal explainers or descramblers for the layer weights, leading to descrambled weight matrices. We show that in typical deep learning contexts these descramblers take diverse and interesting forms including (1) matching largest principal components with the lowest frequency modes of the Fourier basis for isotropic hidden data, (2) discovering the semantic development in two-layer linear NNs for signal recovery problems, and (3) explaining CNNs by optimally permuting the neurons. Our numerical experiments indicate that the eigendecompositions of the hidden layer data--now understood as the descramblers--can also reveal the layer's underlying transformation. These results illustrate that the SVD is more directly related to the explainability of NNs than previously thought and offers a promising avenue for discovering interpretable motifs for the hidden action of NNs, especially in contexts of operator learning or physics-informed NNs, where the input/output data has limited human readability.
MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation
Model merging has emerged as an effective approach to combine multiple single-task models, fine-tuned from the same pre-trained model, into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during model merging. In real-world applications, a set of solutions with various trade-offs can be more informative, helping practitioners make decisions based on diverse preferences. In this paper, we introduce a novel low-compute algorithm, Model Merging with Amortized Pareto Front (MAP). MAP identifies a Pareto set of scaling coefficients for merging multiple models to reflect the trade-offs. The core component of MAP is approximating the evaluation metrics of the various tasks using a quadratic approximation surrogate model derived from a pre-selected set of scaling coefficients, enabling amortized inference. Experimental results on vision and natural language processing tasks show that MAP can accurately identify the Pareto front. To further reduce the required computation of MAP, we propose (1) a Bayesian adaptive sampling algorithm and (2) a nested merging scheme with multiple stages.
♻ ☆ RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance
Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
♻ ☆ Uplift Modeling Under Limited Supervision
Estimating causal effects in e-commerce tends to involve costly treatment assignments which can be impractical in large-scale settings. Leveraging machine learning to predict such treatment effects without actual intervention is a standard practice to diminish the risk. However, existing methods for treatment effect prediction tend to rely on training sets of substantial size, which are built from real experiments and are thus inherently risky to create. In this work we propose a graph neural network to diminish the required training set size, relying on graphs that are common in e-commerce data. Specifically, we view the problem as node regression with a restricted number of labeled instances, develop a two-model neural architecture akin to previous causal effect estimators, and test varying message-passing layers for encoding. Furthermore, as an extra step, we combine the model with an acquisition function to guide the creation of the training set in settings with extremely low experimental budget. The framework is flexible since each step can be used separately with other models or treatment policies. The experiments on real large-scale networks indicate a clear advantage of our methodology over the state of the art, which in many cases performs close to random, underlining the need for models that can generalize with limited supervision to reduce experimental risks.
♻ ☆ Globally Stable Neural Imitation Policies
Imitation learning presents an effective approach to alleviate the resource-intensive and time-consuming nature of policy learning from scratch in the solution space. Even though the resulting policy can mimic expert demonstrations reliably, it often lacks predictability in unexplored regions of the state-space, giving rise to significant safety concerns in the face of perturbations. To address these challenges, we introduce the Stable Neural Dynamical System (SNDS), an imitation learning regime which produces a policy with formal stability guarantees. We deploy a neural policy architecture that facilitates the representation of stability based on Lyapunov theorem, and jointly train the policy and its corresponding Lyapunov candidate to ensure global stability. We validate our approach by conducting extensive experiments in simulation and successfully deploying the trained policies on a real-world manipulator arm. The experimental results demonstrate that our method overcomes the instability, accuracy, and computational intensity problems associated with previous imitation learning methods, making our method a promising solution for stable policy learning in complex planning scenarios.
AI-Assisted Generation of Difficult Math Questions
Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.
♻ ☆ Online Detection of Anomalies in Temporal Knowledge Graphs with Interpretability SIGMOD 2025
Temporal knowledge graphs (TKGs) are valuable resources for capturing evolving relationships among entities, yet they are often plagued by noise, necessitating robust anomaly detection mechanisms. Existing dynamic graph anomaly detection approaches struggle to capture the rich semantics introduced by node and edge categories within TKGs, while TKG embedding methods lack interpretability, undermining the credibility of anomaly detection. Moreover, these methods falter in adapting to pattern changes and semantic drifts resulting from knowledge updates. To tackle these challenges, we introduce AnoT, an efficient TKG summarization method tailored for interpretable online anomaly detection in TKGs. AnoT begins by summarizing a TKG into a novel rule graph, enabling flexible inference of complex patterns in TKGs. When new knowledge emerges, AnoT maps it onto a node in the rule graph and traverses the rule graph recursively to derive the anomaly score of the knowledge. The traversal yields reachable nodes that furnish interpretable evidence for the validity or the anomalous of the new knowledge. Overall, AnoT embodies a detector-updater-monitor architecture, encompassing a detector for offline TKG summarization and online scoring, an updater for real-time rule graph updates based on emerging knowledge, and a monitor for estimating the approximation error of the rule graph. Experimental results on four real-world datasets demonstrate that AnoT surpasses existing methods significantly in terms of accuracy and interoperability. All of the raw datasets and the implementation of AnoT are provided in https://github.com/zjs123/ANoT.
comment: 26 pages, 10 figures. Accepted by SIGMOD 2025
♻ ☆ NeuFair: Neural Network Fairness Repair with Dropout ISSTA 2024
This paper investigates neuron dropout as a post-processing bias mitigation for deep neural networks (DNNs). Neural-driven software solutions are increasingly applied in socially critical domains with significant fairness implications. While neural networks are exceptionally good at finding statistical patterns from data, they may encode and amplify existing biases from the historical data. Existing bias mitigation algorithms often require modifying the input dataset or the learning algorithms. We posit that the prevalent dropout methods that prevent over-fitting during training by randomly dropping neurons may be an effective and less intrusive approach to improve the fairness of pre-trained DNNs. However, finding the ideal set of neurons to drop is a combinatorial problem. We propose NeuFair, a family of post-processing randomized algorithms that mitigate unfairness in pre-trained DNNs via dropouts during inference after training. Our randomized search is guided by an objective to minimize discrimination while maintaining the model's utility. We show that our design of randomized algorithms is effective and efficient in improving fairness (up to 69%) with minimal or no model performance degradation. We provide intuitive explanations of these phenomena and carefully examine the influence of various hyperparameters of search algorithms on the results. Finally, we empirically and conceptually compare NeuFair to different state-of-the-art bias mitigators.
comment: Paper accepted at ACM ISSTA 2024
♻ ☆ Privacy-Aware Document Visual Question Answering ICDAR 2024
Document Visual Question Answering (DocVQA) has quickly grown into a central task of document understanding. But despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees. In this work, we explore privacy in the domain of DocVQA for the first time, highlighting privacy issues in state of the art multi-modal LLM models used for DocVQA, and explore possible solutions. Specifically, we focus on invoice processing as a realistic document understanding scenario, and propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers. We employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the data of the invoice provider is the sensitive information to be protected. We demonstrate that non-private models tend to memorise, a behaviour that can lead to exposing private information. We then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through either or both of the two input modalities: vision (document image) or language (OCR tokens). Finally, we design attacks exploiting the memorisation effect of the model, and demonstrate their effectiveness in probing a representative DocVQA models.
comment: 35 pages, 12 figures, accepted for publication at the 18th International Conference on Document Analysis and Recognition, ICDAR 2024
♻ ☆ Exploring Bias and Prediction Metrics to Characterise the Fairness of Machine Learning for Equity-Centered Public Health Decision-Making: A Narrative Review
Background: The rapid advancement of Machine Learning (ML) represents novel opportunities to enhance public health research, surveillance, and decision-making. However, there is a lack of comprehensive understanding of algorithmic bias, systematic errors in predicted population health outcomes, resulting from the public health application of ML. The objective of this narrative review is to explore the types of bias generated by ML and quantitative metrics to assess these biases. Methods : We performed search on PubMed, MEDLINE, IEEE (Institute of Electrical and Electronics Engineers), ACM (Association for Computing Machinery) Digital Library, Science Direct, and Springer Nature. We used keywords to identify studies describing types of bias and metrics to measure these in the domain of ML and public and population health published in English between 2008 and 2023, inclusive. Results: A total of 72 articles met the inclusion criteria. Our review identified the commonly described types of bias and quantitative metrics to assess these biases from an equity perspective. Conclusion : The review will help formalize the evaluation framework for ML on public health from an equity perspective.
comment: under review
♻ ☆ Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? ECCV 2024
Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.
comment: Preprint of paper to be presented at Fairness and Ethics Towards Transparent AI: Facing the Challenge through Model Debiasing (FAILED) during ECCV 2024
♻ ☆ Ancestral Reinforcement Learning: Unifying Zeroth-Order Optimization and Genetic Algorithms for Reinforcement Learning
Reinforcement Learning (RL) offers a fundamental framework for discovering optimal action strategies through interactions within unknown environments. Recent advancement have shown that the performance and applicability of RL can significantly be enhanced by exploiting a population of agents in various ways. Zeroth-Order Optimization (ZOO) leverages an agent population to estimate the gradient of the objective function, enabling robust policy refinement even in non-differentiable scenarios. As another application, Genetic Algorithms (GA) boosts the exploration of policy landscapes by mutational generation of policy diversity in an agent population and its refinement by selection. A natural question is whether we can have the best of two worlds that the agent population can have. In this work, we propose Ancestral Reinforcement Learning (ARL), which synergistically combines the robust gradient estimation of ZOO with the exploratory power of GA. The key idea in ARL is that each agent within a population infers gradient by exploiting the history of its ancestors, i.e., the ancestor population in the past, while maintaining the diversity of policies in the current population as in GA. We also theoretically reveal that the populational search in ARL implicitly induces the KL-regularization of the objective function, resulting in the enhanced exploration. Our results extend the applicability of populational algorithms for RL.
comment: 16pages, 3 figures
♻ ☆ Analysis of Failures and Risks in Deep Learning Model Converters: A Case Study in the ONNX Ecosystem ISSTA'24
Software engineers develop, fine-tune, and deploy deep learning (DL) models using a variety of development frameworks and runtime environments. DL model converters move models between frameworks and to runtime environments. Conversion errors compromise model quality and disrupt deployment. However, the failure characteristics of DL model converters are unknown, adding risk when using DL interoperability technologies. This paper analyzes failures in DL model converters. We survey software engineers about DL interoperability tools, use cases, and pain points (N=92). Then, we characterize failures in model converters associated with the main interoperability tool, ONNX (N=200 issues in PyTorch and TensorFlow). Finally, we formulate and test two hypotheses about structural causes for the failures we studied. We find that the node conversion stage of a model converter accounts for ~75% of the defects and 33% of reported failure are related to semantically incorrect models. The cause of semantically incorrect models is elusive, but models with behaviour inconsistencies share operator sequences. Our results motivate future research on making DL interoperability software simpler to maintain, extend, and validate. Research into behavioural tolerances and architectural coverage metrics could be fruitful.
comment: [ISSTA'24] Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) 2024
♻ ☆ HyperInterval: Hypernetwork approach to training weight interval regions in continual learning
Recently, a new Continual Learning (CL) paradigm was presented to control catastrophic forgetting, called Interval Continual Learning (InterContiNet), which relies on enforcing interval constraints on the neural network parameter space. Unfortunately, InterContiNet training is challenging due to the high dimensionality of the weight space, making intervals difficult to manage. To address this issue, we introduce \our{} \footnote{The source code is available at https://github.com/gmum/HyperInterval}, a technique that employs interval arithmetic within the embedding space and utilizes a hypernetwork to map these intervals to the target network parameter space. We train interval embeddings for consecutive tasks and train a hypernetwork to transform these embeddings into weights of the target network. An embedding for a given task is trained along with the hypernetwork, preserving the response of the target network for the previous task embeddings. Interval arithmetic works with a more manageable, lower-dimensional embedding space rather than directly preparing intervals in a high-dimensional weight space. Our model allows faster and more efficient training. Furthermore, \our{} maintains the guarantee of not forgetting. At the end of training, we can choose one universal embedding to produce a single network dedicated to all tasks. In such a framework, hypernetwork is used only for training and, finally, we can utilize one set of weights. \our{} obtains significantly better results than InterContiNet and gives SOTA results on several benchmarks.
♻ ☆ ResQuNNs:Towards Enabling Deep Learning in Quantum Convolution Neural Networks
In this paper, we present a novel framework for enhancing the performance of Quanvolutional Neural Networks (QuNNs) by introducing trainable quanvolutional layers and addressing the critical challenges associated with them. Traditional quanvolutional layers, although beneficial for feature extraction, have largely been static, offering limited adaptability. Unlike state-of-the-art, our research overcomes this limitation by enabling training within these layers, significantly increasing the flexibility and potential of QuNNs. However, the introduction of multiple trainable quanvolutional layers induces complexities in gradient-based optimization, primarily due to the difficulty in accessing gradients across these layers. To resolve this, we propose a novel architecture, Residual Quanvolutional Neural Networks (ResQuNNs), leveraging the concept of residual learning, which facilitates the flow of gradients by adding skip connections between layers. By inserting residual blocks between quanvolutional layers, we ensure enhanced gradient access throughout the network, leading to improved training performance. Moreover, we provide empirical evidence on the strategic placement of these residual blocks within QuNNs. Through extensive experimentation, we identify an efficient configuration of residual blocks, which enables gradients across all the layers in the network that eventually results in efficient training. Our findings suggest that the precise location of residual blocks plays a crucial role in maximizing the performance gains in QuNNs. Our results mark a substantial step forward in the evolution of quantum deep learning, offering new avenues for both theoretical development and practical quantum computing applications.
♻ ☆ Stabilizing Extreme Q-learning by Maclaurin Expansion
In offline reinforcement learning, in-sample learning methods have been widely used to prevent performance degradation caused by evaluating out-of-distribution actions from the dataset. Extreme Q-learning (XQL) employs a loss function based on the assumption that Bellman error follows a Gumbel distribution, enabling it to model the soft optimal value function in an in-sample manner. It has demonstrated strong performance in both offline and online reinforcement learning settings. However, issues remain, such as the instability caused by the exponential term in the loss function and the risk of the error distribution deviating from the Gumbel distribution. Therefore, we propose Maclaurin Expanded Extreme Q-learning to enhance stability. In this method, applying Maclaurin expansion to the loss function in XQL enhances stability against large errors. This approach involves adjusting the modeled value function between the value function under the behavior policy and the soft optimal value function, thus achieving a trade-off between stability and optimality depending on the order of expansion. It also enables adjustment of the error distribution assumption from a normal distribution to a Gumbel distribution. Our method significantly stabilizes learning in online RL tasks from DM Control, where XQL was previously unstable. Additionally, it improves performance in several offline RL tasks from D4RL.
comment: Accepted at RLC 2024: The first Reinforcement Learning Conference
♻ ☆ An Effective Information Theoretic Framework for Channel Pruning
Channel pruning is a promising method for accelerating and compressing convolutional neural networks. However, current pruning algorithms still remain unsolved problems that how to assign layer-wise pruning ratios properly and discard the least important channels with a convincing criterion. In this paper, we present a novel channel pruning approach via information theory and interpretability of neural networks. Specifically, we regard information entropy as the expected amount of information for convolutional layers. In addition, if we suppose a matrix as a system of linear equations, a higher-rank matrix represents there exist more solutions to it, which indicates more uncertainty. From the point of view of information theory, the rank can also describe the amount of information. In a neural network, considering the rank and entropy as two information indicators of convolutional layers, we propose a fusion function to reach a compromise of them, where the fusion results are defined as ``information concentration''. When pre-defining layer-wise pruning ratios, we employ the information concentration as a reference instead of heuristic and engineering tuning to provide a more interpretable solution. Moreover, we leverage Shapley values, which are a potent tool in the interpretability of neural networks, to evaluate the channel contributions and discard the least important channels for model compression while maintaining its performance. Extensive experiments demonstrate the effectiveness and promising performance of our method. For example, our method improves the accuracy by 0.21% when reducing 45.5% FLOPs and removing 40.3% parameters for ResNet-56 on CIFAR-10. Moreover, our method obtains loss in Top-1/Top-5 accuracies of 0.43%/0.11% by reducing 41.6% FLOPs and removing 35.0% parameters for ResNet-50 on ImageNet.
♻ ☆ Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model ACM MM 2024
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. Our code and weights are available at https://github.com/xhanxu/Mamba3D.
comment: ACM MM 2024. Code and weights are available at https://github.com/xhanxu/Mamba3D
♻ ☆ Identifying Weight-Variant Latent Causal Models
The task of causal representation learning aims to uncover latent higher-level causal representations that affect lower-level observations. Identifying true latent causal representations from observed data, while allowing instantaneous causal relations among latent variables, remains a challenge, however. To this end, we start from the analysis of three intrinsic properties in identifying latent space from observations: transitivity, permutation indeterminacy, and scaling indeterminacy. We find that transitivity acts as a key role in impeding the identifiability of latent causal representations. To address the unidentifiable issue due to transitivity, we introduce a novel identifiability condition where the underlying latent causal model satisfies a linear-Gaussian model, in which the causal coefficients and the distribution of Gaussian noise are modulated by an additional observed variable. Under some mild assumptions, we can show that the latent causal representations can be identified up to trivial permutation and scaling. Furthermore, based on this theoretical result, we propose a novel method, termed Structural caUsAl Variational autoEncoder, which directly learns latent causal representations and causal relationships among them, together with the mapping from the latent causal variables to the observed ones. We show that the proposed method learns the true parameters asymptotically. Experimental results on synthetic and real data demonstrate the identifiability and consistency results and the efficacy of the proposed method in learning latent causal representations.
♻ ☆ Deep Learning-based Target-To-User Association in Integrated Sensing and Communication Systems
In Integrated Sensing and Communication (ISAC) systems, matching the radar targets with communication user equipments (UEs) is functional to several communication tasks, such as proactive handover and beam prediction. In this paper, we consider a radar-assisted communication system where a base station (BS) is equipped with a multiple-input-multiple-output (MIMO) radar that has a double aim: (i) associate vehicular radar targets to vehicular equipments (VEs) in the communication beamspace and (ii) predict the beamforming vector for each VE from radar data. The proposed target-to-user (T2U) association consists of two stages. First, vehicular radar targets are detected from range-angle images, and, for each, a beamforming vector is estimated. Then, the inferred per-target beamforming vectors are matched with the ones utilized at the BS for communication to perform target-to-user (T2U) association. Joint multi-target detection and beam inference is obtained by modifying the you only look once (YOLO) model, which is trained over simulated range-angle radar images. Simulation results over different urban vehicular mobility scenarios show that the proposed T2U method provides a probability of correct association that increases with the size of the BS antenna array, highlighting the respective increase of the separability of the VEs in the beamspace. Moreover, we show that the modified YOLO architecture can effectively perform both beam prediction and radar target detection, with similar performance in mean average precision on the latter over different antenna array sizes.
♻ ☆ Barlow Twins Deep Neural Network for Advanced 1D Drug-Target Interaction Prediction
Accurate prediction of drug-target interactions is critical for advancing drug discovery. By reducing time and cost, machine learning and deep learning can accelerate this laborious discovery process. In a novel approach, BarlowDTI, we utilise the powerful Barlow Twins architecture for feature-extraction while considering the structure of the target protein. Our method achieves state-of-the-art predictive performance against multiple established benchmarks using only one-dimensional input. The use of gradient boosting machine as the underlying predictor ensures fast and efficient predictions without the need for substantial computational resources. We also investigate how the model reaches its decision based on individual training samples. By comparing co-crystal structures, we find that BarlowDTI effectively exploits catalytically active and stabilising residues, highlighting the model's ability to generalise from one-dimensional input data. In addition, we further benchmark new baselines against existing methods. Together, these innovations improve the efficiency and effectiveness of drug-target interaction predictions, providing robust tools for accelerating drug development and deepening the understanding of molecular interactions. Therefore, we provide an easy-to-use web interface that can be freely accessed at https://www.bio.nat.tum.de/oc2/barlowdti .
comment: Refined model architecture, additional results added
♻ ☆ TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition
One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in deteriorating SER performance in practice. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments, without compromising its performance in noise-free environments.
comment: 14 pages, 3 figures
♻ ☆ DiffLoad: Uncertainty Quantification in Electrical Load Forecasting with the Diffusion Model
Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.
comment: Accepted by IEEE Transactions on Power Systems, 2024
♻ ☆ Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
Designing a safe policy for uncertain environments is crucial in real-world control applications. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional Lagrangian max-min formulation with policy gradient methods can become trapped in suboptimal solutions by encountering a sum of conflicting gradients from the objective and constraint functions during its inner minimization problem. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a binary search algorithm with a policy gradient subroutine and prove that it identifies an $\varepsilon$-optimal policy in an RCMDP with $\tilde{\mathcal{O}}(\varepsilon^{-4})$ policy evaluations.
♻ ☆ EuroPED-NN: Uncertainty aware surrogate model
This work successfully generates an uncertainty-aware surrogate model of the EuroPED plasma pedestal model using the Bayesian neural network with noise contrastive prior (BNN-NCP) technique. This model is trained using data from the JET-ILW pedestal database and subsequent model evaluations, conforming to EuroPED-NN. The BNN-NCP technique has been proven to be a suitable method for generating uncertainty-aware surrogate models. It matches the output results of a regular neural network while providing confidence estimates for predictions as uncertainties. Additionally, it highlights out-of-distribution (OOD) regions using surrogate model uncertainties. This provides critical insights into model robustness and reliability. EuroPED-NN has been physically validated, first, analyzing electron density $n_e\!\left(\psi_{\text{pol}}=0.94\right)$ with respect to increasing plasma current, $I_p$, and second, validating the $\Delta-\beta_{p,ped}$ relation associated with the EuroPED model. This affirms the robustness of the underlying physics learned by the surrogate model. On top of that, the method was used to develop a EuroPED-like model fed with experimental data, i.e. an uncertainty aware experimental model, which is functional in JET database. Both models have been also tested in $\sim 50$ AUG shots.
♻ ☆ The Initial Screening Order Problem
We investigate the role of the initial screening order (ISO) in candidate screening tasks, such as employee hiring and academic admissions, in which a screener is tasked with selecting $k$ candidates from a candidate pool. The ISO refers to the order in which the screener searches the candidate pool. Today, it is common for the ISO to be the product of an information access system, such as an online platform or a database query. The ISO has been largely overlooked in the literature, despite its potential impact on the optimality and fairness of the chosen $k$ candidates, especially under a human screener. We define two problem formulations describing the search behavior of the screener under the ISO: the best-$k$, where the screener selects the $k$ best candidates; and the good-$k$, where the screener selects the $k$ first good-enough candidates. To study the impact of the ISO, we introduce a human-like screener and compare it to its algorithmic counterpart, where the human-like screener is conceived to be inconsistent over time due to fatigue. In particular, our analysis shows that the ISO, under a human-like screener solving for the good-$k$ problem, hinders individual fairness despite meeting group level fairness, and hampers the optimality of the selected $k$ candidates. This is due to position bias, where a candidate's evaluation is affected by its position within the ISO. We report extensive simulated experiments exploring the parameters of the best-$k$ and good-$k$ problems for the algorithmic and human-like screeners. The simulation framework is flexible enough to account for multiple screening settings, being an alternative to running real-world candidate screening procedures. This work is motivated by a real-world candidate screening problem studied in collaboration with an European company.
♻ ☆ Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis BMVC2024
Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.
comment: Accepted at BMVC2024
♻ ☆ Autonomous Payload Thermal Control SP
In small satellites there is less room for heat control equipment, scientific instruments, and electronic components. Furthermore, the near proximity of electronic components makes power dissipation difficult, with the risk of not being able to control the temperature appropriately, reducing component lifetime and mission performance. To address this challenge, taking advantage of the advent of increasing intelligence on board satellites, an autonomous thermal control tool that uses deep reinforcement learning is proposed for learning the thermal control policy onboard. The tool was evaluated in a real space edge processing computer that will be used in a demonstration payload hosted in the International Space Station (ISS). The experiment results show that the proposed framework is able to learn to control the payload processing power to maintain the temperature under operational ranges, complementing traditional thermal control systems.
comment: To be included in the proceedings of ESA's SPAICE conference at ECSAT, UK, 2024
♻ ☆ Fast Robust Kernel Regression through Sign Gradient Descent with Early Stopping
Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the model parameters. Here, we introduce an equivalent formulation of the objective function of KRR, which opens up both for replacing the ridge penalty with the $\ell_\infty$ and $\ell_1$ penalties and for studying kernel ridge regression from the perspective of gradient descent. Using the $\ell_\infty$ and $\ell_1$ penalties, we obtain robust and sparse kernel regression, respectively. We further study the similarities between explicitly regularized kernel regression and the solutions obtained by early stopping of iterative gradient-based methods, where we connect $\ell_\infty$ regularization to sign gradient descent, $\ell_1$ regularization to forward stagewise regression (also known as coordinate descent), and $\ell_2$ regularization to gradient descent, and, in the last case, theoretically bound for the differences. We exploit the close relations between $\ell_\infty$ regularization and sign gradient descent, and between $\ell_1$ regularization and coordinate descent to propose computationally efficient methods for robust and sparse kernel regression. We finally compare robust kernel regression through sign gradient descent to existing methods for robust kernel regression on five real data sets, demonstrating that our method is one to two orders of magnitude faster, without compromising accuracy.
comment: Article arXiv:2306.16838v1 has been updated and split into two articles: this article and arXiv:2311.01762. Thus, some of the content in arXiv:2306.16838v1 is not a part of arXiv:2306.16838v2, but of arXiv:2311.01762
♻ ☆ Simplifying the Theory on Over-Smoothing
Graph convolutions have gained popularity due to their ability to efficiently operate on data with an irregular geometric structure. However, graph convolutions cause over-smoothing, which refers to representations becoming more similar with increased depth. However, many different definitions and intuitions currently coexist, leading to research efforts focusing on incompatible directions. This paper attempts to align these directions by showing that over-smoothing is merely a special case of power iteration. This greatly simplifies the existing theory on over-smoothing, making it more accessible. Based on the theory, we provide a novel comprehensive definition of rank collapse as a generalized form of over-smoothing and introduce the rank-one distance as a corresponding metric. Our empirical evaluation of 14 commonly used methods shows that more models than were previously known suffer from this issue.
♻ ☆ BadMerging: Backdoor Attacks Against Model Merging CCS
Fine-tuning pre-trained models for downstream tasks has led to a proliferation of open-sourced task-specific models. Recently, Model Merging (MM) has emerged as an effective approach to facilitate knowledge transfer among these independently fine-tuned models. MM directly combines multiple fine-tuned task-specific models into a merged model without additional training, and the resulting model shows enhanced capabilities in multiple tasks. Although MM provides great utility, it may come with security risks because an adversary can exploit MM to affect multiple downstream tasks. However, the security risks of MM have barely been studied. In this paper, we first find that MM, as a new learning paradigm, introduces unique challenges for existing backdoor attacks due to the merging process. To address these challenges, we introduce BadMerging, the first backdoor attack specifically designed for MM. Notably, BadMerging allows an adversary to compromise the entire merged model by contributing as few as one backdoored task-specific model. BadMerging comprises a two-stage attack mechanism and a novel feature-interpolation-based loss to enhance the robustness of embedded backdoors against the changes of different merging parameters. Considering that a merged model may incorporate tasks from different domains, BadMerging can jointly compromise the tasks provided by the adversary (on-task attack) and other contributors (off-task attack) and solve the corresponding unique challenges with novel attack designs. Extensive experiments show that BadMerging achieves remarkable attacks against various MM algorithms. Our ablation study demonstrates that the proposed attack designs can progressively contribute to the attack performance. Finally, we show that prior defense mechanisms fail to defend against our attacks, highlighting the need for more advanced defense.
comment: To appear in ACM Conference on Computer and Communications Security (CCS), 2024
♻ ☆ OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection
Recent studies have demonstrated the significant potential of Large Language Models (LLMs) in generating Register Transfer Level (RTL) code, with notable advancements showcased by commercial models such as GPT-4 and Claude3-Opus. However, these proprietary LLMs often raise concerns regarding privacy and security. While open-source LLMs offer solutions to these concerns, they typically underperform commercial models in RTL code generation tasks, primarily due to the scarcity of high-quality open-source RTL datasets. To address this challenge, we introduce OriGen , a fully open-source framework that incorporates self-reflection capabilities and a novel dataset augmentation methodology for generating high-quality, large-scale RTL code. Our approach employs a code-tocode augmentation technique to enhance the quality of open-source RTL code datasets. Furthermore, OriGen can rectify syntactic errors through a self-reflection process that leverages compiler feedback. Experimental results demonstrate that OriGen significantly outperforms other open-source alternatives in RTL code generation. It surpasses the previous best-performing open-source LLM by 12.8% and even exceeds GPT-4 Turbo in the pass@1 metric on the VerilogEval-Human benchmark. Moreover, OriGen exhibits superior capabilities in self-reflection and error correction, outperforming GPT-4 by 19.9% on a benchmark designed to evaluate self-reflection capabilities.
♻ ☆ Biometrics and Behavior Analysis for Detecting Distractions in e-Learning
In this article, we explore computer vision approaches to detect abnormal head pose during e-learning sessions and we introduce a study on the effects of mobile phone usage during these sessions. We utilize behavioral data collected from 120 learners monitored while participating in a MOOC learning sessions. Our study focuses on the influence of phone-usage events on behavior and physiological responses, specifically attention, heart rate, and meditation, before, during, and after phone usage. Additionally, we propose an approach for estimating head pose events using images taken by the webcam during the MOOC learning sessions to detect phone-usage events. Our hypothesis suggests that head posture undergoes significant changes when learners interact with a mobile phone, contrasting with the typical behavior seen when learners face a computer during e-learning sessions. We propose an approach designed to detect deviations in head posture from the average observed during a learner's session, operating as a semi-supervised method. This system flags events indicating alterations in head posture for subsequent human review and selection of mobile phone usage occurrences with a sensitivity over 90%.
comment: Published in IEEE Intl. Symposium on Computers in Education (SIIE) 2024
♻ ☆ VAAD: Visual Attention Analysis Dashboard applied to e-Learning
In this paper, we present an approach in the Multimodal Learning Analytics field. Within this approach, we have developed a tool to visualize and analyze eye movement data collected during learning sessions in online courses. The tool is named VAAD, an acronym for Visual Attention Analysis Dashboard. These eye movement data have been gathered using an eye-tracker and subsequently processed and visualized for interpretation. The purpose of the tool is to conduct a descriptive analysis of the data by facilitating its visualization, enabling the identification of differences and learning patterns among various learner populations. Additionally, it integrates a predictive module capable of anticipating learner activities during a learning session. Consequently, VAAD holds the potential to offer valuable insights into online learning behaviors from both descriptive and predictive perspectives.
comment: Published in IEEE Intl. Symposium on Computers in Education (SIIE) 2024
♻ ☆ From Static to Dynamic Structures: Improving Binding Affinity Prediction with Graph-Based Deep Learning
Accurate prediction of protein-ligand binding affinities is an essential challenge in structure-based drug design. Despite recent advances in data-driven methods for affinity prediction, their accuracy is still limited, partially because they only take advantage of static crystal structures while the actual binding affinities are generally determined by the thermodynamic ensembles between proteins and ligands. One effective way to approximate such a thermodynamic ensemble is to use molecular dynamics (MD) simulation. Here, an MD dataset containing 3,218 different protein-ligand complexes is curated, and Dynaformer, a graph-based deep learning model is further developed to predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories. In silico experiments demonstrated that the model exhibits state-of-the-art scoring and ranking power on the CASF-2016 benchmark dataset, outperforming the methods hitherto reported. Moreover, in a virtual screening on heat shock protein 90 (HSP90) using Dynaformer, 20 candidates are identified and their binding affinities are further experimentally validated. Dynaformer displayed promising results in virtual drug screening, revealing 12 hit compounds (two are in the submicromolar range), including several novel scaffolds. Overall, these results demonstrated that the approach offer a promising avenue for accelerating the early drug discovery process.
comment: Update the content according to the published version on Advanced Science (https://doi.org/10.1002/advs.202405404)
♻ ☆ Directly Handling Missing Data in Linear Discriminant Analysis for Enhancing Classification Accuracy and Interpretability
As the adoption of Artificial Intelligence (AI) models expands into critical real-world applications, ensuring the explainability of these models becomes paramount, particularly in sensitive fields such as medicine and finance. Linear Discriminant Analysis (LDA) remains a popular choice for classification due to its interpretable nature, derived from its capacity to model class distributions and enhance class separation through linear combinations of features. However, real-world datasets often suffer from incomplete data, posing substantial challenges for both classification accuracy and model interpretability. In this paper, we introduce a novel and robust classification method, termed Weighted missing Linear Discriminant Analysis (WLDA), which extends LDA to handle datasets with missing values without the need for imputation. Our approach innovatively incorporates a weight matrix that penalizes missing entries, thereby refining parameter estimation directly on incomplete data. This methodology not only preserves the interpretability of LDA but also significantly enhances classification performance in scenarios plagued by missing data. We conduct an in-depth theoretical analysis to establish the properties of WLDA and thoroughly evaluate its explainability. Experimental results across various datasets demonstrate that WLDA consistently outperforms traditional methods, especially in challenging environments where missing values are prevalent in both training and test datasets. This advancement provides a critical tool for improving classification accuracy and maintaining model transparency in the face of incomplete data.
♻ ☆ ERATTA: Extreme RAG for Table To Answers with Large Language Models
Large language models (LLMs) with retrieval augmented-generation (RAG) have been the optimal choice for scalable generative AI solutions in the recent past. Although RAG implemented with AI agents (agentic-RAG) has been recently popularized, its suffers from unstable cost and unreliable performances for Enterprise-level data-practices. Most existing use-cases that incorporate RAG with LLMs have been either generic or extremely domain specific, thereby questioning the scalability and generalizability of RAG-LLM approaches. In this work, we propose a unique LLM-based system where multiple LLMs can be invoked to enable data authentication, user-query routing, data-retrieval and custom prompting for question-answering capabilities from Enterprise-data tables. The source tables here are highly fluctuating and large in size and the proposed framework enables structured responses in under 10 seconds per query. Additionally, we propose a five metric scoring module that detects and reports hallucinations in the LLM responses. Our proposed system and scoring metrics achieve >90% confidence scores across hundreds of user queries in the sustainability, financial health and social media domains. Extensions to the proposed extreme RAG architectures can enable heterogeneous source querying using LLMs.
comment: 5 pages, 4 tables, IEEE Big Data, 2024
♻ ☆ Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities
Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.
comment: 15 pages, 3 figures, 4 tables
♻ ☆ PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator
We present Piecewise Rectified Flow (PeRFlow), a flow-based method for accelerating diffusion models. PeRFlow divides the sampling process of generative flows into several time windows and straightens the trajectories in each interval via the reflow operation, thereby approaching piecewise linear flows. PeRFlow achieves superior performance in a few-step generation. Moreover, through dedicated parameterizations, the PeRFlow models inherit knowledge from the pretrained diffusion models. Thus, the training converges fast and the obtained models show advantageous transfer ability, serving as universal plug-and-play accelerators that are compatible with various workflows based on the pre-trained diffusion models. Codes for training and inference are publicly released. https://github.com/magic-research/piecewise-rectified-flow
♻ ☆ Instant Adversarial Purification with Adversarial Consistency Distillation
Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.
♻ ☆ MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents
Machine learning research, crucial for technological advancements and innovation, often faces significant challenges due to its inherent complexity, slow pace of experimentation, and the necessity for specialized expertise. Motivated by this, we present a new systematic framework, autonomous Machine Learning Research with large language models (MLR-Copilot), designed to enhance machine learning research productivity through the automatic generation and implementation of research ideas using Large Language Model (LLM) agents. The framework consists of three phases: research idea generation, experiment implementation, and implementation execution. First, existing research papers are used to generate hypotheses and experimental plans vis IdeaAgent powered by LLMs. Next, the implementation generation phase translates these plans into executables with ExperimentAgent. This phase leverages retrieved prototype code and optionally retrieves candidate models and data. Finally, the execution phase, also managed by ExperimentAgent, involves running experiments with mechanisms for human feedback and iterative debugging to enhance the likelihood of achieving executable research outcomes. We evaluate our framework on five machine learning research tasks and the experimental results show the framework's potential to facilitate the research progress and innovations.
♻ ☆ A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse
Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.
comment: 21 pages, 7 figures, 10 tables
♻ ☆ Pearl: A Production-ready Reinforcement Learning Agent
Reinforcement learning (RL) is a versatile framework for optimizing long-term goals. Although many real-world problems can be formalized with RL, learning and deploying a performant RL policy requires a system designed to address several important challenges, including the exploration-exploitation dilemma, partial observability, dynamic action spaces, and safety concerns. While the importance of these challenges has been well recognized, existing open-source RL libraries do not explicitly address them. This paper introduces Pearl, a Production-Ready RL software package designed to embrace these challenges in a modular way. In addition to presenting benchmarking results, we also highlight examples of Pearl's ongoing industry adoption to demonstrate its advantages for production use cases. Pearl is open sourced on GitHub at github.com/facebookresearch/pearl and its official website is pearlagent.github.io.
♻ ☆ Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training
In this paper, we investigate the challenging framework of Byzantine-robust training in distributed machine learning (ML) systems, focusing on enhancing both efficiency and practicality. As distributed ML systems become integral for complex ML tasks, ensuring resilience against Byzantine failures-where workers may contribute incorrect updates due to malice or error-gains paramount importance. Our first contribution is the introduction of the Centered Trimmed Meta Aggregator (CTMA), an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands. Additionally, we propose harnessing a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context. Our paper highlights its theoretical and practical advantages for Byzantine-robust training, especially in simplifying the tuning process and reducing the reliance on numerous hyperparameters. The effectiveness of this technique is supported by theoretical insights within the stochastic convex optimization (SCO) framework and corroborated by empirical evidence.
♻ ☆ An Idiosyncrasy of Time-discretization in Reinforcement Learning
Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic.
comment: RLC 2024
♻ ☆ On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift
Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data -- a scenario that is likely when finetuning private tasks due to the sensitive nature of the data. In this work, we show empirically across three tasks that even in settings with large distribution shift, where both zero-shot performance from public data and training from scratch with private data give unusably weak results, public features can in fact improve private training accuracy by up to 67\% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training even if it is impossible to learn the private task from the public data alone. Altogether, our results provide evidence that public data can indeed make private training practical in realistic settings of extreme distribution shift.
♻ ☆ From Wide to Deep: Dimension Lifting Network for Parameter-efficient Knowledge Graph Embedding
Knowledge graph embedding (KGE) that maps entities and relations into vector representations is essential for downstream applications. Conventional KGE methods require high-dimensional representations to learn the complex structure of knowledge graph, but lead to oversized model parameters. Recent advances reduce parameters by low-dimensional entity representations, while developing techniques (e.g., knowledge distillation or reinvented representation forms) to compensate for reduced dimension. However, such operations introduce complicated computations and model designs that may not benefit large knowledge graphs. To seek a simple strategy to improve the parameter efficiency of conventional KGE models, we take inspiration from that deeper neural networks require exponentially fewer parameters to achieve expressiveness comparable to wider networks for compositional structures. We view all entity representations as a single-layer embedding network, and conventional KGE methods that adopt high-dimensional entity representations equal widening the embedding network to gain expressiveness. To achieve parameter efficiency, we instead propose a deeper embedding network for entity representations, i.e., a narrow entity embedding layer plus a multi-layer dimension lifting network (LiftNet). Experiments on three public datasets show that by integrating LiftNet, four conventional KGE methods with 16-dimensional representations achieve comparable link prediction accuracy as original models that adopt 512-dimensional representations, saving 68.4% to 96.9% parameters.
♻ ☆ TrajDeleter: Enabling Trajectory Forgetting in Offline Reinforcement Learning Agents NDSS 2025
Reinforcement learning (RL) trains an agent from experiences interacting with the environment. In scenarios where online interactions are impractical, offline RL, which trains the agent using pre-collected datasets, has become popular. While this new paradigm presents remarkable effectiveness across various real-world domains, like healthcare and energy management, there is a growing demand to enable agents to rapidly and completely eliminate the influence of specific trajectories from both the training dataset and the trained agents. To meet this problem, this paper advocates Trajdeleter, the first practical approach to trajectory unlearning for offline RL agents. The key idea of Trajdeleter is to guide the agent to demonstrate deteriorating performance when it encounters states associated with unlearning trajectories. Simultaneously, it ensures the agent maintains its original performance level when facing other remaining trajectories. Additionally, we introduce Trajauditor, a simple yet efficient method to evaluate whether Trajdeleter successfully eliminates the specific trajectories of influence from the offline RL agent. Extensive experiments conducted on six offline RL algorithms and three tasks demonstrate that Trajdeleter requires only about 1.5% of the time needed for retraining from scratch. It effectively unlearns an average of 94.8% of the targeted trajectories yet still performs well in actual environment interactions after unlearning. The replication package and agent parameters are available online.
comment: Accepted at NDSS 2025. The presented document here is the full version of our paper
♻ ☆ Blending Neural Operators and Relaxation Methods in PDE Numerical Solvers
Neural networks suffer from spectral bias having difficulty in representing the high frequency components of a function while relaxation methods can resolve high frequencies efficiently but stall at moderate to low frequencies. We exploit the weaknesses of the two approaches by combining them synergistically to develop a fast numerical solver of partial differential equations (PDEs) at scale. Specifically, we propose HINTS, a hybrid, iterative, numerical, and transferable solver by integrating a Deep Operator Network (DeepONet) with standard relaxation methods, leading to parallel efficiency and algorithmic scalability for a wide class of PDEs, not tractable with existing monolithic solvers. HINTS balances the convergence behavior across the spectrum of eigenmodes by utilizing the spectral bias of DeepONet, resulting in a uniform convergence rate and hence exceptional performance of the hybrid solver overall. Moreover, HINTS applies to large-scale, multidimensional systems, it is flexible with regards to discretizations, computational domain, and boundary conditions.
comment: Main text: 17 pages, 6 figures. Supplementary Information: 30 pages, 8 figures, 2 tables, 4 algorithms
Information Retrieval 13
☆ Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets ECCV 2024
Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. Project Page: https://daveishan.github.io/avr-webpage/.
comment: ECCV 2024 Oral
☆ Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain
Hybrid search has emerged as an effective strategy to offset the limitations of different matching paradigms, especially in out-of-domain contexts where notable improvements in retrieval quality have been observed. However, existing research predominantly focuses on a limited set of retrieval methods, evaluated in pairs on domain-general datasets exclusively in English. In this work, we study the efficacy of hybrid search across a variety of prominent retrieval models within the unexplored field of law in the French language, assessing both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot context, fusing different domain-general models consistently enhances performance compared to using a standalone model, regardless of the fusion method. Surprisingly, when models are trained in-domain, we find that fusion generally diminishes performance relative to using the best single system, unless fusing scores with carefully tuned weights. These novel insights, among others, expand the applicability of prior findings across a new field and language, and contribute to a deeper understanding of hybrid search in non-English specialized domains.
comment: Under review
SSD4Rec: A Structured State Space Duality Model for Efficient Sequential Recommendation
Sequential recommendation methods are crucial in modern recommender systems for their remarkable capability to understand a user's changing interests based on past interactions. However, a significant challenge faced by current methods (e.g., RNN- or Transformer-based models) is to effectively and efficiently capture users' preferences by modeling long behavior sequences, which impedes their various applications like short video platforms where user interactions are numerous. Recently, an emerging architecture named Mamba, built on state space models (SSM) with efficient hardware-aware designs, has showcased the tremendous potential for sequence modeling, presenting a compelling avenue for addressing the challenge effectively. Inspired by this, we propose a novel generic and efficient sequential recommendation backbone, SSD4Rec, which explores the seamless adaptation of Mamba for sequential recommendations. Specifically, SSD4Rec marks the variable- and long-length item sequences with sequence registers and processes the item representations with bidirectional Structured State Space Duality (SSD) blocks. This not only allows for hardware-aware matrix multiplication but also empowers outstanding capabilities in variable-length and long-range sequence modeling. Extensive evaluations on four benchmark datasets demonstrate that the proposed model achieves state-of-the-art performance while maintaining near-linear scalability with user sequence length. Our code is publicly available at https://github.com/ZhangYifeng1995/SSD4Rec.
☆ Real World Conversational Entity Linking Requires More Than Zeroshots
Entity linking (EL) in conversations faces notable challenges in practical applications, primarily due to the scarcity of entity-annotated conversational datasets and sparse knowledge bases (KB) containing domain-specific, long-tail entities. We designed targeted evaluation scenarios to measure the efficacy of EL models under resource constraints. Our evaluation employs two KBs: Fandom, exemplifying real-world EL complexities, and the widely used Wikipedia. First, we assess EL models' ability to generalize to a new unfamiliar KB using Fandom and a novel zero-shot conversational entity linking dataset that we curated based on Reddit discussions on Fandom entities. We then evaluate the adaptability of EL models to conversational settings without prior training. Our results indicate that current zero-shot EL models falter when introduced to new, domain-specific KBs without prior training, significantly dropping in performance. Our findings reveal that previous evaluation approaches fall short of capturing real-world complexities for zero-shot EL, highlighting the necessity for new approaches to design and assess conversational EL models to adapt to limited resources. The evaluation setup and the dataset proposed in this research are made publicly available.
LLM-PQA: LLM-enhanced Prediction Query Answering CIKM 2024
The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML model has to be employed and inference has to be performed in order to provide an answer. This paper introduces LLM-PQA, a novel tool that addresses prediction queries formulated in natural language. LLM-PQA is the first to combine the capabilities of LLMs and retrieval-augmented mechanism for the needs of prediction queries by integrating data lakes and model zoos. This integration provides users with access to a vast spectrum of heterogeneous data and diverse ML models, facilitating dynamic prediction query answering. In addition, LLM-PQA can dynamically train models on demand, based on specific query requirements, ensuring reliable and relevant results even when no pre-trained model in a model zoo, available for the task.
comment: This paper is accepted as a demo at CIKM 2024
☆ Evidential Transformers for Improved Image Retrieval ECCV 2024
We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval. In this paper, we make several contributions to content-based image retrieval (CBIR). We incorporate probabilistic methods into image retrieval, achieving robust and reliable results, with evidential classification surpassing traditional training based on multiclass classification as a baseline for deep metric learning. Furthermore, we improve the state-of-the-art retrieval results on several datasets by leveraging the Global Context Vision Transformer (GC ViT) architecture. Our experimental results consistently demonstrate the reliability of our approach, setting a new benchmark in CBIR in all test settings on the Stanford Online Products (SOP) and CUB-200-2011 datasets.
comment: 6 pages, 6 figures, To be presented at the 3rd Workshop on Uncertainty Quantification for Computer Vision, at the ECCV 2024 conference in Milan, Italy
☆ Improved Diversity-Promoting Collaborative Metric Learning for Recommendation
Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and collaborative filtering. Following the convention of RS, existing practices exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests. Under this setting, the unique user representation might induce preference bias, especially when the item category distribution is imbalanced. To address this issue, we propose a novel method called \textit{Diversity-Promoting Collaborative Metric Learning} (DPCML), with the hope of considering the commonly ignored minority interest of the user. The key idea behind DPCML is to introduce a set of multiple representations for each user in the system where users' preference toward an item is aggregated by taking the minimum item-user distance among their embedding set. Specifically, we instantiate two effective assignment strategies to explore a proper quantity of vectors for each user. Meanwhile, a \textit{Diversity Control Regularization Scheme} (DCRS) is developed to accommodate the multi-vector representation strategy better. Theoretically, we show that DPCML could induce a smaller generalization error than traditional CML. Furthermore, we notice that CML-based approaches usually require \textit{negative sampling} to reduce the heavy computational burden caused by the pairwise objective therein. In this paper, we reveal the fundamental limitation of the widely adopted hard-aware sampling from the One-Way Partial AUC (OPAUC) perspective and then develop an effective sampling alternative for the CML-based paradigm. Finally, comprehensive experiments over a range of benchmark datasets speak to the efficacy of DPCML. Code are available at \url{https://github.com/statusrank/LibCML}.
comment: arXiv admin note: text overlap with arXiv:2209.15292
☆ Towards Investigating Biases in Spoken Conversational Search
Voice-based systems like Amazon Alexa, Google Assistant, and Apple Siri, along with the growing popularity of OpenAI's ChatGPT and Microsoft's Copilot, serve diverse populations, including visually impaired and low-literacy communities. This reflects a shift in user expectations from traditional search to more interactive question-answering models. However, presenting information effectively in voice-only channels remains challenging due to their linear nature. This limitation can impact the presentation of complex queries involving controversial topics with multiple perspectives. Failing to present diverse viewpoints may perpetuate or introduce biases and affect user attitudes. Balancing information load and addressing biases is crucial in designing a fair and effective voice-based system. To address this, we (i) review how biases and user attitude changes have been studied in screen-based web search, (ii) address challenges in studying these changes in voice-based settings like SCS, (iii) outline research questions, and (iv) propose an experimental setup with variables, data, and instruments to explore biases in a voice-based setting like Spoken Conversational Search.
comment: Accepted Late-Breaking Results at ACM ICMI Companion 2024
♻ ☆ Manipulating Large Language Models to Increase Product Visibility
Large language models (LLMs) are increasingly being integrated into search engines to provide natural language responses tailored to user queries. Customers and end-users are also becoming more dependent on these models for quick and easy purchase decisions. In this work, we investigate whether recommendations from LLMs can be manipulated to enhance a product's visibility. We demonstrate that adding a strategic text sequence (STS) -- a carefully crafted message -- to a product's information page can significantly increase its likelihood of being listed as the LLM's top recommendation. To understand the impact of STS, we use a catalog of fictitious coffee machines and analyze its effect on two target products: one that seldom appears in the LLM's recommendations and another that usually ranks second. We observe that the strategic text sequence significantly enhances the visibility of both products by increasing their chances of appearing as the top recommendation. This ability to manipulate LLM-generated search responses provides vendors with a considerable competitive advantage and has the potential to disrupt fair market competition. Just as search engine optimization (SEO) revolutionized how webpages are customized to rank higher in search engine results, influencing LLM recommendations could profoundly impact content optimization for AI-driven search services. Code for our experiments is available at https://github.com/aounon/llm-rank-optimizer.
♻ ☆ A multi-language toolkit for supporting automated checking of research outputs
This article presents the automatic checking of research outputs package acro, which assists researchers and data governance teams by automatically applying best-practice principles-based statistical disclosure control (SDC) techniques on-the-fly as researchers conduct their analyses. acro distinguishes between: research output that is safe to publish; output that requires further analysis; and output that cannot be published because it creates substantial risk of disclosing private data. This is achieved through the use of a lightweight Python wrapper that sits over well-known analysis tools that produce outputs such as tables, plots, and statistical models. This adds functionality to (i) identify potentially disclosive outputs against a range of commonly used disclosure tests; (ii) apply disclosure mitigation strategies where required; (iii) report reasons for applying SDC; and (iv) produce simple summary documents trusted research environment staff can use to streamline their workflow. The major analytical programming languages used by researchers are supported: Python, R, and Stata. The acro code and documentation are available under an MIT license at https://github.com/AI-SDC/ACRO
♻ ☆ VM-Rec: A Variational Mapping Approach for Cold-start User Recommendation
The cold-start problem is a common challenge for most recommender systems. The practical application of most cold-start methods is hindered by the deficiency in auxiliary content information for users. Moreover, most methods necessitate simultaneous updates to the extensive parameters of recommender models, leading to significant training costs, particularly in large-scale industrial scenarios. We observe that the model can generate expressive embeddings for warm users with relatively more interactions. Initially, these users were cold-start users, and after transitioning to warm users, they exhibit clustering patterns in their embeddings with consistent initial interactions. Based on this motivation, we propose a Variational Mapping approach for cold-start user Recommendation (VM-Rec), mapping from few initial interactions to expressive embeddings for cold-start users. Specifically, we encode the initial interactions into a latent representation, where each dimension disentangledly signifies the degree of association with each warm user. Subsequently, we utilize this latent representation as the parameters for the mapping function, mapping (decoding) it into an expressive embedding, which can be integrated into a pre-trained recommender model directly. Our method is evaluated on three datasets using the same base model, demonstrating superior performance compared to other popular cold-start methods.
♻ ☆ A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning KDD
Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ability. We refined the text chunks and tables in web pages, added attribute predictors to reduce hallucinations, conducted LLM Knowledge Extractor and Knowledge Graph Extractor, and finally built a reasoning strategy with all the references. We evaluated our system on the CRAG dataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and online evaluations demonstrate that our system significantly enhances complex reasoning capabilities. In local evaluations, we have significantly improved accuracy and reduced error rates compared to the baseline model, achieving a notable increase in scores. In the meanwhile, we have attained outstanding results in online assessments, demonstrating the performance and generalization capabilities of the proposed system. The source code for our system is released in \url{https://gitlab.aicrowd.com/shizueyy/crag-new}.
comment: Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024
♻ ☆ PEPT: Expert Finding Meets Personalized Pre-training
Finding experts is essential in Community Question Answering (CQA) platforms as it enables the effective routing of questions to potential users who can provide relevant answers. The key is to personalized learning expert representations based on their historical answered questions, and accurately matching them with target questions. There have been some preliminary works exploring the usability of PLMs in expert finding, such as pre-training expert or question representations. However, these models usually learn pure text representations of experts from histories, disregarding personalized and fine-grained expert modeling. For alleviating this, we present a personalized pre-training and fine-tuning paradigm, which could effectively learn expert interest and expertise simultaneously. Specifically, in our pre-training framework, we integrate historical answered questions of one expert with one target question, and regard it as a candidate aware expert-level input unit. Then, we fuse expert IDs into the pre-training for guiding the model to model personalized expert representations, which can help capture the unique characteristics and expertise of each individual expert. Additionally, in our pre-training task, we design: 1) a question-level masked language model task to learn the relatedness between histories, enabling the modeling of question-level expert interest; 2) a vote-oriented task to capture question-level expert expertise by predicting the vote score the expert would receive. Through our pre-training framework and tasks, our approach could holistically learn expert representations including interests and expertise. Our method has been extensively evaluated on six real-world CQA datasets, and the experimental results consistently demonstrate the superiority of our approach over competitive baseline methods.
Computation and Language 19
♻ ☆ CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
Open large language models (LLMs) have significantly advanced the field of natural language processing, showcasing impressive performance across various tasks.Despite the significant advancements in LLMs, their effective operation still relies heavily on human input to accurately guide the dialogue flow, with agent tuning being a crucial optimization technique that involves human adjustments to the model for better response to such guidance.Addressing this dependency, our work introduces the TinyAgent model, trained on a meticulously curated high-quality dataset. We also present the Collaborative Multi-Agent Tuning (CMAT) framework, an innovative system designed to augment language agent capabilities through adaptive weight updates based on environmental feedback. This framework fosters collaborative learning and real-time adaptation among multiple intelligent agents, enhancing their context-awareness and long-term memory. In this research, we propose a new communication agent framework that integrates multi-agent systems with environmental feedback mechanisms, offering a scalable method to explore cooperative behaviors. Notably, our TinyAgent-7B model exhibits performance on par with GPT-3.5, despite having fewer parameters, signifying a substantial improvement in the efficiency and effectiveness of LLMs.
♻ ☆ Dynamic Boundary Time Warping for Sub-sequence Matching with Few Examples
The paper presents a novel method of finding a fragment in a long temporal sequence similar to the set of shorter sequences. We are the first to propose an algorithm for such a search that does not rely on computing the average sequence from query examples. Instead, we use query examples as is, utilizing all of them simultaneously. The introduced method based on the Dynamic Time Warping (DTW) technique is suited explicitly for few-shot query-by-example retrieval tasks. We evaluate it on two different few-shot problems from the field of Natural Language Processing. The results show it either outperforms baselines and previous approaches or achieves comparable results when a low number of examples is available.
♻ ☆ REBEL: Reinforcement Learning via Regressing Relative Rewards
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.
comment: New experimental results on general chat
♻ ☆ MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.
comment: 9 pages, 3 figures, 2 algorithms, appendix
♻ ☆ Forecasting Live Chat Intent from Browsing History CIKM 2024
Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user's browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.
comment: CIKM 2024
♻ ☆ Greedy Grammar Induction with Indirect Negative Evidence
This paper offers a fresh look at the pumping lemma constant as an upper bound on the information required for learning Context Free Grammars. An objective function based on indirect negative evidence considers the occurrences, and non-occurrences, of a finite number of strings, encountered after a sufficiently long presentation. This function has optimal substructure in the hypotheses space, giving rise to a greedy search learner in a branch and bound method. A hierarchy of learnable classes is defined in terms of the number of production rules that must be added to interim solutions in order to incrementally fit the input. Efficiency strongly depends on the position of the target grammar in the hierarchy and on the richness of the input.
comment: 12 pages (including references), 1 png files. 5 anciliary files (dataset)
♻ ☆ LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to generate answers from the short retrieved units. The imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. The loss of contextual information in the short, chunked units may increase the likelihood of introducing hard negatives during the retrieval stage. Additionally, the reader might not fully leverage the capabilities of recent advancements in LLMs. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. In the two Wikipedia-based datasets, NQ and HotpotQA, LongRAG processes the entire Wikipedia corpus into 4K-token units by grouping related documents. By increasing the unit size, we significantly reduce the total number of units. This greatly reduces the burden on the retriever, resulting in strong retrieval performance with only a few (less than 8) top units. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA, which are on par with the (fully-trained) SoTA model. Furthermore, we test on two non-Wikipedia-based datasets, Qasper and MultiFieldQA-en. LongRAG processes each individual document as a single (long) unit rather than chunking them into smaller units. By doing so, we achieve an F1 score of 25.9% on Qasper and 57.5% on MultiFieldQA-en. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
comment: Technical Report
♻ ☆ Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
♻ ☆ T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its security risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover fewer aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, a new benchmark designed for conducting safety-critical assessments of text-to-video models. We define 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AI.
♻ ☆ CancerLLM: A Large Language Model in Cancer Domain
Medical Large Language Models (LLMs) such as ClinicalCamel 70B, Llama3-OpenBioLLM 70B have demonstrated impressive performance on a wide variety of medical NLP task.However, there still lacks a large language model (LLM) specifically designed for cancer domain. Moreover, these LLMs typically have billions of parameters, making them computationally expensive for healthcare systems.Thus, in this study, we propose CancerLLM, a model with 7 billion parameters and a Mistral-style architecture, pre-trained on 2,676,642 clinical notes and 515,524 pathology reports covering 17 cancer types, followed by fine-tuning on three cancer-relevant tasks, including cancer phenotypes extraction, and cancer diagnosis generation. Our evaluation demonstrated that CancerLLM achieves state-of-the-art results compared to other existing LLMs, with an average F1 score improvement of 7.61 %. Additionally, CancerLLM outperforms other models on two proposed robustness testbeds. This illustrates that CancerLLM can be effectively applied to clinical AI systems, enhancing clinical research and healthcare delivery in the field of cancer.
comment: add the diagnosis evaluation of ICD code
♻ ☆ MLRegTest: A Benchmark for the Machine Learning of Regular Languages
Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture.
comment: Accepted for publication in the Journal of Machine Learning Research. Dataset available at https://doi.org/10.5061/dryad.dncjsxm4h , code available at https://github.com/heinz-jeffrey/subregular-learning
♻ ☆ The Oscars of AI Theater: A Survey on Role-Playing with Language Models
This survey explores the burgeoning field of role-playing with language models, focusing on their development from early persona-based models to advanced character-driven simulations facilitated by Large Language Models (LLMs). Initially confined to simple persona consistency due to limited model capabilities, role-playing tasks have now expanded to embrace complex character portrayals involving character consistency, behavioral alignment, and overall attractiveness. We provide a comprehensive taxonomy of the critical components in designing these systems, including data, models and alignment, agent architecture and evaluation. This survey not only outlines the current methodologies and challenges, such as managing dynamic personal profiles and achieving high-level persona consistency but also suggests avenues for future research in improving the depth and realism of role-playing applications. The goal is to guide future research by offering a structured overview of current methodologies and identifying potential areas for improvement. Related resources and papers are available at https://github.com/nuochenpku/Awesome-Role-Play-Papers.
comment: 28 pages
♻ ☆ Ex3: Automatic Novel Writing by Extracting, Excelsior and Expanding
Generating long-term texts such as novels using artificial intelligence has always been a challenge. A common approach is to use large language models (LLMs) to construct a hierarchical framework that first plans and then writes. Despite the fact that the generated novels reach a sufficient length, they exhibit poor logical coherence and appeal in their plots and deficiencies in character and event depiction, ultimately compromising the overall narrative quality. In this paper, we propose a method named Extracting Excelsior and Expanding. Ex3 initially extracts structure information from raw novel data. By combining this structure information with the novel data, an instruction-following dataset is meticulously crafted. This dataset is then utilized to fine-tune the LLM, aiming for excelsior generation performance. In the final stage, a tree-like expansion method is deployed to facilitate the generation of arbitrarily long novels. Evaluation against previous methods showcases Ex3's ability to produce higher-quality long-form novels.
♻ ☆ Tur[k]ingBench: A Challenge Benchmark for Web Agents
Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.
♻ ☆ ReMamba: Equip Mamba with Effective Long-Sequence Modeling
While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited compared to transformer-based models. In this study, we investigate the long-context efficiency issues of the Mamba models and propose ReMamba, which enhances Mamba's ability to comprehend long contexts. ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process, incurring minimal additional inference costs overhead. Experimental results on the LongBench and L-Eval benchmarks demonstrate ReMamba's efficacy, improving over the baselines by 3.2 and 1.6 points, respectively, and attaining performance almost on par with same-size transformer models.
♻ ☆ Editing Personality for Large Language Models NLPCC 2024
This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs). This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. Specifically, we construct PersonalityEdit, a new benchmark dataset to address this task. Drawing on the theory in Social Psychology, we isolate three representative traits, namely Neuroticism, Extraversion, and Agreeableness, as the foundation for our benchmark. We then gather data using GPT-4, generating responses that align with a specified topic and embody the targeted personality trait. We conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in LLMs. Our findings uncover potential challenges of the proposed task, illustrating several remaining issues. We anticipate that our work can stimulate further annotation in model editing and personality-related research. Code is available at https://github.com/zjunlp/EasyEdit.
comment: NLPCC 2024
♻ ☆ Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships
Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
comment: 20 pages, 6 figures, 15 tables
♻ ☆ BEADs: Bias Evaluation Across Domains
Recent advancements in large language models (LLMs) have greatly enhanced natural language processing (NLP) applications. Nevertheless, these models often inherit biases from their training data. Despite the availability of various datasets, most are limited to one or two NLP tasks (typically classification or evaluation) and lack comprehensive evaluations across a broader range of NLP tasks. To address this gap, we introduce the Bias Evaluations Across Domains (BEADs) dataset, designed to support a wide array of NLP tasks, including text classification, token classification, bias quantification, and benign language generation. A key focus of this paper is the gold label subset of BEADs, an important portion of the data verified by experts to ensure high reliability. BEADs provides data for both fine-tuning, including classification and language generation tasks, and for evaluating LLMs. Our findings indicate that BEADs effectively identifies numerous biases when fine-tuned on this dataset. It also reduces biases when used for fine-tuning language generation task, while preserving language quality. The results also reveal some prevalent demographic biases in LLMs when BEADs is used for evaluation in demographic task. The benchmarking results highlight the efficacy of fine-tuning LLMs for bias identification and the necessity of comprehensive bias evaluation. We make BEADs publicly available to promote more responsible AI development. The dataset can be accessed at https://huggingface.co/datasets/shainar/BEAD .
comment: under review
♻ ☆ The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist
Computer Vision and Pattern Recognition 22
♻ ☆ U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization
Efficient relocalization is essential for intelligent vehicles when GPS reception is insufficient or sensor-based localization fails. Recent advances in Bird's-Eye-View (BEV) segmentation allow for accurate estimation of local scene appearance and in turn, can benefit the relocalization of the vehicle. However, one downside of BEV methods is the heavy computation required to leverage the geometric constraints. This paper presents U-BEV, a U-Net inspired architecture that extends the current state-of-the-art by allowing the BEV to reason about the scene on multiple height layers before flattening the BEV features. We show that this extension boosts the performance of the U-BEV by up to 4.11 IoU. Additionally, we combine the encoded neural BEV with a differentiable template matcher to perform relocalization on neural SD-map data. The model is fully end-to-end trainable and outperforms transformer-based BEV methods of similar computational complexity by 1.7 to 2.8 mIoU and BEV-based relocalization by over 26% Recall Accuracy on the nuScenes dataset.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ REBEL: Reinforcement Learning via Regressing Relative Rewards
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.
comment: New experimental results on general chat
♻ ☆ GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection
As DeepFake video manipulation techniques escalate, posing profound threats, the urgent need to develop efficient detection strategies is underscored. However, one particular issue lies with facial images being mis-detected, often originating from degraded videos or adversarial attacks, leading to unexpected temporal artifacts that can undermine the efficacy of DeepFake video detection techniques. This paper introduces a novel method for robust DeepFake video detection, harnessing the power of the proposed Graph-Regularized Attentive Convolutional Entanglement (GRACE) based on the graph convolutional network with graph Laplacian to address the aforementioned challenges. First, conventional Convolution Neural Networks are deployed to perform spatiotemporal features for the entire video. Then, the spatial and temporal features are mutually entangled by constructing a graph with sparse constraint, enforcing essential features of valid face images in the noisy face sequences remaining, thus augmenting stability and performance for DeepFake video detection. Furthermore, the Graph Laplacian prior is proposed in the graph convolutional network to remove the noise pattern in the feature space to further improve the performance. Comprehensive experiments are conducted to illustrate that our proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences. The source code is available at https://github.com/ming053l/GRACE.
comment: Submitted to TPAMI 2024
♻ ☆ Exploring Driving Behavior for Autonomous Vehicles Based on Gramian Angular Field Vision Transformer
Effective classification of autonomous vehicle (AV) driving behavior emerges as a critical area for diagnosing AV operation faults, enhancing autonomous driving algorithms, and reducing accident rates. This paper presents the Gramian Angular Field Vision Transformer (GAF-ViT) model, designed to analyze AV driving behavior. The proposed GAF-ViT model consists of three key components: GAF Transformer Module, Channel Attention Module, and Multi-Channel ViT Module. These modules collectively convert representative sequences of multivariate behavior into multi-channel images and employ image recognition techniques for behavior classification. A channel attention mechanism is applied to multi-channel images to discern the impact of various driving behavior features. Experimental evaluation on the Waymo Open Dataset of trajectories demonstrates that the proposed model achieves state-of-the-art performance. Furthermore, an ablation study effectively substantiates the efficacy of individual modules within the model.
♻ ☆ Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
♻ ☆ T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its security risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover fewer aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, a new benchmark designed for conducting safety-critical assessments of text-to-video models. We define 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AI.
♻ ☆ Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery
Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making, enabling doctors to extract understanding from clinical images and videos. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. However, the inability of VQA models to visually indicate the regions of interest corresponding to the given questions results in incomplete comprehension of the surgical scene. To tackle this, we propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images. Furthermore, to address the strong demand for safety in surgical scenarios and potential corruptions in image acquisition and transmission, we propose a novel approach called Calibrated Co-Attention Gated Vision-Language (C$^2$G-ViL) embedding to integrate and align multimodal information effectively. Additionally, we leverage the adversarial sample-based contrastive learning strategy to boost our performance and robustness. We also extend our EndoVis-18-VQLA and EndoVis-17-VQLA datasets to broaden the scope and application of our data. Extensive experiments on the aforementioned datasets demonstrate the remarkable performance and robustness of our solution. Our solution can effectively combat real-world image corruption. Thus, our proposed approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes.
comment: Accepted by Information Fusion. Code and data availability: https://github.com/longbai1006/Surgical-VQLAPlus
♻ ☆ CSAD: Unsupervised Component Segmentation for Logical Anomaly Detection
To improve logical anomaly detection, some previous works have integrated segmentation techniques with conventional anomaly detection methods. Although these methods are effective, they frequently lead to unsatisfactory segmentation results and require manual annotations. To address these drawbacks, we develop an unsupervised component segmentation technique that leverages foundation models to autonomously generate training labels for a lightweight segmentation network without human labeling. Integrating this new segmentation technique with our proposed Patch Histogram module and the Local-Global Student-Teacher (LGST) module, we achieve a detection AUROC of 95.3% in the MVTec LOCO AD dataset, which surpasses previous SOTA methods. Furthermore, our proposed method provides lower latency and higher throughput than most existing approaches.
♻ ☆ Quantum Implicit Neural Representations
Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.
comment: This paper was accepted by icml 2024
♻ ☆ Generic Objects as Pose Probes for Few-Shot View Synthesis
Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as "pose probes". The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance. Our project page is available at: \href{https://zhirui-gao.github.io/PoseProbe.github.io/}{this https URL}
♻ ☆ Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination
Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs oblivious to the accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may equally advocate both accurate and erroneous content. To address this issue, we propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination. Specifically, Pensieve enables MLLMs to retrospect relevant images as references and compare their visual content with the test image via confidence score subtraction. Moreover, our paradigm balances the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores. Experiments on Whoops, LLaVA Bench, POPE, and MME demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Pensieve also aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.
♻ ☆ Tur[k]ingBench: A Challenge Benchmark for Web Agents
Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.
♻ ☆ MBSS-T1: Model-Based Self-Supervised Motion Correction for Robust Cardiac T1 Mapping
T1 mapping is a valuable quantitative MRI technique for diagnosing diffuse myocardial diseases. Traditional methods, relying on breath-hold sequences and echo triggering, face challenges with patient compliance and arrhythmias, limiting their effectiveness. Image registration can enable motion-robust T1 mapping, but inherent intensity differences between time points pose a challenge. We introduce MBSS-T1, a self-supervised model for motion correction in cardiac T1 mapping, constrained by physical and anatomical principles. The physical constraints ensure expected signal decay behavior, while the anatomical constraints maintain realistic deformations. The unique combination of these constraints ensures accurate T1 mapping along the longitudinal relaxation axis. MBSS-T1 outperformed baseline deep-learning-based image registration approaches in a 5-fold experiment on a public dataset of 210 patients (STONE sequence) and an internal dataset of 19 patients (MOLLI sequence). MBSS-T1 excelled in model fitting quality ($R^2$: 0.975 vs. 0.941, 0.946), anatomical alignment (Dice score: 0.89 vs. 0.84, 0.88), and expert visual quality assessment for the presence of visible motion artifacts (4.33 vs. 3.38, 3.66). MBSS-T1 has the potential to enable motion-robust T1 mapping for a broader range of patients, overcoming challenges such as arrhythmias and suboptimal compliance, and allowing for free-breathing T1 mapping without requiring large training datasets. Our code will be publicly available upon acceptance.
♻ ☆ Achieving Resolution-Agnostic DNN-based Image Watermarking: A Novel Perspective of Implicit Neural Representation ACM MM'24
DNN-based watermarking methods are rapidly developing and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the best of our knowledge, Resolution-Agnostic Image WaterMarking (RAIMark) framework by watermarking the implicit neural representation (INR) of image. Unlike previous methods, our method does not rely on the previous reduction process by directly watermarking the continuous signal instead of image pixels, thus achieving resolution-agnostic watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for the target image. As a continuous signal, such an INR can be sampled to obtain images with variant resolutions. Then, we quickly fine-tune the fitted INR to get a watermarked INR conditioned on a binary secret message. A pre-trained watermark decoder extracts the hidden message from any sampled images with arbitrary resolutions. By directly watermarking INR, we achieve resolution-agnostic watermarking with increased robustness. Extensive experiments show that our method outperforms previous methods with significant improvements: averagely improved bit accuracy by 7%$\sim$29%. Notably, we observe that previous methods are vulnerable to at least one watermarking attack (e.g. JPEG, crop, resize), while ours are robust against all watermarking attacks.
comment: Accepted by ACM MM'24
♻ ☆ NEDS-SLAM: A Neural Explicit Dense Semantic SLAM Framework using 3D Gaussian Splatting
We propose NEDS-SLAM, a dense semantic SLAM system based on 3D Gaussian representation, that enables robust 3D semantic mapping, accurate camera tracking, and high-quality rendering in real-time. In the system, we propose a Spatially Consistent Feature Fusion model to reduce the effect of erroneous estimates from pre-trained segmentation head on semantic reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we employ a lightweight encoder-decoder to compress the high-dimensional semantic features into a compact 3D Gaussian representation, mitigating the burden of excessive memory consumption. Furthermore, we leverage the advantage of 3D Gaussian splatting, which enables efficient and differentiable novel view rendering, and propose a Virtual Camera View Pruning method to eliminate outlier gaussians, thereby effectively enhancing the quality of scene representations. Our NEDS-SLAM method demonstrates competitive performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in 3D dense semantic mapping.
comment: accepted by RA-L, IEEE Robotics and Automation Letters
♻ ☆ CRS-Diff: Controllable Remote Sensing Image Generation with Diffusion Model
The emergence of generative models has revolutionized the field of remote sensing (RS) image generation. Despite generating high-quality images, existing methods are limited in relying mainly on text control conditions, and thus do not always generate images accurately and stably. In this paper, we propose CRS-Diff, a new RS generative framework specifically tailored for RS image generation, leveraging the inherent advantages of diffusion models while integrating more advanced control mechanisms. Specifically, CRS-Diff can simultaneously support text-condition, metadata-condition, and image-condition control inputs, thus enabling more precise control to refine the generation process. To effectively integrate multiple condition control information, we introduce a new conditional control mechanism to achieve multi-scale feature fusion, thus enhancing the guiding effect of control conditions. To our knowledge, CRS-Diff is the first multiple-condition controllable RS generative model. Experimental results in single-condition and multiple-condition cases have demonstrated the superior ability of our CRS-Diff to generate RS images both quantitatively and qualitatively compared with previous methods. Additionally, our CRS-Diff can serve as a data engine that generates high-quality training data for downstream tasks, e.g., road extraction. The code is available at https://github.com/Sonettoo/CRS-Diff.
♻ ☆ Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation
Generative Neural Radiance Fields (GNeRF)-based 3D-aware GANs have showcased remarkable prowess in crafting high-fidelity images while upholding robust 3D consistency, particularly face generation. However, specific existing models prioritize view consistency over disentanglement, leading to constrained semantic or attribute control during the generation process. While many methods have explored incorporating semantic masks or leveraging 3D Morphable Models (3DMM) priors to imbue models with semantic control, these methods often demand training from scratch, entailing significant computational overhead. In this paper, we propose a novel approach: a conditional GNeRF model that integrates specific attribute labels as input, thus amplifying the controllability and disentanglement capabilities of 3D-aware generative models. Our approach builds upon a pre-trained 3D-aware face model, and we introduce a Training as Init and Optimizing for Tuning (TRIOT) method to train a conditional normalized flow module to enable the facial attribute editing, then optimize the latent vector to improve attribute-editing precision further. Our extensive experiments substantiate the efficacy of our model, showcasing its ability to generate high-quality edits with enhanced view consistency while safeguarding non-target regions. The code for our model is publicly available at https://github.com/zhangqianhui/TT-GNeRF.
comment: 14 pages
♻ ☆ ICAL: Implicit Character-Aided Learning for Enhanced Handwritten Mathematical Expression Recognition ICDAR 2024
Significant progress has been made in the field of handwritten mathematical expression recognition, while existing encoder-decoder methods are usually difficult to model global information in $LaTeX$. Therefore, this paper introduces a novel approach, Implicit Character-Aided Learning (ICAL), to mine the global expression information and enhance handwritten mathematical expression recognition. Specifically, we propose the Implicit Character Construction Module (ICCM) to predict implicit character sequences and use a Fusion Module to merge the outputs of the ICCM and the decoder, thereby producing corrected predictions. By modeling and utilizing implicit character information, ICAL achieves a more accurate and context-aware interpretation of handwritten mathematical expressions. Experimental results demonstrate that ICAL notably surpasses the state-of-the-art(SOTA) models, improving the expression recognition rate (ExpRate) by 2.25\%/1.81\%/1.39\% on the CROHME 2014/2016/2019 datasets respectively, and achieves a remarkable 69.06\% on the challenging HME100k test set. We make our code available on the GitHub: https://github.com/qingzhenduyu/ICAL
comment: ICDAR 2024 Oral Paper
♻ ☆ DiffMap: Enhancing Map Segmentation with Map Prior Using Diffusion Model
Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited utilization of structured priors inherent in map segmentation masks. In light of this, we propose DiffMap, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model. By incorporating this technique, the performance of existing semantic segmentation methods can be significantly enhanced and certain structural errors present in the segmentation outputs can be effectively rectified. Notably, the proposed module can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information. Furthermore, through extensive visualization analysis, our model demonstrates superior proficiency in generating results that more accurately reflect real-world map layouts, further validating its efficacy in improving the quality of the generated maps.
♻ ☆ Advancing Medical Image Segmentation: Morphology-Driven Learning with Diffusion Transformer BMVC 2024
Understanding the morphological structure of medical images and precisely segmenting the region of interest or abnormality is an important task that can assist in diagnosis. However, the unique properties of medical imaging make clear segmentation difficult,and the high cost and time-consuming task of labeling leads to a coarse-grained representation of ground truth. Facing with these problems, we propose a novel Diffusion Transformer Segmentation (DTS) model for robust segmentation in the presence of noise. We propose an alternative to the dominant Denoising U-Net encoder through experiments applying a transformer architecture, which captures global dependency through self-attention. Additionally, we propose k-neighbor label smoothing, reverse boundary attention, and self-supervised learning with morphology-driven learning to improve the ability to identify complex structures. Our model, which analyzes the morphological representation of images, shows better results than the previous models in various medical imaging modalities, including CT, MRI, and lesion images.
comment: Accepted in BMVC 2024
♻ ☆ BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation ICRA 2023
Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.
comment: ICRA 2023. The first two authors contributed equally to this work. Project page: https://bevfusion.mit.edu
♻ ☆ PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation ECCV 2022
Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS. The source code and trained models will be available at~\url{https://github.com/lxtGH/Panoptic-PartFormer}.
comment: T-PAMI-2024, Extension of PanopticPartFormer (ECCV 2022)
Artificial Intelligence 24
♻ ☆ Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture
The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
♻ ☆ Predicting Scientific Impact Through Diffusion, Conformity, and Contribution Disentanglement CIKM24
The scientific impact of academic papers is influenced by intricate factors such as dynamic popularity and inherent contribution. Existing models typically rely on static graphs for citation count estimation, failing to differentiate among its sources. In contrast, we propose distinguishing effects derived from various factors and predicting citation increments as estimated potential impacts within the dynamic context. In this research, we introduce a novel model, DPPDCC, which Disentangles the Potential impacts of Papers into Diffusion, Conformity, and Contribution values. It encodes temporal and structural features within dynamic heterogeneous graphs derived from the citation networks and applies various auxiliary tasks for disentanglement. By emphasizing comparative and co-cited/citing information and aggregating snapshots evolutionarily, DPPDCC captures knowledge flow within the citation network. Afterwards, popularity is outlined by contrasting augmented graphs to extract the essence of citation diffusion and predicting citation accumulation bins for quantitative conformity modeling. Orthogonal constraints ensure distinct modeling of each perspective, preserving the contribution value. To gauge generalization across publication times and replicate the realistic dynamic context, we partition data based on specific time points and retain all samples without strict filtering. Extensive experiments on three datasets validate DPPDCC's superiority over baselines for papers published previously, freshly, and immediately, with further analyses confirming its robustness. Our codes and supplementary materials can be found at https://github.com/ECNU-Text-Computing/DPPDCC.
comment: Accepted by CIKM24
♻ ☆ Forecasting Live Chat Intent from Browsing History CIKM 2024
Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user's browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.
comment: CIKM 2024
♻ ☆ Visual Verity in AI-Generated Imagery: Computational Metrics and Human-Centric Analysis
The rapid advancements in AI technologies have revolutionized the production of graphical content across various sectors, including entertainment, advertising, and e-commerce. These developments have spurred the need for robust evaluation methods to assess the quality and realism of AI-generated images. To address this, we conducted three studies. First, we introduced and validated a questionnaire called Visual Verity, which measures photorealism, image quality, and text-image alignment. Second, we applied this questionnaire to assess images from AI models (DALL-E2, DALL-E3, GLIDE, Stable Diffusion) and camera-generated images, revealing that camera-generated images excelled in photorealism and text-image alignment, while AI models led in image quality. We also analyzed statistical properties, finding that camera-generated images scored lower in hue, saturation, and brightness. Third, we evaluated computational metrics' alignment with human judgments, identifying MS-SSIM and CLIP as the most consistent with human assessments. Additionally, we proposed the Neural Feature Similarity Score (NFSS) for assessing image quality. Our findings highlight the need for refining computational metrics to better capture human visual perception, thereby enhancing AI-generated content evaluation.
♻ ☆ Reinforcement Learning for Adaptive Traffic Signal Control: Turn-Based and Time-Based Approaches to Reduce Congestion
The growing demand for road use in urban areas has led to significant traffic congestion, posing challenges that are costly to mitigate through infrastructure expansion alone. As an alternative, optimizing existing traffic management systems, particularly through adaptive traffic signal control, offers a promising solution. This paper explores the use of Reinforcement Learning (RL) to enhance traffic signal operations at intersections, aiming to reduce congestion without extensive sensor networks. We introduce two RL-based algorithms: a turn-based agent, which dynamically prioritizes traffic signals based on real-time queue lengths, and a time-based agent, which adjusts signal phase durations according to traffic conditions while following a fixed phase cycle. By representing the state as a scalar queue length, our approach simplifies the learning process and lowers deployment costs. The algorithms were tested in four distinct traffic scenarios using seven evaluation metrics to comprehensively assess performance. Simulation results demonstrate that both RL algorithms significantly outperform conventional traffic signal control systems, highlighting their potential to improve urban traffic flow efficiently.
comment: 9 pages, 8 figures, 5 tables
♻ ☆ LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to generate answers from the short retrieved units. The imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. The loss of contextual information in the short, chunked units may increase the likelihood of introducing hard negatives during the retrieval stage. Additionally, the reader might not fully leverage the capabilities of recent advancements in LLMs. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. In the two Wikipedia-based datasets, NQ and HotpotQA, LongRAG processes the entire Wikipedia corpus into 4K-token units by grouping related documents. By increasing the unit size, we significantly reduce the total number of units. This greatly reduces the burden on the retriever, resulting in strong retrieval performance with only a few (less than 8) top units. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA, which are on par with the (fully-trained) SoTA model. Furthermore, we test on two non-Wikipedia-based datasets, Qasper and MultiFieldQA-en. LongRAG processes each individual document as a single (long) unit rather than chunking them into smaller units. By doing so, we achieve an F1 score of 25.9% on Qasper and 57.5% on MultiFieldQA-en. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
comment: Technical Report
♻ ☆ ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey bees
The global decline in bee populations poses significant risks to agriculture, biodiversity, and environmental stability. To bridge the gap in existing data, we introduce ApisTox, a comprehensive dataset focusing on the toxicity of pesticides to honey bees (Apis mellifera). This dataset combines and leverages data from existing sources such as ECOTOX and PPDB, providing an extensive, consistent, and curated collection that surpasses the previous datasets. ApisTox incorporates a wide array of data, including toxicity levels for chemicals, details such as time of their publication in literature, and identifiers linking them to external chemical databases. This dataset may serve as an important tool for environmental and agricultural research, but also can support the development of policies and practices aimed at minimizing harm to bee populations. Finally, ApisTox offers a unique resource for benchmarking molecular property prediction methods on agrochemical compounds, facilitating advancements in both environmental science and cheminformatics. This makes it a valuable tool for both academic research and practical applications in bee conservation.
♻ ☆ T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its security risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover fewer aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, a new benchmark designed for conducting safety-critical assessments of text-to-video models. We define 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AI.
♻ ☆ VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching ICASSP 2024
Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.
comment: 4 figure, 5 pages, accepted to ICASSP 2024
♻ ☆ LoAS: Fully Temporal-Parallel Dataflow for Dual-Sparse Spiking Neural Networks MICRO 2024
Spiking Neural Networks (SNNs) have gained significant research attention in the last decade due to their potential to drive resource-constrained edge devices. Though existing SNN accelerators offer high efficiency in processing sparse spikes with dense weights, opportunities are less explored in SNNs with sparse weights, i.e., dual-sparsity. In this work, we study the acceleration of dual-sparse SNNs, focusing on their core operation, sparse-matrix-sparse-matrix multiplication (spMspM). We observe that naively running a dual-sparse SNN on existing spMspM accelerators designed for dual-sparse Artificial Neural Networks (ANNs) exhibits sub-optimal efficiency. The main challenge is that processing timesteps, a natural property of SNNs, introduces an extra loop to ANN spMspM, leading to longer latency and more memory traffic. To address the problem, we propose a fully temporal-parallel (FTP) dataflow, which minimizes both data movement across timesteps and the end-to-end latency of dual-sparse SNNs. To maximize the efficiency of FTP dataflow, we propose an FTP-friendly spike compression mechanism that efficiently compresses single-bit spikes and ensures contiguous memory access. We further propose an FTP-friendly inner-join circuit that can lower the cost of the expensive prefix-sum circuits with almost no throughput penalty. All the above techniques for FTP dataflow are encapsulated in LoAS, a Low-latency inference Accelerator for dual-sparse SNNs. With FTP dataflow, compression, and inner-join, running dual-sparse SNN workloads on LoAS demonstrates significant speedup (up to $8.51\times$) and energy reduction (up to $3.68\times$) compared to running it on prior dual-sparse accelerators.
comment: Accepted to MICRO 2024. Will update with the camera-ready version once ready. (Github: https://github.com/RuokaiYin/LoAS)
♻ ☆ Clinical Insights: A Comprehensive Review of Language Models in Medicine
This paper provides a detailed examination of the advancements and applications of large language models in the healthcare sector, with a particular emphasis on clinical applications. The study traces the evolution of LLMs from their foundational technologies to the latest developments in domain-specific models and multimodal integration. It explores the technical progression from encoder-based models requiring fine-tuning to sophisticated approaches that integrate textual, visual, and auditory data, thereby facilitating comprehensive AI solutions in healthcare. The paper discusses both the opportunities these technologies present for enhancing clinical efficiency and the challenges they pose in terms of ethics, data privacy, and implementation. Additionally, it critically evaluates the deployment strategies of LLMs, emphasizing the necessity of open-source models to ensure data privacy and adaptability within healthcare environments. Future research directions are proposed, focusing on empirical studies to evaluate the real-world efficacy of LLMs in healthcare and the development of open datasets for further research. This review aims to provide a comprehensive resource for both newcomers and multidisciplinary researchers interested in the intersection of AI and healthcare.
comment: Submitted to PLOS Digital Health
♻ ☆ A Versatile Graph Learning Approach through LLM-based Agent
Designing versatile graph learning approaches is important, considering the diverse graphs and tasks existing in real-world applications. Existing methods have attempted to achieve this target through automated machine learning techniques, pre-training and fine-tuning strategies, and large language models. However, these methods are not versatile enough for graph learning, as they work on either limited types of graphs or a single task. In this paper, we propose to explore versatile graph learning approaches with LLM-based agents, and the key insight is customizing the graph learning procedures for diverse graphs and tasks. To achieve this, we develop several LLM-based agents, equipped with diverse profiles, tools, functions and human experience. They collaborate to configure each procedure with task and data-specific settings step by step towards versatile solutions, and the proposed method is dubbed GL-Agent. By evaluating on diverse tasks and graphs, the correct results of the agent and its comparable performance showcase the versatility of the proposed method, especially in complex scenarios.The low resource cost and the potential to use open-source LLMs highlight the efficiency of GL-Agent.
♻ ☆ SFR-GNN: Simple and Fast Robust GNNs against Structural Attacks
Graph Neural Networks (GNNs) have demonstrated commendable performance for graph-structured data. Yet, GNNs are often vulnerable to adversarial structural attacks as embedding generation relies on graph topology. Existing efforts are dedicated to purifying the maliciously modified structure or applying adaptive aggregation, thereby enhancing the robustness against adversarial structural attacks. It is inevitable for a defender to consume heavy computational costs due to lacking prior knowledge about modified structures. To this end, we propose an efficient defense method, called Simple and Fast Robust Graph Neural Network (SFR-GNN), supported by mutual information theory. The SFR-GNN first pre-trains a GNN model using node attributes and then fine-tunes it over the modified graph in the manner of contrastive learning, which is free of purifying modified structures and adaptive aggregation, thus achieving great efficiency gains. Consequently, SFR-GNN exhibits a 24%--162% speedup compared to advanced robust models, demonstrating superior robustness for node classification tasks.
♻ ☆ The Oscars of AI Theater: A Survey on Role-Playing with Language Models
This survey explores the burgeoning field of role-playing with language models, focusing on their development from early persona-based models to advanced character-driven simulations facilitated by Large Language Models (LLMs). Initially confined to simple persona consistency due to limited model capabilities, role-playing tasks have now expanded to embrace complex character portrayals involving character consistency, behavioral alignment, and overall attractiveness. We provide a comprehensive taxonomy of the critical components in designing these systems, including data, models and alignment, agent architecture and evaluation. This survey not only outlines the current methodologies and challenges, such as managing dynamic personal profiles and achieving high-level persona consistency but also suggests avenues for future research in improving the depth and realism of role-playing applications. The goal is to guide future research by offering a structured overview of current methodologies and identifying potential areas for improvement. Related resources and papers are available at https://github.com/nuochenpku/Awesome-Role-Play-Papers.
comment: 28 pages
♻ ☆ Quantum Implicit Neural Representations
Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.
comment: This paper was accepted by icml 2024
♻ ☆ Ex3: Automatic Novel Writing by Extracting, Excelsior and Expanding
Generating long-term texts such as novels using artificial intelligence has always been a challenge. A common approach is to use large language models (LLMs) to construct a hierarchical framework that first plans and then writes. Despite the fact that the generated novels reach a sufficient length, they exhibit poor logical coherence and appeal in their plots and deficiencies in character and event depiction, ultimately compromising the overall narrative quality. In this paper, we propose a method named Extracting Excelsior and Expanding. Ex3 initially extracts structure information from raw novel data. By combining this structure information with the novel data, an instruction-following dataset is meticulously crafted. This dataset is then utilized to fine-tune the LLM, aiming for excelsior generation performance. In the final stage, a tree-like expansion method is deployed to facilitate the generation of arbitrarily long novels. Evaluation against previous methods showcases Ex3's ability to produce higher-quality long-form novels.
♻ ☆ Tur[k]ingBench: A Challenge Benchmark for Web Agents
Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.
♻ ☆ Can LLMs Understand Social Norms in Autonomous Driving Games?
Social norm is defined as a shared standard of acceptable behavior in a society. The emergence of social norms fosters coordination among agents without any hard-coded rules, which is crucial for the large-scale deployment of AVs in an intelligent transportation system. This paper explores the application of LLMs in understanding and modeling social norms in autonomous driving games. We introduce LLMs into autonomous driving games as intelligent agents who make decisions according to text prompts. These agents are referred to as LLM-based agents. Our framework involves LLM-based agents playing Markov games in a multi-agent system (MAS), allowing us to investigate the emergence of social norms among individual agents. We aim to identify social norms by designing prompts and utilizing LLMs on textual information related to the environment setup and the observations of LLM-based agents. Using the OpenAI Chat API powered by GPT-4.0, we conduct experiments to simulate interactions and evaluate the performance of LLM-based agents in two driving scenarios: unsignalized intersection and highway platoon. The results show that LLM-based agents can handle dynamically changing environments in Markov games, and social norms evolve among LLM-based agents in both scenarios. In the intersection game, LLM-based agents tend to adopt a conservative driving policy when facing a potential car crash. The advantage of LLM-based agents in games lies in their strong operability and analyzability, which facilitate experimental design.
♻ ☆ Editing Personality for Large Language Models NLPCC 2024
This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs). This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. Specifically, we construct PersonalityEdit, a new benchmark dataset to address this task. Drawing on the theory in Social Psychology, we isolate three representative traits, namely Neuroticism, Extraversion, and Agreeableness, as the foundation for our benchmark. We then gather data using GPT-4, generating responses that align with a specified topic and embody the targeted personality trait. We conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in LLMs. Our findings uncover potential challenges of the proposed task, illustrating several remaining issues. We anticipate that our work can stimulate further annotation in model editing and personality-related research. Code is available at https://github.com/zjunlp/EasyEdit.
comment: NLPCC 2024
♻ ☆ Strict Partitioning for Sporadic Rigid Gang Tasks
The rigid gang task model is based on the idea of executing multiple threads simultaneously on a fixed number of processors to increase efficiency and performance. Although there is extensive literature on global rigid gang scheduling, partitioned approaches have several practical advantages (e.g., task isolation and reduced scheduling overheads). In this paper, we propose a new partitioned scheduling strategy for rigid gang tasks, named strict partitioning. The method creates disjoint partitions of tasks and processors to avoid inter-partition interference. Moreover, it tries to assign tasks with similar volumes (i.e., parallelisms) to the same partition so that the intra-partition interference can be reduced. Within each partition, the tasks can be scheduled using any type of scheduler, which allows the use of a less pessimistic schedulability test. Extensive synthetic experiments and a case study based on Edge TPU benchmarks show that strict partitioning achieves better schedulability performance than state-of-the-art global gang schedulability analyses for both preemptive and non-preemptive rigid gang task sets.
comment: Published in IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2024)
♻ ☆ Advancing Medical Image Segmentation: Morphology-Driven Learning with Diffusion Transformer BMVC 2024
Understanding the morphological structure of medical images and precisely segmenting the region of interest or abnormality is an important task that can assist in diagnosis. However, the unique properties of medical imaging make clear segmentation difficult,and the high cost and time-consuming task of labeling leads to a coarse-grained representation of ground truth. Facing with these problems, we propose a novel Diffusion Transformer Segmentation (DTS) model for robust segmentation in the presence of noise. We propose an alternative to the dominant Denoising U-Net encoder through experiments applying a transformer architecture, which captures global dependency through self-attention. Additionally, we propose k-neighbor label smoothing, reverse boundary attention, and self-supervised learning with morphology-driven learning to improve the ability to identify complex structures. Our model, which analyzes the morphological representation of images, shows better results than the previous models in various medical imaging modalities, including CT, MRI, and lesion images.
comment: Accepted in BMVC 2024
♻ ☆ BEADs: Bias Evaluation Across Domains
Recent advancements in large language models (LLMs) have greatly enhanced natural language processing (NLP) applications. Nevertheless, these models often inherit biases from their training data. Despite the availability of various datasets, most are limited to one or two NLP tasks (typically classification or evaluation) and lack comprehensive evaluations across a broader range of NLP tasks. To address this gap, we introduce the Bias Evaluations Across Domains (BEADs) dataset, designed to support a wide array of NLP tasks, including text classification, token classification, bias quantification, and benign language generation. A key focus of this paper is the gold label subset of BEADs, an important portion of the data verified by experts to ensure high reliability. BEADs provides data for both fine-tuning, including classification and language generation tasks, and for evaluating LLMs. Our findings indicate that BEADs effectively identifies numerous biases when fine-tuned on this dataset. It also reduces biases when used for fine-tuning language generation task, while preserving language quality. The results also reveal some prevalent demographic biases in LLMs when BEADs is used for evaluation in demographic task. The benchmarking results highlight the efficacy of fine-tuning LLMs for bias identification and the necessity of comprehensive bias evaluation. We make BEADs publicly available to promote more responsible AI development. The dataset can be accessed at https://huggingface.co/datasets/shainar/BEAD .
comment: under review
♻ ☆ The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist
♻ ☆ Enhancing Training Efficiency Using Packing with Flash Attention
Padding is often used in tuning LLM models by adding special tokens to shorter training examples to match the length of the longest sequence in each batch. While this ensures uniformity for batch processing, it introduces inefficiencies by including irrelevant padding tokens in the computation and wastes GPU resources. Hugging Face SFT trainer has always offered the option to use packing to combine multiple training examples, allowing for maximal utilization of GPU resources. However, up till now, it did not offer proper masking of each packed training example. This capability has been added to Hugging Face Transformers 4.44. We analyse this new feature and show the benefits across different variations of packing.
Machine Learning 28
♻ ☆ Horseshoe-type Priors for Independent Component Estimation
Independent Component Estimation (ICE) has many applications in modern day machine learning as a feature engineering extraction method. Horseshoe-type priors are used to provide scalable algorithms that enables both point estimates via expectation-maximization (EM) and full posterior sampling via Markov Chain Monte Carlo (MCMC) algorithms. Our methodology also applies to flow-based methods for nonlinear feature extraction and deep learning. We also discuss how to implement conditional posteriors and envelope-based methods for optimization. Through this hierarchy representation, we unify a number of hitherto disparate estimation procedures. We illustrate our methodology and algorithms on a numerical example. Finally, we conclude with directions for future research.
comment: 23 pages, 2 figures
♻ ☆ Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture
The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
♻ ☆ Discovering Preference Optimization Algorithms with and for Large Language Models
Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.
♻ ☆ REBEL: Reinforcement Learning via Regressing Relative Rewards
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.
comment: New experimental results on general chat
♻ ☆ MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.
comment: 9 pages, 3 figures, 2 algorithms, appendix
♻ ☆ CardioLab: Laboratory Values Estimation from Electrocardiogram Features -- An Exploratory Study
Introduction: Laboratory value represents a cornerstone of medical diagnostics, but suffers from slow turnaround times, and high costs and only provides information about a single point in time. The continuous estimation of laboratory values from non-invasive data such as electrocardiogram (ECG) would therefore mark a significant frontier in healthcare monitoring. Despite its transformative potential, this domain remains relatively underexplored within the medical community. Methods: In this preliminary study, we used a publicly available dataset (MIMIC-IV-ECG) to investigate the feasibility of inferring laboratory values from ECG features and patient demographics using tree-based models (XGBoost). We define the prediction task as a binary prediction problem of predicting whether the lab value falls into low or high abnormalities. The model performance can then be assessed using AUROC. Results: Our findings demonstrate promising results in the estimation of laboratory values related to different organ systems based on a small yet comprehensive set of features. While further research and validation are warranted to fully assess the clinical utility and generalizability of ECG-based estimation in healthcare monitoring, our findings lay the groundwork for future investigations into approaches to laboratory value estimation using ECG data. Such advancements hold promise for revolutionizing predictive healthcare applications, offering faster, non-invasive, and more affordable means of patient monitoring.
comment: 4 pages, (updated dataset features set description version) code under https://github.com/AI4HealthUOL/CardioLab
♻ ☆ Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms ICLR 2024
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.
comment: ICLR 2024
♻ ☆ ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey bees
The global decline in bee populations poses significant risks to agriculture, biodiversity, and environmental stability. To bridge the gap in existing data, we introduce ApisTox, a comprehensive dataset focusing on the toxicity of pesticides to honey bees (Apis mellifera). This dataset combines and leverages data from existing sources such as ECOTOX and PPDB, providing an extensive, consistent, and curated collection that surpasses the previous datasets. ApisTox incorporates a wide array of data, including toxicity levels for chemicals, details such as time of their publication in literature, and identifiers linking them to external chemical databases. This dataset may serve as an important tool for environmental and agricultural research, but also can support the development of policies and practices aimed at minimizing harm to bee populations. Finally, ApisTox offers a unique resource for benchmarking molecular property prediction methods on agrochemical compounds, facilitating advancements in both environmental science and cheminformatics. This makes it a valuable tool for both academic research and practical applications in bee conservation.
♻ ☆ T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its security risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover fewer aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, a new benchmark designed for conducting safety-critical assessments of text-to-video models. We define 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AI.
♻ ☆ Data-Aware Gradient Compression for DML in Communication-Constrained Mobile Computing
Distributed machine learning (DML) in mobile environments faces significant communication bottlenecks. Gradient compression has proven as an effective solution to this issue, offering substantial benefits in environments with limited bandwidth and metered data. Yet, it encounters severe performance drops in non-IID environments due to a one-size-fits-all compression approach, which does not account for the varying data volumes across workers. Assigning varying compression ratios to workers with distinct data distributions and volumes is therefore a promising solution. This work derives the convergence rate of distributed SGD with non-uniform compression, which reveals the intricate relationship between model convergence and the compression ratios applied to individual workers. Accordingly, we frame the relative compression ratio assignment as an $n$-variable chi-squared nonlinear optimization problem, constrained by a limited communication budget. We propose DAGC-R, which assigns conservative compression to workers handling larger data volumes. Recognizing the computational limitations of mobile devices, we propose the DAGC-A, which is computationally less demanding and enhances the robustness of compression in non-IID scenarios. Our experiments confirm that the DAGC-A and DAGC-R can speed up the training speed by up to $16.65\%$ and $25.43\%$ compared to the uniform compression respectively, when dealing with highly imbalanced data volume distribution and restricted communication.
♻ ☆ Creating Temporally Correlated High-Resolution Profiles of Load Injection Using Constrained Generative Adversarial Networks
Traditional smart meters, which measure energy usage every 15 minutes or more and report it at least a few hours later, lack the granularity needed for real-time decision-making. To address this practical problem, we introduce a new method using generative adversarial networks (GAN) that enforces temporal consistency on its high-resolution outputs via hard inequality constraints using convex optimization. A unique feature of our GAN model is that it is trained solely on slow timescale aggregated historical energy data obtained from smart meters. The results demonstrate that the model can successfully create minute-by-minute temporally correlated profiles of power usage from 15-minute interval average power consumption information. This innovative approach, emphasizing inter-neuron constraints, offers a promising avenue for improved high-speed state estimation in distribution systems and enhances the applicability of data-driven solutions for monitoring and subsequently controlling such systems.
comment: 6 pages
♻ ☆ Generalizing Graph Transformers Across Diverse Graphs and Tasks via Pre-Training on Industrial-Scale Data
Graph pre-training has been concentrated on graph-level on small graphs (e.g., molecular graphs) or learning node representations on a fixed graph. Extending graph pre-trained models to web-scale graphs with billions of nodes in industrial scenarios, while avoiding negative transfer across graphs or tasks, remains a challenge. We aim to develop a general graph pre-trained model with inductive ability that can make predictions for unseen new nodes and even new graphs. In this work, we introduce a scalable transformer-based graph pre-training framework called PGT (Pre-trained Graph Transformer). Specifically, we design a flexible and scalable graph transformer as the backbone network. Meanwhile, based on the masked autoencoder architecture, we design two pre-training tasks: one for reconstructing node features and the other one for reconstructing local structures. Unlike the original autoencoder architecture where the pre-trained decoder is discarded, we propose a novel strategy that utilizes the decoder for feature augmentation. We have deployed our framework on Tencent's online game data. Extensive experiments have demonstrated that our framework can perform pre-training on real-world web-scale graphs with over 540 million nodes and 12 billion edges and generalizes effectively to unseen new graphs with different downstream tasks. We further conduct experiments on the publicly available ogbn-papers100M dataset, which consists of 111 million nodes and 1.6 billion edges. Our framework achieves state-of-the-art performance on both industrial datasets and public datasets, while also enjoying scalability and efficiency.
comment: Work in progress
♻ ☆ Harmonizing Generalization and Personalization in Federated Prompt Learning
Federated Prompt Learning (FPL) incorporates large pre-trained Vision-Language models (VLM) into federated learning through prompt tuning. The transferable representations and remarkable generalization capacity of VLM make them highly compatible with the integration of federated learning. Addressing data heterogeneity in federated learning requires personalization, but excessive focus on it across clients could compromise the model's ability to generalize effectively. To preserve the impressive generalization capability of VLM, it is crucial to strike a balance between personalization and generalization in FPL. To tackle this challenge, we proposed Federated Prompt Learning with CLIP Generalization and low-rank Personalization (FedPGP), which employs pre-trained CLIP to provide knowledge-guidance on the global prompt for improved generalization and incorporates a low-rank adaptation term to personalize the global prompt. Further, FedPGP integrates a prompt-wise contrastive loss to achieve knowledge guidance and personalized adaptation simultaneously, enabling a harmonious balance between personalization and generalization in FPL. We conduct extensive experiments on various datasets to explore base-to-novel generalization in both category-level and domain-level scenarios with heterogeneous data, showing the superiority of FedPGP in balancing generalization and personalization.
♻ ☆ A Versatile Graph Learning Approach through LLM-based Agent
Designing versatile graph learning approaches is important, considering the diverse graphs and tasks existing in real-world applications. Existing methods have attempted to achieve this target through automated machine learning techniques, pre-training and fine-tuning strategies, and large language models. However, these methods are not versatile enough for graph learning, as they work on either limited types of graphs or a single task. In this paper, we propose to explore versatile graph learning approaches with LLM-based agents, and the key insight is customizing the graph learning procedures for diverse graphs and tasks. To achieve this, we develop several LLM-based agents, equipped with diverse profiles, tools, functions and human experience. They collaborate to configure each procedure with task and data-specific settings step by step towards versatile solutions, and the proposed method is dubbed GL-Agent. By evaluating on diverse tasks and graphs, the correct results of the agent and its comparable performance showcase the versatility of the proposed method, especially in complex scenarios.The low resource cost and the potential to use open-source LLMs highlight the efficiency of GL-Agent.
♻ ☆ MLRegTest: A Benchmark for the Machine Learning of Regular Languages
Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture.
comment: Accepted for publication in the Journal of Machine Learning Research. Dataset available at https://doi.org/10.5061/dryad.dncjsxm4h , code available at https://github.com/heinz-jeffrey/subregular-learning
♻ ☆ SFR-GNN: Simple and Fast Robust GNNs against Structural Attacks
Graph Neural Networks (GNNs) have demonstrated commendable performance for graph-structured data. Yet, GNNs are often vulnerable to adversarial structural attacks as embedding generation relies on graph topology. Existing efforts are dedicated to purifying the maliciously modified structure or applying adaptive aggregation, thereby enhancing the robustness against adversarial structural attacks. It is inevitable for a defender to consume heavy computational costs due to lacking prior knowledge about modified structures. To this end, we propose an efficient defense method, called Simple and Fast Robust Graph Neural Network (SFR-GNN), supported by mutual information theory. The SFR-GNN first pre-trains a GNN model using node attributes and then fine-tunes it over the modified graph in the manner of contrastive learning, which is free of purifying modified structures and adaptive aggregation, thus achieving great efficiency gains. Consequently, SFR-GNN exhibits a 24%--162% speedup compared to advanced robust models, demonstrating superior robustness for node classification tasks.
♻ ☆ Finite-dimensional approximations of push-forwards on locally analytic functionals
This paper introduces a novel theoretical framework for investigating analytic maps from finite discrete data. Our approach is to consider the push-forward on the space of locally analytic functionals, instead of directly handling the analytic map itself. We establish a methodology enabling appropriate finite-dimensional approximation of the push-forward from finite discrete data, through the theory of the Fourier--Borel transform and the Fock space. Moreover, we prove a rigorous convergence result with a convergence rate. As an application, we prove that it is not the least-squares polynomial, but the polynomial obtained by truncating its higher-degree terms, that approximates analytic functions and further allows for approximation beyond the support of the data distribution. One advantage of our theory is that it enables us to apply linear algebraic operations to the finite-dimensional approximation of the push-forward. Utilizing this, we prove the convergence of a method for approximating an analytic vector field from finite data of the flow map of an ordinary differential equation.
comment: 32 pages. 2 figures. We modified resutls. Comments are welcome
Multi-Fidelity Active Learning with GFlowNets
In the last decades, the capacity to generate large amounts of data in science and engineering applications has been growing steadily. Meanwhile, machine learning has progressed to become a suitable tool to process and utilise the available data. Nonetheless, many relevant scientific and engineering problems present challenges where current machine learning methods cannot yet efficiently leverage the available data and resources. For example, in scientific discovery, we are often faced with the problem of exploring very large, structured and high-dimensional spaces. Moreover, the high fidelity, black-box objective function is often very expensive to evaluate. Progress in machine learning methods that can efficiently tackle such challenges would help accelerate currently crucial areas such as drug and materials discovery. In this paper, we propose a multi-fidelity active learning algorithm with GFlowNets as a sampler, to efficiently discover diverse, high-scoring candidates where multiple approximations of the black-box function are available at lower fidelity and cost. Our evaluation on molecular discovery tasks shows that multi-fidelity active learning with GFlowNets can discover high-scoring candidates at a fraction of the budget of its single-fidelity counterpart while maintaining diversity, unlike RL-based alternatives. These results open new avenues for multi-fidelity active learning to accelerate scientific discovery and engineering design.
comment: Published in Transactions on Machine Learning Research (TMLR) 07/2024 https://openreview.net/forum?id=dLaazW9zuF
♻ ☆ Bridging the Sim-to-Real Gap with Bayesian Inference
We present SIM-FSVGD for learning robot dynamics from data. As opposed to traditional methods, SIM-FSVGD leverages low-fidelity physical priors, e.g., in the form of simulators, to regularize the training of neural network models. While learning accurate dynamics already in the low data regime, SIM-FSVGD scales and excels also when more data is available. We empirically show that learning with implicit physical priors results in accurate mean model estimation as well as precise uncertainty quantification. We demonstrate the effectiveness of SIM-FSVGD in bridging the sim-to-real gap on a high-performance RC racecar system. Using model-based RL, we demonstrate a highly dynamic parking maneuver with drifting, using less than half the data compared to the state of the art.
♻ ☆ Quantum Implicit Neural Representations
Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.
comment: This paper was accepted by icml 2024
♻ ☆ Uncertainty Modeling in Graph Neural Networks via Stochastic Differential Equations
We address the problem of learning uncertainty-aware representations for graph-structured data. While Graph Neural Ordinary Differential Equations (GNODE) are effective in learning node representations, they fail to quantify uncertainty. To address this, we introduce Latent Graph Neural Stochastic Differential Equations (LGNSDE), which enhance GNODE by embedding randomness through Brownian motion to quantify uncertainty. We provide theoretical guarantees for LGNSDE and empirically show better performance in uncertainty quantification.
comment: 9 pages including appendix
♻ ☆ Downlink CCM Estimation via Representation Learning with Graph Regularization
In this paper, we propose an algorithm for downlink (DL) channel covariance matrix (CCM) estimation for frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) communication systems with base station (BS) possessing a uniform linear array (ULA) antenna structure. We consider a setting where the UL CCM is mapped to DL CCM by a mapping function. We first present a theoretical error analysis of learning a nonlinear embedding by constructing a mapping function, which points to the importance of the Lipschitz regularity of the mapping function for achieving high estimation performance. Then, based on the theoretical ground, we propose a representation learning algorithm as a solution for the estimation problem, where Gaussian RBF kernel interpolators are chosen to map UL CCMs to their DL counterparts. The proposed algorithm is based on the optimization of an objective function that fits a regression model between the DL CCM and UL CCM samples in the training dataset and preserves the local geometric structure of the data in the UL CCM space, while explicitly regulating the Lipschitz continuity of the mapping function in light of our theoretical findings. The proposed algorithm surpasses benchmark methods in terms of three error metrics as shown by simulations.
♻ ☆ Editing Personality for Large Language Models NLPCC 2024
This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs). This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. Specifically, we construct PersonalityEdit, a new benchmark dataset to address this task. Drawing on the theory in Social Psychology, we isolate three representative traits, namely Neuroticism, Extraversion, and Agreeableness, as the foundation for our benchmark. We then gather data using GPT-4, generating responses that align with a specified topic and embody the targeted personality trait. We conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in LLMs. Our findings uncover potential challenges of the proposed task, illustrating several remaining issues. We anticipate that our work can stimulate further annotation in model editing and personality-related research. Code is available at https://github.com/zjunlp/EasyEdit.
comment: NLPCC 2024
♻ ☆ UniHPF : Universal Healthcare Predictive Framework with Zero Domain Knowledge ML4H
Despite the abundance of Electronic Healthcare Records (EHR), its heterogeneity restricts the utilization of medical data in building predictive models. To address this challenge, we propose Universal Healthcare Predictive Framework (UniHPF), which requires no medical domain knowledge and minimal pre-processing for multiple prediction tasks. Experimental results demonstrate that UniHPF is capable of building large-scale EHR models that can process any form of medical data from distinct EHR systems. We believe that our findings can provide helpful insights for further research on the multi-source learning of EHRs.
comment: The original paper is published on Journal of Biomedical and Health Informatics(JBHI) 2023, https://ieeexplore.ieee.org/document/10298642. Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2022, November 28th, 2022, New Orleans, United States, 19 pages(main paper 6 pages). arXiv admin note: substantial text overlap with arXiv:2207.09858
♻ ☆ Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships
Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
comment: 20 pages, 6 figures, 15 tables
♻ ☆ AirPilot: Interpretable PPO-based DRL Auto-Tuned Nonlinear PID Drone Controller for Robust Autonomous Flights
Navigation precision, speed and stability are crucial for safe Unmanned Aerial Vehicle (UAV) flight maneuvers and effective flight mission executions in dynamic environments. Different flight missions may have varying objectives, such as minimizing energy consumption, achieving precise positioning, or maximizing speed. A controller that can adapt to different objectives on the fly is highly valuable. Proportional Integral Derivative (PID) controllers are one of the most popular and widely used control algorithms for drones and other control systems, but their linear control algorithm fails to capture the nonlinear nature of the dynamic wind conditions and complex drone system. Manually tuning the PID gains for various missions can be time-consuming and requires significant expertise. This paper aims to revolutionize drone flight control by presenting the AirPilot, a nonlinear Deep Reinforcement Learning (DRL) - enhanced Proportional Integral Derivative (PID) drone controller using Proximal Policy Optimization (PPO). AirPilot controller combines the simplicity and effectiveness of traditional PID control with the adaptability, learning capability, and optimization potential of DRL. This makes it better suited for modern drone applications where the environment is dynamic, and mission-specific performance demands are high. We employed a COEX Clover autonomous drone for training the DRL agent within the simulator and implemented it in a real-world lab setting, which marks a significant milestone as one of the first attempts to apply a DRL-based flight controller on an actual drone. Airpilot is capable of reducing the navigation error of the default PX4 PID position controller by 90%, improving effective navigation speed of a fine-tuned PID controller by 21%, reducing settling time and overshoot by 17% and 16% respectively.
comment: 9 pages, 20 figures
♻ ☆ The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist
♻ ☆ Enhancing Training Efficiency Using Packing with Flash Attention
Padding is often used in tuning LLM models by adding special tokens to shorter training examples to match the length of the longest sequence in each batch. While this ensures uniformity for batch processing, it introduces inefficiencies by including irrelevant padding tokens in the computation and wastes GPU resources. Hugging Face SFT trainer has always offered the option to use packing to combine multiple training examples, allowing for maximal utilization of GPU resources. However, up till now, it did not offer proper masking of each packed training example. This capability has been added to Hugging Face Transformers 4.44. We analyse this new feature and show the benefits across different variations of packing.
Information Retrieval 10
☆ A Counterfactual Explanation Framework for Retrieval Models
Explainability has become a crucial concern in today's world, aiming to enhance transparency in machine learning and deep learning models. Information retrieval is no exception to this trend. In existing literature on explainability of information retrieval, the emphasis has predominantly been on illustrating the concept of relevance concerning a retrieval model. The questions addressed include why a document is relevant to a query, why one document exhibits higher relevance than another, or why a specific set of documents is deemed relevant for a query. However, limited attention has been given to understanding why a particular document is considered non-relevant to a query with respect to a retrieval model. In an effort to address this gap, our work focus on the question of what terms need to be added within a document to improve its ranking. This in turn answers the question of which words played a role in not being favored by a retrieval model for a particular query. We use an optimization framework to solve the above-mentioned research problem. % To the best of our knowledge, we mark the first attempt to tackle this specific counterfactual problem. Our experiments show the effectiveness of our proposed approach in predicting counterfactuals for both statistical (e.g. BM25) and deep-learning-based models (e.g. DRMM, DSSM, ColBERT).
☆ Dissecting Temporal Understanding in Text-to-Audio Retrieval
Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/.
comment: 9 pages, 5 figures, ACM Multimedia 2024, https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/
☆ Building FKG.in: a Knowledge Graph for Indian Food
This paper presents an ontology design along with knowledge engineering, and multilingual semantic reasoning techniques to build an automated system for assimilating culinary information for Indian food in the form of a knowledge graph. The main focus is on designing intelligent methods to derive ontology designs and capture all-encompassing knowledge about food, recipes, ingredients, cooking characteristics, and most importantly, nutrition, at scale. We present our ongoing work in this workshop paper, describe in some detail the relevant challenges in curating knowledge of Indian food, and propose our high-level ontology design. We also present a novel workflow that uses AI, LLM, and language technology to curate information from recipe blog sites in the public domain to build knowledge graphs for Indian food. The methods for knowledge curation proposed in this paper are generic and can be replicated for any domain. The design is application-agnostic and can be used for AI-driven smart analysis, building recommendation systems for Personalized Digital Health, and complementing the knowledge graph for Indian food with contextual information such as user information, food biochemistry, geographic information, agricultural information, etc.
comment: 14 pages, 3 figures, 25 references, Formal Ontology in Information Systems Conference 2024 - Integrated Food Ontology Workshop
☆ Hound: Hunting Supervision Signals for Few and Zero Shot Node Classification on Text-attributed Graph
Text-attributed graph (TAG) is an important type of graph structured data with text descriptions for each node. Few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. However, the two tasks are challenging due to the lack of supervision signals, and existing methods only use the contrastive loss to align graph-based node embedding and language-based text embedding. In this paper, we propose Hound to improve accuracy by introducing more supervision signals, and the core idea is to go beyond the node-text pairs that come with data. Specifically, we design three augmentation techniques, i.e., node perturbation, text matching, and semantics negation to provide more reference nodes for each text and vice versa. Node perturbation adds/drops edges to produce diversified node embeddings that can be matched with a text. Text matching retrieves texts with similar embeddings to match with a node. Semantics negation uses a negative prompt to construct a negative text with the opposite semantics, which is contrasted with the original node and text. We evaluate Hound on 5 datasets and compare with 13 state-of-the-art baselines. The results show that Hound consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.
☆ Fair Reciprocal Recommendation in Matching Markets RecSys2024
Recommender systems play an increasingly crucial role in shaping people's opportunities, particularly in online dating platforms. It is essential from the user's perspective to increase the probability of matching with a suitable partner while ensuring an appropriate level of fairness in the matching opportunities. We investigate reciprocal recommendation in two-sided matching markets between agents divided into two sides. In our model, a match is considered successful only when both individuals express interest in each other. Additionally, we assume that agents prefer to appear prominently in the recommendation lists presented to those on the other side. We define each agent's opportunity to be recommended and introduce its fairness criterion, envy-freeness, from the perspective of fair division theory. The recommendations that approximately maximize the expected number of matches, empirically obtained by heuristic algorithms, are likely to result in significant unfairness of opportunity. Therefore, there can be a trade-off between maximizing the expected matches and ensuring fairness of opportunity. To address this challenge, we propose a method to find a policy that is close to being envy-free by leveraging the Nash social welfare function. Experiments on synthetic and real-world datasets demonstrate the effectiveness of our approach in achieving both relatively high expected matches and fairness for opportunities of both sides in reciprocal recommender systems.
comment: Accepted at RecSys2024
☆ A Learnable Agent Collaboration Network Framework for Personalized Multimodal AI Search Engine
Large language models (LLMs) and retrieval-augmented generation (RAG) techniques have revolutionized traditional information access, enabling AI agent to search and summarize information on behalf of users during dynamic dialogues. Despite their potential, current AI search engines exhibit considerable room for improvement in several critical areas. These areas include the support for multimodal information, the delivery of personalized responses, the capability to logically answer complex questions, and the facilitation of more flexible interactions. This paper proposes a novel AI Search Engine framework called the Agent Collaboration Network (ACN). The ACN framework consists of multiple specialized agents working collaboratively, each with distinct roles such as Account Manager, Solution Strategist, Information Manager, and Content Creator. This framework integrates mechanisms for picture content understanding, user profile tracking, and online evolution, enhancing the AI search engine's response quality, personalization, and interactivity. A highlight of the ACN is the introduction of a Reflective Forward Optimization method (RFO), which supports the online synergistic adjustment among agents. This feature endows the ACN with online learning capabilities, ensuring that the system has strong interactive flexibility and can promptly adapt to user feedback. This learning method may also serve as an optimization approach for agent-based systems, potentially influencing other domains of agent applications.
comment: ACMMM 2024 MMGR WORKSHOP
♻ ☆ Dynamic Boundary Time Warping for Sub-sequence Matching with Few Examples
The paper presents a novel method of finding a fragment in a long temporal sequence similar to the set of shorter sequences. We are the first to propose an algorithm for such a search that does not rely on computing the average sequence from query examples. Instead, we use query examples as is, utilizing all of them simultaneously. The introduced method based on the Dynamic Time Warping (DTW) technique is suited explicitly for few-shot query-by-example retrieval tasks. We evaluate it on two different few-shot problems from the field of Natural Language Processing. The results show it either outperforms baselines and previous approaches or achieves comparable results when a low number of examples is available.
♻ ☆ Forecasting Live Chat Intent from Browsing History CIKM 2024
Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user's browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.
comment: CIKM 2024
♻ ☆ Sparsity-regularized coded ptychography for robust and efficient lensless microscopy on a chip
Coded ptychography has emerged as a powerful technique for high-throughput, high-resolution lensless imaging. However, the trade-off between acquisition speed and image quality remains a significant challenge. To address this, we introduce a novel sparsity-regularized approach to coded ptychography that dramatically reduces the number of required measurements while maintaining high reconstruction quality. The reported approach, termed the ptychographic proximal total-variation (PPTV) solver, formulates the reconstruction task as a total variation regularized optimization problem. Unlike previous implementations that rely on specialized hardware or illumination schemes, PPTV integrates seamlessly into existing coded ptychography setups. Through comprehensive numerical simulations, we demonstrate that PPTV-driven coded ptychography can produce accurate reconstructions with as few as eight intensity measurements, a significant reduction compared to conventional methods. Convergence analysis confirms the robustness and stability of the PPTV algorithm. Experimental results from our optical prototype, featuring a disorder-engineered surface for wavefront modulation, validate PPTV's ability to achieve high-throughput, high-resolution imaging with a substantially reduced measurement burden. By enabling high-quality reconstructions from fewer measurements, PPTV paves the way for more compact, efficient, and cost-effective lensless microscopy systems on a chip, with potential applications in digital pathology, endoscopy, point-of-care diagnostics, and high-content screening.
comment: 14 pages, 7 figures
♻ ☆ Vague Preference Policy Learning for Conversational Recommendation
Conversational recommendation systems (CRS) commonly assume users have clear preferences, leading to potential over-filtering of relevant alternatives. However, users often exhibit vague, non-binary preferences. We introduce the Vague Preference Multi-round Conversational Recommendation (VPMCR) scenario, employing a soft estimation mechanism to accommodate users' vague and dynamic preferences while mitigating over-filtering. In VPMCR, we propose Vague Preference Policy Learning (VPPL), consisting of Ambiguity-aware Soft Estimation (ASE) and Dynamism-aware Policy Learning (DPL). ASE captures preference vagueness by estimating scores for clicked and non-clicked options, using a choice-based approach and time-aware preference decay. DPL leverages ASE's preference distribution to guide the conversation and adapt to preference changes for recommendations or attribute queries. Extensive experiments demonstrate VPPL's effectiveness within VPMCR, outperforming existing methods and setting a new benchmark. Our work advances CRS by accommodating users' inherent ambiguity and relative decision-making processes, improving real-world applicability.
Computation and Language 15
♻ ☆ Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.
♻ ☆ Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex structures and operations often pose challenges for non-experts to grasp. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex structure with explanations of the underlying operations. By comparing image generation of prompt variants, users can discover the impact of keyword changes on image generation. A 56-participant user study demonstrates that Diffusion Explainer offers substantial learning benefits to non-experts. Our tool has been used by over 10,300 users from 124 countries at https://poloclub.github.io/diffusion-explainer/.
comment: 5 pages, 7 figures
♻ ☆ The Foundational Capabilities of Large Language Models in Predicting Postoperative Risks Using Clinical Notes
Clinical notes recorded during a patient's perioperative journey holds immense informational value. Advances in large language models (LLMs) offer opportunities for bridging this gap. Using 84,875 pre-operative notes and its associated surgical cases from 2018 to 2021, we examine the performance of LLMs in predicting six postoperative risks using various fine-tuning strategies. Pretrained LLMs outperformed traditional word embeddings by an absolute AUROC of 38.3% and AUPRC of 33.2%. Self-supervised fine-tuning further improved performance by 3.2% and 1.5%. Incorporating labels into training further increased AUROC by 1.8% and AUPRC by 2%. The highest performance was achieved with a unified foundation model, with improvements of 3.6% for AUROC and 2.6% for AUPRC compared to self-supervision, highlighting the foundational capabilities of LLMs in predicting postoperative risks, which could be potentially beneficial when deployed for perioperative care
comment: Codes are publicly available at: https://github.com/cja5553/LLMs_in_perioperative_care
♻ ☆ A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond
Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions. An ongoing, dynamically updated project and resources associated with this survey have been released at https://github.com/QiushiSun/NCISurvey.
comment: 64 pages, 6 figures, 10 tables, 695 references
♻ ☆ Characterizing Online Toxicity During the 2022 Mpox Outbreak: A Computational Analysis of Topical and Network Dynamics
Background: Online toxicity, encompassing behaviors such as harassment, bullying, hate speech, and the dissemination of misinformation, has become a pressing social concern in the digital age. The 2022 Mpox outbreak, initially termed "Monkeypox" but subsequently renamed to mitigate associated stigmas and societal concerns, serves as a poignant backdrop to this issue. Objective: In this research, we undertake a comprehensive analysis of the toxic online discourse surrounding the 2022 Mpox outbreak. Our objective is to dissect its origins, characterize its nature and content, trace its dissemination patterns, and assess its broader societal implications, with the goal of providing insights that can inform strategies to mitigate such toxicity in future crises. Methods: We collected more than 1.6 million unique tweets and analyzed them from five dimensions, including context, extent, content, speaker, and intent. Utilizing BERT-based topic modeling and social network community clustering, we delineated the toxic dynamics on Twitter. Results: We identified five high-level topic categories in the toxic online discourse on Twitter, including disease (46.6%), health policy and healthcare (19.3%), homophobia (23.9%), politics (6.0%), and racism (4.1%). Through the toxicity diffusion networks of mentions, retweets, and the top users, we found that retweets of toxic content were widespread, while influential users rarely engaged with or countered this toxicity through retweets. Conclusions: By tracking topical dynamics, we can track the changing popularity of toxic content online, providing a better understanding of societal challenges. Network dynamics spotlight key social media influencers and their intents, indicating that addressing these central figures in toxic discourse can enhance crisis communication and inform policy-making.
comment: 36 pages, 8 figure, and 12 tables
♻ ☆ Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing
Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). While prior studies have primarily focused on the post-generation analysis and refinement of outputs, this paper centers on the effectiveness of queries in eliciting accurate responses from LLMs. We present HalluciBot, a model that estimates the query's propensity to hallucinate before generation, without invoking any LLMs during inference. HalluciBot can serve as a proxy reward model for query rewriting, offering a general framework to estimate query quality based on accuracy and consensus. In essence, HalluciBot investigates how poorly constructed queries can lead to erroneous outputs - moreover, by employing query rewriting guided by HalluciBot's empirical estimates, we demonstrate that 95.7% output accuracy can be achieved for Multiple Choice questions. The training procedure for HalluciBot consists of perturbing 369,837 queries n times, employing n+1 independent LLM agents, sampling an output from each query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs, and training an encoder classifier. The idea of perturbation is the outcome of our ablation studies that measures the increase in output diversity (+12.5 agreement spread) by perturbing a query in lexically different but semantically similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1 score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2% positive class transition from hallucinatory to non-hallucinatory), rank (+50.6% positive class transition from hallucinatory to non-hallucinatory), and route queries to effective pipelines.
♻ ☆ Evaluating Large Language Models for Health-related Queries with Presuppositions ACL 2024
As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.
comment: Findings of ACL 2024
♻ ☆ Towards a theory of how the structure of language is acquired by deep neural networks
How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG) -- a tree-like generative model that captures many of the hierarchical structures found in natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem. We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets. In particular, our conjecture predicts how the scaling law for the test loss behaviour with training set size depends on the length of the context window, which we confirm empirically in Shakespeare's plays and Wikipedia articles.
comment: 9 pages, 4 figures (main)
♻ ☆ Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition
With the release of increasing open-source emotion recognition datasets on social media platforms and the rapid development of computing resources, multimodal emotion recognition tasks (MER) have begun to receive widespread research attention. The MER task extracts and fuses complementary semantic information from different modalities, which can classify the speaker's emotions. However, the existing feature fusion methods have usually mapped the features of different modalities into the same feature space for information fusion, which can not eliminate the heterogeneity between different modalities. Therefore, it is challenging to make the subsequent emotion class boundary learning. To tackle the above problems, we have proposed a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation, which can achieve information interaction between modalities and eliminate heterogeneity among modalities. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Specifically, we construct a graph structure for three modal features and perform contrastive representation learning on nodes with different emotions in the same modality and the same emotion in different modalities, which can improve the feature representation ability of nodes. Extensive experimental works show that the ARL-IIGCN method can significantly improve emotion recognition accuracy on IEMOCAP and MELD datasets.
comment: 14 pages, 6 figures
♻ ☆ Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations
The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52\% and 35\%, respectively.
comment: 11 pages, 3 tables
♻ ☆ DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition
With the continuous development of deep learning (DL), the task of multimodal dialogue emotion recognition (MDER) has recently received extensive research attention, which is also an essential branch of DL. The MDER aims to identify the emotional information contained in different modalities, e.g., text, video, and audio, in different dialogue scenes. However, existing research has focused on modeling contextual semantic information and dialogue relations between speakers while ignoring the impact of event relations on emotion. To tackle the above issues, we propose a novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method. It models dialogue relations between speakers and captures latent event relations information. Specifically, we construct a weighted multi-relationship graph to simultaneously capture the dependencies between speakers and event relations in a dialogue. Moreover, we also introduce a Self-Supervised Masked Graph Autoencoder (SMGAE) to improve the fusion representation ability of features and structures. Next, we design a new Multiple Information Transformer (MIT) to capture the correlation between different relations, which can provide a better fuse of the multivariate information between relations. Finally, we propose a loss optimization strategy based on contrastive learning to enhance the representation learning ability of minority class features. We conduct extensive experiments on the IEMOCAP and MELD benchmark datasets, which verify the effectiveness of the DER-GCN model. The results demonstrate that our model significantly improves both the average accuracy and the f1 value of emotion recognition.
comment: 14 pages, 7 figures
♻ ☆ Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models ACL 2024
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.
comment: Accepted at ACL 2024
♻ ☆ A Novel ICD Coding Method Based on Associated and Hierarchical Code Description Distillation
ICD(International Classification of Diseases) coding involves assigning ICD codes to patients visit based on their medical notes. ICD coding is a challenging multilabel text classification problem due to noisy medical document inputs. Recent advancements in automated ICD coding have enhanced performance by integrating additional data and knowledge bases with the encoding of medical notes and codes. However, most of them ignore the code hierarchy, leading to improper code assignments. To address these problems, we propose a novel framework based on associated and hierarchical code description distillation (AHDD) for better code representation learning and avoidance of improper code assignment.we utilize the code description and the hierarchical structure inherent to the ICD codes. Therefore, in this paper, we leverage the code description and the hierarchical structure inherent to the ICD codes. The code description is also applied to aware the attention layer and output layer. Experimental results on the benchmark dataset show the superiority of the proposed framework over several state-of-the-art baselines.
♻ ☆ FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction ACL 2024
Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.
comment: ACL 2024 camera ready. Code and data at https://github.com/Babelscape/FENICE
♻ ☆ ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models ACL 2024
In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
comment: Accepted in ACL 2024
Computer Vision and Pattern Recognition 17
♻ ☆ Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.
♻ ☆ CURLing the Dream: Contrastive Representations for World Modeling in Reinforcement Learning
In this work, we present Curled-Dreamer, a novel reinforcement learning algorithm that integrates contrastive learning into the DreamerV3 framework to enhance performance in visual reinforcement learning tasks. By incorporating the contrastive loss from the CURL algorithm and a reconstruction loss from autoencoder, Curled-Dreamer achieves significant improvements in various DeepMind Control Suite tasks. Our extensive experiments demonstrate that Curled-Dreamer consistently outperforms state-of-the-art algorithms, achieving higher mean and median scores across a diverse set of tasks. The results indicate that the proposed approach not only accelerates learning but also enhances the robustness of the learned policies. This work highlights the potential of combining different learning paradigms to achieve superior performance in reinforcement learning applications.
comment: Paper accepted for 24th International Conference on Control, Automation and Systems (ICCAS)
♻ ☆ Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification
This paper presents dilated Residual Network (ResNet) models for disease classification from retinal fundus images. Dilated convolution filters are used to replace normal convolution filters in the higher layers of the ResNet model (dilated ResNet) in order to improve the receptive field compared to the normal ResNet model for disease classification. This study introduces computer-assisted diagnostic tools that employ deep learning, enhanced with explainable AI techniques. These techniques aim to make the tool's decision-making process transparent, thereby enabling medical professionals to understand and trust the AI's diagnostic decision. They are particularly relevant in today's healthcare landscape, where there is a growing demand for transparency in AI applications to ensure their reliability and ethical use. The dilated ResNet is used as a replacement for the normal ResNet to enhance the classification accuracy of retinal eye diseases and reduce the required computing time. The dataset used in this work is the Ocular Disease Intelligent Recognition (ODIR) dataset which is a structured ophthalmic database with eight classes covering most of the common retinal eye diseases. The evaluation metrics used in this work include precision, recall, accuracy, and F1 score. In this work, a comparative study has been made between normal ResNet models and dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, and 0.70 respectively for the above respective variants in ODIR multiclass disease classification.
comment: Added authors' contributions
♻ ☆ Convex Hull Prediction for Adaptive Video Streaming by Recurrent Learning
Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step is equivalent to an exhaustive search process over the space of possible encoding parameters, which causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we propose a deep learning based method of content aware convex hull prediction. We employ a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. A two-step transfer learning scheme is adopted to train our proposed RCN-Hull model, which ensures sufficient content diversity to analyze scene complexity, while also making it possible to capture the scene statistics of pristine source videos. Our experimental results reveal that our proposed model yields better approximations of the optimal convex hulls, and offers competitive time savings as compared to existing approaches. On average, the pre-encoding time was reduced by 53.8% by our method, while the average Bjontegaard delta bitrate (BD-rate) of the predicted convex hulls against ground truth was 0.26%, and the mean absolute deviation of the BD-rate distribution was 0.57%.
♻ ☆ Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations
Due to the successful development of deep image generation technology, forgery detection plays a more important role in social and economic security. Racial bias has not been explored thoroughly in the deep forgery detection field. In the paper, we first contribute a dedicated dataset called the Fair Forgery Detection (FairFD) dataset, where we prove the racial bias of public state-of-the-art (SOTA) methods. Different from existing forgery detection datasets, the self-constructed FairFD dataset contains a balanced racial ratio and diverse forgery generation images with the largest-scale subjects. Additionally, we identify the problems with naive fairness metrics when benchmarking forgery detection models. To comprehensively evaluate fairness, we design novel metrics including Approach Averaged Metric and Utility Regularized Metric, which can avoid deceptive results. We also present an effective and robust post-processing technique, Bias Pruning with Fair Activations (BPFA), which improves fairness without requiring retraining or weight updates. Extensive experiments conducted with 12 representative forgery detection models demonstrate the value of the proposed dataset and the reasonability of the designed fairness metrics. By applying the BPFA to the existing fairest detector, we achieve a new SOTA. Furthermore, we conduct more in-depth analyses to offer more insights to inspire researchers in the community.
♻ ☆ S3E: A Mulit-Robot Multimodal Dataset for Collaborative SLAM
The burgeoning demand for collaborative robotic systems to execute complex tasks collectively has intensified the research community's focus on advancing simultaneous localization and mapping (SLAM) in a cooperative context. Despite this interest, the scalability and diversity of existing datasets for collaborative trajectories remain limited, especially in scenarios with constrained perspectives where the generalization capabilities of Collaborative SLAM (C-SLAM) are critical for the feasibility of multi-agent missions. Addressing this gap, we introduce S3E, an expansive multimodal dataset. Captured by a fleet of unmanned ground vehicles traversing four distinct collaborative trajectory paradigms, S3E encompasses 13 outdoor and 5 indoor sequences. These sequences feature meticulously synchronized and spatially calibrated data streams, including 360-degree LiDAR point cloud, high-resolution stereo imagery, high-frequency inertial measurement units (IMU), and Ultra-wideband (UWB) relative observations. Our dataset not only surpasses previous efforts in scale, scene diversity, and data intricacy but also provides a thorough analysis and benchmarks for both collaborative and individual SLAM methodologies. For access to the dataset and the latest information, please visit our repository at https://pengyu-team.github.io/S3E.
♻ ☆ CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation
The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed Fourierformer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CILF-CIAE has achieved better age prediction results.
comment: 14 pages, 14 figures, 3 tables
♻ ☆ Graph Information Bottleneck for Remote Sensing Segmentation
Remote sensing segmentation has a wide range of applications in environmental protection, and urban change detection, etc. Despite the success of deep learning-based remote sensing segmentation methods (e.g., CNN and Transformer), they are not flexible enough to model irregular objects. In addition, existing graph contrastive learning methods usually adopt the way of maximizing mutual information to keep the node representations consistent between different graph views, which may cause the model to learn task-independent redundant information. To tackle the above problems, this paper treats images as graph structures and introduces a simple contrastive vision GNN (SC-ViG) architecture for remote sensing segmentation. Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. Furthermore, this paper innovatively introduces information bottleneck theory into graph contrastive learning to maximize task-related information while minimizing task-independent redundant information. Finally, we replace the convolutional module in UNet with the SC-ViG module to complete the segmentation and classification tasks of remote sensing images. Extensive experiments on publicly available real datasets demonstrate that our method outperforms state-of-the-art remote sensing image segmentation methods.
comment: 13 pages, 6 figures
♻ ☆ Palantir: Towards Efficient Super Resolution for Ultra-high-definition Live Streaming
Neural enhancement through super-resolution (SR) deep neural networks (DNNs) opens up new possibilities for ultra-high-definition (UHD) live streaming over existing encoding and networking infrastructure. Yet, the heavy SR DNN inference overhead leads to severe deployment challenges. To reduce the overhead, existing systems propose to apply DNN-based SR only on carefully selected anchor frames while upscaling non-anchor frames via the lightweight reusing-based SR approach. However, frame-level scheduling is coarse-grained and fails to deliver optimal efficiency. In this work, we propose Palantir, the first neural-enhanced UHD live streaming system with fine-grained patch-level scheduling. Two novel techniques are incorporated into Palantir to select the most beneficial anchor patches and support latency-sensitive UHD live streaming applications. Firstly, under the guidance of our pioneering and theoretical analysis, Palantir constructs a directed acyclic graph (DAG) for lightweight yet accurate SR quality estimation under any possible anchor patch set. Secondly, to further optimize the scheduling latency, Palantir improves parallelizability by refactoring the computation subprocedure of the estimation process into a sparse matrix-matrix multiplication operation. The evaluation results suggest that Palantir incurs a negligible scheduling latency accounting for less than 5.7% of the end-to-end latency requirement. When compared to the naive method of applying DNN-based SR on all the frames, Palantir can reduce the SR DNN inference overhead by 20 times (or 60 times) while preserving 54.0-82.6% (or 32.8-64.0%) of the quality gain. When compared to the state-of-the-art real-time frame-level scheduling strategy, Palantir can reduce the SR DNN inference overhead by 80.1% at most (and 38.4% on average) without sacrificing the video quality.
♻ ☆ Anticipating Future Object Compositions without Forgetting
Despite the significant advancements in computer vision models, their ability to generalize to novel object-attribute compositions remains limited. Existing methods for Compositional Zero-Shot Learning (CZSL) mainly focus on image classification. This paper aims to enhance CZSL in object detection without forgetting prior learned knowledge. We use Grounding DINO and incorporate Compositional Soft Prompting (CSP) into it and extend it with Compositional Anticipation. We achieve a 70.5% improvement over CSP on the harmonic mean (HM) between seen and unseen compositions on the CLEVR dataset. Furthermore, we introduce Contrastive Prompt Tuning to incrementally address model confusion between similar compositions. We demonstrate the effectiveness of this method and achieve an increase of 14.5% in HM across the pretrain, increment, and unseen sets. Collectively, these methods provide a framework for learning various compositions with limited data, as well as improving the performance of underperforming compositions when additional data becomes available.
♻ ☆ InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding
Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 49.16\% and 42.72\%, and average scores of 3.22 and 2.71 out of 5, respectively. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at https://vision-cair.github.io/InfiniBench/
comment: 24 pages,25 figures
♻ ☆ Translating Images to Road Network: A Sequence-to-Sequence Perspective ICCV 2023
The extraction of road network is essential for the generation of high-definition maps since it enables the precise localization of road landmarks and their interconnections. However, generating road network poses a significant challenge due to the conflicting underlying combination of Euclidean (e.g., road landmarks location) and non-Euclidean (e.g., road topological connectivity) structures. Existing methods struggle to merge the two types of data domains effectively, but few of them address it properly. Instead, our work establishes a unified representation of both types of data domain by projecting both Euclidean and non-Euclidean data into an integer series called RoadNet Sequence. Further than modeling an auto-regressive sequence-to-sequence Transformer model to understand RoadNet Sequence, we decouple the dependency of RoadNet Sequence into a mixture of auto-regressive and non-autoregressive dependency. Building on this, our proposed non-autoregressive sequence-to-sequence approach leverages non-autoregressive dependencies while fixing the gap towards auto-regressive dependencies, resulting in success on both efficiency and accuracy. We further identify two main bottlenecks in the current RoadNetTransformer on a non-overfitting split of the dataset: poor landmark detection limited by the BEV Encoder and error propagation to topology reasoning. Therefore, we propose Topology-Inherited Training to inherit better topology knowledge into RoadNetTransformer. Additionally, we collect SD-Maps from open-source map datasets and use this prior information to significantly improve landmark detection and reachability. Extensive experiments on nuScenes dataset demonstrate the superiority of RoadNet Sequence representation and the non-autoregressive approach compared to existing state-of-the-art alternatives.
comment: V1 is the ICCV 2023 conference version, and V2 is the extended version
♻ ☆ Geometric Prior Guided Feature Representation Learning for Long-Tailed Classification
Real-world data are long-tailed, the lack of tail samples leads to a significant limitation in the generalization ability of the model. Although numerous approaches of class re-balancing perform well for moderate class imbalance problems, additional knowledge needs to be introduced to help the tail class recover the underlying true distribution when the observed distribution from a few tail samples does not represent its true distribution properly, thus allowing the model to learn valuable information outside the observed domain. In this work, we propose to leverage the geometric information of the feature distribution of the well-represented head class to guide the model to learn the underlying distribution of the tail class. Specifically, we first systematically define the geometry of the feature distribution and the similarity measures between the geometries, and discover four phenomena regarding the relationship between the geometries of different feature distributions. Then, based on four phenomena, feature uncertainty representation is proposed to perturb the tail features by utilizing the geometry of the head class feature distribution. It aims to make the perturbed features cover the underlying distribution of the tail class as much as possible, thus improving the model's generalization performance in the test domain. Finally, we design a three-stage training scheme enabling feature uncertainty modeling to be successfully applied. Experiments on CIFAR-10/100-LT, ImageNet-LT, and iNaturalist2018 show that our proposed approach outperforms other similar methods on most metrics. In addition, the experimental phenomena we discovered are able to provide new perspectives and theoretical foundations for subsequent studies.
comment: This work was accepted by the IJCV 2024
♻ ☆ xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations ECCV24
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
comment: Accepted by ECCV24 AI4VA
♻ ☆ CMAB: A First National-Scale Multi-Attribute Building Dataset in China Derived from Open Source Data and GeoAI
Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper introduces a geospatial artificial intelligence (GeoAI) framework for large-scale building modeling, presenting the first national-scale Multi-Attribute Building dataset (CMAB), covering 3,667 spatial cities, 29 million buildings, and 21.3 billion square meters of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 337.7 billion cubic meters of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating features such as morphology, location, and function. Using multi-source data, including billions of high-resolution Google Earth images and 60 million street view images (SVIs), we generated rooftop, height, function, age, and quality attributes for each building. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.
comment: 43 pages, 20 figures
♻ ☆ Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection WACV
Skeleton-based video anomaly detection (SVAD) is a crucial task in computer vision. Accurately identifying abnormal patterns or events enables operators to promptly detect suspicious activities, thereby enhancing safety. Achieving this demands a comprehensive understanding of human motions, both at body and region levels, while also accounting for the wide variations of performing a single action. However, existing studies fail to simultaneously address these crucial properties. This paper introduces a novel, practical and lightweight framework, namely Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection (GiCiSAD) to overcome the challenges associated with SVAD. GiCiSAD consists of three novel modules: the Graph Attention-based Forecasting module to capture the spatio-temporal dependencies inherent in the data, the Graph-level Jigsaw Puzzle Maker module to distinguish subtle region-level discrepancies between normal and abnormal motions, and the Graph-based Conditional Diffusion model to generate a wide spectrum of human motions. Extensive experiments on four widely used skeleton-based video datasets show that GiCiSAD outperforms existing methods with significantly fewer training parameters, establishing it as the new state-of-the-art.
comment: Accepted at the Winter Conference on Applications of Computer Vision (WACV). 17 pages, 6 figures, 6 tables
♻ ☆ A Survey for Foundation Models in Autonomous Driving
The advent of foundation models has revolutionized the fields of natural language processing and computer vision, paving the way for their application in autonomous driving (AD). This survey presents a comprehensive review of more than 40 research papers, demonstrating the role of foundation models in enhancing AD. Large language models contribute to planning and simulation in AD, particularly through their proficiency in reasoning, code generation and translation. In parallel, vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking, as well as creating realistic driving scenarios for simulation and testing. Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning, crucial for end-to-end AD. This survey not only provides a structured taxonomy, categorizing foundation models based on their modalities and functionalities within the AD domain but also delves into the methods employed in current research. It identifies the gaps between existing foundation models and cutting-edge AD approaches, thereby charting future research directions and proposing a roadmap for bridging these gaps.
Artificial Intelligence 26
♻ ☆ Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.
♻ ☆ Adversarial Domain Adaptation for Cross-user Activity Recognition Using Diffusion-based Noise-centred Learning
Human Activity Recognition (HAR) plays a crucial role in various applications such as human-computer interaction and healthcare monitoring. However, challenges persist in HAR models due to the data distribution differences between training and real-world data distributions, particularly evident in cross-user scenarios. This paper introduces a novel framework, termed Diffusion-based Noise-centered Adversarial Learning Domain Adaptation (Diff-Noise-Adv-DA), designed to address these challenges by leveraging generative diffusion modeling and adversarial learning techniques. Traditional HAR models often struggle with the diversity of user behaviors and sensor data distributions. Diff-Noise-Adv-DA innovatively integrates the inherent noise within diffusion models, harnessing its latent information to enhance domain adaptation. Specifically, the framework transforms noise into a critical carrier of activity and domain class information, facilitating robust classification across different user domains. Experimental evaluations demonstrate the effectiveness of Diff-Noise-Adv-DA in improving HAR model performance across different users, surpassing traditional domain adaptation methods. The framework not only mitigates distribution mismatches but also enhances data quality through noise-based denoising techniques.
♻ ☆ CURLing the Dream: Contrastive Representations for World Modeling in Reinforcement Learning
In this work, we present Curled-Dreamer, a novel reinforcement learning algorithm that integrates contrastive learning into the DreamerV3 framework to enhance performance in visual reinforcement learning tasks. By incorporating the contrastive loss from the CURL algorithm and a reconstruction loss from autoencoder, Curled-Dreamer achieves significant improvements in various DeepMind Control Suite tasks. Our extensive experiments demonstrate that Curled-Dreamer consistently outperforms state-of-the-art algorithms, achieving higher mean and median scores across a diverse set of tasks. The results indicate that the proposed approach not only accelerates learning but also enhances the robustness of the learned policies. This work highlights the potential of combining different learning paradigms to achieve superior performance in reinforcement learning applications.
comment: Paper accepted for 24th International Conference on Control, Automation and Systems (ICCAS)
♻ ☆ Kolmogorov-Arnold Network for Online Reinforcement Learning
Kolmogorov-Arnold Networks (KANs) have shown potential as an alternative to Multi-Layer Perceptrons (MLPs) in neural networks, providing universal function approximation with fewer parameters and reduced memory usage. In this paper, we explore the use of KANs as function approximators within the Proximal Policy Optimization (PPO) algorithm. We evaluate this approach by comparing its performance to the original MLP-based PPO using the DeepMind Control Proprio Robotics benchmark. Our results indicate that the KAN-based reinforcement learning algorithm can achieve comparable performance to its MLP-based counterpart, often with fewer parameters. These findings suggest that KANs may offer a more efficient option for reinforcement learning models.
comment: Paper accepted at 24th International Conference on Control, Automation and Systems (ICCAS)
♻ ☆ Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex structures and operations often pose challenges for non-experts to grasp. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex structure with explanations of the underlying operations. By comparing image generation of prompt variants, users can discover the impact of keyword changes on image generation. A 56-participant user study demonstrates that Diffusion Explainer offers substantial learning benefits to non-experts. Our tool has been used by over 10,300 users from 124 countries at https://poloclub.github.io/diffusion-explainer/.
comment: 5 pages, 7 figures
♻ ☆ Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification
This paper presents dilated Residual Network (ResNet) models for disease classification from retinal fundus images. Dilated convolution filters are used to replace normal convolution filters in the higher layers of the ResNet model (dilated ResNet) in order to improve the receptive field compared to the normal ResNet model for disease classification. This study introduces computer-assisted diagnostic tools that employ deep learning, enhanced with explainable AI techniques. These techniques aim to make the tool's decision-making process transparent, thereby enabling medical professionals to understand and trust the AI's diagnostic decision. They are particularly relevant in today's healthcare landscape, where there is a growing demand for transparency in AI applications to ensure their reliability and ethical use. The dilated ResNet is used as a replacement for the normal ResNet to enhance the classification accuracy of retinal eye diseases and reduce the required computing time. The dataset used in this work is the Ocular Disease Intelligent Recognition (ODIR) dataset which is a structured ophthalmic database with eight classes covering most of the common retinal eye diseases. The evaluation metrics used in this work include precision, recall, accuracy, and F1 score. In this work, a comparative study has been made between normal ResNet models and dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, and 0.70 respectively for the above respective variants in ODIR multiclass disease classification.
comment: Added authors' contributions
♻ ☆ Do Concept Bottleneck Models Respect Localities?
Concept-based methods explain model predictions using human-understandable concepts. These models require accurate concept predictors, yet the faithfulness of existing concept predictors to their underlying concepts is unclear. In this paper, we investigate the faithfulness of Concept Bottleneck Models (CBMs), a popular family of concept-based architectures, by looking at whether they respect "localities" in datasets. Localities involve using only relevant features when predicting a concept's value. When localities are not considered, concepts may be predicted based on spuriously correlated features, degrading performance and robustness. This work examines how CBM predictions change when perturbing model inputs, and reveals that CBMs may not capture localities, even when independent concepts are localised to non-overlapping feature subsets. Our empirical and theoretical results demonstrate that datasets with correlated concepts may lead to accurate but uninterpretable models that fail to learn localities. Overall, we find that CBM interpretability is fragile, as CBMs occasionally rely upon spurious features, necessitating further research into the robustness of concept predictors.
comment: Previous Version Accepted at NeurIPs 23 XAI in Action Workshop
♻ ☆ Probabilistic Variational Causal Approach in Observational Studies
In this paper, we introduce a new causal methodology that accounts for the rarity and frequency of events in observational studies based on their relevance to the underlying problem. Specifically, we propose a direct causal effect metric called the Probabilistic Variational Causal Effect (PACE) and its variations adhering to certain postulates applicable to both non-binary and binary treatments. The PACE metric is derived by integrating the concept of total variation -- representing the purely causal component -- with interventions on the treatment value, combined with the probabilities of hypothetical transitioning between treatment levels. PACE features a parameter d, where lower values of d correspond to scenarios emphasizing rare treatment values, while higher values of d focus on situations where the causal impact of more frequent treatment levels is more relevant. Thus, instead of a single causal effect value, we provide a causal effect function of the degree d. Additionally, we introduce positive and negative PACE to measure the respective positive and negative causal changes in the outcome as exposure values shift. We also consider normalized versions of PACE, referred to as MEAN PACE. Furthermore, we provide an identifiability criterion for PACE to handle counterfactual challenges in observational studies, and we define several generalizations of our methodology. Lastly, we compare our framework with other well-known causal frameworks through the analysis of various examples.
comment: 35 pages, 5 figures, 2 tables
♻ ☆ Public Transit Arrival Prediction: a Seq2Seq RNN Approach
Arrival/Travel times for public transit exhibit variability on account of factors like seasonality, dwell times at bus stops, traffic signals, travel demand fluctuation etc. The developing world in particular is plagued by additional factors like lack of lane discipline, excess vehicles, diverse modes of transport and so on. This renders the bus arrival time prediction (BATP) to be a challenging problem especially in the developing world. A novel data-driven model based on recurrent neural networks (RNNs) is proposed for BATP (in real-time) in the current work. The model intelligently incorporates both spatial and temporal correlations in a unique (non-linear) fashion distinct from existing approaches. In particular, we propose a Gated Recurrent Unit (GRU) based Encoder-Decoder(ED) OR Seq2Seq RNN model (originally introduced for language translation) for BATP. The geometry of the dynamic real time BATP problem enables a nice fit with the Encoder-Decoder based RNN structure. We feed relevant additional synchronized inputs (from previous trips) at each step of the decoder (a feature classically unexplored in machine translation applications). Further motivated from accurately modelling congestion influences on travel time prediction, we additionally propose to use a bidirectional layer at the decoder (something unexplored in other time-series based ED application contexts). The effectiveness of the proposed algorithms is demonstrated on real field data collected from challenging traffic conditions. Our experiments indicate that the proposed method outperforms diverse existing state-of-art data-driven approaches proposed for the same problem.
♻ ☆ A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond
Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions. An ongoing, dynamically updated project and resources associated with this survey have been released at https://github.com/QiushiSun/NCISurvey.
comment: 64 pages, 6 figures, 10 tables, 695 references
♻ ☆ Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing
Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). While prior studies have primarily focused on the post-generation analysis and refinement of outputs, this paper centers on the effectiveness of queries in eliciting accurate responses from LLMs. We present HalluciBot, a model that estimates the query's propensity to hallucinate before generation, without invoking any LLMs during inference. HalluciBot can serve as a proxy reward model for query rewriting, offering a general framework to estimate query quality based on accuracy and consensus. In essence, HalluciBot investigates how poorly constructed queries can lead to erroneous outputs - moreover, by employing query rewriting guided by HalluciBot's empirical estimates, we demonstrate that 95.7% output accuracy can be achieved for Multiple Choice questions. The training procedure for HalluciBot consists of perturbing 369,837 queries n times, employing n+1 independent LLM agents, sampling an output from each query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs, and training an encoder classifier. The idea of perturbation is the outcome of our ablation studies that measures the increase in output diversity (+12.5 agreement spread) by perturbing a query in lexically different but semantically similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1 score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2% positive class transition from hallucinatory to non-hallucinatory), rank (+50.6% positive class transition from hallucinatory to non-hallucinatory), and route queries to effective pipelines.
♻ ☆ Evaluating Large Language Models for Health-related Queries with Presuppositions ACL 2024
As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.
comment: Findings of ACL 2024
♻ ☆ The Winnability of Klondike Solitaire and Many Other Patience Games
Our ignorance of the winnability percentage of the solitaire card game `Klondike' has been described as "one of the embarrassments of applied mathematics". Klondike, the game in the Windows Solitaire program, is just one of many single-player card games, generically called 'patience' or 'solitaire' games, for which players have long wanted to know how likely a particular game is to be winnable. A number of different games have been studied empirically in the academic literature and by non-academic enthusiasts. Here we show that a single general purpose Artificial Intelligence program named `Solvitaire' can be used to determine the winnability percentage of 73 variants of 35 different single-player card games with a 95% confidence interval of +/- 0.1% or better. For example, we report the winnability of Klondike as 81.945%+/- 0.084% (in the `thoughtful' variant where the player knows the rank and suit of all cards), a 30-fold reduction in confidence interval over the best previous result. The vast majority of our results are either entirely new or represent significant improvements on previous knowledge.
♻ ☆ Explaining Explanations in Probabilistic Logic Programming
The emergence of tools based on artificial intelligence has also led to the need of producing explanations which are understandable by a human being. In most approaches, the system is considered a black box, making it difficult to generate appropriate explanations. In this work, though, we consider a setting where models are transparent: probabilistic logic programming (PLP), a paradigm that combines logic programming for knowledge representation and probability to model uncertainty. However, given a query, the usual notion of explanation is associated with a set of choices, one for each random variable of the model. Unfortunately, such a set does not explain why the query is true and, in fact, it may contain choices that are actually irrelevant for the considered query. To improve this situation, we present in this paper an approach to explaining explanations which is based on defining a new query-driven inference mechanism for PLP where proofs are labeled with "choice expressions", a compact and easy to manipulate representation for sets of choices. The combination of proof trees and choice expressions allows us to produce comprehensible query justifications with a causal structure.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will appear in the APLAS 2024 proceedings published by Springer LNCS
♻ ☆ Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning SC'24
The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC.
comment: This is the preprint version of the paper accepted for presentation at the 2024 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'24). \c{opyright} 2024 IEEE. Personal use of this material is permitted. For other uses, permission from IEEE must be obtained. Please refer to IEEE Xplore for the final published version
♻ ☆ CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation
The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed Fourierformer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CILF-CIAE has achieved better age prediction results.
comment: 14 pages, 14 figures, 3 tables
♻ ☆ Graph Information Bottleneck for Remote Sensing Segmentation
Remote sensing segmentation has a wide range of applications in environmental protection, and urban change detection, etc. Despite the success of deep learning-based remote sensing segmentation methods (e.g., CNN and Transformer), they are not flexible enough to model irregular objects. In addition, existing graph contrastive learning methods usually adopt the way of maximizing mutual information to keep the node representations consistent between different graph views, which may cause the model to learn task-independent redundant information. To tackle the above problems, this paper treats images as graph structures and introduces a simple contrastive vision GNN (SC-ViG) architecture for remote sensing segmentation. Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. Furthermore, this paper innovatively introduces information bottleneck theory into graph contrastive learning to maximize task-related information while minimizing task-independent redundant information. Finally, we replace the convolutional module in UNet with the SC-ViG module to complete the segmentation and classification tasks of remote sensing images. Extensive experiments on publicly available real datasets demonstrate that our method outperforms state-of-the-art remote sensing image segmentation methods.
comment: 13 pages, 6 figures
♻ ☆ Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations
The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52\% and 35\%, respectively.
comment: 11 pages, 3 tables
♻ ☆ DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition
With the continuous development of deep learning (DL), the task of multimodal dialogue emotion recognition (MDER) has recently received extensive research attention, which is also an essential branch of DL. The MDER aims to identify the emotional information contained in different modalities, e.g., text, video, and audio, in different dialogue scenes. However, existing research has focused on modeling contextual semantic information and dialogue relations between speakers while ignoring the impact of event relations on emotion. To tackle the above issues, we propose a novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method. It models dialogue relations between speakers and captures latent event relations information. Specifically, we construct a weighted multi-relationship graph to simultaneously capture the dependencies between speakers and event relations in a dialogue. Moreover, we also introduce a Self-Supervised Masked Graph Autoencoder (SMGAE) to improve the fusion representation ability of features and structures. Next, we design a new Multiple Information Transformer (MIT) to capture the correlation between different relations, which can provide a better fuse of the multivariate information between relations. Finally, we propose a loss optimization strategy based on contrastive learning to enhance the representation learning ability of minority class features. We conduct extensive experiments on the IEMOCAP and MELD benchmark datasets, which verify the effectiveness of the DER-GCN model. The results demonstrate that our model significantly improves both the average accuracy and the f1 value of emotion recognition.
comment: 14 pages, 7 figures
♻ ☆ Palantir: Towards Efficient Super Resolution for Ultra-high-definition Live Streaming
Neural enhancement through super-resolution (SR) deep neural networks (DNNs) opens up new possibilities for ultra-high-definition (UHD) live streaming over existing encoding and networking infrastructure. Yet, the heavy SR DNN inference overhead leads to severe deployment challenges. To reduce the overhead, existing systems propose to apply DNN-based SR only on carefully selected anchor frames while upscaling non-anchor frames via the lightweight reusing-based SR approach. However, frame-level scheduling is coarse-grained and fails to deliver optimal efficiency. In this work, we propose Palantir, the first neural-enhanced UHD live streaming system with fine-grained patch-level scheduling. Two novel techniques are incorporated into Palantir to select the most beneficial anchor patches and support latency-sensitive UHD live streaming applications. Firstly, under the guidance of our pioneering and theoretical analysis, Palantir constructs a directed acyclic graph (DAG) for lightweight yet accurate SR quality estimation under any possible anchor patch set. Secondly, to further optimize the scheduling latency, Palantir improves parallelizability by refactoring the computation subprocedure of the estimation process into a sparse matrix-matrix multiplication operation. The evaluation results suggest that Palantir incurs a negligible scheduling latency accounting for less than 5.7% of the end-to-end latency requirement. When compared to the naive method of applying DNN-based SR on all the frames, Palantir can reduce the SR DNN inference overhead by 20 times (or 60 times) while preserving 54.0-82.6% (or 32.8-64.0%) of the quality gain. When compared to the state-of-the-art real-time frame-level scheduling strategy, Palantir can reduce the SR DNN inference overhead by 80.1% at most (and 38.4% on average) without sacrificing the video quality.
♻ ☆ Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4
Unlike perfect information games, where all elements are known to every player, imperfect information games emulate the real-world complexities of decision-making under uncertain or incomplete information. GPT-4, the recent breakthrough in large language models (LLMs) trained on massive passive data, is notable for its knowledge retrieval and reasoning abilities. This paper delves into the applicability of GPT-4's learned knowledge for imperfect information games. To achieve this, we introduce \textbf{Suspicion-Agent}, an innovative agent that leverages GPT-4's capabilities for performing in imperfect information games. With proper prompt engineering to achieve different functions, Suspicion-Agent based on GPT-4 demonstrates remarkable adaptability across a range of imperfect information card games. Importantly, GPT-4 displays a strong high-order theory of mind (ToM) capacity, meaning it can understand others and intentionally impact others' behavior. Leveraging this, we design a planning strategy that enables GPT-4 to competently play against different opponents, adapting its gameplay style as needed, while requiring only the game rules and descriptions of observations as input. In the experiments, we qualitatively showcase the capabilities of Suspicion-Agent across three different imperfect information games and then quantitatively evaluate it in Leduc Hold'em. The results show that Suspicion-Agent can potentially outperform traditional algorithms designed for imperfect information games, without any specialized training or examples. In order to encourage and foster deeper insights within the community, we make our game-related data publicly available.
♻ ☆ Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models ACL 2024
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.
comment: Accepted at ACL 2024
♻ ☆ Enhancing Transferability of Adversarial Attacks with GE-AdvGAN+: A Comprehensive Framework for Gradient Editing
Transferable adversarial attacks pose significant threats to deep neural networks, particularly in black-box scenarios where internal model information is inaccessible. Studying adversarial attack methods helps advance the performance of defense mechanisms and explore model vulnerabilities. These methods can uncover and exploit weaknesses in models, promoting the development of more robust architectures. However, current methods for transferable attacks often come with substantial computational costs, limiting their deployment and application, especially in edge computing scenarios. Adversarial generative models, such as Generative Adversarial Networks (GANs), are characterized by their ability to generate samples without the need for retraining after an initial training phase. GE-AdvGAN, a recent method for transferable adversarial attacks, is based on this principle. In this paper, we propose a novel general framework for gradient editing-based transferable attacks, named GE-AdvGAN+, which integrates nearly all mainstream attack methods to enhance transferability while significantly reducing computational resource consumption. Our experiments demonstrate the compatibility and effectiveness of our framework. Compared to the baseline AdvGAN, our best-performing method, GE-AdvGAN++, achieves an average ASR improvement of 47.8. Additionally, it surpasses the latest competing algorithm, GE-AdvGAN, with an average ASR increase of 5.9. The framework also exhibits enhanced computational efficiency, achieving 2217.7 FPS, outperforming traditional methods such as BIM and MI-FGSM. The implementation code for our GE-AdvGAN+ framework is available at https://github.com/GEAdvGANP
♻ ☆ ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models ACL 2024
In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
comment: Accepted in ACL 2024
♻ ☆ xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations ECCV24
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
comment: Accepted by ECCV24 AI4VA
♻ ☆ Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection WACV
Skeleton-based video anomaly detection (SVAD) is a crucial task in computer vision. Accurately identifying abnormal patterns or events enables operators to promptly detect suspicious activities, thereby enhancing safety. Achieving this demands a comprehensive understanding of human motions, both at body and region levels, while also accounting for the wide variations of performing a single action. However, existing studies fail to simultaneously address these crucial properties. This paper introduces a novel, practical and lightweight framework, namely Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection (GiCiSAD) to overcome the challenges associated with SVAD. GiCiSAD consists of three novel modules: the Graph Attention-based Forecasting module to capture the spatio-temporal dependencies inherent in the data, the Graph-level Jigsaw Puzzle Maker module to distinguish subtle region-level discrepancies between normal and abnormal motions, and the Graph-based Conditional Diffusion model to generate a wide spectrum of human motions. Extensive experiments on four widely used skeleton-based video datasets show that GiCiSAD outperforms existing methods with significantly fewer training parameters, establishing it as the new state-of-the-art.
comment: Accepted at the Winter Conference on Applications of Computer Vision (WACV). 17 pages, 6 figures, 6 tables
Machine Learning 22
♻ ☆ Adversarial Domain Adaptation for Cross-user Activity Recognition Using Diffusion-based Noise-centred Learning
Human Activity Recognition (HAR) plays a crucial role in various applications such as human-computer interaction and healthcare monitoring. However, challenges persist in HAR models due to the data distribution differences between training and real-world data distributions, particularly evident in cross-user scenarios. This paper introduces a novel framework, termed Diffusion-based Noise-centered Adversarial Learning Domain Adaptation (Diff-Noise-Adv-DA), designed to address these challenges by leveraging generative diffusion modeling and adversarial learning techniques. Traditional HAR models often struggle with the diversity of user behaviors and sensor data distributions. Diff-Noise-Adv-DA innovatively integrates the inherent noise within diffusion models, harnessing its latent information to enhance domain adaptation. Specifically, the framework transforms noise into a critical carrier of activity and domain class information, facilitating robust classification across different user domains. Experimental evaluations demonstrate the effectiveness of Diff-Noise-Adv-DA in improving HAR model performance across different users, surpassing traditional domain adaptation methods. The framework not only mitigates distribution mismatches but also enhances data quality through noise-based denoising techniques.
♻ ☆ CURLing the Dream: Contrastive Representations for World Modeling in Reinforcement Learning
In this work, we present Curled-Dreamer, a novel reinforcement learning algorithm that integrates contrastive learning into the DreamerV3 framework to enhance performance in visual reinforcement learning tasks. By incorporating the contrastive loss from the CURL algorithm and a reconstruction loss from autoencoder, Curled-Dreamer achieves significant improvements in various DeepMind Control Suite tasks. Our extensive experiments demonstrate that Curled-Dreamer consistently outperforms state-of-the-art algorithms, achieving higher mean and median scores across a diverse set of tasks. The results indicate that the proposed approach not only accelerates learning but also enhances the robustness of the learned policies. This work highlights the potential of combining different learning paradigms to achieve superior performance in reinforcement learning applications.
comment: Paper accepted for 24th International Conference on Control, Automation and Systems (ICCAS)
♻ ☆ Online-Score-Aided Federated Learning: Taming the Resource Constraints in Wireless Networks
While FL is a widely popular distributed ML strategy that protects data privacy, time-varying wireless network parameters and heterogeneous system configurations of the wireless device pose significant challenges. Although the limited radio and computational resources of the network and the clients, respectively, are widely acknowledged, two critical yet often ignored aspects are (a) wireless devices can only dedicate a small chunk of their limited storage for the FL task and (b) new training samples may arrive in an online manner in many practical wireless applications. Therefore, we propose a new FL algorithm called OSAFL, specifically designed to learn tasks relevant to wireless applications under these practical considerations. Since it has long been proven that under extreme resource constraints, clients may perform an arbitrary number of local training steps, which may lead to client drift under statistically heterogeneous data distributions, we leverage normalized gradient similarities and exploit weighting clients' updates based on optimized scores that facilitate the convergence rate of the proposed OSAFL algorithm. Our extensive simulation results on two different tasks -- each with three different datasets -- with four popular ML models validate the effectiveness of OSAFL compared to six existing state-of-the-art FL baselines.
comment: Under review for possible publication in IEEE Transactions on Communications
♻ ☆ Kolmogorov-Arnold Network for Online Reinforcement Learning
Kolmogorov-Arnold Networks (KANs) have shown potential as an alternative to Multi-Layer Perceptrons (MLPs) in neural networks, providing universal function approximation with fewer parameters and reduced memory usage. In this paper, we explore the use of KANs as function approximators within the Proximal Policy Optimization (PPO) algorithm. We evaluate this approach by comparing its performance to the original MLP-based PPO using the DeepMind Control Proprio Robotics benchmark. Our results indicate that the KAN-based reinforcement learning algorithm can achieve comparable performance to its MLP-based counterpart, often with fewer parameters. These findings suggest that KANs may offer a more efficient option for reinforcement learning models.
comment: Paper accepted at 24th International Conference on Control, Automation and Systems (ICCAS)
♻ ☆ Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex structures and operations often pose challenges for non-experts to grasp. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex structure with explanations of the underlying operations. By comparing image generation of prompt variants, users can discover the impact of keyword changes on image generation. A 56-participant user study demonstrates that Diffusion Explainer offers substantial learning benefits to non-experts. Our tool has been used by over 10,300 users from 124 countries at https://poloclub.github.io/diffusion-explainer/.
comment: 5 pages, 7 figures
♻ ☆ Do Concept Bottleneck Models Respect Localities?
Concept-based methods explain model predictions using human-understandable concepts. These models require accurate concept predictors, yet the faithfulness of existing concept predictors to their underlying concepts is unclear. In this paper, we investigate the faithfulness of Concept Bottleneck Models (CBMs), a popular family of concept-based architectures, by looking at whether they respect "localities" in datasets. Localities involve using only relevant features when predicting a concept's value. When localities are not considered, concepts may be predicted based on spuriously correlated features, degrading performance and robustness. This work examines how CBM predictions change when perturbing model inputs, and reveals that CBMs may not capture localities, even when independent concepts are localised to non-overlapping feature subsets. Our empirical and theoretical results demonstrate that datasets with correlated concepts may lead to accurate but uninterpretable models that fail to learn localities. Overall, we find that CBM interpretability is fragile, as CBMs occasionally rely upon spurious features, necessitating further research into the robustness of concept predictors.
comment: Previous Version Accepted at NeurIPs 23 XAI in Action Workshop
♻ ☆ Discovery of Small Ultra-short-period Planets Orbiting KG Dwarfs in Kepler Survey Using GPU Phase Folding and Deep Learning Detection System
Since the discovery of the first hot Jupiter orbiting a solar-type star, 51 Peg, in 1995, more than 4000 exoplanets have been identified using various observational techniques. The formation process of these sub-Earths remains elusive, and acquiring additional samples is essential for investigating this unique population. In our study, we employ a novel GPU Phase Folding algorithm combined with a Convolutional Neural Network, termed the GPFC method, on Kepler photometry data. This method enhances the transit search speed significantly over the traditional Box-fitting Least Squares method, allowing a complete search of the known KOI photometry data within hours using a commercial GPU card. To date, we have identified five promising sub-Earth short-period candidates: K00446.c, K01821.b, K01522.c, K03404.b, and K04978.b. A closer analysis reveals the following characteristics: K00446.c orbits a K dwarf on a 0.645091-day period. With a radius of $0.461R_\oplus$, it ranks as the second smallest USP discovered to date. K01821.b is a sub-Earth with a radius of $0.648R_\oplus$, orbiting a G dwarf over a 0.91978-day period. It is the second smallest USP among all confirmed USPs orbiting G dwarfs in the NASA Archive. K01522.c has a radius of $0.704 R_\oplus$ and completes an orbit around a Sun-like G dwarf in 0.64672 days; K03404.b, with a radius of $0.738 R_\oplus$, orbits a G dwarf on a 0.68074-day period; and K04978.b, with its planetary radius of $0.912 R_\oplus$, orbits a G dwarf, completing an orbit every 0.94197 days. Three of our finds, K01821.b, K01522.c and K03404.b, rank as the smallest planets among all confirmed USPs orbiting G dwarfs in the Kepler dataset. The discovery of these small exoplanets underscores the promising capability of the GPFC method for searching for small, new transiting exoplanets in photometry data from Kepler, TESS, and future space transit missions.
comment: 24 pages, 40 figures; To be published in the Monthly Notices of the Royal Astronomical Society (MNRAS)
♻ ☆ Convex Hull Prediction for Adaptive Video Streaming by Recurrent Learning
Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step is equivalent to an exhaustive search process over the space of possible encoding parameters, which causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we propose a deep learning based method of content aware convex hull prediction. We employ a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. A two-step transfer learning scheme is adopted to train our proposed RCN-Hull model, which ensures sufficient content diversity to analyze scene complexity, while also making it possible to capture the scene statistics of pristine source videos. Our experimental results reveal that our proposed model yields better approximations of the optimal convex hulls, and offers competitive time savings as compared to existing approaches. On average, the pre-encoding time was reduced by 53.8% by our method, while the average Bjontegaard delta bitrate (BD-rate) of the predicted convex hulls against ground truth was 0.26%, and the mean absolute deviation of the BD-rate distribution was 0.57%.
♻ ☆ Public Transit Arrival Prediction: a Seq2Seq RNN Approach
Arrival/Travel times for public transit exhibit variability on account of factors like seasonality, dwell times at bus stops, traffic signals, travel demand fluctuation etc. The developing world in particular is plagued by additional factors like lack of lane discipline, excess vehicles, diverse modes of transport and so on. This renders the bus arrival time prediction (BATP) to be a challenging problem especially in the developing world. A novel data-driven model based on recurrent neural networks (RNNs) is proposed for BATP (in real-time) in the current work. The model intelligently incorporates both spatial and temporal correlations in a unique (non-linear) fashion distinct from existing approaches. In particular, we propose a Gated Recurrent Unit (GRU) based Encoder-Decoder(ED) OR Seq2Seq RNN model (originally introduced for language translation) for BATP. The geometry of the dynamic real time BATP problem enables a nice fit with the Encoder-Decoder based RNN structure. We feed relevant additional synchronized inputs (from previous trips) at each step of the decoder (a feature classically unexplored in machine translation applications). Further motivated from accurately modelling congestion influences on travel time prediction, we additionally propose to use a bidirectional layer at the decoder (something unexplored in other time-series based ED application contexts). The effectiveness of the proposed algorithms is demonstrated on real field data collected from challenging traffic conditions. Our experiments indicate that the proposed method outperforms diverse existing state-of-art data-driven approaches proposed for the same problem.
♻ ☆ Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing
Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). While prior studies have primarily focused on the post-generation analysis and refinement of outputs, this paper centers on the effectiveness of queries in eliciting accurate responses from LLMs. We present HalluciBot, a model that estimates the query's propensity to hallucinate before generation, without invoking any LLMs during inference. HalluciBot can serve as a proxy reward model for query rewriting, offering a general framework to estimate query quality based on accuracy and consensus. In essence, HalluciBot investigates how poorly constructed queries can lead to erroneous outputs - moreover, by employing query rewriting guided by HalluciBot's empirical estimates, we demonstrate that 95.7% output accuracy can be achieved for Multiple Choice questions. The training procedure for HalluciBot consists of perturbing 369,837 queries n times, employing n+1 independent LLM agents, sampling an output from each query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs, and training an encoder classifier. The idea of perturbation is the outcome of our ablation studies that measures the increase in output diversity (+12.5 agreement spread) by perturbing a query in lexically different but semantically similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1 score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2% positive class transition from hallucinatory to non-hallucinatory), rank (+50.6% positive class transition from hallucinatory to non-hallucinatory), and route queries to effective pipelines.
♻ ☆ Evaluating Large Language Models for Health-related Queries with Presuppositions ACL 2024
As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.
comment: Findings of ACL 2024
♻ ☆ Towards a theory of how the structure of language is acquired by deep neural networks
How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG) -- a tree-like generative model that captures many of the hierarchical structures found in natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem. We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets. In particular, our conjecture predicts how the scaling law for the test loss behaviour with training set size depends on the length of the context window, which we confirm empirically in Shakespeare's plays and Wikipedia articles.
comment: 9 pages, 4 figures (main)
♻ ☆ Enhanced Federated Optimization: Adaptive Unbiased Client Sampling with Reduced Variance
Federated Learning (FL) is a distributed learning paradigm to train a global model across multiple devices without collecting local data. In FL, a server typically selects a subset of clients for each training round to optimize resource usage. Central to this process is the technique of unbiased client sampling, which ensures a representative selection of clients. Current methods primarily utilize a random sampling procedure which, despite its effectiveness, achieves suboptimal efficiency owing to the loose upper bound caused by the sampling variance. In this work, by adopting an independent sampling procedure, we propose a federated optimization framework focused on adaptive unbiased client sampling, improving the convergence rate via an online variance reduction strategy. In particular, we present the first adaptive client sampler, K-Vib, employing an independent sampling procedure. K-Vib achieves a linear speed-up on the regret bound $\tilde{\mathcal{O}}\big(N^{\frac{1}{3}}T^{\frac{2}{3}}/K^{\frac{4}{3}}\big)$ within a set communication budget $K$. Empirical studies indicate that K-Vib doubles the speed compared to baseline algorithms, demonstrating significant potential in federated optimization.
comment: Under review
♻ ☆ An experimental evaluation of Deep Reinforcement Learning algorithms for HVAC control
Heating, Ventilation, and Air Conditioning (HVAC) systems are a major driver of energy consumption in commercial and residential buildings. Recent studies have shown that Deep Reinforcement Learning (DRL) algorithms can outperform traditional reactive controllers. However, DRL-based solutions are generally designed for ad hoc setups and lack standardization for comparison. To fill this gap, this paper provides a critical and reproducible evaluation, in terms of comfort and energy consumption, of several state-of-the-art DRL algorithms for HVAC control. The study examines the controllers' robustness, adaptability, and trade-off between optimization goals by using the Sinergym framework. The results obtained confirm the potential of DRL algorithms, such as SAC and TD3, in complex scenarios and reveal several challenges related to generalization and incremental learning.
♻ ☆ Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer
Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.
♻ ☆ A rapid approach to urban traffic noise mapping with a generative adversarial network
With rapid urbanisation and the accompanying increase in traffic density, traffic noise has become a major concern in urban planning. However, traditional grid noise mapping methods have limitations in terms of time consumption, software costs, and a lack of parameter integration interfaces. These limitations hinder their ability to meet the need for iterative updates and rapid performance feedback in the early design stages of street-scale urban planning. Herein, we developed a rapid urban traffic noise mapping technique that leverages generative adversarial networks (GANs) as a surrogate model. This approach enables the rapid assessment of urban traffic noise distribution by using urban elements such as roads and buildings as the input. The mean values for the mean squared error (RMSE) and structural similarity index (SSIM) are 0.3024 dB(A) and 0.8528, respectively, for the validation dataset. The trained model is integrated into Grasshopper as a tool, facilitating the rapid generation of traffic noise maps. This integration allows urban designers and planners, even those without expertise in acoustics, to easily anticipate changes in acoustics impacts caused by design in the early design stages.
comment: Accepted by Applied Acoustics
♻ ☆ Active Learning of Discrete-Time Dynamics for Uncertainty-Aware Model Predictive Control
Model-based control requires an accurate model of the system dynamics for precisely and safely controlling the robot in complex and dynamic environments. Moreover, in the presence of variations in the operating conditions, the model should be continuously refined to compensate for dynamics changes. In this paper, we present a self-supervised learning approach that actively models the dynamics of nonlinear robotic systems. We combine offline learning from past experience and online learning from current robot interaction with the unknown environment. These two ingredients enable a highly sample-efficient and adaptive learning process, capable of accurately inferring model dynamics in real-time even in operating regimes that greatly differ from the training distribution. Moreover, we design an uncertainty-aware model predictive controller that is heuristically conditioned to the aleatoric (data) uncertainty of the learned dynamics. This controller actively chooses the optimal control actions that (i) optimize the control performance and (ii) improve the efficiency of online learning sample collection. We demonstrate the effectiveness of our method through a series of challenging real-world experiments using a quadrotor system. Our approach showcases high resilience and generalization capabilities by consistently adapting to unseen flight conditions, while it significantly outperforms classical and adaptive control baselines.
♻ ☆ Primal-dual extrapolation methods for monotone inclusions under local Lipschitz continuity
In this paper we consider a class of monotone inclusion (MI) problems of finding a zero of the sum of two monotone operators, in which one operator is maximal monotone while the other is {\it locally Lipschitz} continuous. We propose primal-dual extrapolation methods to solve them using a point and operator extrapolation technique, whose parameters are chosen by a backtracking line search scheme. The proposed methods enjoy an operation complexity of ${\cal O}(\log \epsilon^{-1})$ and ${\cal O}(\epsilon^{-1}\log \epsilon^{-1})$, measured by the number of fundamental operations consisting only of evaluations of one operator and resolvent of the other operator, for finding an $\varepsilon$-residual solution of strongly and non-strongly MI problems, respectively. The latter complexity significantly improves the previously best operation complexity ${\cal O}(\varepsilon^{-2})$. As a byproduct, complexity results of the primal-dual extrapolation methods are also obtained for finding an $\varepsilon$-KKT or $\varepsilon$-residual solution of convex conic optimization, conic constrained saddle point, and variational inequality problems under {\it local Lipschitz} continuity. We provide preliminary numerical results to demonstrate the performance of the proposed methods.
comment: To appear in Mathematics of Operations Research
♻ ☆ Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations
The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52\% and 35\%, respectively.
comment: 11 pages, 3 tables
♻ ☆ VA-learning as a more efficient alternative to Q-learning ICML 2023
In reinforcement learning, the advantage function is critical for policy improvement, but is often extracted from a learned Q-function. A natural question is: Why not learn the advantage function directly? In this work, we introduce VA-learning, which directly learns advantage function and value function using bootstrapping, without explicit reference to Q-functions. VA-learning learns off-policy and enjoys similar theoretical guarantees as Q-learning. Thanks to the direct learning of advantage function and value function, VA-learning improves the sample efficiency over Q-learning both in tabular implementations and deep RL agents on Atari-57 games. We also identify a close connection between VA-learning and the dueling architecture, which partially explains why a simple architectural change to DQN agents tends to improve performance.
comment: Accepted to ICML 2023 as a conference paper
♻ ☆ Adaptive Split Balancing for Optimal Random Forest
In this paper, we propose a new random forest algorithm that constructs the trees using a novel adaptive split-balancing method. Rather than relying on the widely-used random feature selection, we propose a permutation-based balanced splitting criterion. The adaptive split balancing forest (ASBF), achieves minimax optimality under the Lipschitz class. Its localized version, which fits local regressions at the leaf level, attains the minimax rate under the broad H\"older class $\mathcal{H}^{q,\beta}$ of problems for any $q\in\mathbb{N}$ and $\beta\in(0,1]$. We identify that over-reliance on auxiliary randomness in tree construction may compromise the approximation power of trees, leading to suboptimal results. Conversely, the proposed less random, permutation-based approach demonstrates optimality over a wide range of models. Although random forests are known to perform well empirically, their theoretical convergence rates are slow. Simplified versions that construct trees without data dependence offer faster rates but lack adaptability during tree growth. Our proposed method achieves optimality in simple, smooth scenarios while adaptively learning the tree structure from the data. Additionally, we establish uniform upper bounds and demonstrate that ASBF improves dimensionality dependence in average treatment effect estimation problems. Simulation studies and real-world applications demonstrate our methods' superior performance over existing random forests.
♻ ☆ A Survey for Foundation Models in Autonomous Driving
The advent of foundation models has revolutionized the fields of natural language processing and computer vision, paving the way for their application in autonomous driving (AD). This survey presents a comprehensive review of more than 40 research papers, demonstrating the role of foundation models in enhancing AD. Large language models contribute to planning and simulation in AD, particularly through their proficiency in reasoning, code generation and translation. In parallel, vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking, as well as creating realistic driving scenarios for simulation and testing. Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning, crucial for end-to-end AD. This survey not only provides a structured taxonomy, categorizing foundation models based on their modalities and functionalities within the AD domain but also delves into the methods employed in current research. It identifies the gaps between existing foundation models and cutting-edge AD approaches, thereby charting future research directions and proposing a roadmap for bridging these gaps.
Information Retrieval 3
☆ PSLF: A PID Controller-incorporated Second-order Latent Factor Analysis Model for Recommender System
A second-order-based latent factor (SLF) analysis model demonstrates superior performance in graph representation learning, particularly for high-dimensional and incomplete (HDI) interaction data, by incorporating the curvature information of the loss landscape. However, its objective function is commonly bi-linear and non-convex, causing the SLF model to suffer from a low convergence rate. To address this issue, this paper proposes a PID controller-incorporated SLF (PSLF) model, leveraging two key strategies: a) refining learning error estimation by incorporating the PID controller principles, and b) acquiring second-order information insights through Hessian-vector products. Experimental results on multiple HDI datasets indicate that the proposed PSLF model outperforms four state-of-the-art latent factor models based on advanced optimizers regarding convergence rates and generalization performance.
☆ An Enhanced Batch Query Architecture in Real-time Recommendation CIKM 2024
In industrial recommendation systems on websites and apps, it is essential to recall and predict top-n results relevant to user interests from a content pool of billions within milliseconds. To cope with continuous data growth and improve real-time recommendation performance, we have designed and implemented a high-performance batch query architecture for real-time recommendation systems. Our contributions include optimizing hash structures with a cacheline-aware probing method to enhance coalesced hashing, as well as the implementation of a hybrid storage key-value service built upon it. Our experiments indicate this approach significantly surpasses conventional hash tables in batch query throughput, achieving up to 90% of the query throughput of random memory access when incorporating parallel optimization. The support for NVMe, integrating two-tier storage for hot and cold data, notably reduces resource consumption. Additionally, the system facilitates dynamic updates, automated sharding of attributes and feature embedding tables, and introduces innovative protocols for consistency in batch queries, thereby enhancing the effectiveness of real-time incremental learning updates. This architecture has been deployed and in use in the bilibili recommendation system for over a year, a video content community with hundreds of millions of users, supporting 10x increase in model computation with minimal resource growth, improving outcomes while preserving the system's real-time performance.
comment: 8 pages, 10 figures, CIKM 2024 Applied Research Paper
♻ ☆ Decoding Knowledge Claims: The Evaluation of Scientific Publication Contributions through Semantic Analysis
The surge in scientific publications challenges the use of publication counts as a measure of scientific progress, requiring alternative metrics that emphasize the quality and novelty of scientific contributions rather than sheer quantity. This paper proposes the use of Relaxed Word Mover's Distance (RWMD), a semantic text similarity measure, to evaluate the novelty of scientific papers. We hypothesize that RWMD can more effectively gauge the growth of scientific knowledge. To test such an assumption, we apply RWMD to evaluate seminal papers, with Hirsch's H-Index paper as a primary case study. We compare RWMD results across three groups: 1) H-Index-related papers, 2) scientometric studies, and 3) unrelated papers, aiming to discern redundant literature and hype from genuine innovations. Findings suggest that emphasizing knowledge claims offers a deeper insight into scientific contributions, marking RWMD as a promising alternative method to traditional citation metrics, thus better tracking significant scientific breakthroughs.